We are excited to share that our paper βStructXLIP: Enhancing Vision-Language Models with Multimodal Structural Cuesβ has been accepted to CVPR 2026! π
Congratulations to Zanxi Ruan, Songqun Gao, Qiuyu Kong, Yiming Wang, and Marco Cristani on this exciting achievement! π
StructXLIP introduces a structure-centric fine-tuning framework for vision-language models, aiming to improve cross-modal retrieval under long, detailed, and compositionally rich captions. The key idea is to explicitly incorporate multimodal structural cues into the training process, including edge-based visual representations and structure-focused textual descriptions.
By aligning these structural cues at both global and local levels, StructXLIP helps vision-language models better capture object shapes, boundaries, spatial layouts, and fine-grained compositional details. The framework introduces auxiliary structure-centric objectives for global edge-text alignment, local region-phrase matching, and RGB-edge consistency regularization, while keeping the original CLIP-style contrastive learning objective.

A major advantage of StructXLIP is its practicality: structural cues are used only during fine-tuning, which means the model adds no inference overhead. At test time, it works with standard image-text inputs while benefiting from the stronger structure-aware representations learned during training. π

Extensive experiments on multiple retrieval benchmarks, including SKETCHY, INSECT, DOCCI, and DCI, show that StructXLIP consistently improves long-text vision-language alignment and achieves strong performance over recent fine-tuning approaches.
This work highlights the importance of structural information in multimodal representation learning and provides a simple, effective, and plug-and-play way to enhance vision-language models.
π Paper: https://arxiv.org/abs/2602.20089
π» Code: https://github.com/intelligolabs/StructXLIP
Great work by the team! πβ¨

