πŸŽ‰ StructXLIP: Enhancing Vision-Language Models with Multimodal Structural Cues Accepted at CVPR 2026!

We are excited to share that our paper β€œStructXLIP: Enhancing Vision-Language Models with Multimodal Structural Cues” has been accepted to CVPR 2026! πŸŽ‰

Congratulations to Zanxi Ruan, Songqun Gao, Qiuyu Kong, Yiming Wang, and Marco Cristani on this exciting achievement! πŸ‘

StructXLIP introduces a structure-centric fine-tuning framework for vision-language models, aiming to improve cross-modal retrieval under long, detailed, and compositionally rich captions. The key idea is to explicitly incorporate multimodal structural cues into the training process, including edge-based visual representations and structure-focused textual descriptions.

By aligning these structural cues at both global and local levels, StructXLIP helps vision-language models better capture object shapes, boundaries, spatial layouts, and fine-grained compositional details. The framework introduces auxiliary structure-centric objectives for global edge-text alignment, local region-phrase matching, and RGB-edge consistency regularization, while keeping the original CLIP-style contrastive learning objective.

A major advantage of StructXLIP is its practicality: structural cues are used only during fine-tuning, which means the model adds no inference overhead. At test time, it works with standard image-text inputs while benefiting from the stronger structure-aware representations learned during training. πŸš€

Extensive experiments on multiple retrieval benchmarks, including SKETCHY, INSECT, DOCCI, and DCI, show that StructXLIP consistently improves long-text vision-language alignment and achieves strong performance over recent fine-tuning approaches.

This work highlights the importance of structural information in multimodal representation learning and provides a simple, effective, and plug-and-play way to enhance vision-language models.

πŸ“„ Paper: https://arxiv.org/abs/2602.20089
πŸ’» Code: https://github.com/intelligolabs/StructXLIP

Great work by the team! πŸ‘βœ¨