
We are excited to share that our paper “Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues” has been accepted to ICCV 2025! 🎉
Congratulations to Francesco Taioli, Edoardo Zorzi, Gianni Franchi, Alberto Castellini, Alessandro Farinelli, Marco Cristani, and Yiming Wang on this exciting achievement! 👏
This work introduces Collaborative Instance Object Navigation (CoIN), a new task setting for embodied AI where an agent must locate a specific object instance in an unknown environment while interacting with a human user only when necessary. Unlike standard instance object navigation, where users are expected to provide a detailed target description before navigation starts, CoIN allows users to begin with a minimal instruction such as “Find the picture”. The agent then actively resolves uncertainty through natural, template-free, open-ended dialogues during navigation. 🤖💬
To address this challenge, the authors propose AIUTA — Agent-user Interaction with UncerTainty Awareness — a training-free framework that leverages Vision-Language Models and Large Language Models to reason about when and how the agent should ask for help. AIUTA combines a Self-Questioner module, which enables the agent to internally ask visual questions and refine its understanding of detected objects, with an Interaction Trigger module, which decides whether to continue exploring, ask the user a clarifying question, or stop when the target is found.

A key contribution of the work is an uncertainty-aware mechanism based on normalized entropy, designed to reduce hallucinations and unreliable visual descriptions from VLMs. By estimating uncertainty at the attribute level, the agent can filter out ambiguous information, ask more informative questions, and avoid unnecessary interactions with the user. 🔍✨
The paper also introduces CoIN-Bench, a new benchmark for collaborative instance object navigation in challenging multi-instance environments. CoIN-Bench supports both real human evaluation and reproducible simulated user-agent interactions, enabling scalable assessment of agents that must distinguish between visually similar object instances.
Experiments show that AIUTA, despite being training-free, achieves strong performance on CoIN-Bench and significantly improves over existing zero-shot navigation baselines. The results demonstrate that uncertainty-aware reasoning can help embodied agents navigate more effectively while minimizing human effort.
This work represents an important step toward more practical and collaborative embodied AI systems, where agents can communicate naturally, reason about uncertainty, and ask for help only when it is truly needed.
🔗 Project Page: https://intelligolabs.github.io/CoIN
Great work by the team! 👏✨

