Mind the Error! Detection and Localization of Instruction Errors in Vision-and-Language Navigation – accepted (oral presentation) @ IROS24

We are pleased to announce that Mind the Error! Detection and Localization of Instruction Errors in Vision-and-Language Navigation is accepted with oral presentation @ IROS24! This work is in collaboration with Istituto Italiano di Tecnologia (IIT) and Fondazione Bruno Kessler.

If you want to learn more, the paper is on ArXiv, while on the project page you will find the dataset and code.

Idea and results:

Interacting with agents through natural language is a long-term goal of embodied AI as it is potentially the most intuitive mode for human-robot communication. The emerging research on Vision-and-Language Navigation (VLN) is along this path, aiming to develop embodied agents that, following a given instruction in the format of natural language, can reach a target destination in a 3D environment, e.g., “Exit the bedroom and turn left. Walk straight past the grey couch and stop near the rug.”

In real-world scenarios, however, instructions given by humans may contain errors when describing a spatial environment due to inaccurate memory or confusion.

To this end:

– We categorize errors in the VLN-CE task, and establish the first benchmark – R2RI-CE

– We show that state-of-the-art VLN-CE methods are not robust to instruction error

– We formalize the task of Detection and Localization of Instruction Errors

– We propose a method, Instruction Error Detection & Localizer (IEDL)

– We use IEDL to discover errors in the ground truth annotations of the R2R-CE and RxE-CE datasets

Method Overview

A frozen policy produces visual observation following an instruction. A panoramic encoder and a language encoder produce, respectively, the trajectory visual features and instruction features. We then feed the trajectory sets to a cross-modal multi-layer transformer to produce visual-language aligned features. Finally, two specialized heads perform Instruction Error Detection and Instruction Error Localization, respectively.