Interactive Robot Action Replanning using Multimodal LLM Trained from Human Demonstration Videos

Chiori Hori, Motonari Kambara, Komei Sugiura, Kei Ota, Sameer Khurana, Siddarth Jain, Radu Corcodel, Devesh Jha, Diego Romeres, Jonathan Le Roux

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Understanding human actions could allow robots to perform a large spectrum of complex manipulation tasks and make collaboration with humans easier. Recently, multimodal scene understanding using audio-visual Transformers has been used to generate robot action sequences from videos of human demonstrations. However, automatic action sequence generation is not always perfect due to the distribution gap between the training and test environments. To bridge this gap, human intervention could be very effective, such as telling the robot agent what should be done. Motivated by this, we propose an error-correction-based action replanning approach that regenerates better action sequences using (1) automatically generated actions from a pretrained action generator and (2) human error-correction in natural language. We collected single-arm robot action sequences aligned to human action instruction for the cooking video dataset YouCook2. We trained the proposed error-correction-based action replanning model using a pre-trained multimodal LLM model (AVBLIP-2), generating a pair of (a) single-arm robot micro-step action sequences and (b) action descriptions in natural language simultaneously. To assess the performance of error correction, we collected human feedback on correcting errors in the automatically generated robot actions. Experiments show that our proposed interactive replanning model trained in a multitask manner using action sequence and description outperformed the baseline model in all types of scores.

Original languageEnglish
Title of host publication2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Proceedings
EditorsBhaskar D Rao, Isabel Trancoso, Gaurav Sharma, Neelesh B. Mehta
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350368741
DOIs
Publication statusPublished - 2025
Event2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Hyderabad, India
Duration: 2025 Apr 62025 Apr 11

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN (Print)1520-6149

Conference

Conference2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025
Country/TerritoryIndia
CityHyderabad
Period25/4/625/4/11

Keywords

  • Human-robot collaboration
  • Interactive error correction
  • Multimodal LLM
  • Multimodal scene understanding
  • Robot action generation

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Interactive Robot Action Replanning using Multimodal LLM Trained from Human Demonstration Videos'. Together they form a unique fingerprint.

Cite this