Interactive Robot Action Replanning using Multimodal LLM Trained from Human Demonstration Videos

  • Chiori Hori
  • , Motonari Kambara
  • , Komei Sugiura
  • , Kei Ota
  • , Sameer Khurana
  • , Siddarth Jain
  • , Radu Corcodel
  • , Devesh Jha
  • , Diego Romeres
  • , Jonathan Le Roux

研究成果: Conference contribution

1 被引用数 (Scopus)

抄録

Understanding human actions could allow robots to perform a large spectrum of complex manipulation tasks and make collaboration with humans easier. Recently, multimodal scene understanding using audio-visual Transformers has been used to generate robot action sequences from videos of human demonstrations. However, automatic action sequence generation is not always perfect due to the distribution gap between the training and test environments. To bridge this gap, human intervention could be very effective, such as telling the robot agent what should be done. Motivated by this, we propose an error-correction-based action replanning approach that regenerates better action sequences using (1) automatically generated actions from a pretrained action generator and (2) human error-correction in natural language. We collected single-arm robot action sequences aligned to human action instruction for the cooking video dataset YouCook2. We trained the proposed error-correction-based action replanning model using a pre-trained multimodal LLM model (AVBLIP-2), generating a pair of (a) single-arm robot micro-step action sequences and (b) action descriptions in natural language simultaneously. To assess the performance of error correction, we collected human feedback on correcting errors in the automatically generated robot actions. Experiments show that our proposed interactive replanning model trained in a multitask manner using action sequence and description outperformed the baseline model in all types of scores.

本文言語English
ホスト出版物のタイトル2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Proceedings
編集者Bhaskar D Rao, Isabel Trancoso, Gaurav Sharma, Neelesh B. Mehta
出版社Institute of Electrical and Electronics Engineers Inc.
ISBN(電子版)9798350368741
DOI
出版ステータスPublished - 2025
イベント2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Hyderabad, India
継続期間: 2025 4月 62025 4月 11

出版物シリーズ

名前ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN(印刷版)1520-6149

Conference

Conference2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025
国/地域India
CityHyderabad
Period25/4/625/4/11

ASJC Scopus subject areas

  • ソフトウェア
  • 信号処理
  • 電子工学および電気工学

フィンガープリント

「Interactive Robot Action Replanning using Multimodal LLM Trained from Human Demonstration Videos」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル