TY - GEN
T1 - Interactive Robot Action Replanning using Multimodal LLM Trained from Human Demonstration Videos
AU - Hori, Chiori
AU - Kambara, Motonari
AU - Sugiura, Komei
AU - Ota, Kei
AU - Khurana, Sameer
AU - Jain, Siddarth
AU - Corcodel, Radu
AU - Jha, Devesh
AU - Romeres, Diego
AU - Le Roux, Jonathan
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Understanding human actions could allow robots to perform a large spectrum of complex manipulation tasks and make collaboration with humans easier. Recently, multimodal scene understanding using audio-visual Transformers has been used to generate robot action sequences from videos of human demonstrations. However, automatic action sequence generation is not always perfect due to the distribution gap between the training and test environments. To bridge this gap, human intervention could be very effective, such as telling the robot agent what should be done. Motivated by this, we propose an error-correction-based action replanning approach that regenerates better action sequences using (1) automatically generated actions from a pretrained action generator and (2) human error-correction in natural language. We collected single-arm robot action sequences aligned to human action instruction for the cooking video dataset YouCook2. We trained the proposed error-correction-based action replanning model using a pre-trained multimodal LLM model (AVBLIP-2), generating a pair of (a) single-arm robot micro-step action sequences and (b) action descriptions in natural language simultaneously. To assess the performance of error correction, we collected human feedback on correcting errors in the automatically generated robot actions. Experiments show that our proposed interactive replanning model trained in a multitask manner using action sequence and description outperformed the baseline model in all types of scores.
AB - Understanding human actions could allow robots to perform a large spectrum of complex manipulation tasks and make collaboration with humans easier. Recently, multimodal scene understanding using audio-visual Transformers has been used to generate robot action sequences from videos of human demonstrations. However, automatic action sequence generation is not always perfect due to the distribution gap between the training and test environments. To bridge this gap, human intervention could be very effective, such as telling the robot agent what should be done. Motivated by this, we propose an error-correction-based action replanning approach that regenerates better action sequences using (1) automatically generated actions from a pretrained action generator and (2) human error-correction in natural language. We collected single-arm robot action sequences aligned to human action instruction for the cooking video dataset YouCook2. We trained the proposed error-correction-based action replanning model using a pre-trained multimodal LLM model (AVBLIP-2), generating a pair of (a) single-arm robot micro-step action sequences and (b) action descriptions in natural language simultaneously. To assess the performance of error correction, we collected human feedback on correcting errors in the automatically generated robot actions. Experiments show that our proposed interactive replanning model trained in a multitask manner using action sequence and description outperformed the baseline model in all types of scores.
KW - Human-robot collaboration
KW - Interactive error correction
KW - Multimodal LLM
KW - Multimodal scene understanding
KW - Robot action generation
UR - https://www.scopus.com/pages/publications/105003874779
UR - https://www.scopus.com/inward/citedby.url?scp=105003874779&partnerID=8YFLogxK
U2 - 10.1109/ICASSP49660.2025.10887717
DO - 10.1109/ICASSP49660.2025.10887717
M3 - Conference contribution
AN - SCOPUS:105003874779
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
BT - 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Proceedings
A2 - Rao, Bhaskar D
A2 - Trancoso, Isabel
A2 - Sharma, Gaurav
A2 - Mehta, Neelesh B.
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025
Y2 - 6 April 2025 through 11 April 2025
ER -