TY - GEN
T1 - Single-modal incremental terrain clustering from self-supervised audio-visual feature learning
AU - Ishikawa, Reina
AU - Hachiuma, Ryo
AU - Kurobe, Akiyoshi
AU - Saito, Hideo
N1 - Funding Information:
This research is supported by JST (JPMJMI19B2).
Publisher Copyright:
© 2020 IEEE
PY - 2020
Y1 - 2020
N2 - The key to an accurate understanding of terrain is to extract the informative features from the multi-modal data obtained from different devices. Sensors, such as RGB cameras, depth sensors, vibration sensors, and microphones, are used as the multi-modal data. Many studies have explored ways to use them, especially in the robotics field. Some papers have successfully introduced single-modal or multi-modal methods. However, in practice, robots can be faced with extreme conditions; microphones do not work well in crowded scenes, and an RGB camera cannot capture terrains well in the dark. In this paper, we present a novel framework using the multi-modal variational autoencoder and the Gaussian mixture model clustering algorithm on image data and audio data for terrain type clustering. Our method enables the terrain type clustering even if one of the modalities (either image or audio) is missing at the test-time. We evaluated the clustering accuracy with a conventional multi-modal terrain type clustering method and we conducted ablation studies to show the effectiveness of our approach.
AB - The key to an accurate understanding of terrain is to extract the informative features from the multi-modal data obtained from different devices. Sensors, such as RGB cameras, depth sensors, vibration sensors, and microphones, are used as the multi-modal data. Many studies have explored ways to use them, especially in the robotics field. Some papers have successfully introduced single-modal or multi-modal methods. However, in practice, robots can be faced with extreme conditions; microphones do not work well in crowded scenes, and an RGB camera cannot capture terrains well in the dark. In this paper, we present a novel framework using the multi-modal variational autoencoder and the Gaussian mixture model clustering algorithm on image data and audio data for terrain type clustering. Our method enables the terrain type clustering even if one of the modalities (either image or audio) is missing at the test-time. We evaluated the clustering accuracy with a conventional multi-modal terrain type clustering method and we conducted ablation studies to show the effectiveness of our approach.
UR - http://www.scopus.com/inward/record.url?scp=85107187736&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85107187736&partnerID=8YFLogxK
U2 - 10.1109/ICPR48806.2021.9412638
DO - 10.1109/ICPR48806.2021.9412638
M3 - Conference contribution
AN - SCOPUS:85107187736
T3 - Proceedings - International Conference on Pattern Recognition
SP - 9399
EP - 9406
BT - Proceedings of ICPR 2020 - 25th International Conference on Pattern Recognition
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 25th International Conference on Pattern Recognition, ICPR 2020
Y2 - 10 January 2021 through 15 January 2021
ER -