TY - JOUR
T1 - Self-Supervised Audio-Visual Feature Learning for Single-Modal Incremental Terrain Type Clustering
AU - Ishikawa, Reina
AU - Hachiuma, Ryo
AU - Saito, Hideo
N1 - Funding Information:
This work was supported by JST-Mirai Program, Japan, under Grant JPMJMI19B2.
Publisher Copyright:
© 2013 IEEE.
PY - 2021
Y1 - 2021
N2 - The key to an accurate understanding of terrain is to extract the informative features from the multi-modal data obtained from different devices. Sensors, such as RGB cameras, depth sensors, vibration sensors, and microphones, are used as the multi-modal data. Many studies have explored ways to use them, especially in the robotics field. Some papers have successfully introduced single-modal or multi-modal methods. However, in practice, robots can be faced with extreme conditions; microphones do not work well in crowded scenes, and an RGB camera cannot capture terrains well in the dark. In this paper, we present a novel framework using the multi-modal variational autoencoder and the Gaussian mixture model clustering algorithm on image data and audio data for terrain type clustering by forcing the features to be closer together in the feature space. Our method enables the terrain type clustering even if one of the modalities (either image or audio) is missing at the test-time. We evaluated the clustering accuracy with a conventional multi-modal terrain type clustering method and we conducted ablation studies to show the effectiveness of our approach.
AB - The key to an accurate understanding of terrain is to extract the informative features from the multi-modal data obtained from different devices. Sensors, such as RGB cameras, depth sensors, vibration sensors, and microphones, are used as the multi-modal data. Many studies have explored ways to use them, especially in the robotics field. Some papers have successfully introduced single-modal or multi-modal methods. However, in practice, robots can be faced with extreme conditions; microphones do not work well in crowded scenes, and an RGB camera cannot capture terrains well in the dark. In this paper, we present a novel framework using the multi-modal variational autoencoder and the Gaussian mixture model clustering algorithm on image data and audio data for terrain type clustering by forcing the features to be closer together in the feature space. Our method enables the terrain type clustering even if one of the modalities (either image or audio) is missing at the test-time. We evaluated the clustering accuracy with a conventional multi-modal terrain type clustering method and we conducted ablation studies to show the effectiveness of our approach.
KW - Self-supervised
KW - multi-modal learning
KW - terrain type clustering
UR - http://www.scopus.com/inward/record.url?scp=85107176150&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85107176150&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2021.3075582
DO - 10.1109/ACCESS.2021.3075582
M3 - Article
AN - SCOPUS:85107176150
SN - 2169-3536
VL - 9
SP - 64346
EP - 64357
JO - IEEE Access
JF - IEEE Access
M1 - 9416486
ER -