TY - JOUR
T1 - Audio-Visual Self-Supervised Terrain Type Recognition for Ground Mobile Platforms
AU - Kurobe, Akiyoshi
AU - Nakajima, Yoshikatsu
AU - Kitani, Kris
AU - Saito, Hideo
N1 - Funding Information:
This work was supported in part by the JST CREST under Grant JPMJCR19F3, and in part by the JST-Mirai Program, Japan, under Grant JPMJMI19B2.
Publisher Copyright:
© 2013 IEEE.
PY - 2021
Y1 - 2021
N2 - The ability to recognize and identify terrain characteristics is an essential function required for many autonomous ground robots such as social robots, assistive robots, autonomous vehicles, and ground exploration robots. Recognizing and identifying terrain characteristics is challenging because similar terrains may have very different appearances (e.g., carpet comes in many colors), while terrains with very similar appearance may have very different physical properties (e.g., mulch versus dirt). In order to address the inherent ambiguity in vision-based terrain recognition and identification, we propose a multi-modal self-supervised learning technique that switches between audio features extracted from a microphone attached to the underside of a mobile platform and image features extracted by a camera on the platform to cluster terrain types. The terrain cluster labels are then used to train an image-based real-time CNN (Convolutional Neural Network) to predict terrain types changes. Through experiments, we demonstrate that the proposed self-supervised terrain type recognition method achieves over 80% accuracy, which greatly outperforms several baselines and suggests strong potential for assistive applications.
AB - The ability to recognize and identify terrain characteristics is an essential function required for many autonomous ground robots such as social robots, assistive robots, autonomous vehicles, and ground exploration robots. Recognizing and identifying terrain characteristics is challenging because similar terrains may have very different appearances (e.g., carpet comes in many colors), while terrains with very similar appearance may have very different physical properties (e.g., mulch versus dirt). In order to address the inherent ambiguity in vision-based terrain recognition and identification, we propose a multi-modal self-supervised learning technique that switches between audio features extracted from a microphone attached to the underside of a mobile platform and image features extracted by a camera on the platform to cluster terrain types. The terrain cluster labels are then used to train an image-based real-time CNN (Convolutional Neural Network) to predict terrain types changes. Through experiments, we demonstrate that the proposed self-supervised terrain type recognition method achieves over 80% accuracy, which greatly outperforms several baselines and suggests strong potential for assistive applications.
KW - CNN
KW - Ground robots
KW - assistive application
KW - self-supervised learning
UR - http://www.scopus.com/inward/record.url?scp=85100912315&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85100912315&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2021.3059620
DO - 10.1109/ACCESS.2021.3059620
M3 - Article
AN - SCOPUS:85100912315
SN - 2169-3536
VL - 9
SP - 29970
EP - 29979
JO - IEEE Access
JF - IEEE Access
M1 - 9354792
ER -