Data Collection-Free Masked Video Modeling

Yuchi Ishikawa, Masayoshi Kondo, Yoshimitsu Aoki

研究成果: Conference contribution

抄録

Pre-training video transformers generally requires a large amount of data, presenting significant challenges in terms of data collection costs and concerns related to privacy, licensing, and inherent biases. Synthesizing data is one of the promising ways to solve these issues, yet pre-training solely on synthetic data has its own challenges. In this paper, we introduce an effective self-supervised learning framework for videos that leverages readily available and less costly static images. Specifically, we define the Pseudo Motion Generator (PMG) module that recursively applies image transformations to generate pseudo-motion videos from images. These pseudo-motion videos are then leveraged in masked video modeling. Our approach is applicable to synthetic images as well, thus entirely freeing video pre-training from data collection costs and other concerns in real data. Through experiments in action recognition tasks, we demonstrate that this framework allows effective learning of spatio-temporal features through pseudo-motion videos, significantly improving over existing methods which also use static images and partially outperforming those using both real and synthetic videos. These results uncover fragments of what video transformers learn through masked video modeling.

本文言語English
ホスト出版物のタイトルComputer Vision – ECCV 2024 - 18th European Conference, Proceedings
編集者Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, Gül Varol
出版社Springer Science and Business Media Deutschland GmbH
ページ37-56
ページ数20
ISBN(印刷版)9783031732461
DOI
出版ステータスPublished - 2025
イベント18th European Conference on Computer Vision, ECCV 2024 - Milan, Italy
継続期間: 2024 9月 292024 10月 4

出版物シリーズ

名前Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
15069 LNCS
ISSN(印刷版)0302-9743
ISSN(電子版)1611-3349

Conference

Conference18th European Conference on Computer Vision, ECCV 2024
国/地域Italy
CityMilan
Period24/9/2924/10/4

ASJC Scopus subject areas

  • 理論的コンピュータサイエンス
  • コンピュータサイエンス一般

フィンガープリント

「Data Collection-Free Masked Video Modeling」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル