Spatiotemporal Video Highlight by Neural Network Considering Gaze and Hands of Surgeon in Egocentric Surgical Videos

Keitaro Yoshida, Ryo Hachiuma, Hisako Tomita, Jingjing Pan, Kris Kitani, Hiroki Kajita, Tetsu Hayashida, Maki Sugimoto

Research output: Contribution to journalArticlepeer-review

5 Citations (Scopus)


In the medical field, surgical videos can be used to introduce surgical skills. Medical students and residents watch the videos to study the surgical skills and increase learning speed by compensating for the lack of experience in surgical rooms due to limited opportunity to join in surgery. To record egocentric surgical videos by a wearable camera is a solution to record surgical skills of a surgeon in detail. However, most egocentric surgical videos are of quite long duration. For example, in the case of tumor removal in breast surgery, a video recording time often reaches 2h. With that length, it is time consuming to see important scenes in the video, particularly because many surgical videos include nonessential scenes such as sterilization and preparation of tools. For extracting specific scenes from a long video, we can apply scene estimation by machine learning. Furthermore, it is important to know where the surgeon is looking to observe the area of the incision in detail. In particular, it is vital to be able to zoom in on key elements, allowing viewers to see the incision area and the fine details of the necessary surgical skills. In this study, we aimed to highlight incision scenes from egocentric surgical videos in the spatiotemporal domain by utilizing two neural networks for the temporal and spatial highlights. For the temporal highlights, we designed a neural network that estimates the incision scenes by learning gaze speed, hand movements, number of hands, and background movements in egocentric surgical videos. For the spatial highlights, in order to estimate the important area to zoom in, we designed a neural network that learns the surgeon's gaze on natural features of surgical scenes to form a probability map as a representation of the estimated gaze area. The estimated gaze area was also used to calculate the appropriate zoom-in position and zoom-in ratio. To control the highlighted parameters in accord with user preferences, we also made a user interface that allows for the selection of playback speed gain and zoom ratio gain. For the evaluation, we verified the performance of the networks by a quantitative assessment and conducted a user study with medical doctors by showing an actual surgical video to obtain a qualitative assessment on the proposed system.

Original languageEnglish
Article number2141001
JournalJournal of Medical Robotics Research
Issue number1
Publication statusPublished - 2022 Mar 1


  • Scene estimation
  • egocentric video
  • gaze point
  • surgical video
  • video editing

ASJC Scopus subject areas

  • Biomedical Engineering
  • Human-Computer Interaction
  • Computer Science Applications
  • Artificial Intelligence
  • Applied Mathematics


Dive into the research topics of 'Spatiotemporal Video Highlight by Neural Network Considering Gaze and Hands of Surgeon in Egocentric Surgical Videos'. Together they form a unique fingerprint.

Cite this