This paper proposes VISTURE, a system for generating a robot's gesture and speech by using video as input. VISTURE assumes a situation in which a robot conveys what it saw with a camera to a person who was absent. The value of this paper is that we have performed a case study to investigate the expressions that Japanese people use to describe video scenes, and used the results to build VISTURE. In particular, we found classification of expressions depicting the video scenes throughout the case study: Foreground information that is the relevant event of the scene and Background one that is not the main point of the description giving the entire scene. Foreground and Background are referred in combination. VISTURE employs the classification to generate human-like expressions. Moreover, we designed the method to determine Foreground and Background, and it can generate multiple combinations of expressions. We investigated the people's impression of a robot performing the gestures and speech generated by VISTURE to evaluate the quality of those gestures and speech. The results showed that the robot was perceived as more likable and capable when it performed gestures.