An exploratory workshop focusing on the nascent area of learning from instructional videos. FIVER features topic-focused invited speakers, posters and a discussion panel focused on identifying challenges and future efforts in this area.
Video understanding has advanced quite a long way in the past decade, moving from classifying easily segmented human activity on static backgrounds and tracking single objects smoothly moving without clutter to large-scale detection and segmentation of action amidst dense clutter and translation of video into textual description automatically, to name a few. In this process, the core problems of video understanding have progressed, the raw and annotated data available for these problems has substantially increased, and the suites of methods in study has broadened. However, much of this work remains a proxy for an eventual task or applications, such as video indexing and search, or agent-based understanding of the environment.
In this workshop, we want to take a step beyond these proxies and into a concrete and grounded tasks: learning from instructional video. This is a nascent area in the vision, learning, robotics and broader AI communities with but a handful of recent papers and datasets being published on the topic. The goal of this workshop is to start a conversation around learning from instructional video with the ultimate plan to organize a future, longer-scale workshop with a challenge associated with the problem area. In this conversation, we invite abstract submissions, have invited speakers, present an overview of the problem area (by organizers) and have a panel discussion that will ask and discuss questions like: What are the core problems in learning from instructional video? What are reasonable goals to set for this space in the next few years? What data do we have available now and what data do we need?
Fine-grained instructional video understanding differs from other tasks in computer vision in that the nature of concepts to be understood have significant non-visual elements. Different fine-grained actions, though semantically distinct, may be visually similar and difficult to distinguish using conventional visual representations only. Methods which leverage an internal, non-visual, representation in which actions are more distinguishable are desirable. I will describe ongoing work that leverages force - more than the visual characteristics of manipulation it is the forces involved in manipulation which define its semantics. Our method uses force information, collected with special gloves, provided during training to learn a mapping from the visual space to the force space, and as a consequence is able to "hallucinate" forces from video during test time. This force representation is then leveraged in classification, prediction, and segmentation.
Grounding textual phrases in visual content with standalone image-sentence pairs is a challenging task. When we consider grounding in instructional videos, this problem becomes profoundly more complex: the latent temporal structure of instructional videos breaks independence assumptions and necessitates contextual understanding for resolving ambiguous visual-linguistic cues. Furthermore, dense annotations and video data scale mean supervised approaches are prohibitively costly.
In this work, we propose to tackle this new task with a weakly-supervised framework for reference-aware visual grounding in instructional videos, where only the temporal alignment between the transcription and the video segment are available for supervision. We introduce the visually grounded action graph, a structured representation capturing the latent dependency between grounding and references in video.
Most of current approaches focus on developing frame-based representation of videos. But can we recognize actions at frame-level appearances? For example, how do humans recognize the action ``opening a book''? We argue that there are two important cues: modeling temporal shape dynamics and modeling functional relationships between humans and objects. In this talk, I will talk about a new representation for videos: space-time region graphs.
At the end of the talk, I will also talk about how two new exciting datasets which can help move the needle in understanding actions. First, I will talk about Charades-Ego: a dataset with semantically aligned first person and third person videos. Second, I will talk about a new dataset Robo-Charades: aligned demonstration videos and robotic kinesthetic trajectories.
Current computer vision work focuses on simple labeling tasks. Images and video are labeled with classes, captions, or regions. Animals and humans use vision to perform much richer tasks: avoiding predators, finding food, mating, and modifying the environment. Often such involves complex interactive behavior described by rules or protocols. Games, in general, and board games, in particular, are a metaphoric abstraction of interactive behavior governed by rules. Children learn to play games by watching their peers play; they rarely read the rules. In this talk, I will present robots that similarly learn the rules of game play by watching other robots play.
There is a huge opportunity in making robotics and smart home devices intelligent partners to people living in the home. One crucial element is to enable these systems to gain a fine-grained understanding of the activities. As an example, knowing about a recipe can enable a robot to help with cooking and enable a smart home to give contextually-relevant answer about the specific step the cooking is in. In this talk, I will present a learning architecture (as part of a bigger RoboBrain effort) that allows learning and sharing of semantically meaningful, and actionable representations. Then, I'll discuss future directions and discuss applications to robotics and smart homes.