Fine-grained Instructional Video undER standing Workshop

An exploratory workshop focusing on the nascent area of learning from instructional videos. FIVER features topic-focused invited speakers, posters and a discussion panel focused on identifying challenges and future efforts in this area.

Video understanding has advanced quite a long way in the past decade, moving from classifying easily segmented human activity on static backgrounds and tracking single objects smoothly moving without clutter to large-scale detection and segmentation of action amidst dense clutter and translation of video into textual description automatically, to name a few. In this process, the core problems of video understanding have progressed, the raw and annotated data available for these problems has substantially increased, and the suites of methods in study has broadened. However, much of this work remains a proxy for an eventual task or applications, such as video indexing and search, or agent-based understanding of the environment.

In this workshop, we want to take a step beyond these proxies and into a concrete and grounded tasks: learning from instructional video. This is a nascent area in the vision, learning, robotics and broader AI communities with but a handful of recent papers and datasets being published on the topic. The goal of this workshop is to start a conversation around learning from instructional video with the ultimate plan to organize a future, longer-scale workshop with a challenge associated with the problem area. In this conversation, we invite abstract submissions, have invited speakers, present an overview of the problem area (by organizers) and have a panel discussion that will ask and discuss questions like: What are the core problems in learning from instructional video? What are reasonable goals to set for this space in the next few years? What data do we have available now and what data do we need?

News, Dates and Updates

Confirmed Speakers

Cornelia Fermuller University of Maryland College Park

Leveraging Motoric Information for Fine-Grained Action Understanding

Fine-grained instructional video understanding differs from other tasks in computer vision in that the nature of concepts to be understood have significant non-visual elements. Different fine-grained actions, though semantically distinct, may be visually similar and difficult to distinguish using conventional visual representations only. Methods which leverage an internal, non-visual, representation in which actions are more distinguishable are desirable. I will describe ongoing work that leverages force - more than the visual characteristics of manipulation it is the forces involved in manipulation which define its semantics. Our method uses force information, collected with special gloves, provided during training to learn a mapping from the visual space to the force space, and as a consequence is able to "hallucinate" forces from video during test time. This force representation is then leveraged in classification, prediction, and segmentation.

Animesh Garg Stanford University

Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Video

Grounding textual phrases in visual content with standalone image-sentence pairs is a challenging task. When we consider grounding in instructional videos, this problem becomes profoundly more complex: the latent temporal structure of instructional videos breaks independence assumptions and necessitates contextual understanding for resolving ambiguous visual-linguistic cues. Furthermore, dense annotations and video data scale mean supervised approaches are prohibitively costly.
In this work, we propose to tackle this new task with a weakly-supervised framework for reference-aware visual grounding in instructional videos, where only the temporal alignment between the transcription and the video segment are available for supervision. We introduce the visually grounded action graph, a structured representation capturing the latent dependency between grounding and references in video.

Abhinav Gupta Carnegie Mellon University

Title TBD

Jeffrey Siskind Purdue University

Learning Grounded Game Play

Current computer vision work focuses on simple labeling tasks. Images and video are labeled with classes, captions, or regions. Animals and humans use vision to perform much richer tasks: avoiding predators, finding food, mating, and modifying the environment. Often such involves complex interactive behavior described by rules or protocols. Games, in general, and board games, in particular, are a metaphoric abstraction of interactive behavior governed by rules. Children learn to play games by watching their peers play; they rarely read the rules. In this talk, I will present robots that similarly learn the rules of game play by watching other robots play.

Ashutosh Saxena Stanford University /

Title TBD


To be posted when available.


To be posted when available.


  • Jason Corso, University of Michigan
  • Ivan Laptev, INRIA
  • Josef Sivic, INRIA and Czech Technical University
  • Luowei Zhou, University of Michigan