2020-21 Magic Grant Profile: Improving Remote Learning via Hierarchical Decomposition of Instructional Videos

Improving Remote Learning via Hierarchical Decomposition of Instructional Videos – one of our 2020-21 Magic Grants – aims to facilitate the creation of instructional videos with logical navigation. To do this, the team is using algorithms that automatically learn shared action steps from videos across different tasks. With this knowledge, the team is finding a way to “hierarchically” segment these tasks into discrete steps and provide voice-based navigation commands for accessing the steps.

This post is part of a series of interviews with our current 2020-21 Magic Grants. Since we are back in Magic Grant application season we want to showcase some of the great work our current grantees are doing and encourage you to consider applying for a Magic Grant. Here’s the link to the call for proposals, and to our FAQ. And here’s the link to the application itselfapplications are due May 1.


Here’s a (lightly) edited version of the interview.

What was the main impetus for your project? Where did the idea come from?

How-to videos have been increasingly growing in popularity over the years. Even before Covid-19 hit, people were often using them to accomplish a number of tasks. However, our personal experiences of trying to follow how-to videos were extremely frustrating because of the navigation challenges. We (Anh Truong and Chien-Yi Chang) were individually doing research on computer vision for understanding instructional videos and HCI (human computer interaction) methods for navigating instructional videos and this project came about as an intersection of those two research focuses.

Were you thinking of this before Covid-19 hit and made remote learning so important? And how has your project evolved or changed during or because of the pandemic?

While we were thinking about this before Covid-19 hit, we started with more targeted domains of instructional videos, such as makeup tutorials. Once the pandemic hit, we saw a greater need for an approach that could generalize across a number of different domains. People were now turning to instructional videos for all sorts of tasks such as making masks, cutting hair, cooking and more.

What are some of the biggest milestones you’ve achieved so far?

One of our papers on segmenting makeup videos was accepted to SIGCHI (the Special Interest Group on Computer–Human Interaction conference). Additionally, we’ve scrubbed a large number of instructional videos and text instructions to build a corpus that will help guide the fine and coarse-level segmentation for a set of domains. We have also implemented voice commands that enable hands free navigation of instructional videos using references to high level intentions such as steps, tools and objects.

What are some of the most challenging aspects of your project?

One of the most challenging aspects of our project is extracting the coarse-level groupings of the tutorial. For many domains, this grouping is often implicit and not directly mentioned w.r.t. (with regard to) each fine level step. We have to figure out, one, what those coarse groupings are for a given domain and, two, how to extract them.

What are some of the ideal use cases for what you are developing? Where do you hope to have the biggest impact?

People watch instructional videos for a number of reasons. Some want to actually follow along in the moment, some for entertainment, and some so they can passively intake the skills and try to roughly approximate it later. We believe our interface can be especially effective in cases where the user is actually trying to follow along with the tutorial in the moment or are looking to understand which portions of the tutorials are relevant to their task. Even for people who want to watch tutorials simply for entertainment, we believe our UI (user interface) can give them a high level overview of what the tutorial entails before they commit to watching it.

What comes next?

Next, we would like to release some of the tutorials we were able to segment so that people can actually use them to complete real world tasks and to give us feedback on them.