Given an audio file and a puppet video, we produce a dubbed result in which the puppet is saying the new audio phrase with proper mouth articulation. Specifically, each syllable of the input audio matches a closed-open-closed mouth sequence in our dubbed result. We present two methods, one semi-automatic appearance-based and one fully automatic audio-based,
CategoryResearch
Information Maximizing Visual Question Generation
Example questions generated for a set of images and answer categories. Incorrect questions are shown in grey and occur when no relevant question van be generated for a given image and answer category Authors Ranjay Krishna, Michael Bernstein, Li Fei-Fei Abstract Though image-to-sequence generation models have become overwhelmingly popular in human-computer communications, they suffer from
A Glimpse Far into the Future: Understanding Long-term Crowd Worker Accuracy
A selection of individual workers’ accuracy over time during the question answering task. Each worker remains relatively constant throughout his or her entire lifetime. Authors Kenji Hata, Ranjay Krishna, Li Fei-Fei, Michael Bernstein Abstract Microtask crowdsourcing is increasingly critical to the creation of extremely large datasets. As a result, crowd workers spend weeks or months
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
An overview of the data needed to move from perceptual awareness to cognitive understanding of images. We present a dataset of images densely annotated with numerous region descriptions, objects, attributes, and relationships. Region descriptions (e.g. “girl feeding large elephant” and “a man taking a picture behind girl”) are shown (top). The objects (e.g. elephant), attributes
Visual Relationship Detection with Language Priors
Even though all the images contain the same objects (a person and a bicycle), it is the relationship between the objects that determine the holistic interpretation of the image. Authors Cewu Lu*, Ranjay Krishna*, Michael Bernstein, Li Fei-Fei Abstract Visual relationships capture a wide variety of interactions between pairs of objects in images (e.g. “man
Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval
Actual results using a popular image search engine (top row) and ideal results (bottom row) for the query a boy wearing a t-shirt with a plane on it. Authors Sebastian Schuster, Ranjay Krishna, Angel Chang, Li Fei-Fei and Christopher D. Manning Abstract Semantically complex queries which include attributes of objects and relations between objects still
Embracing Error to Enable Rapid Crowdsourcing
(a) Images are shown to workers at 100ms per image. Workers react whenever they see a dog. (b) The true labels are the ground truth dog images. (c) The workers’ keypresses are slow and occur several images after the dog images have already passed. We record these keypresses as the observed labels. (d) Our technique
Image Retrieval using Scene Graphs
Image search using a complex query like “man holding fish and wearing hat on white boat” returns unsatisfactory results in (a). Ideal results (b) include correct objects (“man”, “boat”), attributes (“boat is white”) and relationships (“man on boat”). Authors Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Ayman Shamma, Michael Bernstein, Li Fei-Fei Abstract
Editorial Algorithms: Using Social Media to Discover and Report Local News
Screenshot of CityBeat interface showing the Detected Events List, Event Window and the Statistics Sidebar Authors Schwartz, R., Naaman M., Teodoro R. Abstract The role of algorithms in the detection, curation and broadcast of news is becoming increasingly prevalent. To better understand this role we developed CityBeat, a system that implements what we call “editorial algorithms”