GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos

Tomas Soucek, Dima Damen, Michael Wray, Ivan Laptev, Josef Sivic

Research output: Chapter in Book/Report/Conference proceedingConference Contribution (Conference Proceeding)

Abstract

We address the task of generating temporally consistent and physically plausible images of actions and object state transformations. Given an input image and a text prompt describing the targeted transformation, our generated im- ages preserve the environment and transform objects in the initial image. Our contributions are threefold. First, we leverage a large body of instructional videos and automati- cally mine a dataset of triplets of consecutive frames cor- responding to initial object states, actions, and resulting object transformations. Second, equipped with this data, we develop and train a conditioned diffusion model dubbed GenHowTo. Third, we evaluate GenHowTo on a variety of objects and actions and show superior performance com- pared to existing methods. In particular, we introduce a quantitative evaluation where GenHowTo achieves 88% and 74% on seen and unseen interaction categories, respec- tively, outperforming prior work by a large margin.
Original languageEnglish
Title of host publication2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Publication statusAccepted/In press - 17 Jun 2024
EventIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): CVPR - Seattle, United States
Duration: 17 Jun 202421 Jun 2024
https://cvpr.thecvf.com

Publication series

NameIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
PublisherIEEE
ISSN (Print)1063-6919
ISSN (Electronic)2575-7075

Conference

ConferenceIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Country/TerritoryUnited States
CitySeattle
Period17/06/2421/06/24
Internet address

Fingerprint

Dive into the research topics of 'GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos'. Together they form a unique fingerprint.

Cite this