HourVideo: 1-Hour Video-Language Understanding

Abstract

We present HourVideo, a benchmark dataset for hour-long video-language understanding. Our dataset consists of a novel task suite comprising summarization, perception (recall, tracking), visual reasoning (spatial, temporal, predictive, causal, counterfactual), and navigation (room-to-room, object retrieval) tasks. HourVideo includes 500 manually curated egocentric videos from the Ego4D dataset, spanning durations of 20 to 120 minutes, and features 12,976 high-quality, five-way multiple-choice questions. Benchmarking results reveal that multimodal models, including GPT-4 and LLaVA-NeXT, achieve marginal improvements over random chance. In stark contrast, human experts significantly outperform the state-of-the-art long-context multimodal model, Gemini Pro 1.5 (85.0% vs. 37.3%), highlighting a substantial gap in multimodal capabilities.

HourVideo at a Glance

Baselines

Proposed Task Suite

This table shows all 4 tasks and 18 sub-tasks proposed in HourVideo, along with our handcrafted question prototypes.
Summarization
Key Events/ Objects	Summarize the key interactions of the camera wearer in the [supermarket].
Temporal Sequencing	Describe the sequence of activities performed by the camera wearer to [prepare the dessert].
Compare/ Contrast	How did the camera wearer's activities in the [apartment] differ from those in the [restaurant]?
Perception
Information Retrieval
• Factuall Recall	What [dairy products] did the camera wearer [pick up] in the [supermarket]?
• Sequence Recall	What did the camera wearer do immediately after [weighing tomatoes] at the [supermarket]?
• Temporal Distance	How long after starting to [eat pizza] did the camera wearer [dispose of the pizza box]?
Tracking	List the unique [individuals] the camera wearer interacted with at the [drugstore].
Visual Reasoning
Spatial
• Relationship	Where was the [microwave] placed in relation to the [stove] in the [kitchen]?
• Proximity	Is the [microwave] closer to the [fridge] compared to the [sink]?
• Layout	Which is the correct [IMAGE] depicting the layout of the camera wearer's [apartment]?
Temporal
• Duration	Which activity did the camera wearer spend more time on: [cooking] or [playing the piano]?
• Frequency	Did the camera wearer use the [circular saw] or [crosscut saw] more frequently to [cut wood]?
• Prerequisites	What preparation steps did the camera wearer take before [baking cookies]?
Predictive	What is the most likely activity the camera wearer will do next after [doing laundry]?
Causal	Why did the camera wearer [leave the garage for the second time]?
Counterfactual	What if the camera wearer used the [oven] to [cook mashed potatoes]?
Navigation
Room-to-Room	How did the camera wearer get from the [building entrance] to the [apartment]?
Object Retrieval	How can the camera wearer retrieve the [TV remote] if they are in the [kitchen]?

BibTeX

@inproceedings{chandrasegaran2024hourvideo,
      title={HourVideo: 1-Hour Video-Language Understanding},
      author={Chandrasegaran, Keshigeyan and Gupta, Agrim and Hadzic, Lea M. and Kota, Taran and 
      He, Jimming and Eyzaguirre, Cristobal and Durante, Zane and Li, Manling and Wu, Jiajun and Li, Fei-Fei},
      booktitle = {Advances in Neural Information Processing Systems},
      year={2024},
      volume = {37},
}

Acknowledgments

This work was in part supported by the Stanford Institute for Human-Centered Artificial Intelligence (HAI), ONR N00014-23-1-2355, and Microsoft. This work was partly supported by API credit grants from Google DeepMind and OpenAI. We thank Vishal Dharmadhikari for assistance with setting up Gemini 1.5 evaluations, Hashem Elezabi and Canon Grace Pham for help with data curation. We thank Chengshu (Eric) Li and Sanjana Srivastava for discussions on navigation questions, and Michael Poli, Daniel Y Fu, Jing Yu Koh, Stephen Tian, Tristan Thrush and Ngoc-Trung Tran for their feedback on the manuscript.