Abstract

We present HourVideo, a benchmark dataset for hour-long video-language understanding. Our dataset consists of a novel task suite comprising summarization, perception (recall, tracking), visual reasoning (spatial, temporal, predictive, causal, counterfactual), and navigation (room-to-room, object retrieval) tasks. HourVideo includes 500 manually curated egocentric videos from the Ego4D dataset, spanning durations of 20 to 120 minutes, and features 12,976 high-quality, five-way multiple-choice questions. Benchmarking results reveal that multimodal models, including GPT-4 and LLaVA-NeXT, achieve marginal improvements over random chance. In stark contrast, human experts significantly outperform the state-of-the-art long-context multimodal model, Gemini Pro 1.5 (85.0% vs. 37.3%), highlighting a substantial gap in multimodal capabilities.

HourVideo at a Glance


Baselines


Proposed Task Suite

This table shows all 4 tasks and 18 sub-tasks proposed in HourVideo, along with our handcrafted question prototypes.
Summarization
Key Events/ Objects Summarize the key interactions of the camera wearer in the [supermarket].
Temporal Sequencing Describe the sequence of activities performed by the camera wearer to [prepare the dessert].
Compare/ Contrast How did the camera wearer's activities in the [apartment] differ from those in the [restaurant]?
Perception
Information Retrieval
• Factuall Recall What [dairy products] did the camera wearer [pick up] in the [supermarket]?
• Sequence Recall What did the camera wearer do immediately after [weighing tomatoes] at the [supermarket]?
• Temporal Distance How long after starting to [eat pizza] did the camera wearer [dispose of the pizza box]?
Tracking List the unique [individuals] the camera wearer interacted with at the [drugstore].
Visual Reasoning
Spatial
• Relationship Where was the [microwave] placed in relation to the [stove] in the [kitchen]?
• Proximity Is the [microwave] closer to the [fridge] compared to the [sink]?
• Layout Which is the correct [IMAGE] depicting the layout of the camera wearer's [apartment]?
Temporal
• Duration Which activity did the camera wearer spend more time on: [cooking] or [playing the piano]?
• Frequency Did the camera wearer use the [circular saw] or [crosscut saw] more frequently to [cut wood]?
• Prerequisites What preparation steps did the camera wearer take before [baking cookies]?
Predictive What is the most likely activity the camera wearer will do next after [doing laundry]?
Causal Why did the camera wearer [leave the garage for the second time]?
Counterfactual What if the camera wearer used the [oven] to [cook mashed potatoes]?
Navigation
Room-to-Room How did the camera wearer get from the [building entrance] to the [apartment]?
Object Retrieval How can the camera wearer retrieve the [TV remote] if they are in the [kitchen]?

BibTeX

@inproceedings{chandrasegaran2024hourvideo,
      title={HourVideo: 1-Hour Video-Language Understanding},
      author={Chandrasegaran, Keshigeyan and Gupta, Agrim and Hadzic, Lea M. and Kota, Taran and 
      He, Jimming and Eyzaguirre, Cristobal and Durante, Zane and Li, Manling and Wu, Jiajun and Li, Fei-Fei},
      booktitle = {Advances in Neural Information Processing Systems},
      year={2024},
      volume = {37},
}

Acknowledgments

This work was in part supported by the Stanford Institute for Human-Centered Artificial Intelligence (HAI), ONR N00014-23-1-2355, and Microsoft. This work was partly supported by API credit grants from Google DeepMind and OpenAI. We thank Vishal Dharmadhikari for assistance with setting up Gemini 1.5 evaluations, Hashem Elezabi and Canon Grace Pham for help with data curation. We thank Chengshu (Eric) Li and Sanjana Srivastava for discussions on navigation questions, and Michael Poli, Daniel Y Fu, Jing Yu Koh, Stephen Tian, Tristan Thrush and Ngoc-Trung Tran for their feedback on the manuscript.