CrossTimeBenchcc-by-4.0

CrossTimeBench

Name: CrossTimeBench
Creator: CrossTimeBench
License: cc-by-4.0
Keywords: rgb, language, video-classification, question-answering, temporal-reasoning, action-anticipation, temporal-localization

A comprehensive benchmark for evaluating Multimodal Large Language Models on cross-temporal video reasoning in complex multi-event scenarios, consolidating 12 existing VQA datasets with 2,607 videos and 8,496 QA pairs requiring integration of information from multiple temporal segments.

Downloads77

Technical Profile

Modalities: rgblanguage
Task Types: video-classificationquestion-answeringtemporal-reasoningaction-anticipationtemporal-localization
Data Format: json
License: cc-by-4.0

Part of the CrossTimeBench family

Access

View on HuggingFace

Need custom rgb data?

Claru builds purpose-built datasets for any environment applications with dense human annotations and quality assurance.

Request a Sample Pack

Related Datasets

OmniAction

A large-scale multimodal dataset for proactive robot manipulation with 141,162 episodes covering contextual instruction following through spoken dialogue, environmental sounds, and visual cues. The dataset includes 5,096 distinct speaker timbres, 2,482 non-verbal sound events, and 640 environmental backgrounds across six categories of contextual instructions.

rgbaudiolanguage

104K downloadsApr 2026cc-by-nc-4.0

OmniAction

A large-scale multimodal dataset for proactive robot manipulation comprising 141,162 episodes with cross-modal contextual instructions derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands.

rgbaudiolanguage

98K downloadsMar 2026cc-by-nc-4.0

Xperience-10M

A large-scale egocentric multimodal dataset of human experience containing 10 million interactions and 10,000 hours of synchronized first-person recordings with six video streams, audio, stereo depth, camera pose, hand mocap, full-body mocap, IMU, and hierarchical language annotations for embodied AI, robotics, and world modeling research.

rgbaudiodepthproprioception+3

89K downloadsApr 2026other

ABC-130k

The largest open-source robot teleoperation dataset containing bimanual manipulation trajectories collected on two-arm YAM stations with 130,822 episodes across 3,555 hours of data.

rgbproprioception

87K downloadsJun 2026apache-2.0

OmniAction

A large-scale multimodal dataset for proactive robot manipulation with 141,162 episodes covering contextual instruction following through spoken dialogue, environmental sounds, and visual cues.

rgbaudiolanguage

86K downloadsApr 2026cc-by-nc-4.0

Hy-Embodied-0.5-VLA-Data

A large-scale bimanual manipulation dataset with 2,163 hours of high-fidelity demonstrations collected via custom fingertip UMI device with optical motion-capture, spanning 70+ manipulation tasks for training Vision-Language-Action foundation models.

rgbproprioceptionlanguage

76K downloadsJun 2026cc-by-4.0