VLA-Arena2025apache-2.0

VLA-Arena Dataset (L1 - Large Variant)

Name: VLA-Arena Dataset (L1 - Large Variant)
Creator: VLA-Arena
Published: 2025-01-01
License: apache-2.0
Keywords: rgb, proprioception, language, manipulation, grasping, pick_and_place, simulation

An open-source benchmark for systematic evaluation of Vision-Language-Action (VLA) models, featuring 55 tasks at difficulty level 1 with 2,750 human demonstrations across safety, distractor, extrapolation, and long horizon domains.

Downloads77
Episodes2750

Why This Matters for Physical AI

VLA-Arena provides a comprehensive benchmark for evaluating vision-language-action models on safety, generalization, and long-horizon reasoning critical for deploying robotic agents in real-world environments.

Technical Profile

Modalities: rgbproprioceptionlanguage
Action Space: end_effector_delta
Environment: simulation
Task Types: manipulationgraspingpick_and_place
Episodes: 2750
Data Format: RLDS
Annotation Types: language_instructionsaction_labels
License: apache-2.0

Part of the VLA-Arena family

Access

View on HuggingFace

Need custom rgb data?

Claru builds purpose-built datasets for simulation applications with dense human annotations and quality assurance.

Request a Sample Pack

Related Datasets

OmniAction

A large-scale multimodal dataset for proactive robot manipulation comprising 141,162 episodes with cross-modal contextual instructions derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands.

rgbaudiolanguage

114K downloadsMar 2026cc-by-nc-4.0

Xperience-10M

A large-scale egocentric multimodal dataset of human experience containing 10 million interactions and 10,000 hours of synchronized first-person recordings with six video streams, audio, stereo depth, camera pose, hand mocap, full-body mocap, IMU, and hierarchical language annotations for embodied AI, robotics, and world modeling research.

rgbaudiodepthproprioception+3

111K downloadsApr 2026other

OmniAction

A large-scale multimodal dataset for proactive robot manipulation with 141,162 episodes covering contextual instruction following through spoken dialogue, environmental sounds, and visual cues. The dataset includes 5,096 distinct speaker timbres, 2,482 non-verbal sound events, and 640 environmental backgrounds across six categories of contextual instructions.

rgbaudiolanguage

101K downloadsApr 2026cc-by-nc-4.0

OmniAction

A large-scale multimodal dataset for proactive robot manipulation with 141,162 episodes covering contextual instruction following through spoken dialogue, environmental sounds, and visual cues.

rgbaudiolanguage

100K downloadsApr 2026cc-by-nc-4.0

OmniAction

A large-scale multimodal dataset for proactive robot manipulation comprising 141,162 episodes across 112 skills and 748 objects, enriched with audio, visual, and contextual instruction data for cross-modal intention recognition.

rgbaudiolanguage

80K downloadsMar 2026cc-by-nc-4.0

Open-H-Embodiment

A community-driven, multi-embodiment dataset of paired kinematics and video for training and evaluating AI autonomy models in surgical robotics and ultrasound applications, including tabletop exercises, clinical procedures, and healthcare robotics simulations.

rgbproprioception

75K downloadsJun 2026cc-by-4.0