VLA-Arenaapache-2.0

VLA-Arena Dataset (L1 - Large Variant)

Name: VLA-Arena Dataset (L1 - Large Variant)
Creator: VLA-Arena
License: apache-2.0
Keywords: rgb, proprioception, language, manipulation, grasping, pick_and_place, simulation

An open-source benchmark for systematic evaluation of Vision-Language-Action (VLA) models featuring 55 tasks at difficulty level 1 with 2,750 human demonstrations across safety, distractor, extrapolation, and long-horizon domains.

Downloads359
Likes1

Technical Profile

Modalities: rgbproprioceptionlanguage
Environment: simulation
Task Types: manipulationgraspingpick_and_place
Data Format: HDF5
License: apache-2.0

Part of the VLA-Arena Dataset (L1 - Large Variant) family

Access

View on HuggingFace

Need custom rgb data?

Claru builds purpose-built datasets for simulation applications with dense human annotations and quality assurance.

Request a Sample Pack

Related Datasets

VLA-Arena Dataset (L1 - Large Variant)

An open-source benchmark for systematic evaluation of Vision-Language-Action (VLA) models with 55 tasks at difficulty level 1, containing 2,750 human demonstrations across safety, distractor, extrapolation, and long-horizon domains.

rgbproprioceptionlanguage

210 downloadsFeb 2026apache-2.0

LingBot-Depth Dataset

Self-curated RGB-D dataset for training masked depth modeling approaches, containing real-world indoor scenes, VLA robot manipulation tasks, and simulated data across multiple camera types and robot platforms.

rgbdepth

167K downloadsApr 2026CC BY-NC-SA 4.0

OmniAction

A large-scale multimodal dataset for proactive robot manipulation comprising 141,162 episodes across 112 skills and 748 objects, enriched with audio, visual, and contextual instruction data for cross-modal intention recognition.

rgbaudiolanguage

145K downloadsMar 2026cc-by-nc-4.0

Xperience-10M

A large-scale egocentric multimodal dataset of human experience containing 10 million interactions and 10,000 hours of synchronized first-person recordings with six video streams, audio, stereo depth, camera pose, hand mocap, full-body mocap, IMU, and hierarchical language annotations for embodied AI, robotics, and world modeling research.

rgbaudiodepthproprioception+3

127K downloadsApr 2026other

Egocentric-100K

The largest dataset of manual labor with 100,405 hours of egocentric video from head-mounted fisheye cameras, featuring state-of-the-art hand visibility and active manipulation density.

rgbvideo

74K downloadsFeb 2026apache-2.0

OmniAction

A large-scale multimodal dataset for proactive robot manipulation comprising 141,162 episodes with cross-modal contextual instructions derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands.

rgbaudiolanguage

71K downloadsMar 2026cc-by-nc-4.0