BeingBeyond2025MIT

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

The first dexterous Vision-Language-Action model pretrained from large-scale human videos via explicit hand motion modeling. Includes pretrained VLA models (1B, 8B, 14B parameters) and a post-training dataset for robot alignment.

Downloads188
Likes1

Why This Matters for Physical AI

Being-H0 advances dexterous vision-language-action modeling by pretraining on large-scale human videos with explicit hand motion modeling, enabling robots to understand and replicate complex human hand manipulation tasks.

Technical Profile

Modalities
rgblanguage
Action Space
motion_tokens
Task Types
manipulationdexterous_manipulation
Data Format
zarr
Annotation Types
language_instructionshand_motion
License
MIT
Part of the Being-H0 family

Community Signals

Top 50% by downloads
HuggingFace Discussions1

Access

Need custom rgb data?

Claru builds purpose-built datasets for any environment applications with dense human annotations and quality assurance.

Request a Sample Pack

Related Datasets