IffYuanother

Embodied-R1-3B-v1

Name: Embodied-R1-3B-v1
Creator: IffYuan
License: other
Keywords: rgb, language, manipulation, visual_target_grounding, referring_region_grounding, open_form_grounding, lab

A 3B vision-language model for general robotic manipulation that introduces a Pointing mechanism and uses Reinforced Fine-tuning to bridge perception and action with strong zero-shot generalization in embodied tasks.

Downloads224

Technical Profile

Modalities: rgblanguage
Environment: lab
Task Types: manipulationvisual_target_groundingreferring_region_groundingopen_form_grounding
License: other

Part of the Embodied-R1-3B-v1 family

Access

View on HuggingFace

Need custom rgb data?

Claru builds purpose-built datasets for lab applications with dense human annotations and quality assurance.

Request a Sample Pack

Related Datasets

ABC-130k

The largest open-source robot teleoperation dataset containing bimanual manipulation trajectories collected on two-arm YAM stations with 130,822 episodes across 3,555 hours of data.

rgbproprioception

663K downloadsJul 2026apache-2.0

Hy-Embodied-0.5-VLA-Data

A large-scale bimanual manipulation dataset with 2,163 hours of high-fidelity demonstrations collected via custom fingertip UMI device with optical motion-capture, spanning 70+ manipulation tasks for training Vision-Language-Action foundation models.

rgbproprioceptionlanguage

235K downloadsJul 2026cc-by-4.0

T-Rex Dataset

A large-scale, tactile-reactive bimanual manipulation dataset collected via teleoperation on a Dexmate Vega-1 robot with two Sharpa Wave dexterous hands, featuring 5,464 episodes with tactile, RGB, and proprioceptive observations.

rgbtactileforce_torqueproprioception

170K downloadsJun 2026MIT

OmniAction

A large-scale multimodal dataset for proactive robot manipulation comprising 141,162 episodes with cross-modal contextual instructions derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands.

rgbaudiolanguage

95K downloadsMar 2026cc-by-nc-4.0

OmniAction

A large-scale multimodal dataset for proactive robot manipulation with 141,162 episodes covering contextual instruction following through spoken dialogue, environmental sounds, and visual cues. The dataset includes 5,096 distinct speaker timbres, 2,482 non-verbal sound events, and 640 environmental backgrounds across six categories of contextual instructions.

rgbaudiolanguage

85K downloadsApr 2026cc-by-nc-4.0

OmniAction

A large-scale multimodal dataset for proactive robot manipulation with 141,162 episodes covering contextual instruction following through spoken dialogue, environmental sounds, and visual cues.

rgbaudiolanguage

85K downloadsApr 2026cc-by-nc-4.0