memmelma2025
PEEK VQA
A dataset of 2M image-QA pairs for fine-tuning PEEK VLM, a vision language model for robotics that predicts trajectory paths and task-relevant masking points for robot manipulation. Answers are generated from Open X-Embodiment datasets using point-based representations normalized to [0,1]².
Downloads103
Episodes2M
Why This Matters for Physical AI
This dataset enables training of vision language models to predict interpretable robot trajectories and task-relevant visual attention masks, advancing zero-shot generalization of manipulation policies across diverse embodiments.
Technical Profile
- Modalities
- rgblanguage
- Robot Embodiments
- Franka PandaUR5JacoHydraEDANFanucStretch
- Action Space
- end_effector_delta
- Environment
- labkitchen
- Task Types
- manipulationpick_and_place
- Episodes
- 2M
- Data Format
- JSON
- Annotation Types
- language_instructionsaction_labels
Community Signals
HuggingFace Discussions1
Access
Need custom rgb data?
Claru builds purpose-built datasets for lab applications with dense human annotations and quality assurance.
Request a Sample Pack