memmelma2025

PEEK VQA

A dataset of 2M image-QA pairs for fine-tuning PEEK VLM, a vision language model for robotics that predicts trajectory paths and task-relevant masking points for robot manipulation. Answers are generated from Open X-Embodiment datasets using point-based representations normalized to [0,1]².

Downloads103
Episodes2M

Why This Matters for Physical AI

This dataset enables training of vision language models to predict interpretable robot trajectories and task-relevant visual attention masks, advancing zero-shot generalization of manipulation policies across diverse embodiments.

Technical Profile

Modalities
rgblanguage
Robot Embodiments
Franka PandaUR5JacoHydraEDANFanucStretch
Action Space
end_effector_delta
Environment
labkitchen
Task Types
manipulationpick_and_place
Episodes
2M
Data Format
JSON
Annotation Types
language_instructionsaction_labels
Part of the Open X-Embodiment family

Community Signals

HuggingFace Discussions1

Access

Need custom rgb data?

Claru builds purpose-built datasets for lab applications with dense human annotations and quality assurance.

Request a Sample Pack

Related Datasets