paulpacaud2026apache-2.0
Guardian FailCoT OOD Benchmarks
Three real-world failure-detection benchmarks (UR5-Fail, RoboFail, RoboVQA) for evaluating vision-language models on cross-environment robotic manipulation failure reasoning and out-of-distribution generalization.
Downloads0
Episodes650
Why This Matters for Physical AI
Provides large-scale out-of-distribution real-robot benchmarks for evaluating failure detection and reasoning in vision-language models across diverse embodiments and environments, enabling research on robust cross-environment manipulation.
Technical Profile
- Modalities
- rgblanguage
- Robot Embodiments
- UR5mobile_manipulatorhumanoid
- Environment
- labhome
- Task Types
- manipulationfailure_detectionvisual_question_answering
- Episodes
- 650
- Data Format
- JSONL
- Annotation Types
- language_instructionsreward_labelsfailure_modesfailure_reasons
- License
- apache-2.0
Access
Need custom rgb data?
Claru builds purpose-built datasets for lab applications with dense human annotations and quality assurance.
Request a Sample Pack