paulpacaud2026apache-2.0

Guardian FailCoT OOD Benchmarks

Three real-world failure-detection benchmarks (UR5-Fail, RoboFail, RoboVQA) for evaluating vision-language models on cross-environment robotic manipulation failure reasoning and out-of-distribution generalization.

Downloads0
Episodes650

Why This Matters for Physical AI

Provides large-scale out-of-distribution real-robot benchmarks for evaluating failure detection and reasoning in vision-language models across diverse embodiments and environments, enabling research on robust cross-environment manipulation.

Technical Profile

Modalities
rgblanguage
Robot Embodiments
UR5mobile_manipulatorhumanoid
Environment
labhome
Task Types
manipulationfailure_detectionvisual_question_answering
Episodes
650
Data Format
JSONL
Annotation Types
language_instructionsreward_labelsfailure_modesfailure_reasons
License
apache-2.0
Part of the Guardian family

Access

Need custom rgb data?

Claru builds purpose-built datasets for lab applications with dense human annotations and quality assurance.

Request a Sample Pack

Related Datasets