How to Evaluate Robot Policy Performance

A practitioner's guide to evaluating robot manipulation policies — defining task-specific success criteria, designing statistically rigorous evaluation protocols, running reproducible real-world trials, analyzing failure modes, and connecting evaluation results to actionable data collection improvements.

Difficultyintermediate
Time1-2 weeks

Prerequisites

  • Trained robot policy ready for evaluation
  • Robot platform matching training environment
  • Evaluation task objects and fixtures
  • Video recording setup for failure analysis
  • Statistical analysis tools (Python scipy.stats)
1

Define Task-Specific Success Criteria

For each task you want to evaluate, write a precise, unambiguous success criterion that a non-expert evaluator can apply consistently. Avoid vague criteria like 'the robot successfully picks up the object' — does it count if the robot grasps the object but drops it during transport? Does it count if the gripper contacts the object but fails to lift it?

Write binary criteria using measurable conditions: 'Success = the object is lifted above the table surface by at least 5cm AND transported to within 3cm of the target position AND released without bouncing off the target area.' For insertion tasks: 'Success = the peg is inserted into the hole with the top surface flush with the fixture (within 2mm) AND no contact force exceeded 30N during insertion.' For multi-step tasks, define success for each step independently and for the full sequence.

Create an evaluation score sheet that the evaluator fills in for each trial: trial number, initial condition (object position, orientation), binary success/failure, completion time, failure mode (if failed), and any anomalies. Pre-print these score sheets so evaluators are not making ad-hoc decisions about what to record.

Calibrate evaluators: have two evaluators independently score 20 trials from video recordings. Compute inter-evaluator agreement (Cohen's kappa). If kappa < 0.90, the success criteria are ambiguous and need clarification. This calibration step prevents the common problem of different evaluators applying different standards, which introduces uncontrolled variance into the results.

Evaluation score sheet templateVideo recording for calibration

Tip: Include a 'timeout' criterion — if the policy has not completed the task within a time limit (typically 2-3x the median demonstration duration), the trial is a failure regardless of the final state

2

Design the Evaluation Protocol with Controlled Variation

The evaluation protocol specifies exactly which conditions to test and how many trials per condition. Design the protocol to answer: (1) what is the overall success rate? (2) how does performance vary across conditions? (3) where are the failure modes?

Identify variation axes that matter for deployment: object position (sample 5-10 positions spanning the workspace), object orientation (3-5 orientations including edge cases), object identity (if the policy should generalize across objects, test 5+ objects per category), lighting (daylight, artificial light, dim), background clutter (clean table, moderate clutter, heavy clutter), and distractor objects (none, related objects, random objects).

Design a factorial or Latin-square experiment that covers these axes efficiently. Full factorial with 5 positions x 3 orientations x 5 objects x 3 lighting = 225 conditions is impractical at 50 trials each. Instead, use a fractional factorial design: fix most axes at their nominal values and vary one or two axes at a time. This requires fewer total trials while still isolating the effect of each axis.

Minimum protocol: 50 trials at nominal conditions (the exact setup used during data collection) to establish the baseline success rate, plus 20 trials at each of 5-10 perturbation conditions (shifted positions, changed lighting, novel objects) to identify robustness boundaries. Total: 150-250 trials, achievable in 1-2 days.

Randomize trial order to prevent systematic effects: do not run all 50 nominal trials first, then all perturbation trials. Instead, interleave conditions randomly. Use a pre-generated randomized schedule that the evaluator follows.

Experimental design tools (DoE)Randomized trial schedule generatorCondition tracking spreadsheet

Tip: Include 10 'sanity check' trials at the exact conditions used for the best demonstrations in the training set — if the policy fails these, there is likely a deployment issue (wrong camera calibration, different robot controller settings) rather than a generalization problem

3

Execute Evaluation Trials with Rigorous Logging

Run the evaluation trials following the pre-designed protocol exactly. Do not deviate from the protocol based on mid-evaluation observations (e.g., 'it keeps failing at position 3, let me skip that'). Every deviation introduces bias.

For each trial, log: trial ID, condition code (which position/orientation/lighting), the initial scene image (photograph before the trial starts), binary success/failure per the pre-defined criteria, completion time (from policy start to task completion or timeout), the failure mode if the trial failed (coded from a predefined failure taxonomy: grasp_miss, grasp_slip, collision, timeout, wrong_target, other), and video recordings from all camera angles.

The failure mode taxonomy should cover: approach failures (robot does not move toward the target), grasp failures (misses the object, contacts but slips, crushes the object), transport failures (drops during transport, collides with obstacles), placement failures (places in wrong location, wrong orientation), and system failures (joint limit, communication error, perception error). Every failed trial must be assigned exactly one primary failure mode.

Use a dedicated evaluation operator who is not the person who trained the policy — this prevents unconscious bias in scene setup (placing objects in positions the policy handles well), trial timing (giving the policy more time when it is struggling), or success determination (being lenient on borderline cases).

Record everything on video. Video recordings are essential for post-hoc failure analysis and for resolving disputes about success criteria. Use at least two camera angles: the overhead view for spatial accuracy and the wrist camera view for grasp quality.

Video recording systemTrial logging applicationFailure mode taxonomyIndependent evaluation operator

Tip: Run a 'dress rehearsal' of 5 trials before the official evaluation to check that the logging system works, the video recording captures the full workspace, and the evaluator understands the success criteria — discovering logging failures mid-evaluation wastes hours of trial time

4

Analyze Results with Statistical Rigor

Compute success rates and confidence intervals for each condition. For binary success/failure, use the Wilson score interval (not the normal approximation, which is inaccurate for small samples and extreme probabilities). For 50 trials with 40 successes: success rate = 80%, Wilson 95% CI = [67%, 89%]. Report this as '80% (95% CI: 67-89%)' not just '80%.'

Compare success rates across conditions using Fisher's exact test (for small samples) or the chi-squared test (for large samples). If the baseline success rate is 85% and the 'shifted position' condition achieves 60%, is this difference statistically significant? With 50 trials each, Fisher's exact test gives p = 0.007, which is significant at alpha = 0.05. If p > 0.05, you cannot conclude the conditions differ — you may need more trials.

Generate a failure mode breakdown: for each condition, report the percentage of failures attributable to each failure mode. This directly informs data collection: if 60% of failures are grasp_slip, you need more demonstrations of diverse grasping strategies. If 40% of failures are approach_miss, the policy's visual perception is not generalizing to the test conditions and you need more visual diversity in the training data.

Create visualizations: (1) a heatmap of success rates across conditions, (2) a bar chart of failure mode frequencies, (3) a scatter plot of completion time vs. trial number (to check for temporal effects like robot drift or policy degradation), and (4) representative video clips for each failure mode (5-second clips showing the failure from the best camera angle). These visualizations are the primary output of the evaluation — they communicate results to the team and guide the next data collection cycle.

Python scipy.stats (Fisher's exact, Wilson CI)Matplotlib or Seaborn (visualization)Video clip extraction (ffmpeg)LaTeX or Google Slides (evaluation report)

Tip: Always plot success rate over trial number (a running average) to check for temporal trends — if success rate declines over the evaluation session, the robot may be experiencing calibration drift, thermal effects, or mechanical wear that invalidates later trials

5

Connect Failure Analysis to Data Collection Recommendations

The most valuable output of policy evaluation is a prioritized list of data collection actions that would most improve performance. Map each failure mode to a specific data gap and a concrete collection recommendation.

Grasp_miss failures (the robot reaches for the object but misses): the policy's visual perception does not accurately localize the object in the test conditions. Collect more demonstrations with the object at the positions where misses occur, with the lighting conditions present during evaluation, and with the background clutter levels encountered. If misses are systematic (always 2cm to the left), check camera calibration.

Grasp_slip failures (the robot contacts the object but it slips from the gripper): the policy does not adapt grasp force to the object's friction. Collect demonstrations with a wider variety of gripper approach angles and explicitly include demonstrations where the operator adjusts grip position after initial contact. If using force feedback, include F/T data so the policy can learn force-aware grasping.

Collision failures: the policy does not avoid obstacles in the workspace. Collect demonstrations with the clutter configurations that caused collisions, emphasizing obstacle avoidance motions (reaching around objects, pulling back and re-approaching). If collisions occur in specific workspace regions, collect demonstrations specifically in those regions.

Timeout failures: the policy is slow or hesitant. Collect demonstrations that are faster and more confident. If the policy exhibits repeated back-and-forth motions (oscillation), the training data may contain too much variance in approach direction — collect demonstrations with consistent approach strategies.

Prioritize by impact: if grasp_miss accounts for 50% of failures and collision accounts for 10%, focus the next collection cycle on the grasp_miss root cause. Estimate the expected improvement: 'Collecting 200 additional demonstrations at problematic positions is expected to reduce grasp_miss failures by 50%, improving overall success rate from 75% to 87%.'

Document the full evaluation-to-collection feedback loop in an evaluation report that includes: current success rates, failure mode breakdown, root cause analysis, collection recommendations, expected impact, and timeline for the next collection cycle.

Failure analysis frameworkCollection recommendation templateEvaluation report template

Tip: Track evaluation results over time across collection cycles — plot success rate vs. total dataset size to measure the marginal value of additional data and identify when you are hitting diminishing returns for a given task

Tools & Technologies

Python scipy.stats (statistical tests)ROS2 (robot control)Video recording system (for failure analysis)Spreadsheet or database for trial loggingMatplotlib (visualization)MuJoCo or Isaac Gym (simulation evaluation)

References

  1. [1]Brohan et al.. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” arXiv 2307.15818, 2023. Link
  2. [2]Mandlekar et al.. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation.” CoRL 2022, 2022. Link
  3. [3]Dasari et al.. RoboNet: Large-Scale Multi-Robot Learning.” CoRL 2019, 2019. Link

How Claru Can Help

Claru provides end-to-end evaluation services: protocol design with controlled variation axes tailored to your deployment environment, independent trial execution by trained evaluators, statistical analysis with confidence intervals and failure mode breakdowns, and actionable data collection recommendations. Our evaluation-to-collection feedback loop has helped teams improve policy success rates by 20-35% per collection cycle.

Why Rigorous Evaluation Separates Research from Deployment

A robot policy that achieves 95% success rate in the lab can drop to 60% in deployment if the evaluation protocol did not account for real-world variation. The gap between reported performance and deployed performance is the most persistent problem in robot learning. Google's RT-2 paper reported 73% success rate across 200+ evaluation tasks, but individual task success rates ranged from 30% to 100% — the aggregate number hides enormous variation. Rigorous evaluation requires task-specific success criteria, controlled variation along every axis that matters in deployment (lighting, object position, object identity, operator), and enough trials per condition to be statistically meaningful.

Evaluation serves two purposes: measuring absolute performance (is this policy good enough to deploy?) and diagnosing data gaps (where does the policy fail, and what additional data would help?). The second purpose is often more valuable than the first. A well-designed evaluation protocol that systematically varies object position, lighting, and distractor objects will reveal that the policy fails on objects placed in the left 20% of the workspace — directly informing the next round of data collection to oversample left-workspace positions.

Evaluation Best Practice Benchmarks

50+
Minimum trials per evaluation condition
95% CI
Always report confidence intervals, not point estimates
3+ axes
Variation axes to test (position, lighting, objects)
10%
Typical lab-to-deployment success rate drop

Frequently Asked Questions

For a binomial success/failure metric, you need at least 50 trials per condition to estimate the success rate within +/-14 percentage points (95% CI) and 200 trials for +/-7 percentage points. If you are comparing two policies, use a two-proportion z-test or Fisher's exact test and target 100+ trials per policy per condition to detect a 15-percentage-point difference with 80% power. Always report confidence intervals, not just point estimates — a reported '90% success rate' from 10 trials has a 95% CI of [55%, 100%], which is not meaningful.

Both, but real-world evaluation is the ground truth. Simulation evaluation is useful for rapid iteration (you can run thousands of trials overnight) but systematically overestimates real-world performance due to the sim-to-real gap in contact physics, visual rendering, and sensor noise. Use simulation for development-cycle evaluation (is this policy better than the last one?) and real-world evaluation for deployment decisions (is this policy good enough to ship?). Report sim and real results separately — never average them.

Many tasks have partial success: the robot grasped the object but placed it 2cm from the target, or completed the task but took 3x longer than the demonstration. Define a graded success metric: full success (task completed within tolerance), partial success (task completed but outside tolerance), and failure (task not completed). Report all three rates separately. Additionally, report continuous metrics: task completion time, path efficiency (ratio of executed path length to optimal path length), and final position error. Continuous metrics are more sensitive to policy improvements than binary success rates.

Need Help Evaluating Your Robot Policies?

Claru provides evaluation services with standardized protocols, statistically rigorous trial execution, and diagnostic reports connecting failure modes to data collection recommendations.