Evaluation

Human Evaluation of Video Generation Model Configurations

39KPairwise human evaluations completed
Evaluation
summary.md

Challenge:Automated quality metrics for video generation (FVD, FID, CLIP score) correlate poorly with human perception of visual quality, temporal coherence, and prompt fidelity — the metrics that actually determine which model configuration a customer should use.

Solution:Phase 1 was a pilot study designed to validate the evaluation methodology before committing to scale.

Result:The ELO rankings replaced subjective model selection with a governed, evidence-based process.

0Pairwise human evaluations completed
0Model configurations compared
0Input modalities ranked (T2V, I2V, V2V)
>0.00Krippendorff's alpha on retained metrics
// THE CHALLENGE

Automated quality metrics for video generation (FVD, FID, CLIP score) correlate poorly with human perception of visual quality, temporal coherence, and prompt fidelity — the metrics that actually determine which model configuration a customer should use. The client had 51 model configurations across three input modalities and no reliable way to rank them. Internal evaluation was ad hoc: engineers compared cherry-picked samples, producing contradictory conclusions depending on which prompts were tested. They needed a statistically defensible ranking system that could identify the best configuration per modality with quantified confidence, not subjective impressions.

// OUR APPROACH

Phase 1 was a pilot study designed to validate the evaluation methodology before committing to scale. We tested 6 candidate evaluation metrics — overall quality, temporal consistency, prompt adherence, visual fidelity, motion naturalness, and aesthetic appeal — with 5 trained annotators. Inter-annotator agreement was measured using Krippendorff's alpha. Two metrics (overall quality and prompt adherence) achieved alpha > 0.70 and were retained for the main evaluation. The remaining four were dropped due to insufficient reliability, preventing the main study from producing noisy rankings on poorly calibrated dimensions.

Phase 2 scaled to a full ELO-based evaluation using Swiss-style tournament pairing. For each of 120 prompts, all relevant model configurations generated outputs, producing approximately 6,000 videos. Annotators viewed side-by-side pairs and selected the better output on the retained metrics. Swiss-style pairing meant that after each round, similarly ranked models were paired against each other — maximizing the information gained per comparison and converging on stable rankings faster than random pairing. A fixed baseline model was included in every round to anchor the ELO scale and prevent rating drift across the six tournament rounds.

Separate ELO rankings were maintained per input modality (text-to-video, image-to-video, video-to-video) to ensure the results were actionable at the workflow level. Final rankings included 95% confidence intervals computed via bootstrap resampling of the pairwise comparison data.

01
PilotTest 6 metrics, validate inter-annotator agreement
02
RefineFocus on Overall Quality + Aesthetic Quality
03
RankELO-based pairwise evaluation, Swiss-style pairing
04
DeliverModality-specific model rankings
// RESULTS
39,000Pairwise human evaluations completed
51Model configurations compared
3Input modalities ranked (T2V, I2V, V2V)
>0.70Krippendorff's alpha on retained metrics
// IMPACT

The ELO rankings replaced subjective model selection with a governed, evidence-based process. The client deployed the top-ranked configuration per modality to their respective customer workflows, reducing the risk of shipping a suboptimal model and enabling per-workflow optimization that aggregate benchmarks could not support. The evaluation framework was retained as a reusable asset — each new model checkpoint is now evaluated against the existing ELO ladder with a fraction of the comparisons needed for a from-scratch evaluation.

// SAMPLE DATA

Representative record from the annotation pipeline.

aesthetic_eval_v2v.json
// INPUT SOURCE (ORIGINAL)

A stylish retro-futuristic dancer performs at the center of an opulent grand ballroom, captured by a smooth, low-angle orbiting camera. Her teal silk kimono billows outward, catching molten gold light from a colossal chandelier above.

Reference Video: pose.mp4
// PAIRWISE COMPARISON
Video A (ID: 13)
Video B (ID: 34)
Aesthetically pleasing videoWINNER: Video A
// JSON_RESPONSE
{
"project_title": "Aesthetic Quality Assessment V2V POSE",
"classification_id": "02562c2f-aaef-47f2-a535-2d3bafbe1999",
"created_at": "September 4, 2025, 8:51 PM",
"status": "completed",
"winner": "13",
"video_1_key": "13",
"video_2_key": "34"
}
Processing Time: 42msStatus: 200 OK
// FAQ

Ready to build your next dataset?

Tell us about your project and we will scope a plan within 48 hours.