Human Evaluation of Video Generation Model Configurations
Challenge:Automated quality metrics for video generation (FVD, FID, CLIP score) correlate poorly with human perception of visual quality, temporal coherence, and prompt fidelity — the metrics that actually determine which model configuration a customer should use.
Solution:Phase 1 was a pilot study designed to validate the evaluation methodology before committing to scale.
Result:The ELO rankings replaced subjective model selection with a governed, evidence-based process.
Automated quality metrics for video generation (FVD, FID, CLIP score) correlate poorly with human perception of visual quality, temporal coherence, and prompt fidelity — the metrics that actually determine which model configuration a customer should use. The client had 51 model configurations across three input modalities and no reliable way to rank them. Internal evaluation was ad hoc: engineers compared cherry-picked samples, producing contradictory conclusions depending on which prompts were tested. They needed a statistically defensible ranking system that could identify the best configuration per modality with quantified confidence, not subjective impressions.
Phase 1 was a pilot study designed to validate the evaluation methodology before committing to scale. We tested 6 candidate evaluation metrics — overall quality, temporal consistency, prompt adherence, visual fidelity, motion naturalness, and aesthetic appeal — with 5 trained annotators. Inter-annotator agreement was measured using Krippendorff's alpha. Two metrics (overall quality and prompt adherence) achieved alpha > 0.70 and were retained for the main evaluation. The remaining four were dropped due to insufficient reliability, preventing the main study from producing noisy rankings on poorly calibrated dimensions.
Phase 2 scaled to a full ELO-based evaluation using Swiss-style tournament pairing. For each of 120 prompts, all relevant model configurations generated outputs, producing approximately 6,000 videos. Annotators viewed side-by-side pairs and selected the better output on the retained metrics. Swiss-style pairing meant that after each round, similarly ranked models were paired against each other — maximizing the information gained per comparison and converging on stable rankings faster than random pairing. A fixed baseline model was included in every round to anchor the ELO scale and prevent rating drift across the six tournament rounds.
Separate ELO rankings were maintained per input modality (text-to-video, image-to-video, video-to-video) to ensure the results were actionable at the workflow level. Final rankings included 95% confidence intervals computed via bootstrap resampling of the pairwise comparison data.
The ELO rankings replaced subjective model selection with a governed, evidence-based process. The client deployed the top-ranked configuration per modality to their respective customer workflows, reducing the risk of shipping a suboptimal model and enabling per-workflow optimization that aggregate benchmarks could not support. The evaluation framework was retained as a reusable asset — each new model checkpoint is now evaluated against the existing ELO ladder with a fraction of the comparisons needed for a from-scratch evaluation.
Representative record from the annotation pipeline.
A stylish retro-futuristic dancer performs at the center of an opulent grand ballroom, captured by a smooth, low-angle orbiting camera. Her teal silk kimono billows outward, catching molten gold light from a colossal chandelier above.
{ "project_title": "Aesthetic Quality Assessment V2V POSE", "classification_id": "02562c2f-aaef-47f2-a535-2d3bafbe1999", "created_at": "September 4, 2025, 8:51 PM", "status": "completed", "winner": "13", "video_1_key": "13", "video_2_key": "34" }
Ready to build your next dataset?
Tell us about your project and we will scope a plan within 48 hours.