Why use ELO ranking instead of absolute quality scores?

ELO ranking produces reliable relative orderings from pairwise comparisons, which are cognitively easier and more consistent for human annotators than assigning absolute scores on a Likert scale. Pairwise judgments also eliminate scale bias — annotators do not need to maintain a consistent internal reference across hundreds of evaluations. The Swiss-style tournament pairing further improves efficiency by concentrating comparisons among similarly ranked models.

How did the pilot study inform the main evaluation?

The pilot tested 6 candidate evaluation metrics and retained only the 2 with Krippendorff's alpha above 0.70. This prevented the main study from producing noisy rankings on unreliable dimensions. Without the pilot, the client would have invested 39,000 evaluations across metrics that annotators could not apply consistently — wasting budget and producing misleading rankings.

What is Swiss-style tournament pairing?

Swiss-style pairing matches similarly ranked models against each other in each round, rather than using random or round-robin pairing. This maximizes the information gained per comparison: early rounds identify the approximate ranking tier, and later rounds refine the ordering within tiers. The result is convergence to stable rankings in fewer total comparisons than exhaustive pairwise evaluation.

How were confidence intervals computed?

95% confidence intervals on final ELO rankings were computed via bootstrap resampling of the pairwise comparison data. The pairwise results were resampled with replacement 10,000 times, and the ELO algorithm was re-run on each bootstrap sample. The resulting distribution of rankings provides a direct measure of ranking stability. Models whose confidence intervals overlap are reported as statistically tied.

Evaluation

Human Evaluation of Video Generation Model Configurations

39KPairwise human evaluations completed

Evaluation

summary.md

Challenge:Automated quality metrics for video generation (FVD, FID, CLIP score) correlate poorly with human perception of visual quality, temporal coherence, and prompt fidelity — the metrics that actually determine which model configuration a customer should use.

Solution:Phase 1 was a pilot study designed to validate the evaluation methodology before committing to scale.

Result:The ELO rankings replaced subjective model selection with a governed, evidence-based process.

0Pairwise human evaluations completed

0Model configurations compared

0Input modalities ranked (T2V, I2V, V2V)

>0.00Krippendorff's alpha on retained metrics

// THE CHALLENGE

Automated quality metrics for video generation (FVD, FID, CLIP score) correlate poorly with human perception of visual quality, temporal coherence, and prompt fidelity — the metrics that actually determine which model configuration a customer should use. The client had 51 model configurations across three input modalities and no reliable way to rank them. Internal evaluation was ad hoc: engineers compared cherry-picked samples, producing contradictory conclusions depending on which prompts were tested. They needed a statistically defensible ranking system that could identify the best configuration per modality with quantified confidence, not subjective impressions.

// OUR APPROACH

Phase 1 was a pilot study designed to validate the evaluation methodology before committing to scale. We tested 6 candidate evaluation metrics — overall quality, temporal consistency, prompt adherence, visual fidelity, motion naturalness, and aesthetic appeal — with 5 trained annotators. Inter-annotator agreement was measured using Krippendorff's alpha. Two metrics (overall quality and prompt adherence) achieved alpha > 0.70 and were retained for the main evaluation. The remaining four were dropped due to insufficient reliability, preventing the main study from producing noisy rankings on poorly calibrated dimensions.

Phase 2 scaled to a full ELO-based evaluation using Swiss-style tournament pairing. For each of 120 prompts, all relevant model configurations generated outputs, producing approximately 6,000 videos. Annotators viewed side-by-side pairs and selected the better output on the retained metrics. Swiss-style pairing meant that after each round, similarly ranked models were paired against each other — maximizing the information gained per comparison and converging on stable rankings faster than random pairing. A fixed baseline model was included in every round to anchor the ELO scale and prevent rating drift across the six tournament rounds.

Separate ELO rankings were maintained per input modality (text-to-video, image-to-video, video-to-video) to ensure the results were actionable at the workflow level. Final rankings included 95% confidence intervals computed via bootstrap resampling of the pairwise comparison data.

PilotTest 6 metrics, validate inter-annotator agreement

RefineFocus on Overall Quality + Aesthetic Quality

RankELO-based pairwise evaluation, Swiss-style pairing

DeliverModality-specific model rankings

PilotTest 6 metrics, validate inter-annotator agreement

RefineFocus on Overall Quality + Aesthetic Quality

RankELO-based pairwise evaluation, Swiss-style pairing

DeliverModality-specific model rankings

// RESULTS

39,000Pairwise human evaluations completed

51Model configurations compared

3Input modalities ranked (T2V, I2V, V2V)

>0.70Krippendorff's alpha on retained metrics

// IMPACT

The ELO rankings replaced subjective model selection with a governed, evidence-based process. The client deployed the top-ranked configuration per modality to their respective customer workflows, reducing the risk of shipping a suboptimal model and enabling per-workflow optimization that aggregate benchmarks could not support. The evaluation framework was retained as a reusable asset — each new model checkpoint is now evaluated against the existing ELO ladder with a fraction of the comparisons needed for a from-scratch evaluation.

// SAMPLE DATA

Representative record from the annotation pipeline.

aesthetic_eval_v2v.json

// INPUT SOURCE (ORIGINAL)

A stylish retro-futuristic dancer performs at the center of an opulent grand ballroom, captured by a smooth, low-angle orbiting camera. Her teal silk kimono billows outward, catching molten gold light from a colossal chandelier above.

Reference Video: pose.mp4

// PAIRWISE COMPARISON

Video A (ID: 13)

Video B (ID: 34)

Aesthetically pleasing videoWINNER: Video A

// JSON_RESPONSE

{
"project_title": "Aesthetic Quality Assessment V2V POSE",
"classification_id": "02562c2f-aaef-47f2-a535-2d3bafbe1999",
"created_at": "September 4, 2025, 8:51 PM",
"status": "completed",
"winner": "13",
"video_1_key": "13",
"video_2_key": "34"
}

Processing Time: 42msStatus: 200 OK

Service UsedBenchmark Curation

// RELATED

180

Benchmarking Prompt Enhancement Quality Across Leading LLMs

180 prompts benchmarked across text and video generation — replacing guesswork with large-scale human evaluation to identify the best prompt enhancement solution for production.

Read case study

241K+

Scaling Generative AI Safety Through Human-Led Data Labeling

241,000+ safety annotations across text and video — validating content moderation pipelines and delivering defensible evidence of safety performance for audits and governance.

Read case study

// FAQ

Ready to build your next dataset?

Tell us about your project and we will scope a plan within 48 hours.