Annotation

Video Quality Annotation at Scale for RLHF and Model Selection

976K+Human quality assessments delivered
Annotation
summary.md

Challenge:RLHF for video generation requires massive volumes of human quality judgments — not just binary good/bad labels, but multi-dimensional assessments that capture the specific axes along which video quality varies.

Solution:We structured the annotation pipeline around four evaluation dimensions, each with calibrated rubrics and anchored scoring scales.

Result:The 976K+ annotations became the primary preference signal for the lab's RLHF training pipeline, directly shaping how the model learned to distinguish high-quality video from low-quality output.

0K+Human quality assessments delivered
0Evaluation dimensions per annotation
0%+Annotator calibration threshold (gold-standard agreement)
0Source categories (cinematic, street-level, curated)
// THE CHALLENGE

RLHF for video generation requires massive volumes of human quality judgments — not just binary good/bad labels, but multi-dimensional assessments that capture the specific axes along which video quality varies. Motion quality, visual fidelity, viewer engagement, and prompt adherence are distinct failure modes that require distinct training signal. The lab's internal evaluation team could produce hundreds of annotations per week; their training pipeline consumed thousands per day. They needed a partner who could deliver six-figure annotation volumes while maintaining the calibration and consistency that RLHF demands — because noisy preference data doesn't just slow convergence, it teaches the model the wrong preferences.

// OUR APPROACH

We structured the annotation pipeline around four evaluation dimensions, each with calibrated rubrics and anchored scoring scales.

Motion quality assessed temporal coherence, physics plausibility, and artifact severity — distinguishing between natural motion, subtle jitter, and catastrophic deformation. Visual fidelity evaluated resolution consistency, lighting accuracy, texture detail, and the absence of generation artifacts (blurring, tiling, color banding). Viewer interest captured whether the video would hold a viewer's attention — a subjective but critical signal for user-facing products. Text-to-video alignment measured how faithfully the generated output matched the input prompt across subject, action, setting, and style.

Annotators were calibrated through a qualification pipeline that tested agreement with expert gold-standard annotations across all four dimensions. Only annotators exceeding 85% agreement on calibration sets entered the production pool. Inter-rater agreement was monitored continuously using Krippendorff's alpha, with automatic flagging when agreement dropped below threshold on any dimension. Source material spanned three categories — licensed cinematic footage, street-level captures, and curated video libraries — ensuring the annotations covered the full distribution of content types the model would encounter in production.

Weekly delivery batches were formatted for direct ingestion into the lab's RLHF training pipeline, with per-dimension scores, annotator confidence indicators, and source metadata attached to each assessment.

01
CalibrateQualify annotators against gold-standard benchmarks (85%+ agreement)
02
AssessScore each clip on 4 dimensions: motion, fidelity, interest, alignment
03
MonitorContinuous inter-rater agreement tracking via Krippendorff's alpha
04
DeliverWeekly RLHF-ready batches with per-dimension scores and metadata
// RESULTS
976K+Human quality assessments delivered
4Evaluation dimensions per annotation
85%+Annotator calibration threshold (gold-standard agreement)
3Source categories (cinematic, street-level, curated)
// IMPACT

The 976K+ annotations became the primary preference signal for the lab's RLHF training pipeline, directly shaping how the model learned to distinguish high-quality video from low-quality output. The four-dimensional scoring enabled the lab to train separate reward models per quality axis — meaning the system could optimize for motion quality without degrading visual fidelity, or improve prompt adherence without sacrificing viewer interest. The calibrated annotator pool and continuous agreement monitoring meant the lab could trust the preference data at scale without manual review of individual annotations.

// SAMPLE DATA

Representative record from the annotation pipeline.

quality_assessment_batch.json
// ANNOTATION VOLUME
976,355human quality assessments
3 source categoriesweekly RLHF-ready batches
// EVALUATION DIMENSIONS (4)
🎬Motion Quality

Temporal coherence, physics plausibility, artifact severity

4/5
👁️Visual Fidelity

Resolution consistency, lighting accuracy, texture detail

3/5
Viewer Interest

Would a viewer watch this through? Engagement signal

5/5
🎯Text-to-Video Alignment

Prompt faithfulness across subject, action, setting, style

4/5
// SAMPLE ANNOTATION
Clip IDvqa-847291
SourceCinematic (licensed)
Duration4.2s
Annotator PoolCalibrated (85%+)
Agreement\u03B1 = 0.78
Confidence TierHigh
// SOURCE CATEGORIES
Source:Licensed CinematicSource:Street-Level CaptureSource:Curated LibrariesOutput:RLHF-Ready
// FAQ

Ready to build your next dataset?

Tell us about your project and we will scope a plan within 48 hours.