Why four separate evaluation dimensions instead of a single quality score?

Video quality fails along distinct axes — a video can have perfect visual fidelity but physically impossible motion, or match the prompt exactly but bore the viewer. A single aggregate score collapses these failure modes into noise, making it impossible to train a reward model that can distinguish between them. Four-dimension scoring gives the RLHF pipeline targeted signal on each axis, enabling the lab to train separate reward models and optimize each independently.

How were annotators calibrated?

Annotators completed a qualification pipeline that tested their agreement with expert gold-standard annotations across all four dimensions. Only annotators exceeding 85% agreement entered the production pool. Once in production, inter-rater agreement was monitored continuously using Krippendorff's alpha, with automatic flagging and retraining when agreement dropped below threshold on any dimension.

What source material was annotated?

Three categories: licensed cinematic footage (professional production quality, diverse genres), street-level captures (dashcam, surveillance, and urban video with natural lighting and motion), and curated video libraries (controlled content covering specific subjects, actions, and settings). This spread ensures the annotations cover the full distribution of content types the model encounters in production.

How were annotations formatted for RLHF ingestion?

Weekly delivery batches included per-dimension scores (1-5 scale), annotator confidence indicators, inter-rater agreement metrics per clip, and source metadata (content category, resolution, duration). The format was designed for direct ingestion into the lab's reward model training pipeline without preprocessing or reformatting.

How does annotation volume at this scale affect model quality?

RLHF reward models are data-hungry — preference signal from hundreds of examples produces unstable reward functions, while six-figure volumes enable the reward model to generalize across content types, quality levels, and failure modes. At 976K+ annotations, the lab could train reward models with enough coverage to avoid the distribution gaps that cause reward hacking in production.

Annotation

Video Quality Annotation at Scale for RLHF and Model Selection

976K+Human quality assessments delivered

Annotation

summary.md

Challenge:RLHF for video generation requires massive volumes of human quality judgments — not just binary good/bad labels, but multi-dimensional assessments that capture the specific axes along which video quality varies.

Solution:We structured the annotation pipeline around four evaluation dimensions, each with calibrated rubrics and anchored scoring scales.

Result:The 976K+ annotations became the primary preference signal for the lab's RLHF training pipeline, directly shaping how the model learned to distinguish high-quality video from low-quality output.

0K+Human quality assessments delivered

0Evaluation dimensions per annotation

0%+Annotator calibration threshold (gold-standard agreement)

0Source categories (cinematic, street-level, curated)

// THE CHALLENGE

RLHF for video generation requires massive volumes of human quality judgments — not just binary good/bad labels, but multi-dimensional assessments that capture the specific axes along which video quality varies. Motion quality, visual fidelity, viewer engagement, and prompt adherence are distinct failure modes that require distinct training signal. The lab's internal evaluation team could produce hundreds of annotations per week; their training pipeline consumed thousands per day. They needed a partner who could deliver six-figure annotation volumes while maintaining the calibration and consistency that RLHF demands — because noisy preference data doesn't just slow convergence, it teaches the model the wrong preferences.

// OUR APPROACH

We structured the annotation pipeline around four evaluation dimensions, each with calibrated rubrics and anchored scoring scales.

Motion quality assessed temporal coherence, physics plausibility, and artifact severity — distinguishing between natural motion, subtle jitter, and catastrophic deformation. Visual fidelity evaluated resolution consistency, lighting accuracy, texture detail, and the absence of generation artifacts (blurring, tiling, color banding). Viewer interest captured whether the video would hold a viewer's attention — a subjective but critical signal for user-facing products. Text-to-video alignment measured how faithfully the generated output matched the input prompt across subject, action, setting, and style.

Annotators were calibrated through a qualification pipeline that tested agreement with expert gold-standard annotations across all four dimensions. Only annotators exceeding 85% agreement on calibration sets entered the production pool. Inter-rater agreement was monitored continuously using Krippendorff's alpha, with automatic flagging when agreement dropped below threshold on any dimension. Source material spanned three categories — licensed cinematic footage, street-level captures, and curated video libraries — ensuring the annotations covered the full distribution of content types the model would encounter in production.

Weekly delivery batches were formatted for direct ingestion into the lab's RLHF training pipeline, with per-dimension scores, annotator confidence indicators, and source metadata attached to each assessment.

CalibrateQualify annotators against gold-standard benchmarks (85%+ agreement)

AssessScore each clip on 4 dimensions: motion, fidelity, interest, alignment

MonitorContinuous inter-rater agreement tracking via Krippendorff's alpha

DeliverWeekly RLHF-ready batches with per-dimension scores and metadata

CalibrateQualify annotators against gold-standard benchmarks (85%+ agreement)

AssessScore each clip on 4 dimensions: motion, fidelity, interest, alignment

MonitorContinuous inter-rater agreement tracking via Krippendorff's alpha

DeliverWeekly RLHF-ready batches with per-dimension scores and metadata

// RESULTS

976K+Human quality assessments delivered

4Evaluation dimensions per annotation

85%+Annotator calibration threshold (gold-standard agreement)

3Source categories (cinematic, street-level, curated)

// IMPACT

The 976K+ annotations became the primary preference signal for the lab's RLHF training pipeline, directly shaping how the model learned to distinguish high-quality video from low-quality output. The four-dimensional scoring enabled the lab to train separate reward models per quality axis — meaning the system could optimize for motion quality without degrading visual fidelity, or improve prompt adherence without sacrificing viewer interest. The calibrated annotator pool and continuous agreement monitoring meant the lab could trust the preference data at scale without manual review of individual annotations.

// SAMPLE DATA

Representative record from the annotation pipeline.

quality_assessment_batch.json

// ANNOTATION VOLUME

976,355human quality assessments

3 source categoriesweekly RLHF-ready batches

// EVALUATION DIMENSIONS (4)

🎬Motion Quality

Temporal coherence, physics plausibility, artifact severity

4/5

👁️Visual Fidelity

Resolution consistency, lighting accuracy, texture detail

3/5

⭐Viewer Interest

Would a viewer watch this through? Engagement signal

5/5

🎯Text-to-Video Alignment

Prompt faithfulness across subject, action, setting, style

4/5

// SAMPLE ANNOTATION

Clip IDvqa-847291

SourceCinematic (licensed)

Duration4.2s

Annotator PoolCalibrated (85%+)

Agreement\u03B1 = 0.78

Confidence TierHigh

// SOURCE CATEGORIES

Source:Licensed CinematicSource:Street-Level CaptureSource:Curated LibrariesOutput:RLHF-Ready

Service UsedData Curation & Annotation

// RELATED

39K

Human Evaluation of Video Generation Model Configurations

39,000 pairwise human evaluations across 51 model configurations — replacing uncertain aggregate metrics with statistically defensible ELO rankings for text-to-video, image-to-video, and video-to-video.

Read case study

105K

High-Confidence Video Content Classification at Scale

105,000 video clips classified in just seven days — after rapidly redesigning the annotation framework mid-project to eliminate subjectivity and deliver zero downstream rework.

Read case study

// FAQ

Ready to build your next dataset?

Tell us about your project and we will scope a plan within 48 hours.