Video Quality Annotation at Scale for RLHF and Model Selection
Challenge:RLHF for video generation requires massive volumes of human quality judgments — not just binary good/bad labels, but multi-dimensional assessments that capture the specific axes along which video quality varies.
Solution:We structured the annotation pipeline around four evaluation dimensions, each with calibrated rubrics and anchored scoring scales.
Result:The 976K+ annotations became the primary preference signal for the lab's RLHF training pipeline, directly shaping how the model learned to distinguish high-quality video from low-quality output.
RLHF for video generation requires massive volumes of human quality judgments — not just binary good/bad labels, but multi-dimensional assessments that capture the specific axes along which video quality varies. Motion quality, visual fidelity, viewer engagement, and prompt adherence are distinct failure modes that require distinct training signal. The lab's internal evaluation team could produce hundreds of annotations per week; their training pipeline consumed thousands per day. They needed a partner who could deliver six-figure annotation volumes while maintaining the calibration and consistency that RLHF demands — because noisy preference data doesn't just slow convergence, it teaches the model the wrong preferences.
We structured the annotation pipeline around four evaluation dimensions, each with calibrated rubrics and anchored scoring scales.
Motion quality assessed temporal coherence, physics plausibility, and artifact severity — distinguishing between natural motion, subtle jitter, and catastrophic deformation. Visual fidelity evaluated resolution consistency, lighting accuracy, texture detail, and the absence of generation artifacts (blurring, tiling, color banding). Viewer interest captured whether the video would hold a viewer's attention — a subjective but critical signal for user-facing products. Text-to-video alignment measured how faithfully the generated output matched the input prompt across subject, action, setting, and style.
Annotators were calibrated through a qualification pipeline that tested agreement with expert gold-standard annotations across all four dimensions. Only annotators exceeding 85% agreement on calibration sets entered the production pool. Inter-rater agreement was monitored continuously using Krippendorff's alpha, with automatic flagging when agreement dropped below threshold on any dimension. Source material spanned three categories — licensed cinematic footage, street-level captures, and curated video libraries — ensuring the annotations covered the full distribution of content types the model would encounter in production.
Weekly delivery batches were formatted for direct ingestion into the lab's RLHF training pipeline, with per-dimension scores, annotator confidence indicators, and source metadata attached to each assessment.
The 976K+ annotations became the primary preference signal for the lab's RLHF training pipeline, directly shaping how the model learned to distinguish high-quality video from low-quality output. The four-dimensional scoring enabled the lab to train separate reward models per quality axis — meaning the system could optimize for motion quality without degrading visual fidelity, or improve prompt adherence without sacrificing viewer interest. The calibrated annotator pool and continuous agreement monitoring meant the lab could trust the preference data at scale without manual review of individual annotations.
Representative record from the annotation pipeline.
Temporal coherence, physics plausibility, artifact severity
Resolution consistency, lighting accuracy, texture detail
Would a viewer watch this through? Engagement signal
Prompt faithfulness across subject, action, setting, style
Ready to build your next dataset?
Tell us about your project and we will scope a plan within 48 hours.