Expert Preference Data for Video Generation Models
Video generation models improve through human preference data — not just more training compute. But existing preference datasets cap at 182,000 annotations with narrow evaluation criteria, while frontier labs need millions of multi-dimensional quality assessments to train reward models that generalize across content types and failure modes. Claru delivers calibrated human preference data at the scale and specificity that RLHF for video generation demands.
Why Do Video Generation Models Need Human Preference Data?
Video generation models need human preference data because automated quality metrics — FVD, FID, CLIP score — correlate poorly with how humans actually perceive video quality. A video can score well on FID while containing physically impossible motion, or match a CLIP embedding while boring the viewer. The VideoReward framework demonstrated this gap explicitly: by collecting 182,000 human preference annotations across 12 video generation models, researchers built a reward model that outperformed automated metrics on predicting human quality judgments. The key finding was that preference data needed to span multiple evaluation dimensions — visual quality, motion quality, temporal consistency, and text-video alignment — because collapsing these into a single score produces reward models that optimize for the wrong features. The practical implication is that RLHF for video generation is data-hungry in a way that text RLHF is not. Text preference pairs can be evaluated in seconds; video preference pairs require watching two clips, comparing them on multiple axes, and making a judgment that accounts for temporal dynamics that do not exist in static content. VideoReward's Flow-DPO training approach showed that preference-aligned fine-tuning measurably improved generation quality, but the 182,000-annotation dataset was collected across just 12 models — a fraction of the configuration space that production labs explore.
[1]What Are the Limitations of Existing Video Datasets?
Existing video-text datasets were built for training, not evaluation — and the difference matters for RLHF. VidGen-1M compiled 1 million video-text pairs, but the paper itself noted that captions in existing datasets average fewer than 15 words, insufficient to capture the spatial, temporal, and stylistic detail that preference evaluation requires. The dataset addressed this with longer, more detailed captions, but it was designed for generation training (input data), not preference annotation (output evaluation). GenAI-Bench provides 1,600 compositional text-to-visual prompts with human ratings, but its scale is too small for reward model training and it covers only text-to-image and text-to-video without addressing image-to-video or video-to-video workflows. VidProM offers 6.69 million unique text-to-video prompts scraped from public usage, but prompts are not preferences — knowing what users asked for does not tell you which outputs they preferred. The gap is structural: labs building video generation models need preference annotations on their own model outputs across their specific configuration space, evaluated on dimensions calibrated to their quality priorities. Off-the-shelf datasets cannot provide this because the preference signal must match the model's actual output distribution to be useful for RLHF training.
[2][1]How Do Existing Video Evaluation Datasets Compare?
Public video evaluation datasets range from small benchmark sets to million-scale collections, but none provide the combination of scale, multi-dimensional preference annotations, and model-specific evaluation that RLHF for video generation requires. The comparison below illustrates why labs building frontier video models need custom preference data rather than off-the-shelf benchmarks.
GenAI-Bench
VidProM
VidGen-1M
Claru Expert Preference Annotation
Human Evaluation of Video Generation Model Configurations
Phase 1 was a pilot study designed to validate the evaluation methodology before committing to scale. We tested 6 candidate evaluation metrics — overall quality, temporal consistency, prompt adherence, visual fidelity, motion naturalness, and aesthetic appeal — with 5 trained annotators. Inter-annotator agreement was measured using Krippendorff's alpha.
Read Full Case StudyVideo Quality Annotation at Scale for RLHF and Model Selection
We structured the annotation pipeline around four evaluation dimensions, each with calibrated rubrics and anchored scoring scales. Motion quality assessed temporal coherence, physics plausibility, and artifact severity — distinguishing between natural motion, subtle jitter, and catastrophic deformation. Visual fidelity evaluated resolution consistency, lighting accuracy, texture detail, and the absence of generation artifacts (blurring, tiling, color banding).
Read Full Case StudyHigh-Confidence Video Content Classification at Scale
We identified the quality problem within the first 2,000 annotations by monitoring inter-annotator agreement in real time. The root cause was clear: the original annotation guidelines defined "organic" using abstract criteria that annotators interpreted differently depending on their background and the specific content of each clip. The framework was redesigned mid-project in under 24 hours.
Read Full Case StudyAnnotators
Countries
Annotations Delivered
QA Turnaround
Frequently Asked Questions
Reward models for video generation typically need six-figure preference annotation volumes to generalize across content types and failure modes. The VideoReward benchmark used 182,000 annotations across 12 models as a research baseline. In production, Claru has delivered 976,000+ multi-dimensional quality assessments for a single frontier lab — enough to train separate reward models per quality axis (motion, fidelity, viewer interest, text alignment) without distribution gaps that cause reward hacking.
Four dimensions capture the primary axes along which video quality varies: motion quality (temporal coherence, physics plausibility, artifact severity), visual fidelity (resolution consistency, lighting accuracy, texture detail), viewer interest (whether the video holds attention), and text-to-video alignment (how faithfully output matches the input prompt). Collapsing these into a single score produces reward models that optimize for the wrong features. Claru's pilot studies validate which dimensions achieve reliable inter-annotator agreement before scaling.
Automated metrics like FVD, FID, and CLIP score correlate poorly with human perception of video quality. A video can score well on FID while containing physically impossible motion, or match a CLIP embedding while being visually boring. VideoReward demonstrated that reward models trained on 182,000 human preference annotations outperformed automated metrics on predicting human quality judgments. RLHF requires human signal because the optimization target is human preference, not metric performance.
Claru uses a multi-layer calibration system: annotators must exceed 85% agreement with expert gold-standard annotations on a qualification pipeline before entering production. In production, calibration sets — pre-labeled items with known ground truth — are seeded at a 5% rate, and inter-rater agreement is tracked continuously via Krippendorff's alpha. Annotators falling below threshold on any dimension are flagged for immediate retraining. This system maintained 92%+ calibration agreement across 241,000+ safety annotations and 976,000+ quality assessments.
Yes. Claru maintains separate evaluation frameworks for text-to-video, image-to-video, and video-to-video workflows. In one engagement, Claru ranked 51 model configurations across all 3 modalities using 39,000 pairwise evaluations with ELO-based scoring and Swiss-style tournament pairing. Separate per-modality rankings enabled the client to deploy the best configuration for each workflow rather than selecting a single compromise model.
Your next hire isn't a vendor.
It's a data team.
Tell us what you're training. We'll scope the dataset.
References
- [1]Liu et al.. “Improving Video Generation with Human Feedback.” arXiv, 2025. Introduced VideoReward, a multi-dimensional video reward model trained on large-scale human preference annotations; Flow-DPO preference-aligned fine-tuning measurably improved generation quality over automated metrics. Link
- [2]Tan et al.. “VidGen-1M: A Large-Scale Dataset for Text-to-Video Generation.” arXiv, 2024. Compiled 1 million video-text pairs with detailed captions; found that captions in existing datasets average fewer than 15 words, insufficient for temporal and compositional understanding required by video generation models. Link
- [3]Liu et al.. “Improving Video Generation with Human Feedback (Flow-DPO).” arXiv, 2025. Flow-DPO, introduced in the same paper as VideoReward, adapts Direct Preference Optimization for flow-matching video models and demonstrates superior alignment performance compared to supervised fine-tuning and Flow-RWR. Link
- [4]Lin et al.. “GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation.” arXiv, 2024. Provides 1,600 compositional text-to-visual prompts with human ratings, establishing a benchmark for compositional generation evaluation across text-to-image and text-to-video. Link
- [5]Wang et al.. “VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models.” arXiv, 2024. Collected 6.69 million unique text-to-video prompts from real user interactions, revealing the distribution gap between research prompts and production usage patterns. Link