Expert Preference Data for Video Generation Models

Q: How many preference annotations does video generation RLHF require?

Reward models for video generation typically need six-figure preference annotation volumes to generalize across content types and failure modes. The VideoReward benchmark used 182,000 annotations across 12 models as a research baseline. In production, Claru has delivered 976,000+ multi-dimensional quality assessments for a single frontier lab — enough to train separate reward models per quality axis (motion, fidelity, viewer interest, text alignment) without distribution gaps that cause reward hacking.

Q: What evaluation dimensions matter for video generation quality?

Four dimensions capture the primary axes along which video quality varies: motion quality (temporal coherence, physics plausibility, artifact severity), visual fidelity (resolution consistency, lighting accuracy, texture detail), viewer interest (whether the video holds attention), and text-to-video alignment (how faithfully output matches the input prompt). Collapsing these into a single score produces reward models that optimize for the wrong features. Claru's pilot studies validate which dimensions achieve reliable inter-annotator agreement before scaling.

Q: Why can't automated metrics replace human preference data for video RLHF?

Automated metrics like FVD, FID, and CLIP score correlate poorly with human perception of video quality. A video can score well on FID while containing physically impossible motion, or match a CLIP embedding while being visually boring. VideoReward demonstrated that reward models trained on 182,000 human preference annotations outperformed automated metrics on predicting human quality judgments. RLHF requires human signal because the optimization target is human preference, not metric performance.

Q: How does Claru ensure annotation consistency at scale?

Claru uses a multi-layer calibration system: annotators must exceed 85% agreement with expert gold-standard annotations on a qualification pipeline before entering production. In production, calibration sets — pre-labeled items with known ground truth — are seeded at a 5% rate, and inter-rater agreement is tracked continuously via Krippendorff's alpha. Annotators falling below threshold on any dimension are flagged for immediate retraining. This system maintained 92%+ calibration agreement across 241,000+ safety annotations and 976,000+ quality assessments.

Q: Can Claru evaluate video models across different input modalities?

Yes. Claru maintains separate evaluation frameworks for text-to-video, image-to-video, and video-to-video workflows. In one engagement, Claru ranked 51 model configurations across all 3 modalities using 39,000 pairwise evaluations with ELO-based scoring and Swiss-style tournament pairing. Separate per-modality rankings enabled the client to deploy the best configuration for each workflow rather than selecting a single compromise model.

Video generation models improve through human preference data — not just more training compute. But existing preference datasets cap at 182,000 annotations with narrow evaluation criteria, while frontier labs need millions of multi-dimensional quality assessments to train reward models that generalize across content types and failure modes. Claru delivers calibrated human preference data at the scale and specificity that RLHF for video generation demands.

Why Do Video Generation Models Need Human Preference Data?

Video generation models need human preference data because automated quality metrics — FVD, FID, CLIP score — correlate poorly with how humans actually perceive video quality. A video can score well on FID while containing physically impossible motion, or match a CLIP embedding while boring the viewer. The VideoReward framework demonstrated this gap explicitly: by collecting 182,000 human preference annotations across 12 video generation models, researchers built a reward model that outperformed automated metrics on predicting human quality judgments. The key finding was that preference data needed to span multiple evaluation dimensions — visual quality, motion quality, temporal consistency, and text-video alignment — because collapsing these into a single score produces reward models that optimize for the wrong features. The practical implication is that RLHF for video generation is data-hungry in a way that text RLHF is not. Text preference pairs can be evaluated in seconds; video preference pairs require watching two clips, comparing them on multiple axes, and making a judgment that accounts for temporal dynamics that do not exist in static content. VideoReward's Flow-DPO training approach showed that preference-aligned fine-tuning measurably improved generation quality, but the 182,000-annotation dataset was collected across just 12 models — a fraction of the configuration space that production labs explore.

[1]

What Are the Limitations of Existing Video Datasets?

Existing video-text datasets were built for training, not evaluation — and the difference matters for RLHF. VidGen-1M compiled 1 million video-text pairs, but the paper itself noted that captions in existing datasets average fewer than 15 words, insufficient to capture the spatial, temporal, and stylistic detail that preference evaluation requires. The dataset addressed this with longer, more detailed captions, but it was designed for generation training (input data), not preference annotation (output evaluation). GenAI-Bench provides 1,600 compositional text-to-visual prompts with human ratings, but its scale is too small for reward model training and it covers only text-to-image and text-to-video without addressing image-to-video or video-to-video workflows. VidProM offers 6.69 million unique text-to-video prompts scraped from public usage, but prompts are not preferences — knowing what users asked for does not tell you which outputs they preferred. The gap is structural: labs building video generation models need preference annotations on their own model outputs across their specific configuration space, evaluated on dimensions calibrated to their quality priorities. Off-the-shelf datasets cannot provide this because the preference signal must match the model's actual output distribution to be useful for RLHF training.

[2][1]

How Do Existing Video Evaluation Datasets Compare?

Public video evaluation datasets range from small benchmark sets to million-scale collections, but none provide the combination of scale, multi-dimensional preference annotations, and model-specific evaluation that RLHF for video generation requires. The comparison below illustrates why labs building frontier video models need custom preference data rather than off-the-shelf benchmarks.

Name	Scale	Tasks	Environments	Limitations
GenAI-Bench	1,600 prompts	Compositional text-to-visual generation evaluation	Text-to-image, text-to-video	Too small for reward model training; no per-dimension preference scoring; covers only T2I and T2V modalities
VidProM	6.69M prompts	Prompt distribution analysis for video generation	Text-to-video prompt collection	Prompts only, no preference annotations; no quality assessments; reflects user demand, not model quality
VidGen-1M	1M video-text pairs	Video generation training with detailed captions	Diverse video categories with temporal descriptions	Designed for generation training, not preference evaluation; no pairwise comparisons or quality scores; captions describe content, not quality
Claru Expert Preference Annotation	976K+ assessments delivered in a single program	Multi-dimensional quality assessment, ELO-based model ranking, RLHF preference pairs, content classification	T2V, I2V, V2V across cinematic, street-level, and curated video	Requires 2-3 week annotator calibration ramp-up per new evaluation framework; cost scales with annotation dimensions

GenAI-Bench

Scale1,600 prompts

TasksCompositional text-to-visual generation evaluation

EnvironmentsText-to-image, text-to-video

LimitationsToo small for reward model training; no per-dimension preference scoring; covers only T2I and T2V modalities

VidProM

Scale6.69M prompts

TasksPrompt distribution analysis for video generation

EnvironmentsText-to-video prompt collection

LimitationsPrompts only, no preference annotations; no quality assessments; reflects user demand, not model quality

VidGen-1M

Scale1M video-text pairs

TasksVideo generation training with detailed captions

EnvironmentsDiverse video categories with temporal descriptions

LimitationsDesigned for generation training, not preference evaluation; no pairwise comparisons or quality scores; captions describe content, not quality

Claru Expert Preference Annotation

Scale976K+ assessments delivered in a single program

TasksMulti-dimensional quality assessment, ELO-based model ranking, RLHF preference pairs, content classification

EnvironmentsT2V, I2V, V2V across cinematic, street-level, and curated video

LimitationsRequires 2-3 week annotator calibration ramp-up per new evaluation framework; cost scales with annotation dimensions

Human Evaluation of Video Generation Model Configurations

39,000Pairwise human evaluations completed

51Model configurations compared

3Input modalities ranked (T2V, I2V, V2V)

>0.70Krippendorff's alpha on retained metrics

Phase 1 was a pilot study designed to validate the evaluation methodology before committing to scale. We tested 6 candidate evaluation metrics — overall quality, temporal consistency, prompt adherence, visual fidelity, motion naturalness, and aesthetic appeal — with 5 trained annotators. Inter-annotator agreement was measured using Krippendorff's alpha.

Read Full Case Study

Video Quality Annotation at Scale for RLHF and Model Selection

976K+Human quality assessments delivered

4Evaluation dimensions per annotation

85%+Annotator calibration threshold (gold-standard agreement)

3Source categories (cinematic, street-level, curated)

We structured the annotation pipeline around four evaluation dimensions, each with calibrated rubrics and anchored scoring scales. Motion quality assessed temporal coherence, physics plausibility, and artifact severity — distinguishing between natural motion, subtle jitter, and catastrophic deformation. Visual fidelity evaluated resolution consistency, lighting accuracy, texture detail, and the absence of generation artifacts (blurring, tiling, color banding).

Read Full Case Study

High-Confidence Video Content Classification at Scale

105,000Video clips classified in 7 days

4Automated confidence tiers delivered

0Downstream rework required

<24hFramework redesign turnaround time

We identified the quality problem within the first 2,000 annotations by monitoring inter-annotator agreement in real time. The root cause was clear: the original annotation guidelines defined "organic" using abstract criteria that annotators interpreted differently depending on their background and the specific content of each clip. The framework was redesigned mid-project in under 24 hours.

Read Full Case Study

Annotators

Countries

0M+

Annotations Delivered

Same-day

QA Turnaround

Frequently Asked Questions

Reward models for video generation typically need six-figure preference annotation volumes to generalize across content types and failure modes. The VideoReward benchmark used 182,000 annotations across 12 models as a research baseline. In production, Claru has delivered 976,000+ multi-dimensional quality assessments for a single frontier lab — enough to train separate reward models per quality axis (motion, fidelity, viewer interest, text alignment) without distribution gaps that cause reward hacking.

Four dimensions capture the primary axes along which video quality varies: motion quality (temporal coherence, physics plausibility, artifact severity), visual fidelity (resolution consistency, lighting accuracy, texture detail), viewer interest (whether the video holds attention), and text-to-video alignment (how faithfully output matches the input prompt). Collapsing these into a single score produces reward models that optimize for the wrong features. Claru's pilot studies validate which dimensions achieve reliable inter-annotator agreement before scaling.

Automated metrics like FVD, FID, and CLIP score correlate poorly with human perception of video quality. A video can score well on FID while containing physically impossible motion, or match a CLIP embedding while being visually boring. VideoReward demonstrated that reward models trained on 182,000 human preference annotations outperformed automated metrics on predicting human quality judgments. RLHF requires human signal because the optimization target is human preference, not metric performance.

Claru uses a multi-layer calibration system: annotators must exceed 85% agreement with expert gold-standard annotations on a qualification pipeline before entering production. In production, calibration sets — pre-labeled items with known ground truth — are seeded at a 5% rate, and inter-rater agreement is tracked continuously via Krippendorff's alpha. Annotators falling below threshold on any dimension are flagged for immediate retraining. This system maintained 92%+ calibration agreement across 241,000+ safety annotations and 976,000+ quality assessments.

Yes. Claru maintains separate evaluation frameworks for text-to-video, image-to-video, and video-to-video workflows. In one engagement, Claru ranked 51 model configurations across all 3 modalities using 39,000 pairwise evaluations with ELO-based scoring and Swiss-style tournament pairing. Separate per-modality rankings enabled the client to deploy the best configuration for each workflow rather than selecting a single compromise model.

╔════════════════════╗
║  INITIATE CONTACT  ║
║  ▶ CONNECT NOW     ║
╚════════════════════╝

┌────────────────┐
│ STATUS: READY  │
│ AWAITING INPUT │
└────────────────┘

// INITIATE

Your next hire isn't a vendor.
It's a data team.

Tell us what you're training. We'll scope the dataset.

</>

References

[1]Liu et al.. “Improving Video Generation with Human Feedback.” arXiv, 2025. Introduced VideoReward, a multi-dimensional video reward model trained on large-scale human preference annotations; Flow-DPO preference-aligned fine-tuning measurably improved generation quality over automated metrics. Link
[2]Tan et al.. “VidGen-1M: A Large-Scale Dataset for Text-to-Video Generation.” arXiv, 2024. Compiled 1 million video-text pairs with detailed captions; found that captions in existing datasets average fewer than 15 words, insufficient for temporal and compositional understanding required by video generation models. Link
[3]Liu et al.. “Improving Video Generation with Human Feedback (Flow-DPO).” arXiv, 2025. Flow-DPO, introduced in the same paper as VideoReward, adapts Direct Preference Optimization for flow-matching video models and demonstrates superior alignment performance compared to supervised fine-tuning and Flow-RWR. Link
[4]Lin et al.. “GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation.” arXiv, 2024. Provides 1,600 compositional text-to-visual prompts with human ratings, establishing a benchmark for compositional generation evaluation across text-to-image and text-to-video. Link
[5]Wang et al.. “VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models.” arXiv, 2024. Collected 6.69 million unique text-to-video prompts from real user interactions, revealing the distribution gap between research prompts and production usage patterns. Link

Expert Preference Data for Video Generation Models

Why Do Video Generation Models Need Human Preference Data?

What Are the Limitations of Existing Video Datasets?

How Do Existing Video Evaluation Datasets Compare?

GenAI-Bench

VidProM

VidGen-1M

Claru Expert Preference Annotation

Human Evaluation of Video Generation Model Configurations

Video Quality Annotation at Scale for RLHF and Model Selection

High-Confidence Video Content Classification at Scale

Frequently Asked Questions

Your next hire isn't a vendor. It's a data team.

References

Related Solutions

Your next hire isn't a vendor.
It's a data team.