Why Crowdsourced RLHF Fails for Code and Math Models

Q: Why is crowdsourced RLHF specifically problematic for code and math models?

Code and math tasks have objectively correct answers that require domain expertise to evaluate. A crowdsourced annotator without programming experience cannot distinguish between code that compiles and runs correctly versus code that appears well-structured but contains logical errors. The "Secrets of RLHF Part II" research found that these incorrect preference pairs actively degrade reward model training rather than simply adding noise. For math, the problem compounds: subtle proof errors and incorrect intermediate reasoning steps are invisible to non-expert annotators.

Q: How much does expert RLHF annotation cost compared to crowdsourced?

Expert annotation costs $50-$120 per label compared to $0.02-$0.09 per label for crowdsourced work. However, RLTHF research shows expert annotation requires only 6-7% of the volume that crowdsourced approaches need for equivalent reward model quality. At those ratios, expert RLHF is cost-competitive or cheaper when accounting for total cost of ownership: fewer labels needed, no reward hacking debugging cycles ($50,000-$200,000 per cycle), and no retraining compute wasted on noisy preferences.

Q: Can I use crowdsourced RLHF for initial training and switch to experts later?

This staged approach is common but carries risks. A reward model pretrained on noisy crowdsourced preferences develops biases that expert data must overcome rather than simply extend. The more effective hybrid strategy is to use crowdsourced annotation for non-technical preference categories (helpfulness, tone, formatting) where domain expertise is less critical, and expert annotation exclusively for domains requiring correctness evaluation (code, math, scientific reasoning).

Q: How many expert annotators does Claru deploy for a typical RLHF engagement?

Panel size depends on the domain and throughput requirements. For code RLHF, typical engagements deploy 8-15 calibrated annotators with computer science backgrounds. For math, 5-10 annotators with graduate-level proof-writing competence. Each annotator passes domain-specific assessments and a calibration phase before joining the active pool. Continuous monitoring with 5% calibration seeding ensures sustained quality across the engagement.

Q: What evidence shows expert RLHF produces better reward models?

Three lines of evidence converge. RLTHF demonstrates 93-94% volume reduction with expert feedback while maintaining reward model quality. MM-RLHF's 50+ expert team produced 120,000 samples that outperformed crowdsourced baselines on multimodal evaluation tasks. The Secrets of RLHF Part II study provides the mechanistic explanation: incorrect preference pairs do not average out as noise but actively mislead reward model training, creating systematic biases that expert annotation avoids.

Reinforcement learning from human feedback depends on the quality of human judgments. For code and math models, crowdsourced annotators introduce incorrect preference pairs that actively degrade reward model performance. This analysis compares specific failure modes, cost structures, and the empirical evidence for why expert annotation produces measurably better RLHF outcomes in technical domains.

Noisy labels do not average out; they compound

The conventional assumption that crowdsourced annotation errors cancel out through aggregation does not hold for RLHF preference pairs. The "Secrets of RLHF in Large Language Models Part II" study found that "incorrect preference pairs hinder reward model" training by introducing systematic biases that the reward model learns as signal rather than noise [1]. In code generation tasks, a crowdsourced annotator who cannot evaluate algorithmic correctness will consistently prefer responses that appear well-formatted over responses that produce correct output. This creates a reward model that optimizes for surface-level code style rather than functional correctness, a failure mode that is invisible in standard evaluation metrics until the model is deployed.

[1]

Domain expertise is non-negotiable for code and mathematics

MM-RLHF assembled 50+ domain experts to produce 120,000 preference samples across multimodal tasks, establishing that expert annotation is both feasible at scale and measurably superior for technical domains [2]. The gap is starkest in code and mathematics: evaluating whether a code response correctly implements a recursive algorithm or whether a mathematical proof contains a subtle logical error requires domain knowledge that 3-hour training sessions cannot provide. RLTHF (Reinforcement Learning from Transformative Human Feedback) found that expert annotators reduce the annotation volume needed by 93-94%, requiring only 6-7% of the preference pairs that crowdsourced approaches need to achieve equivalent reward model quality [3]. This volume reduction directly translates to cost savings despite higher per-annotation rates.

[2][3]

The hidden cost of crowdsourced RLHF: debugging reward hacking

When a reward model is trained on noisy crowdsourced preferences, the policy model learns to exploit reward model errors rather than improve on the target task. This phenomenon, known as reward hacking, manifests as outputs that score highly on the flawed reward model while performing poorly on held-out human evaluations. Debugging reward hacking requires retracing preference pair quality, retraining the reward model, and re-running policy optimization. Teams at frontier labs report that a single reward hacking cycle costs 2-4 weeks of researcher time and $50,000-$200,000 in compute [1]. Expert annotation avoids this cycle by producing preference pairs that accurately reflect task performance from the outset.

[1]

What are the real costs of crowdsourced vs expert RLHF annotation?

Cost comparisons between crowdsourced and expert RLHF must account for total cost of ownership: per-label rates, annotation volume required, reward model retraining cycles, and the compute cost of debugging reward hacking when preference quality is low. Expert annotation costs 50-100x more per label but requires 93-94% fewer labels to achieve equivalent reward model quality.

Name	Scale	Tasks	Environments	Limitations
Crowdsourced RLHF	$0.02-$0.09 per label	General text, simple classification, style preferences	MTurk, Scale AI, Surge AI platforms	Cannot evaluate code correctness, mathematical proofs, or domain-specific accuracy; 15-40% error rate on technical tasks; reward hacking risk
Expert RLHF (in-house)	$80-$120 per annotation	Code review, math verification, domain-specific evaluation	Internal teams (6-12 month ramp-up)	Slow to scale; recruiting PhDs/senior engineers is competitive; team attrition disrupts continuity
Expert RLHF (outsourced)	$50-$100 per annotation	Code, math, scientific reasoning, multimodal evaluation	Specialized vendors (Claru, Surge, Invisible)	Requires rigorous vendor vetting; calibration phase per new domain; quality variance across vendors
Claru Expert RLHF	Custom pricing per engagement	Code, math, multimodal, prompt enhancement, safety evaluation	Dedicated expert teams with domain calibration	Requires 1-2 week calibration phase; minimum engagement scale applies

Crowdsourced RLHF

Scale$0.02-$0.09 per label

TasksGeneral text, simple classification, style preferences

EnvironmentsMTurk, Scale AI, Surge AI platforms

LimitationsCannot evaluate code correctness, mathematical proofs, or domain-specific accuracy; 15-40% error rate on technical tasks; reward hacking risk

Expert RLHF (in-house)

Scale$80-$120 per annotation

TasksCode review, math verification, domain-specific evaluation

EnvironmentsInternal teams (6-12 month ramp-up)

LimitationsSlow to scale; recruiting PhDs/senior engineers is competitive; team attrition disrupts continuity

Expert RLHF (outsourced)

Scale$50-$100 per annotation

TasksCode, math, scientific reasoning, multimodal evaluation

EnvironmentsSpecialized vendors (Claru, Surge, Invisible)

LimitationsRequires rigorous vendor vetting; calibration phase per new domain; quality variance across vendors

Claru Expert RLHF

ScaleCustom pricing per engagement

TasksCode, math, multimodal, prompt enhancement, safety evaluation

EnvironmentsDedicated expert teams with domain calibration

LimitationsRequires 1-2 week calibration phase; minimum engagement scale applies

Benchmarking Prompt Enhancement Quality Across Leading LLMs

180Prompts evaluated across 2 modalities

2Clear statistical winners identified (p < 0.01)

3Evaluation dimensions (intent, clarity, effectiveness)

p<0.01Statistical significance separating top 2 from field

The evaluation was structured in two phases, one per modality, with a shared methodology for aggregating human judgments into statistically defensible recommendations. The text phase used side-by-side comparisons: for each of 180 prompts, all candidate enhancement solutions produced an enhanced version, and 3 independent annotators selected the best one. Annotators evaluated holistic quality across three dimensions — intent preservation (does the enhanced prompt maintain the user's original goal?

Read Full Case Study

Annotators

Countries

0M+

Annotations Delivered

Same-day

QA Turnaround

Frequently Asked Questions

Code and math tasks have objectively correct answers that require domain expertise to evaluate. A crowdsourced annotator without programming experience cannot distinguish between code that compiles and runs correctly versus code that appears well-structured but contains logical errors. The "Secrets of RLHF Part II" research found that these incorrect preference pairs actively degrade reward model training rather than simply adding noise. For math, the problem compounds: subtle proof errors and incorrect intermediate reasoning steps are invisible to non-expert annotators.