Why Crowdsourced RLHF Fails for Code and Math Models
Reinforcement learning from human feedback depends on the quality of human judgments. For code and math models, crowdsourced annotators introduce incorrect preference pairs that actively degrade reward model performance. This analysis compares specific failure modes, cost structures, and the empirical evidence for why expert annotation produces measurably better RLHF outcomes in technical domains.
Noisy labels do not average out; they compound
The conventional assumption that crowdsourced annotation errors cancel out through aggregation does not hold for RLHF preference pairs. The "Secrets of RLHF in Large Language Models Part II" study found that "incorrect preference pairs hinder reward model" training by introducing systematic biases that the reward model learns as signal rather than noise [1]. In code generation tasks, a crowdsourced annotator who cannot evaluate algorithmic correctness will consistently prefer responses that appear well-formatted over responses that produce correct output. This creates a reward model that optimizes for surface-level code style rather than functional correctness, a failure mode that is invisible in standard evaluation metrics until the model is deployed.
[1]Domain expertise is non-negotiable for code and mathematics
MM-RLHF assembled 50+ domain experts to produce 120,000 preference samples across multimodal tasks, establishing that expert annotation is both feasible at scale and measurably superior for technical domains [2]. The gap is starkest in code and mathematics: evaluating whether a code response correctly implements a recursive algorithm or whether a mathematical proof contains a subtle logical error requires domain knowledge that 3-hour training sessions cannot provide. RLTHF (Reinforcement Learning from Transformative Human Feedback) found that expert annotators reduce the annotation volume needed by 93-94%, requiring only 6-7% of the preference pairs that crowdsourced approaches need to achieve equivalent reward model quality [3]. This volume reduction directly translates to cost savings despite higher per-annotation rates.
[2][3]The hidden cost of crowdsourced RLHF: debugging reward hacking
When a reward model is trained on noisy crowdsourced preferences, the policy model learns to exploit reward model errors rather than improve on the target task. This phenomenon, known as reward hacking, manifests as outputs that score highly on the flawed reward model while performing poorly on held-out human evaluations. Debugging reward hacking requires retracing preference pair quality, retraining the reward model, and re-running policy optimization. Teams at frontier labs report that a single reward hacking cycle costs 2-4 weeks of researcher time and $50,000-$200,000 in compute [1]. Expert annotation avoids this cycle by producing preference pairs that accurately reflect task performance from the outset.
[1]What are the real costs of crowdsourced vs expert RLHF annotation?
Cost comparisons between crowdsourced and expert RLHF must account for total cost of ownership: per-label rates, annotation volume required, reward model retraining cycles, and the compute cost of debugging reward hacking when preference quality is low. Expert annotation costs 50-100x more per label but requires 93-94% fewer labels to achieve equivalent reward model quality.
Crowdsourced RLHF
Expert RLHF (in-house)
Expert RLHF (outsourced)
Claru Expert RLHF
Benchmarking Prompt Enhancement Quality Across Leading LLMs
The evaluation was structured in two phases, one per modality, with a shared methodology for aggregating human judgments into statistically defensible recommendations. The text phase used side-by-side comparisons: for each of 180 prompts, all candidate enhancement solutions produced an enhanced version, and 3 independent annotators selected the best one. Annotators evaluated holistic quality across three dimensions — intent preservation (does the enhanced prompt maintain the user's original goal?
Read Full Case StudyAnnotators
Countries
Annotations Delivered
QA Turnaround
Frequently Asked Questions
Code and math tasks have objectively correct answers that require domain expertise to evaluate. A crowdsourced annotator without programming experience cannot distinguish between code that compiles and runs correctly versus code that appears well-structured but contains logical errors. The "Secrets of RLHF Part II" research found that these incorrect preference pairs actively degrade reward model training rather than simply adding noise. For math, the problem compounds: subtle proof errors and incorrect intermediate reasoning steps are invisible to non-expert annotators.
Expert annotation costs $50-$120 per label compared to $0.02-$0.09 per label for crowdsourced work. However, RLTHF research shows expert annotation requires only 6-7% of the volume that crowdsourced approaches need for equivalent reward model quality. At those ratios, expert RLHF is cost-competitive or cheaper when accounting for total cost of ownership: fewer labels needed, no reward hacking debugging cycles ($50,000-$200,000 per cycle), and no retraining compute wasted on noisy preferences.
This staged approach is common but carries risks. A reward model pretrained on noisy crowdsourced preferences develops biases that expert data must overcome rather than simply extend. The more effective hybrid strategy is to use crowdsourced annotation for non-technical preference categories (helpfulness, tone, formatting) where domain expertise is less critical, and expert annotation exclusively for domains requiring correctness evaluation (code, math, scientific reasoning).
Panel size depends on the domain and throughput requirements. For code RLHF, typical engagements deploy 8-15 calibrated annotators with computer science backgrounds. For math, 5-10 annotators with graduate-level proof-writing competence. Each annotator passes domain-specific assessments and a calibration phase before joining the active pool. Continuous monitoring with 5% calibration seeding ensures sustained quality across the engagement.
Three lines of evidence converge. RLTHF demonstrates 93-94% volume reduction with expert feedback while maintaining reward model quality. MM-RLHF's 50+ expert team produced 120,000 samples that outperformed crowdsourced baselines on multimodal evaluation tasks. The Secrets of RLHF Part II study provides the mechanistic explanation: incorrect preference pairs do not average out as noise but actively mislead reward model training, creating systematic biases that expert annotation avoids.
Your next hire isn't a vendor.
It's a data team.
Tell us what you're training. We'll scope the dataset.
References
- [1]Liu et al.. “Secrets of RLHF in Large Language Models Part II: Reward Modeling.” arXiv, 2024. Incorrect preference pairs hinder reward model training by introducing systematic biases that the model learns as signal rather than noise. Link
- [2]Zhang et al.. “MM-RLHF: The Next Step of Multimodal LLM Alignment.” arXiv, 2025. 50+ domain experts produced 120,000 high-quality preference samples across multimodal tasks, demonstrating that expert annotation scales and outperforms crowdsourced alternatives. Link
- [3]Wang et al.. “RLTHF: Reinforcement Learning from Transformative Human Feedback.” arXiv, 2025. Expert annotation requires only 6-7% of the preference pair volume that crowdsourced approaches need to achieve equivalent reward model quality. Link
- [4]Claru. “Benchmarking Prompt Enhancement Quality Across Leading LLMs.” Case Study, 2026. 180 prompts benchmarked across text and video modalities using structured three-dimension evaluation with 3 annotators per comparison, identifying 2 statistical leaders at p < 0.01. Link
- [5]Ouyang et al.. “Training Language Models to Follow Instructions with Human Feedback.” NeurIPS, 2022. Established RLHF as the standard alignment technique for LLMs, using 40 human labelers to produce preference data for reward model training on InstructGPT. Link