Why label outputs that already passed automated moderation?

The goal is to measure residual risk — the rate at which harmful content slips through automated filters undetected. Automated systems produce binary classifications but cannot quantify their own error rate. Human labels on post-filter outputs provide the ground truth needed to calculate actual violation rates, identify systematic gaps, and produce audit-ready safety metrics.

How do you prevent annotator desensitization when reviewing safety-sensitive content?

Annotators rotate between safety-sensitive and non-sensitive tasks on scheduled intervals. Session lengths for safety content are capped at 2 hours with mandatory breaks. Calibration sets are seeded continuously to detect accuracy degradation. Annotators whose calibration scores drop are immediately rotated out for retraining. We also provide access to well-being support resources as standard practice for safety annotation teams.

What safety categories were covered?

The taxonomy covered nudity/NSFW content, violence and gore, hate speech and harassment, self-harm, and illegal activity. Each category included detailed policy definitions with visual exemplars, boundary cases, and guidance on culturally variable norms. The taxonomy was developed collaboratively with the client's trust and safety team.

How is annotation consistency measured?

Calibration sets — pre-labeled items with known ground truth — are seeded into the annotation workflow at a 5% rate. Annotator agreement on these items is tracked continuously. Annotators falling below 92% agreement are pulled for retraining. Additionally, a subset of items receives redundant labels from multiple annotators to measure inter-annotator agreement across the team.

Can this framework scale to new modalities or policy updates?

Yes. The framework is modality-agnostic at the infrastructure level — adding a new output type (e.g., audio, 3D) requires only annotator training and interface adaptation, not pipeline rebuilding. Policy updates are handled via versioned taxonomy definitions: when the client updates their safety policy, we retrain annotators on the delta and re-calibrate agreement thresholds. Typical turnaround for a policy update is 3-5 business days.

Safety

Scaling Generative AI Safety Through Human-Led Data Labeling

241K+Safety annotations completed

Safety

summary.md

Challenge:Automated content moderation pipelines produce binary safe/unsafe classifications but cannot quantify their own residual risk — the rate at which harmful content passes through undetected.

Solution:We built a high-throughput, quality-controlled annotation workflow focused exclusively on residual risk — only reviewing outputs that had already passed the client's automated moderation pipeline.

Result:The annotation program provided the client with defensible, quantitative evidence of safety performance — residual violation rates broken down by product, modality, and safety category — that satisfied both internal governance requirements and external audit requests from enterprise customers.

0K+Safety annotations completed

<0%Violation rate maintained below threshold

Multi-modal0Coverage across text and video outputs

0%+Annotator calibration agreement maintained

// THE CHALLENGE

Automated content moderation pipelines produce binary safe/unsafe classifications but cannot quantify their own residual risk — the rate at which harmful content passes through undetected. The client needed ground-truth human labels on outputs that had already cleared automated filters to measure actual violation rates, identify systematic gaps in their moderation stack, and produce audit-ready documentation for enterprise customers and regulators. The challenge was operational: labeling safety-sensitive content at scale requires annotators trained on nuanced policy definitions, consistent application of multi-dimensional safety taxonomies across modalities (text and video), and continuous monitoring to prevent annotator fatigue and desensitization from degrading label quality.

// OUR APPROACH

We built a high-throughput, quality-controlled annotation workflow focused exclusively on residual risk — only reviewing outputs that had already passed the client's automated moderation pipeline. This design choice was deliberate: the goal was not to replicate automated filtering but to measure its failure rate and characterize the types of violations it misses.

Annotators evaluated text and video outputs against a multi-dimensional safety taxonomy covering nudity/NSFW content, violence and gore, hate speech and harassment, self-harm, and illegal activity. Each output received a binary safety label (safe/violation) plus a category tag when violations were identified. Policy definitions were developed collaboratively with the client's trust and safety team and included visual exemplars, boundary cases, and explicit guidance on culturally variable norms.

Live dashboards tracked violation rates by product, modality, and safety category in real time. Weekly reporting identified emerging patterns — for example, a spike in near-miss violent content from a specific model configuration — enabling the client to update their automated filters proactively rather than reactively. Annotator consistency was monitored via calibration sets (pre-labeled items seeded into the workflow at a 5% rate), with annotators falling below 92% agreement on calibration items pulled for retraining.

SampleSelect outputs that passed automated moderation

AnnotateBinary safety labels per category

AggregateTrack violation rates vs 2% threshold

ReportLive dashboards for ongoing monitoring

SampleSelect outputs that passed automated moderation

AnnotateBinary safety labels per category

AggregateTrack violation rates vs 2% threshold

ReportLive dashboards for ongoing monitoring

// RESULTS

241K+Safety annotations completed

<2%Violation rate maintained below threshold

Multi-modalCoverage across text and video outputs

92%+Annotator calibration agreement maintained

// IMPACT

The annotation program provided the client with defensible, quantitative evidence of safety performance — residual violation rates broken down by product, modality, and safety category — that satisfied both internal governance requirements and external audit requests from enterprise customers. The pattern analysis surfaced three systematic gaps in their automated moderation pipeline that had not been detected through aggregate metrics alone, leading to targeted model retraining that reduced residual violation rates by an estimated 40% in subsequent measurement cycles.

// SAMPLE DATA

Representative record from the annotation pipeline.

residual_risk_audit.json

// ANNOTATION VOLUME

241,000+safety annotations across text & video

<2%Violation Rate

92%+Calibration

5%Calibration Seed Rate

// VIOLATION TAXONOMY (5 CATEGORIES)

⚠️Nudity / NSFW

0.8%

🛡️Violence & Gore

0.5%

🚫Hate Speech

0.4%

❤️‍🩹Self-Harm

0.2%

🔒Illegal Activity

0.3%

// RESIDUAL RISK AUDIT PIPELINE

01SamplePost-filter outputs

02AnnotateBinary + category tag

03AggregateTrack vs 2% threshold

04ReportLive dashboards

// SAMPLE ANNOTATION

AUTOMATED: PASSEDHUMAN: VIOLATIONCATEGORY: VIOLENCE_GORE

ModalityVideo

ConfidenceHigh

Annotator Cal.94.2%

Batchbatch-047

Calibration SeedNo

StatusCompleted

// JSON_RESPONSE

{
  "annotation_id": "aff737b7-5ab8-4b1e-986c-442b545668e7",
  "modality": "video",
  "automated_filter_result": "PASSED",
  "human_label": "violation",
  "violation_category": "violence_gore",
  "confidence_tier": "high",
  "annotator_calibration": "94.2%",
  "batch_id": "batch-2025-12-06-047",
  "status": "completed"
}

Service UsedRed Teaming & Safety

// RELATED

39K

Human Evaluation of Video Generation Model Configurations

39,000 pairwise human evaluations across 51 model configurations — replacing uncertain aggregate metrics with statistically defensible ELO rankings for text-to-video, image-to-video, and video-to-video.

Read case study

105K

High-Confidence Video Content Classification at Scale

105,000 video clips classified in just seven days — after rapidly redesigning the annotation framework mid-project to eliminate subjectivity and deliver zero downstream rework.

Read case study

<2%

Building and Red-Teaming an AI Content Moderation System

Designed, calibrated, and stress-tested a production content moderation system — achieving sub-2% rejection while maintaining safety coverage across all AI products.

Read case study

// FAQ

Ready to build your next dataset?

Tell us about your project and we will scope a plan within 48 hours.