Safety

Scaling Generative AI Safety Through Human-Led Data Labeling

241K+Safety annotations completed
Safety
summary.md

Challenge:Automated content moderation pipelines produce binary safe/unsafe classifications but cannot quantify their own residual risk — the rate at which harmful content passes through undetected.

Solution:We built a high-throughput, quality-controlled annotation workflow focused exclusively on residual risk — only reviewing outputs that had already passed the client's automated moderation pipeline.

Result:The annotation program provided the client with defensible, quantitative evidence of safety performance — residual violation rates broken down by product, modality, and safety category — that satisfied both internal governance requirements and external audit requests from enterprise customers.

0K+Safety annotations completed
<0%Violation rate maintained below threshold
Multi-modal0Coverage across text and video outputs
0%+Annotator calibration agreement maintained
// THE CHALLENGE

Automated content moderation pipelines produce binary safe/unsafe classifications but cannot quantify their own residual risk — the rate at which harmful content passes through undetected. The client needed ground-truth human labels on outputs that had already cleared automated filters to measure actual violation rates, identify systematic gaps in their moderation stack, and produce audit-ready documentation for enterprise customers and regulators. The challenge was operational: labeling safety-sensitive content at scale requires annotators trained on nuanced policy definitions, consistent application of multi-dimensional safety taxonomies across modalities (text and video), and continuous monitoring to prevent annotator fatigue and desensitization from degrading label quality.

// OUR APPROACH

We built a high-throughput, quality-controlled annotation workflow focused exclusively on residual risk — only reviewing outputs that had already passed the client's automated moderation pipeline. This design choice was deliberate: the goal was not to replicate automated filtering but to measure its failure rate and characterize the types of violations it misses.

Annotators evaluated text and video outputs against a multi-dimensional safety taxonomy covering nudity/NSFW content, violence and gore, hate speech and harassment, self-harm, and illegal activity. Each output received a binary safety label (safe/violation) plus a category tag when violations were identified. Policy definitions were developed collaboratively with the client's trust and safety team and included visual exemplars, boundary cases, and explicit guidance on culturally variable norms.

Live dashboards tracked violation rates by product, modality, and safety category in real time. Weekly reporting identified emerging patterns — for example, a spike in near-miss violent content from a specific model configuration — enabling the client to update their automated filters proactively rather than reactively. Annotator consistency was monitored via calibration sets (pre-labeled items seeded into the workflow at a 5% rate), with annotators falling below 92% agreement on calibration items pulled for retraining.

01
SampleSelect outputs that passed automated moderation
02
AnnotateBinary safety labels per category
03
AggregateTrack violation rates vs 2% threshold
04
ReportLive dashboards for ongoing monitoring
// RESULTS
241K+Safety annotations completed
<2%Violation rate maintained below threshold
Multi-modalCoverage across text and video outputs
92%+Annotator calibration agreement maintained
// IMPACT

The annotation program provided the client with defensible, quantitative evidence of safety performance — residual violation rates broken down by product, modality, and safety category — that satisfied both internal governance requirements and external audit requests from enterprise customers. The pattern analysis surfaced three systematic gaps in their automated moderation pipeline that had not been detected through aggregate metrics alone, leading to targeted model retraining that reduced residual violation rates by an estimated 40% in subsequent measurement cycles.

// SAMPLE DATA

Representative record from the annotation pipeline.

residual_risk_audit.json
// ANNOTATION VOLUME
241,000+safety annotations across text & video
<2%Violation Rate
92%+Calibration
5%Calibration Seed Rate
// VIOLATION TAXONOMY (5 CATEGORIES)
⚠️Nudity / NSFW
0.8%
🛡️Violence & Gore
0.5%
🚫Hate Speech
0.4%
❤️‍🩹Self-Harm
0.2%
🔒Illegal Activity
0.3%
// RESIDUAL RISK AUDIT PIPELINE
01SamplePost-filter outputs
02AnnotateBinary + category tag
03AggregateTrack vs 2% threshold
04ReportLive dashboards
// SAMPLE ANNOTATION
AUTOMATED: PASSEDHUMAN: VIOLATIONCATEGORY: VIOLENCE_GORE
ModalityVideo
ConfidenceHigh
Annotator Cal.94.2%
Batchbatch-047
Calibration SeedNo
StatusCompleted
// JSON_RESPONSE
{
  "annotation_id": "aff737b7-5ab8-4b1e-986c-442b545668e7",
  "modality": "video",
  "automated_filter_result": "PASSED",
  "human_label": "violation",
  "violation_category": "violence_gore",
  "confidence_tier": "high",
  "annotator_calibration": "94.2%",
  "batch_id": "batch-2025-12-06-047",
  "status": "completed"
}
// FAQ

Ready to build your next dataset?

Tell us about your project and we will scope a plan within 48 hours.