Human Red Teaming Data for AI Safety and EU AI Act Compliance

Automated scanners find the vulnerabilities they were programmed to find. Expert human red teamers find the vulnerabilities that matter — the novel attack vectors, social engineering exploits, and multi-step jailbreaks that automated tools miss entirely. Claru delivers structured adversarial testing data that satisfies EU AI Act requirements and closes the safety gaps that automated red teaming leaves open.

Why Do Automated Red Teaming Tools Miss Critical Vulnerabilities?

Automated red teaming tools miss critical vulnerabilities because they operate within predefined attack taxonomies. They test for known jailbreak patterns and prompt injection templates, but they cannot invent novel attack strategies the way a skilled adversarial tester can. Anthropic's research on Constitutional AI classifiers demonstrated this gap directly: without defenses, 86% of jailbreak attempts succeeded across 183 human participants over 3,000+ hours of testing. With Constitutional Classifiers deployed, that rate dropped to 4.4% — but the attacks that still succeeded were creative multi-step strategies that no automated scanner had flagged. The 4.4% residual rate represents exactly the kind of sophisticated vulnerability that requires human adversarial reasoning to discover and characterize. OpenAI's approach to external red teaming reinforces this pattern. Their red teaming methodology relies on domain experts — not crowdsourced testers — because effective adversarial testing requires understanding the model's capabilities well enough to probe its boundaries in ways the developers did not anticipate. A cybersecurity researcher probing a code generation model will discover different failure modes than a automated fuzzer running template-based attacks, because the researcher understands the downstream consequences of generated code in ways that a pattern-matching system cannot.

[1][2]

What Does the EU AI Act Require for Red Teaming?

The EU AI Act creates binding obligations for adversarial testing of high-risk and general-purpose AI systems. Article 55 requires providers of general-purpose AI models with systemic risk to conduct adversarial testing, including red teaming, to identify and mitigate risks. Article 99 establishes enforcement with fines up to 35 million euros or 7% of global annual turnover — whichever is higher — for non-compliance. Full enforcement begins August 2026, giving providers a defined window to build compliant testing programs. The regulation does not prescribe specific red teaming methodologies, but it requires that testing be proportionate to the risk profile of the system and that results be documented for regulatory review. The NIST AI Risk Management Framework provides complementary guidance, recommending structured adversarial evaluation as part of the Measure function — testing AI systems against known and emergent threat categories with documented results and remediation plans. Executive Order 14110 further defined AI red teaming as structured testing to identify flaws and vulnerabilities, though it was rescinded in January 2025. The regulatory direction is clear: providers need documented, repeatable adversarial testing — not ad hoc internal reviews — to demonstrate compliance.

[3][4][5]

How Do Different Red Teaming Approaches Compare for AI Safety?

Red teaming approaches range from fully automated vulnerability scanners to expert-led adversarial testing programs. The trade-off is between coverage breadth and vulnerability depth — automated tools test thousands of known patterns quickly, while expert testers discover novel attack vectors that no scanner has been trained on. For EU AI Act compliance, the question is whether your testing methodology produces audit-ready documentation of both known and emergent risks.

Automated Red Teaming Tools

Scale10,000+ attacks/hour
TasksKnown jailbreak patterns, prompt injection templates, toxicity probes
EnvironmentsText-only, single-turn
LimitationsCannot discover novel attack vectors; limited to predefined taxonomies; no multi-step reasoning; miss social engineering and context-dependent exploits

Crowdsourced Red Teaming

Scale100-500 testers
TasksBroad vulnerability discovery, creative prompt attacks
EnvironmentsText, some multi-modal
LimitationsInconsistent tester quality; no domain expertise requirement; high noise-to-signal ratio; limited coverage of specialized domains (code, biomedical, legal)

Expert Human Red Teaming

Scale10-50 specialists
TasksNovel attack discovery, multi-step jailbreaks, domain-specific exploits
EnvironmentsMulti-modal, multi-turn, cross-system
LimitationsHigher cost per test; slower throughput than automated tools; requires careful specialist recruitment

Claru Expert Red Teaming

ScaleCalibrated teams of 10-50+ specialists
TasksStructured adversarial testing, EU AI Act compliance documentation, residual risk quantification, threshold calibration
EnvironmentsMulti-modal (text, image, video), multi-turn, product-context-aware
LimitationsRequires 2-3 week ramp-up for novel safety taxonomies; best suited for systematic programs rather than one-off scans

Building and Red-Teaming an AI Content Moderation System

<2%Output rejection rate with full safety coverage
3Detection models calibrated (NSFW, celebrity, IP)
2Product contexts (consumer strict, enterprise permissive)
0Critical safety gaps remaining after red team cycles

We decomposed the moderation pipeline into discrete visual and text classification models — NSFW detection, celebrity likeness recognition, and IP likeness detection — each with independent confidence thresholds. Rather than applying a single binary filter, we defined category-level rulings and conjunction-based logic: a piece of content could be flagged by one model but cleared by another depending on the product context and risk profile. Confidence thresholds were calibrated per category using labeled datasets, with separate configurations for consumer-facing products (stricter) and enterprise APIs (more permissive).

Read Full Case Study

Scaling Generative AI Safety Through Human-Led Data Labeling

241K+Safety annotations completed
<2%Violation rate maintained below threshold
Multi-modalCoverage across text and video outputs
92%+Annotator calibration agreement maintained

We built a high-throughput, quality-controlled annotation workflow focused exclusively on residual risk — only reviewing outputs that had already passed the client's automated moderation pipeline. This design choice was deliberate: the goal was not to replicate automated filtering but to measure its failure rate and characterize the types of violations it misses. Annotators evaluated text and video outputs against a multi-dimensional safety taxonomy covering nudity/NSFW content, violence and gore, hate speech and harassment, self-harm, and illegal activity.

Read Full Case Study
0+

Annotators

0

Countries

0M+

Annotations Delivered

Same-day

QA Turnaround

Frequently Asked Questions

The EU AI Act requires providers of general-purpose AI models with systemic risk to conduct adversarial testing, including red teaming, under Article 55. Article 99 establishes fines up to 35 million euros or 7% of global annual turnover for non-compliance. Full enforcement begins August 2026. The regulation requires that testing be proportionate to the system's risk profile and that results be documented for regulatory review — meaning providers need structured, repeatable adversarial testing programs with audit-ready output.

Human red teaming discovers novel attack vectors that automated scanners cannot find because automated tools are limited to predefined attack taxonomies. Anthropic's Constitutional Classifiers research showed that even after reducing jailbreak success from 86% to 4.4%, the remaining successful attacks were creative multi-step strategies no automated scanner had flagged. Human red teamers combine domain expertise with adversarial reasoning to probe boundaries in ways developers did not anticipate — particularly for social engineering, context-dependent exploits, and multi-modal attacks.

Claru has delivered 241,000+ safety annotations in a single program across text and video modalities, with annotator calibration maintained above 92% agreement on gold-standard sets. Volume depends on scope — a focused moderation system red team may involve thousands of structured adversarial tests across 3 detection models, while a comprehensive residual risk assessment spans hundreds of thousands of post-filter annotations. Claru scales teams from 10 to 50+ calibrated specialists depending on the engagement.

Claru's safety taxonomies cover nudity and NSFW content, violence and gore, hate speech and harassment, self-harm, illegal activity, celebrity and IP likeness, and domain-specific risk categories defined collaboratively with each client's trust and safety team. Each category includes detailed policy definitions with visual exemplars, boundary cases, and guidance on culturally variable norms. Taxonomies are versioned so that policy updates can be deployed to annotator teams within 3-5 business days.

Yes. Claru's structured adversarial testing produces documentation that maps to both EU AI Act requirements (Articles 55 and 99) and the NIST AI Risk Management Framework's Measure function. The output includes quantified residual risk metrics, failure mode categorization, remediation tracking, and audit-ready reports — evidence artifacts that satisfy regulatory review regardless of which framework the auditor is applying. This dual-framework approach is particularly relevant for companies operating in both EU and US markets.

// INITIATE

Your next hire isn't a vendor. It's a data team.

Tell us what you're training. We'll scope the dataset.

claru@contact ~ READY
CONNECTED
> Initialize consultation request...

Or email us directly at [email protected]

</>

References

  1. [1]Anthropic. Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming.” Anthropic Research, 2025. Constitutional Classifiers reduced jailbreak success rate from 86% to 4.4% across 183 participants and 3,000+ hours of adversarial testing. Link
  2. [2]OpenAI. OpenAI's Approach to External Red Teaming for AI Models and Systems.” OpenAI Research, 2025. Effective red teaming requires domain experts rather than crowdsourced testers, because adversarial testing demands understanding model capabilities deeply enough to probe unanticipated boundaries. Link
  3. [3]European Parliament and Council. Regulation (EU) 2024/1689 — Artificial Intelligence Act.” Official Journal of the European Union, 2024. Article 55 requires adversarial testing (including red teaming) for general-purpose AI with systemic risk; Article 99 establishes fines up to 35 million euros or 7% of global turnover for non-compliance, with full enforcement from August 2026. Link
  4. [4]National Institute of Standards and Technology. Artificial Intelligence Risk Management Framework (AI RMF 1.0).” NIST, 2023. The Measure function recommends structured adversarial evaluation as part of AI risk management — testing systems against known and emergent threat categories with documented results and remediation plans. Link
  5. [5]The White House. Executive Order 14110: Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence.” Federal Register, 2023. Defined AI red-teaming as structured testing to identify flaws and vulnerabilities in AI systems; rescinded January 2025 but established the terminology now used across regulatory frameworks. Link