Explore TopicFolio posts tagged #red-teaming. 5 public posts indexed. Includes activity from AI Safety. Related folio: AI Safety Notes.
Topic Pathways
Move from the topic hub into broader community archives, folio archives, or the main discover surface to keep exploring adjacent conversations.
Before I trust a safety strategy at scale, I want to see documented risks, recurring eval coverage, named owners for mitigations, and a record of at least a few launch or scope decisions that changed because of the findings. That is what separates a safety practice from a safety posture deck.
Three evaluation axes to compare:
- clarity of the threat model
- repeatability of the evaluation process
- evidence that the findings change deployment choices
Review materials:
- Inspect documentation: inspect.aisi.org.uk/
One of the best places to see evaluation design turned into runnable workflows.
- AI RMF Playbook: airc.nist.gov/AI_RMF_Knowledge_Base/Playbook
The most useful NIST material when a team needs implementation moves, not just principles.
- Inspect source: github.com/UKGovernmentBEIS/inspect_ai
Open source evaluation framework from the UK AI Security Institute.
Save the strongest examples, scorecards, and decision memos in this folio so future teammates can see what good evaluation looked like at the time.
The hard public questions are about threshold-setting: what evidence should be required before launch, how much outside scrutiny is enough, and when a voluntary practice stops being a sufficient answer. Those arguments are productive when people bring operating context rather than ideology alone.
Three questions worth debating:
- what a meaningful pre-deployment safety bar should look like
- how much model access external evaluators need
- where voluntary frameworks stop being enough
Background reading before you take a strong stance:
- NIST AI Risk Management Framework: nist.gov/itl/ai-risk-management-framework
Useful for building a shared vocabulary across engineering, policy, and operations.
- Anthropic research archive: anthropic.com/research
A strong public record of how a frontier lab discusses evaluations, misuse, and controls.
- Anthropic video archive: youtube.com/@AnthropicAI/videos
Talks and interviews that help connect research language to deployment reality.
When you respond, include the environment you are optimizing for. Advice changes a lot across stage, regulation, team size, and user expectations.
A usable safety starter pack should have one framework, one research archive, one evaluation tool, and one red-teaming toolkit. That mix gives people language, examples, executable tests, and a reminder that adversarial work needs its own craft, not just more benchmark rows.
The kinds of materials worth saving in this space:
- governance frameworks with concrete implementation guidance
- evaluation reports that describe methods and limitations
- incident retrospectives that explain organizational response
Read:
- NIST AI Risk Management Framework: nist.gov/itl/ai-risk-management-framework
Useful for building a shared vocabulary across engineering, policy, and operations.
- Anthropic research archive: anthropic.com/research
A strong public record of how a frontier lab discusses evaluations, misuse, and controls.
- Inspect documentation: inspect.aisi.org.uk/
One of the best places to see evaluation design turned into runnable workflows.
Documents and downloadable guides:
- AI RMF Playbook: airc.nist.gov/AI_RMF_Knowledge_Base/Playbook
The most useful NIST material when a team needs implementation moves, not just principles.
- NIST Generative AI Profile: airc.nist.gov/AI_RMF_Knowledge_Base/AI_RMF_Ge...
Helpful for teams mapping generative-AI-specific risks onto the broader framework.
Watch:
- Anthropic video archive: youtube.com/@AnthropicAI/videos
Talks and interviews that help connect research language to deployment reality.
Build or inspect:
- Inspect source: github.com/UKGovernmentBEIS/inspect_ai
Open source evaluation framework from the UK AI Security Institute.
- PyRIT: github.com/Azure/PyRIT
A practical red-teaming toolkit for testing risky prompt and tool behaviors.
Image references:
- AI RMF knowledge base: airc.nist.gov/AI_RMF_Knowledge_Base/
Framework visuals and navigable references that are easier to browse than a single PDF.
NIST gives teams a language for risk management, Anthropic's research archive shows how frontier labs reason about evaluations, and Inspect gives you something concrete to run. Together they make the work feel operational instead of ceremonial.
The stack categories worth comparing here:
- evaluation harnesses and benchmark management
- policy and review workflows
- incident logging and response tooling
Open materials worth opening side by side:
- Inspect source: github.com/UKGovernmentBEIS/inspect_ai
Open source evaluation framework from the UK AI Security Institute.
- PyRIT: github.com/Azure/PyRIT
A practical red-teaming toolkit for testing risky prompt and tool behaviors.
- NIST AI Risk Management Framework: nist.gov/itl/ai-risk-management-framework
Useful for building a shared vocabulary across engineering, policy, and operations.
Working documents and guides:
- AI RMF Playbook: airc.nist.gov/AI_RMF_Knowledge_Base/Playbook
The most useful NIST material when a team needs implementation moves, not just principles.
- NIST Generative AI Profile: airc.nist.gov/AI_RMF_Knowledge_Base/AI_RMF_Ge...
Helpful for teams mapping generative-AI-specific risks onto the broader framework.
Release gate checklist:
release_gate:
model_family: frontier-assistant-v3
reviewed_harms:
- unsafe professional advice
- jailbreak resilience
- sensitive data leakage
recurring_evals:
cadence: weekly
owners:
- safety
- applied_ml
blocking_findings:
severity: critical_or_high
unresolved_count_must_equal: 0Good safety work stops looking like a side spreadsheet as soon as it is tied to an actual release gate. The strongest public material in this area is useful because it connects threat models, evaluations, and deployment choices instead of treating them as separate essays.
Three signals I would keep in view:
- Safety work gets more durable when it is tied to release decisions, not a side spreadsheet.
- Red teaming matters most when findings change policy, tooling, or rollout gates.
- The highest-value evaluations usually combine misuse risk with normal product tasks.
Read first:
- NIST AI Risk Management Framework: nist.gov/itl/ai-risk-management-framework
Useful for building a shared vocabulary across engineering, policy, and operations.
- Anthropic research archive: anthropic.com/research
A strong public record of how a frontier lab discusses evaluations, misuse, and controls.
Documents worth saving:
- AI RMF Playbook: airc.nist.gov/AI_RMF_Knowledge_Base/Playbook
The most useful NIST material when a team needs implementation moves, not just principles.
- NIST Generative AI Profile: airc.nist.gov/AI_RMF_Knowledge_Base/AI_RMF_Ge...
Helpful for teams mapping generative-AI-specific risks onto the broader framework.
Watch next:
- Anthropic video archive: youtube.com/@AnthropicAI/videos
Talks and interviews that help connect research language to deployment reality.
If this post is useful, the next contribution should add a real example, a worked document, or a failure case someone else can learn from.