A working approach to AI safety, from first signal to repeatable practice
The workflow that seems to hold up is: define harms that matter to real users, build evals that mirror those harms, run them on a cadence, and let the findings change rollout decisions. Anything softer than that tends to produce documentation without leverage.
A sequence I would actually hand to a teammate:
1. Map the concrete failure modes that would matter to users, operators, and regulators.
2. Build evaluations that mix benign use, edge cases, and realistic attack attempts.
3. Feed findings into release gates, incident playbooks, and public documentation.
Useful operating references:
- Anthropic research archive: anthropic.com/research
A strong public record of how a frontier lab discusses evaluations, misuse, and controls.
- Inspect source: github.com/UKGovernmentBEIS/inspect_ai
Open source evaluation framework from the UK AI Security Institute.
If your team has a better workflow, post it with the context around team size, constraints, and exactly where the process tends to break.