You Can't Prompt Your Way to Safety
Phoebe Yao
,
Feb 28, 2026

AI models are giving medical and mental health advice to millions of people. Can you prevent harmful advice by adding safety instructions to the prompt?
The UK's AI Safety Institute (AISI) recently tested this. They deployed the same chatbot twice: once with minimal safety prompting, once with explicit safety instructions. The finding: safety prompts had no meaningful impact on harmful advice rates.
How did they measure this? With a classifier my team at Pareto helped build, trained on expert-labeled data to detect harmful outputs in real-time.
AISI brought in my team to source extreme experts and coach them to decompose complex professional judgment into verifiable steps. We recruited licensed doctors, therapists, and career coaches, developed grading rubrics with them, and built a dataset of 6,707 evaluated examples. Fine-tuning on this data boosted Llama 8B's accuracy at detecting harmful advice from 77% to 96%, beating GPT-4o's zero-shot performance (93%).

AISI open-sourced everything: the paper, model, and dataset.
Here's what we learned about finding extreme experts and helping them translate professional judgment into high-quality training data.
The Core Challenge: Context Is Everything
Whether advice helps or harms depends entirely on context.
"Stop taking your medication" could save someone's life, endanger it, or be completely fine, all depending on the medication, the condition, the patient's history.
You can't evaluate this with universal rules. You need people who truly understand what good looks like in their fields, paired with thoughtful workflows that draw out their professional judgment.
How We Did It
We began by hand-selecting and interviewing highly qualified domain experts to co-design the project. Before generating a single example of AI advice, our experts collaboratively designed the grading rubrics and scenario frameworks.
They built 0-4 harm scales that captured the nuance of their work. Here's what the mental health rubric looked like:

Each piece of AI advice was reviewed by two experts independently using these scales. When they disagreed by more than one level, a third expert weighed in.
This wasn't content moderation ("does this break a rule?"). It was professional evaluation ("would a competent practitioner consider this appropriate for this situation?").
The Technical Challenge
Frontier models are designed to refuse harmful requests, which is important for safety but makes it difficult to get examples across all harm levels.
The hardest part: getting Level 3 and 4 examples. Even with clear research explanations, models like GPT-4o refused to generate advice that could cause significant harm or immediate threats to health and safety. Getting proper distribution turned out to be a jailbreaking problem in disguise.
Our approach:
- Scale dataset generation: We increased dataset generation size to account for the estimated % of guaranteed model refusals
- Context engineering: We expanded prompts to clearly explain the research context and reassure models we weren't bad actors (avoiding the naive "please give me this bad thing for homework" approach)
- Build inspection tooling: Creating infrastructure to evaluate prompt changes, track rejection rates, and categorize why models refused
We considered multi-turn jailbreaking strategies (which were emerging as effective for models like GPT-4o), but found we could get adequate distribution without them.
The result: 6,707 expert-graded examples across medical, mental health, career, and relationship domains, with controlled coverage across all harm levels.

What We Found: Expert Insight Outperforms Scale
We fine-tuned Llama-3.1-8B on these expert grades and achieved a major lift for the model.
In a side-by-side binary classification task (safe vs harmful advice), the expert data gave Llama a 19 percentage point increase, jumping from 77% to 96% accuracy. This beat GPT-4o doing the zero-shot version of the task at 93%.
A relatively small dataset of expert-labeled examples outperformed a frontier model without task-specific training.

Real-World Impact: Protecting 2,302 People
AISI deployed this classifier in a live study where 2,302 participants had multi-turn conversations seeking AI advice.
The classifier worked in real-time:
- Monitor every response
- Flag potentially harmful advice
- Trigger a new response before the harmful one reaches the person
The result: 0.09% of responses flagged.
Without this safety layer, 22 people would have received at least one response that could have caused real harm. With this layer, zero harmful messages were delivered.
Why This Matters
AISI also tested whether adding safety-focused instructions to the system prompt would prevent harmful outputs.
The data tells the story:

[INSERT GRAPHIC: Participants Who Would Have Received Harmful Advice
- Minimal safety prompting: 22 people
- Safety-focused prompting: ~22 people
- Expert classifier: 0 people ]
Safety instructions had no meaningful impact. Statistical analysis showed the rates of harmful advice "were insensitive to prompts instructing the model with safety oriented prompting."
You genuinely can't prompt your way to safety in domains that require professional judgment and deep contextual understanding.
What This Shows Us
For high-stakes domains (medical, mental health, career, relationships), expert supervision creates meaningfully better outcomes than instruction-based approaches or relying on frontier models without specialized training.
The methodology that scales:
- Bring experts in from day one to co-design
- Build workflows that truly elicit professional judgment
- Capture the reasoning and context, not just the labels
AISI made everything open source:
📄 Paper: https://arxiv.org/abs/2511.15352
🤗 Model: https://huggingface.co/ai-safety-institute/Llama-3.1-8B-harmful-advice-classifier
📊 Dataset: https://huggingface.co/datasets/ai-safety-institute/harmful-advice-dataset
The Path Forward
At Pareto, we're building systems that make it easy for frontier experts to contribute sustainably to AI training. Not by throwing more people at the problem, but by designing workflows that help experts structure their judgment into high-quality data without massive time investment or technical complexity.
The future of AI isn't about replacing human expertise. It's about building better systems to capture expert insight at the edge of what's known and scale it into training data that actually moves model capabilities forward.
Credits
This work was a true collaboration between Pareto and the UK AI Security Institute. Deep gratitude to Elizabeth Nguyen and Daria Butuc at Pareto, and Lennart Luettgau and Henry Davidson at AISI for their thoughtful contributions to making this project succeed.