New Paper: Automated Safety Plan Scoring with Large Language Models

Our team has a new paper out in JMIR Mental Health, published on January 8, 2026.

Publication

Title: Automated Safety Plan Scoring in Outpatient Mental Health Settings Using Large Language Models: Exploratory Study

Authors:
Hayoung K Donnelly, Gregory K Brown, Kelly L Green, Ugurcan Vurgun, Sy Hwang, Emily Schriver, Michael Steinberg,
Megan E Reilly, Haitisha Mehta, Christa Labouliere, Maria A Oquendo, David Mandell, Danielle L Mowery

Journal: JMIR Mental Health (2026), Vol 13, e79010
DOI: 10.2196/79010
Link: https://mental.jmir.org/2026/1/e79010

What this paper does

Safety plans are a core part of suicide prevention, but reviewing their quality is time‑consuming and hard to scale. In this study, we built and evaluated an automated Safety Plan Fidelity Rater (SPFR) that uses large language models (LLMs) to score written safety plans.

Data and approach

  • 266 de‑identified outpatient safety plans from New York State
  • Focused on four steps:
    • warning signs
    • internal coping strategies
    • making the environment safe
    • reasons for living
  • Compared three LLMs (GPT‑4, LLaMA 3, o3‑mini)
  • Evaluated different scoring schemes using weighted F1 scores against expert human ratings

Key findings

  • LLaMA 3 and o3‑mini outperformed GPT‑4 overall
  • No single model was best for every step
  • Simpler 3‑point scoring systems performed better than the original 4‑point scale
  • Results suggest step‑specific model choices may be most effective

Why it matters

This work shows that LLMs can reliably assess the quality of written safety plans and could support faster, more consistent feedback for clinicians—without replacing human judgment. It’s an early step toward scalable quality monitoring in suicide prevention.

If you’re interested in clinical NLP, LLM evaluation, or mental health applications, feel free to reach out.

Updated: