AIM-FM Workshop @ NeurIPS 2024: AI for Cardiac Arrest Identification
Our team will be presenting our research at the AIM-FM Workshop: Advancements In Medical Foundation Models, co-located with NeurIPS’24, to be held on December 14, 2024, at the Vancouver Convention Center.
Reliability in AI-Assisted Critical Care: Assessing Large Language Model Robustness and Instruction Following for Cardiac Arrest Identification
Authors: UÄŸurcan Vurgun, PhD, MA; Aarthi Kaviyarasu, BS; Sy Hwang, MS; Ana Acevedo, BA; Danielle L. Mowery, PhD, MS, MS; Benjamin S. Abella, MD, MPhil; Oscar J. L. Mitchell, MD, MSCE
In this study, we systematically evaluated the ability of large language models (LLMs) to accurately detect in-hospital cardiac arrest (IHCA) events using clinical data. Given the critical nature of IHCA detection, our goal was to assess not only the accuracy of these models but also their robustness across different datasets and their ability to consistently follow clinical instructions. We evaluated a total of 51 open-source models and used GPT-4 as a benchmark for performance comparison.
Our analysis focuses on three key areas:
- Accuracy: We examined how well each model performed in detecting cardiac arrest events from unstructured clinical notes.
- Robustness: Performance stability was measured through confidence intervals derived from non-parametric bootstrapping, highlighting the consistency of the models across various runs.
- Instruction Following: The models were tested on their ability to adhere to structured prompts that mimicked clinical workflows, ensuring that their outputs were suitable for real-world clinical use.
The results showed that GPT-4 exhibited the highest overall performance with an F1 score of 0.90 ± 0.01. Several open-source models, such as Mistral-Nemo-Instruct-2407 and Meta-Llama-3.1-8B-Instruct, demonstrated competitive results, particularly in terms of accuracy and precision, though with greater variability in their robustness. Interestingly, general-purpose models often outperformed medical-specific models in instruction-following tasks, emphasizing the importance of domain-agnostic instruction tuning.
This research highlights the potential of AI-driven tools in critical care settings, with the promise of automating and enhancing the identification of life-threatening conditions like IHCA. However, our findings also underline the need for further development in terms of model consistency and interpretability before widespread clinical deployment can occur.
If you’re interested in learning more about our study or would like to discuss potential collaborations, feel free to reach out. Here is the preprint!
Preprint