The Medical AI Superintelligence Test and NOHARM: A New Framework for Assessing Clinical Safety in AI Systems


December 11, 2025

Artificial intelligence has become an increasingly common tool in medical decision-making. Physicians consult large language models (LLMs) for diagnostic reasoning, documentation, and summarization; patients use them to interpret symptoms; and health systems continue to integrate them into clinical workflows. Yet a basic question remains insufficiently answered: How safe are these systems when their outputs influence real medical decisions?

A recent initiative under Arise AI, centered around the NOHARM benchmark, offers one of the most rigorous evaluations of clinical safety to date. Its findings, and the broader accountability framework behind it, have implications not only for direct patient care but also for clinical development, medical writing, pharmacovigilance, and regulatory documentation. Importantly, the study highlights patterns of AI failure that closely mirror risks encountered when using LLMs for complex scientific and regulatory work.

 

A benchmark designed around real patient harm

NOHARM evaluates LLMs using one hundred real physician-to-specialist consultation cases across ten specialties. Instead of relying on synthetic questions or knowledge tests, the benchmark measures whether AI-generated recommendations could expose patients to harm. More than 4,000 plausible medical actions were annotated by specialists for clinical appropriateness and potential harm, allowing the framework to assess both errors of commission (unsafe recommendations) and omission (failing to recommend necessary actions).

The benchmark sits within the broader MAST (Medical AI Superintelligence Test), initiative led by Harvard and Stanford, hosted on bench.arise-ai.org, which aims to provide ongoing public evaluation of LLMs used in healthcare settings. By publishing comparative and transparent performance metrics — including safety, completeness, precision, and harm rates — MAST serves as a standardized accountability structure for medical AI systems.

 

Key findings from the study

The results provide a nuanced view of current medical AI capabilities:

  • Harm remains a measurable risk. Some LLMs produced severely harmful recommendations in more than 20% of cases.
  • Omissions are the dominant failure mode. Over three-quarters of severe errors involved missing essential actions rather than giving incorrect ones.
  • Model “strength” does not predict safety. Size, recency, and performance on general AI benchmarks had limited correlation with clinical safety.
  • Top models can outperform physicians. In a subset of cases, the best LLMs demonstrated higher safety and completeness than generalist clinicians.
  • Hybrid systems improve outcomes. Multi-agent configurations — where one model critiques or revises another — showed materially lower harm rates.

Collectively, these findings emphasize that clinical safety must be evaluated directly; it cannot be inferred from general intelligence or linguistic fluency.

 

Relevance beyond clinical care: Implications for clinical development

Although NOHARM focuses on medical recommendations, its insights apply directly to workflows in clinical development, where LLMs are increasingly used for drafting protocols, summarizing analyses, generating safety narratives, and producing Clinical Study Reports (CSRs). The risk profile is different — regulators, rather than patients, are the primary audience — but the core failure mode identified in NOHARM is the same: AI systems frequently omit essential information while producing text that appears complete.

These omissions can lead to incomplete evidence packages, insufficient traceability, inconsistencies with statistical outputs, and regulatory challenges. The study therefore reinforces the need for structured validation processes when using LLMs in high-stakes regulatory environments.

 

The CSR example: Completeness as a safety criterion

A clinical study report requires comprehensive reporting: methodology, protocol deviations, statistical analyses, safety findings, and linked tables, figures, and listings. While LLMs can streamline drafting and improve clarity, they do not reliably identify which elements are required for regulatory compliance. As NOHARM demonstrates, even highly capable models often omit critical actions or fail to include context necessary for safety.

This parallels the risk in clinical documentation: a well-written but incomplete CSR is not simply inconvenient — it can delay submission timelines, trigger regulatory questions, or obscure important safety signals. Ensuring completeness therefore becomes a core safety requirement.

 

The necessity of human-in-the-loop systems

One of the clearest insights from the NOHARM study is that hybrid systems outperform both standalone AI models and standalone human reviewers. Multi-agent architectures reduce harmful outputs, and expert human oversight further ensures contextual accuracy, completeness, and regulatory fidelity. In clinical development, this means that LLMs should support — but not replace — experienced medical writers, clinical scientists, statisticians, and safety physicians.

A well-designed workflow leverages AI for efficiency while relying on human expertise for judgment, quality control, and risk mitigation. This aligns with the MAST vision of AI systems operating under ongoing, benchmarked evaluation rather than unmonitored deployment.

 

A path forward: Benchmark-aligned, hybrid AI for regulated medicine

The NOHARM study and the broader Arise AI benchmarking platform represent a shift toward transparent, safety-focused evaluation of medical AI. They show that:

  • Safety and completeness require explicit measurement.
  • Omission is a primary source of AI risk in both clinical and regulatory contexts.
  • Multi-agent and human-in-the-loop systems materially reduce harm.
  • Public, standardized benchmarking supports accountability and informed adoption.

For organizations exploring or deploying AI in clinical development, the message is straightforward: LLMs can accelerate work and improve consistency, but only when embedded within systems designed to detect and mitigate the very risks NOHARM identifies. With rigorous evaluation, hybrid architectures, and expert oversight, AI can be integrated into medical and regulatory workflows in a way that advances both efficiency and safety.

 

Interested in learning more?

Consult the preprint by David Wu, et al., “First, do NOHARM: Towards clinically safe large language models” and access the interactive NOHARM leaderboard to see model performance.

Contact Us!
Subscribe to our newsletter

Manuel Cossio

Director, Innovation and Strategic Consulting

Manuel Cossio is Director, Innovation and Strategic Consulting at Cytel. Manuel is an AI engineer with over a decade of experience in healthcare AI research and development. He currently leads the creation of generative AI solutions aimed at optimizing clinical trials, focusing on hierarchical multi-agent systems with multistage data governance and human-in-the-loop dynamic behavior control.

Manuel has an extensive research background with publications in computer vision, natural language processing, and genetic data analysis. He is a registered Key Opinion Leader at the Digital Medicine Society, a member of the ISPOR Community of Interest in AI, a Generative AI evaluator for the EU Commission, and an AI researcher at UB-UPC- Barcelona Supercomputing Center.

He holds an M.Sc. in Translational Medicine from Universitat de Barcelona, a Master of Engineering in AI from Universitat Politècnica de Catalunya, and a M.Sc. in Neuroscience from Universitat Autònoma de Barcelona.

Read full employee bio

Claim your free 30-minute strategy session

Book a free, no-obligation strategy session with a Cytel expert to get advice on how to improve your drug’s probability of success and plot a clearer route to market.

glow-ring
glow-ring-second