Artificial Intelligence and Machine Learning Archives

The Medical AI Superintelligence Test and NOHARM: A New Framework for Assessing Clinical Safety in AI Systems

Artificial intelligence has become an increasingly common tool in medical decision-making. Physicians consult large language models (LLMs) for diagnostic reasoning, documentation, and summarization; patients use them to interpret symptoms; and health systems continue to integrate them into clinical workflows. Yet a basic question remains insufficiently answered: How safe are these systems when their outputs influence real medical decisions?

A recent initiative under Arise AI, centered around the NOHARM benchmark, offers one of the most rigorous evaluations of clinical safety to date. Its findings, and the broader accountability framework behind it, have implications not only for direct patient care but also for clinical development, medical writing, pharmacovigilance, and regulatory documentation. Importantly, the study highlights patterns of AI failure that closely mirror risks encountered when using LLMs for complex scientific and regulatory work.

A benchmark designed around real patient harm

NOHARM evaluates LLMs using one hundred real physician-to-specialist consultation cases across ten specialties. Instead of relying on synthetic questions or knowledge tests, the benchmark measures whether AI-generated recommendations could expose patients to harm. More than 4,000 plausible medical actions were annotated by specialists for clinical appropriateness and potential harm, allowing the framework to assess both errors of commission (unsafe recommendations) and omission (failing to recommend necessary actions).

The benchmark sits within the broader MAST (Medical AI Superintelligence Test), initiative led by Harvard and Stanford, hosted on bench.arise-ai.org, which aims to provide ongoing public evaluation of LLMs used in healthcare settings. By publishing comparative and transparent performance metrics — including safety, completeness, precision, and harm rates — MAST serves as a standardized accountability structure for medical AI systems.

Key findings from the study

The results provide a nuanced view of current medical AI capabilities:

Harm remains a measurable risk. Some LLMs produced severely harmful recommendations in more than 20% of cases.
Omissions are the dominant failure mode. Over three-quarters of severe errors involved missing essential actions rather than giving incorrect ones.
Model “strength” does not predict safety. Size, recency, and performance on general AI benchmarks had limited correlation with clinical safety.
Top models can outperform physicians. In a subset of cases, the best LLMs demonstrated higher safety and completeness than generalist clinicians.
Hybrid systems improve outcomes. Multi-agent configurations — where one model critiques or revises another — showed materially lower harm rates.

Collectively, these findings emphasize that clinical safety must be evaluated directly; it cannot be inferred from general intelligence or linguistic fluency.

Relevance beyond clinical care: Implications for clinical development

Although NOHARM focuses on medical recommendations, its insights apply directly to workflows in clinical development, where LLMs are increasingly used for drafting protocols, summarizing analyses, generating safety narratives, and producing Clinical Study Reports (CSRs). The risk profile is different — regulators, rather than patients, are the primary audience — but the core failure mode identified in NOHARM is the same: AI systems frequently omit essential information while producing text that appears complete.

These omissions can lead to incomplete evidence packages, insufficient traceability, inconsistencies with statistical outputs, and regulatory challenges. The study therefore reinforces the need for structured validation processes when using LLMs in high-stakes regulatory environments.

The CSR example: Completeness as a safety criterion

A clinical study report requires comprehensive reporting: methodology, protocol deviations, statistical analyses, safety findings, and linked tables, figures, and listings. While LLMs can streamline drafting and improve clarity, they do not reliably identify which elements are required for regulatory compliance. As NOHARM demonstrates, even highly capable models often omit critical actions or fail to include context necessary for safety.

This parallels the risk in clinical documentation: a well-written but incomplete CSR is not simply inconvenient — it can delay submission timelines, trigger regulatory questions, or obscure important safety signals. Ensuring completeness therefore becomes a core safety requirement.

The necessity of human-in-the-loop systems

One of the clearest insights from the NOHARM study is that hybrid systems outperform both standalone AI models and standalone human reviewers. Multi-agent architectures reduce harmful outputs, and expert human oversight further ensures contextual accuracy, completeness, and regulatory fidelity. In clinical development, this means that LLMs should support — but not replace — experienced medical writers, clinical scientists, statisticians, and safety physicians.

A well-designed workflow leverages AI for efficiency while relying on human expertise for judgment, quality control, and risk mitigation. This aligns with the MAST vision of AI systems operating under ongoing, benchmarked evaluation rather than unmonitored deployment.

A path forward: Benchmark-aligned, hybrid AI for regulated medicine

The NOHARM study and the broader Arise AI benchmarking platform represent a shift toward transparent, safety-focused evaluation of medical AI. They show that:

Safety and completeness require explicit measurement.
Omission is a primary source of AI risk in both clinical and regulatory contexts.
Multi-agent and human-in-the-loop systems materially reduce harm.
Public, standardized benchmarking supports accountability and informed adoption.

For organizations exploring or deploying AI in clinical development, the message is straightforward: LLMs can accelerate work and improve consistency, but only when embedded within systems designed to detect and mitigate the very risks NOHARM identifies. With rigorous evaluation, hybrid architectures, and expert oversight, AI can be integrated into medical and regulatory workflows in a way that advances both efficiency and safety.

Interested in learning more?

Consult the preprint by David Wu, et al., “First, do NOHARM: Towards clinically safe large language models” and access the interactive NOHARM leaderboard to see model performance.

Empowering Patient Engagement in HTA: Lessons from an AI-Generated Plain Language Summary Case Study

The challenge: Making HTA understandable to everyone

Health technology assessments (HTAs) play a critical role in determining which treatments and innovations are adopted within healthcare systems. However, the technical language and complexity of HTA reports often make them inaccessible to patients and caregivers — the very individuals whose lives these decisions affect the most.

Plain Language Summaries (PLS) are designed to close this gap. They can translate HTA findings into clear, patient-friendly language, empowering people to engage meaningfully in healthcare decisions. Yet, producing high-quality PLS documents is a slow and resource-intensive process. Teams must balance scientific rigor with readability, cultural sensitivity, and accuracy — a demanding task that limits scalability.

This is where artificial intelligence (AI) offers a transformative opportunity.

The study: Can generative AI help bridge the communication gap?

At ISPOR Europe 2025, we presented a pioneering study exploring whether generative AI can create accurate and patient-friendly summaries from complex HTA documents.

Using a NICE Highly Specialized Technologies (HST) guidance on onasemnogene abeparvovec (a gene therapy for spinal muscular atrophy), the team tested Google Gemini, a large language model, to generate a full PLS automatically.

The AI-generated summary was evaluated across 18 quality measures covering readability, accuracy, relevance, and tone. A “human-in-the-loop” reviewer ensured alignment with patient communication standards and European HTA Regulation principles — integrating transparency and patient empowerment into the assessment.

The results: Speed meets substance

The results were striking. The AI produced an eight-page (2,570-word) PLS in just 15 seconds, structured around all key HTA components — disease context, treatment mechanism, clinical effectiveness, safety, and patient impact.

Across 18 evaluation criteria, the PLS achieved an average score of 8.27/10, reflecting strong alignment with plain language and patient-centered communication standards.

Mechanism simplicity (9.2/10) and plain language explanation (8.9/10) were top-performing categories, demonstrating Gemini’s ability to simplify complex gene therapy concepts without sacrificing accuracy.
The document met CEFR B1 readability, ensuring accessibility for non-specialist audiences.

However, the AI struggled with target population clarity (6.8/10) and unmet need articulation (6.5/10) — areas requiring deeper contextual and emotional nuance. These findings underscore the importance of maintaining a human role in refining and validating AI outputs, especially when tailoring content for specific patient groups.

The implications: Toward patient-centered HTA with AI

The study demonstrates that AI can accelerate and enhance the creation of patient-friendly HTA communications, promoting inclusivity and transparency in healthcare decision-making. But it also emphasizes that AI should complement, not replace, human expertise.

Generative AI tools like Gemini can help:

Scale patient engagement, enabling broader and faster dissemination of accessible HTA information.
Support regulatory compliance, aligning with EU HTA Regulation principles of transparency and participation.
Enhance health literacy, fostering more equitable and informed patient involvement.

Yet, meaningful adoption requires:

Human-in-the-loop systems to verify accuracy, tone, and contextual relevance.
Prompt optimization to capture nuances like unmet needs or cultural differences.
Ongoing validation to ensure reliability and regulatory alignment.

The conclusion: AI as a partner in patient empowerment

This work highlights how AI, when thoughtfully integrated, can make HTA more human-centered, transparent, and inclusive. Rather than automating empathy, it can help scale understanding — bringing patients into the conversation, not leaving them behind.

As HTA continues to evolve under new European regulations, embedding AI into communication workflows may mark a key step toward a truly patient-centered future — where every individual can understand, question, and contribute to the health decisions that shape their lives.

Interested in learning more?

Read the abstract published at ISPOR EUROPE 2025: “Can Generative AI Deliver Patient-Friendly Summaries? A Case Study Using NICE Guidance for Spinal Muscular Atrophy” by Manuel Cossio and Ramiro E. Gilardino.

A Preview of Cytel’s Contributions at PHUSE EU 2025

I can’t believe it has already been a year since we wrapped up PHUSE EU Connect 2024, and in two weeks we will be gathering another exciting PHUSE EU Connect conference, only a few kilometers from Heidelberg, where everything started twenty years ago with the very first PHUSE event. I was one of the couple hundred lucky attendees and now, twenty years later, I have the great honor of supporting Jennie McGuirk and Jinesh Patel as Conference Co-chair for this year’s edition.

With a promising agenda featuring about 190 presentations, 34 posters, 9 hands-on workshops, 2 panel discussions, and 3 inspiring keynote speakers, this year we are going to the city of Hamburg for the 21st PHUSE EU Connect. The agenda is full of topics looking toward the future, with about 40 talks and posters referring to AI in their titles, and once again open source will be the confirmed leitmotif.

Cytel will make a significant contribution this year, perhaps more than ever, with six presentations, one poster, active participation in both panel discussions, and co-chairing the “Scripts, Macros and Automation” and “People Leadership & Management” streams.

Monday topics: Agile code writing, extracting metadata from R OOP functions, and leadership

The week kicks off on Monday with Kamil Foltynski, who will present “Overcoming Challenges in Collaborative Spreadsheet Editing with Shiny, SpreadJS and JSON-Patch” in the Application Development stream at 11:30 am. Kamil will provide a technical deep dive into enabling real-time spreadsheet editing within Shiny applications, using tools such as SpreadJS, sharing key lessons learned so far. Following Kamil’s presentation, Eswara Satyanarayana Gunisetti, will present “Micro-Decisions, Macro Impact: The Role of Agile Thinking in Every Line of Code” in the “Coding Tips & Tricks” stream at 12 pm. See his recent blog on the topic. Eswara will share how an agile “mindset” can positively influence the way we write code.

In the same stream, a few hours later at 2 pm, another colleague Edward Gillian, in collaboration with Sanofi, will present “Risk.assessr: Extracting OOP Function Details,” discussing strategies for extracting metadata from R Object-Oriented Programming functions. Prior to Eswara and Edward’s sessions, at 1:30 pm, Kath Wright, will moderate the Interactive People Leadership & Management session “Invisible Glue: Trust, Influence and The Architecture of Teamwork.” With this live workshop, attendees will engage in practical exercises to learn how to identify barriers to trust, evaluate influence dynamics, and apply evidence-based strategies to strengthen collaboration in both physical and virtual environments.

Tuesday topics: Industry trends, extracting macro usage and dependency information from SAS programs, and integrating ECA data into CDISC-compliant datasets

Tuesday also brings two presentations and one poster. Right after lunch at 1:30 pm, Cedric Marchand will join other industry leaders in the panel discussion “Reimagining Statistical Programming: AI, Standards & the Talent of Tomorrow.” The panel will explore how current industry trends, such as AI, open source, and the evolution of data standards, will influence the next generation of statistical programmers.

The afternoon continues at 4 pm with my young and talented colleague Marie Poupelin, who will present “From Zero to Programming Hero: How Internships Shape Statistical Programmers in a CRO” in the “Professional Development” stream. Marie is a great example of the success of our internship program, and she will share her journey from having “zero” statistical programming experience to becoming an industry-ready programmer. Thirty minutes later, at 4:30 pm, Guido Wendland will present “Which Macros Are Used in the Study?” in the “Scripts, Macros and Automation” stream, a stream co-led this year for the first time by my colleague Sebastià Barceló. Guido will discuss techniques to extract macro usage and dependency information from SAS programs; this is particularly useful for identifying potential issues or estimating the impact of macro updates.

Later, in the traditional Tuesday evening poster session, you can join my colleague Cyril Sombrin in discussing “Our Journey in Integrating External Control Arms (ECAs) and RWD for Rare Disease Trials.” There you can discuss real-world case studies on integrating ECA data into CDISC-compliant datasets, exploring the unique challenges and solutions when aligning real-world data with CDISC standards.

Wednesday topics: Real-time spreadsheet editing within Shiny applications and real-time validation and streamlined submissions

On Wednesday at 12 pm, Hugo Signol, another young talented Cytel statistical programmer and a product of our internship program, will present his talk “From XPT to Dataset-JSON: Enabling Real-Time Validation and Streamlined Submissions.” Building on Cytel’s experience from CDISC Dataset-JSON-Viewer Hackathon, Hugo will demonstrate a Shiny application that supports interactive exploration and real-time validation through API-based checks.

Meet us there!

Cytel will be at Booth 9 at the conference, where you can engage in discussions with our team or meet any of us throughout the week.

I hope I didn’t miss anyone, or anything! We look forward again to reuniting with colleagues and friends from around the world and meeting new acquaintances.

See you all in Hamburg!

Generative AI in Evidence Synthesis: Harnessing Potential with Responsibility

The integration of AI into the healthcare research landscape is accelerating, with one obvious area of application being evidence synthesis. From early scoping reviews to comprehensive systematic literature reviews (SLRs), AI promises to reduce manual burden and enhance efficiency by saving time. However, it is crucial to understand both the strengths and limitations of using AI in this broad context to ensure compliance, reliability, and scientific rigor.

Knowing where it works: A targeted approach

Artificial intelligence, including generative AI models, shines when used for targeted literature reviews (TLRs) or when generating summaries of scientific articles to support evidence-based decision-making at an early development stage. AI can synthesize large volumes of information quickly, offering valuable insights during exploratory or early-phase research.

However, it’s critical to distinguish these from regulatory-facing systematic literature reviews, especially those intended for payer or health technology assessment (HTA) submissions. In this context, SLR extractions have traditionally been completed by two independent human reviewers. This human oversight ensures objectivity and reproducibility, key elements of regulatory compliance.

Expertly trained models vs. generalist giants

The current landscape is filled with large generalist language models trained on diverse internet-scale data. While impressive, these models often exhibit hallucinations — the generation of plausible but incorrect or fabricated content — particularly in domain-specific applications like evidence synthesis.

This is why domain-trained expert models are preferred. These models are fine-tuned on biomedical and scientific corpora, ensuring higher reliability and reducing the risk of misinterpretation or erroneous conclusions. They understand field-specific terminology, data structures, and compliance requirements far better than their generalist counterparts.

The imperative of data traceability

In evidence synthesis, transparency is non-negotiable. Any AI-generated output must allow users to:

Highlight the exact source (i.e., sentence or section) of the original scientific article from which a conclusion or data point was extracted.
Compare the model’s interpretation with the source text to identify discrepancies or nuances that could affect meaning or validity.

Using structured tags to annotate key terms, qualifiers, and relationships can make these comparisons clearer and more systematic but also inform advanced search and retrieval activities. By surfacing subtle differences, tagging supports expert review, preserves contextual integrity, and strengthens the reliability and defensibility of the synthesized evidence.

Measuring what matters: Precision and beyond

Traditional evaluation metrics like precision, recall, and F1 score (the harmonic mean of precision and recall) remain foundational when assessing AI model performance in literature screening and data extraction.

But in generative contexts — where the task may be summarization, paraphrasing, or abstract reasoning — additional measures become valuable:

Answer correctness: Does the output convey a factual, verifiable point?
Semantic similarity: How closely does the AI output align in meaning with the ground truth?
BLEU, ROUGE, and BERTScore: These Natural Language Processing metrics offer quantitative insights into the quality of generated text, especially for summarization and content generation tasks.

Selecting the right mix of these metrics provides a comprehensive view of model performance and reliability.

Where AI makes a difference: Screening and beyond

One of the most promising applications of generative AI in evidence synthesis is in literature screening, or the ability to assess whether a publication (abstract or full text) meets the criteria for inclusion. Studies and pilot implementations suggest that AI can reduce screening time by up to 40%, making it a powerful ally for research teams.

AI tools have been leveraged to assign a probability of inclusion to a title or abstract or full text to guide the screening process but also to allow researchers to quickly understand the impact of modifying search strategies on yield. By automating this repetitive and time-consuming phase, organizations can reallocate expert human resources to higher-value tasks, such as:

Resolving ambiguous or context-dependent data extractions
Validating nuanced findings and offering insights into implications of these findings
Ensuring alignment with HTA submission standards

In this way, AI doesn’t replace human reviewers but augments them, driving efficiency without compromising accuracy.

AI with guardrails

Generative AI is reshaping the landscape of evidence synthesis, but its integration must be strategic, measured, and compliant. By combining domain-trained models, robust traceability, appropriate evaluation metrics, and human oversight, organizations can unlock the true value of AI — accelerating workflows without sacrificing quality or compliance.

When used thoughtfully, generative AI becomes more than just a tool — it becomes a partner in advancing scientific research.

Meet with us at ISPOR 2025!

Manuel Cossio and Nathalie Horowicz-Mehler will be in Glasgow for ISPOR Europe 2025! Click the link below to book a meeting, or stop by Booth #1024 to connect with our experts:

Breaking Barriers in Rare Disease Research with Generative AI and Synthetic Data

In healthcare innovation, one of the most pressing challenges lies in rare disease research. There are approximately 7,000 rare diseases affecting over 300 million people worldwide. With only a handful of patients dispersed globally, gathering sufficient data to power robust clinical studies or predictive models is a monumental hurdle. However, a solution is emerging at the intersection of generative AI and real-world data (RWD) — a novel approach with the potential to reshape possibilities and unlock insights to address unmet medical needs in rare diseases.

The rare disease data dilemma

In the U.S., rare diseases are defined as conditions affecting fewer than 200,000 people. Despite their low individual prevalence, rare diseases collectively impose a significant burden on both patients and healthcare systems.

Research and development in rare diseases often face a vicious cycle: low prevalence leads to data scarcity. Traditional clinical trials are often infeasible and/or statistically underpowered due to the limited pool of participants.

Meanwhile, RWD sources such as electronic health records (EHRs), insurance claims, registries, and patient-reported outcomes offer valuable, albeit messy and fragmented, glimpses into the patient journey. Yet even RWD struggles to paint a complete picture in rare diseases. This is where generative AI steps in.

Enter generative AI: Making data where there is none

Generative AI — especially models like Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and, more recently, large foundation models — has a transformative ability: it can learn patterns from limited datasets and generate synthetic yet realistic datasets.

How it works

Learning from RWD: Even small datasets from rare disease patients can be used to train and fine-tune generative models. These models identify patterns, distributions, and time-dependent relationships present in the data.
Synthesizing patients: Once trained, the model can create new, synthetic patient records that preserve the statistical properties and characteristics of the original data. These “digital patients” simulate disease progression, treatment responses, and comorbidities.
Validating realism: Synthetic data must be validated to ensure it reflects the real-world data it was trained on. Techniques like distributional comparison, propensity scoring, and expert validation are used to ensure accuracy and utility.

Why synthetic data matters for rare diseases

Synthetic data can enhance rare disease clinical research in many ways, including:

1. Augmenting small cohorts

Synthetic data can boost sample sizes for rare disease studies, enabling:

Simulation of clinical trials
Development of more robust predictive models
Generation of synthetic control arms where traditional controls are ethically or logistically impractical

2. Enhancing privacy

In rare diseases, patient re-identification is an increased risk due to unique phenotypes or genetic markers. Synthetic data protects patient privacy, while at the same time preserves the utility of the data.

3. Facilitating global collaboration

As synthetic data is deidentified, it facilitates data sharing among researchers, institutions and borders, minimizing regulatory hurdles and fostering cross-collaborative discovery.

4. Accelerating drug development

Pharma and biotech companies can use synthetic data to:

Test drug targeting strategies
Model long-term outcomes
Conduct in silico trials in the earliest stages of development

Challenges and considerations

While promising, this approach is not without its challenges:

Bias amplification: Synthetic data reflects the biases of its training data. If the RWD is incomplete or skewed, so will the synthetic outputs be. Strategies to handle bias are essential.
Regulatory acceptance: Regulatory bodies are still evaluating how to incorporate synthetic data into approval pathways.
Validation standards: There is a need for consistent benchmarks and best practices for validating synthetic data — both in terms of privacy and utility, as well as broader generative AI applications in healthcare.

Looking ahead

The marriage of generative AI and RWD opens new doors for rare disease research. With the ability to synthesize patient data that preserves real-world complexity, we can begin to break free from the constraints of scarcity — generating insights, hypotheses, and interventions that were once out of reach.

As we move forward, interdisciplinary collaboration among clinicians, data scientists, regulatory bodies, and patient advocacy groups will be key to harnessing this potential ethically and effectively.

Interested in learning more?

Download our complimentary ebook, Rare Disease Clinical Trials: Design Strategies and Regulatory Considerations:

From Metadata to Submission: Rule-Based Robotic Process Automation for Statistical Programming Excellence

In the race to modernize data operations in clinical research and regulatory submissions, Robotic Process Automation (RPA) powered by rule-based systems has emerged as a dependable and high-impact solution. These systems offer clarity, control, and reproducibility — critical traits for industries like biopharma where regulatory compliance and data integrity are non-negotiable.

Here, we discuss rule-based RPA as the foundation for a scalable and auditable standards automation pipeline.

Rule-based automation: Transparent, trusted, and tunable

Unlike more probabilistic models, rule-based systems operate on deterministic logic. Every output is traceable back to an explicit rule, which enhances trust and simplifies troubleshooting. This transparency is particularly valuable when the processes must be easily explained to stakeholders and auditors.

Key strengths of rule-based RPA include:

Transparency

Each step in the workflow is rule-driven, making the logic easy to inspect, validate, and justify. This ensures regulatory reviewers can clearly understand how data was transformed or outputs generated — vital in submission contexts.

Consistency

Standard rules applied across studies generate consistent outputs. For example, Cytel’s ALPS system creates SDTM and ADaM code from structured specifications, producing reliable results that hold up across different projects and teams.

Customizability

Rule-based systems are modular. Teams can easily adapt existing rules to accommodate study-specific needs without overhauling the entire system. Tools like Prism allow this by applying both generic rules and study-specific layers for enriched metadata processing.

Cytel’s metadata-driven RPA workflow in action

Our internal automation pipeline demonstrates the power of rule-based RPA. It’s built on a modular architecture where each tool performs a specific, rules-driven task:

ALPS: Converts metadata specifications into ready-to-run SAS code for SDTM and ADaM datasets, reducing manual programming and minimizing error risks.
Lighthouse: Enables biostatisticians to build mock shells using reusable templates, ensuring consistency in table and listing structures.
Prism: Extracts metadata from mock shells and transforms it into XML-format ARMs (Analysis Results Metadata), enriching it through rules and generating code for up to 60% of standard safety outputs.
TAB Macros and CytelDocs: Automate the creation of summary tables and documentation, saving hours of effort and ensuring compliance with standardized formats.

This end-to-end pipeline reduces manual touchpoints, maintains high quality, and boosts team efficiency.

Where generative AI complements RPA

While rule-based systems are ideal for tasks requiring consistency and auditability, generative AI can complement these systems — particularly in areas where variability is acceptable and outputs don’t require deterministic reproducibility. For example, Gen AI can assist with:

Drafting exploratory narratives or documentation
Suggesting code for non-critical outputs
Enhancing user interfaces with intelligent prompts
Enrich the set of study specific rules to be used

However, these AI-driven capabilities are best applied where hallucinations won’t compromise integrity, and outputs don’t demand rigid consistency.

Business and quality benefits of rule-based RPA

By relying on rule-based RPA for core data workflows, we’ve realized several tangible gains:

Time efficiency: Standard code is generated automatically, freeing time for custom analysis.
Reduced redundancy: Developers no longer rewrite common code across projects.
Improved QA: Outputs are independently validated and built on rigorously tested rule sets.
Collaboration at scale: Uniform rules simplify onboarding and knowledge transfer.
Focus on what matters: Teams can concentrate on non-standard elements that require expertise.

Final takeaways

Rule-based RPA systems provide the transparency, structure, and adaptability required for high-stakes data environments. At Cytel, we’ve found them indispensable in our mission to expedite regulatory submissions without compromising on quality or compliance. As AI continues to evolve, generative technologies may enrich this foundation — but rule-based automation remains the core engine that ensures accuracy, accountability, and speed.

Agentic Autonomy: How Multi-Agent Systems Could Orchestrate the Future of Clinical Development

In recent years, artificial intelligence has evolved beyond basic pattern matching to become capable of autonomous reasoning, multi-step planning, and even delegation. This transition — from passive tools to goal-driven, reasoning agents — marks the rise of agentic AI.

For the life sciences sector, and especially clinical development, this evolution arrives at a critical time. Clinical trials are increasingly complex, cross-functional, and data-intensive. Agentic AI offers not just faster tools, but the possibility of autonomous collaboration — teams of agents working in harmony to reduce burden, increase efficiency, and shorten timelines.

Here we explore the evolution of agentic AI and how higher levels of autonomy could transform clinical development from reactive execution to proactive, intelligent orchestration.

The evolution of agentic AI

Agentic AI evolves through distinct levels of capability. Each stage unlocks new functionality — from static models to ecosystems of communicating agents. Here’s a clear breakdown of the five major levels:

Each level builds toward intelligent autonomy. The transition from Level 3 to Levels 4 and 5 introduces intentional behavior, goal-setting, and inter-agent collaboration — the foundations of autonomous operations in clinical development.

Agentic AI in clinical development: A new operating model

Clinical development is not just complex — it’s interdependent. Every milestone relies on the seamless handoff and integration of data, code, documents, and decisions. Agentic AI, particularly at Levels 4 and 5, promises to re-architect this model.

Level 4: Planning and reasoning agents

These agents can independently break down goals, design execution paths, and adapt to changing environments. Here’s how they can drive value:

Medical writing agents
- What they do: Generate drafts for protocols, CSRs, and patient narratives.
- How they help: Understand document structures, integrate real-time data, and adapt language for regulatory or clinical audiences.
- Outcome: Faster document turnaround, reduced rework, and scalable writing support.

Statistical programming agents
- What they do: Develop and validate analysis code in SAS, R, or Python.
- How they help: Plan logical sequences, debug outputs, and dynamically update based on protocol amendments.
- Outcome: Accelerated code generation with built-in quality assurance.

Information synthesis agents
- What they do: Retrieve and synthesize information from multiple domains — scientific literature, regulatory guidelines, real-world data, health system policies, and reports on unmet medical needs.
- How they help: Prioritize and contextualize inputs to support clinical design, indication selection, and risk-benefit assessments.
- Outcome: Broader strategic alignment and better-informed cross-functional planning.

Level 5: Multi-agent systems

At this level, clinical development becomes an ecosystem of agents, each with a specialized role, working under the coordination of orchestrator agents that function like project managers.

Orchestrator agents
- What they do: Assign tasks, monitor progress, and realign workflows in real time.
- How they help: Adjust deliverables dynamically as inputs change or downstream agents complete their tasks.
- Outcome: Continuously managed, self-optimizing trial execution.

Agent networks
- Example: A data management agent processes raw datasets and hands outputs to a statistical agent, which triggers a writing agent to draft updated narratives — all autonomously.
- Value: End-to-end automation with minimal human handoffs.
- Outcome: Real-time trial updates and agility under pressure.

The benefits of the agent ecosystem

From automation to autonomy

Agentic AI reflects an evolution from “AI that assists” to “AI that takes initiative” — supporting actions, learning from experience, and extending expertise across domains. In clinical development, where complexity continues to rise and efficiency is critical, this shift offers a meaningful opportunity rather than just an advantage.

As we look toward Levels 4 and 5, we can imagine a future where trials increasingly manage themselves, where teams are supported by networks of intelligent agents, and where human professionals gain more space to focus on innovation, thoughtful oversight, and meaningful patient outcomes.

Meet with us at ISPOR 2025!

Manuel Cossio will be in Glasgow for ISPOR Europe 2025! Click the link below to book a meeting, or stop by Booth #1024 to connect with our experts:

Redefining Clinical Documentation in the Age of Intelligent Collaboration: The Rise of the AI-Assisted Medical Writing Strategist

The introduction of AI into medical writing workflows marks a pivotal turning point in clinical development. As life sciences companies deploy AI agents to generate clinical documents — from clinical study protocols (CSPs) together with the Statistical Analysis Plan (SAP) to clinical study reports (CSRs) — a new role is emerging: the AI-assisted medical writing strategist.

This role represents a shift in mindset and skillset. No longer is the medical writer just a document author; they are becoming a strategic orchestrator of AI tools, data-driven narratives, and regulatory precision.

What is an AI-assisted medical writing strategist?

An AI-assisted medical writing strategist is a clinical and regulatory expert who partners with AI systems to accelerate and optimize the development of clinical documents. They bring together deep scientific understanding, regulatory knowledge, and technical fluency to co-create documents that are not only accurate and compliant but also delivered at unprecedented speed.

They are not just reviewing AI outputs — they are shaping the way AI generates those outputs, continuously fine-tuning the interaction between human judgment and machine efficiency.

Core pillars of the strategist role

The AI-assisted medical writing strategist role is defined by the following five key pillars:

1. AI orchestration, not just review

At the heart of the strategist’s work is the ability to guide AI systems toward producing high-quality, usable first drafts. This means:

Designing intelligent prompts based on document type and trial context.
Structuring modular content frameworks that AI can populate and iterate on.
Embedding company-specific style guides, preferred language, and regulatory templates into AI workflows.

2. Scientific and regulatory oversight

Even with AI generating drafts, clinical development demands nuanced, evidence-based interpretation. The strategist ensures:

Scientific rigor in efficacy and safety narratives.
Consistency in data interpretation across documents.
Adherence to ICH, FDA, EMA, and country-specific requirements.

AI might know the rules, but the strategist knows the exceptions, the subtleties, and the evolving guidance that govern every submission.

3. Training the AI with human expertise

AI systems improve through feedback. Strategists:

Curate and label high-quality training datasets (e.g., past CSRs, protocols).
Correct and comment on AI-generated drafts to reinforce preferred structures and content styles.
Continuously evaluate model performance and guide retraining cycles.

They act as domain-informed teachers, helping the AI become a better writing partner over time.

4. Cross-functional bridge builder

Medical writing is inherently collaborative. The strategist aligns AI output with expectations from:

Clinical, data management, and statistical teams.
Regulatory affairs and quality assurance.
Legal, ethical, and patient advocacy groups.

In doing so, they help organizations reimagine review cycles, moving from linear drafting to agile co-creation.

5. Champion of ethics and transparency

AI is powerful — but it must be used responsibly. Strategists play a leading role in:

Ensuring AI doesn’t fabricate data or misrepresent study outcomes.
Clarifying where automation was used in document creation.
Promoting transparency, reproducibility, and compliance in every AI-assisted process.

Why this role matters

The volume and complexity of clinical documentation are only increasing. At the same time, timelines are shrinking, budgets are tightening, and regulatory scrutiny is rising. AI offers a way forward — but only when guided by human intelligence.

The AI-Assisted Medical Writing Strategist ensures that automation enhances human value rather than diminishing it. They unlock:

Faster turnaround times for key deliverables.
More consistent documentation across global studies.
Greater focus on high-value tasks like interpretation, innovation, and communication.

How to prepare for this role

Transitioning into this role requires new capabilities:

AI literacy: Understanding how large language models (LLMs) work, how they’re trained, and where they fall short.
Prompt engineering: Knowing how to ask the right questions and frame the right context for AI tools.
Regulatory acumen: Staying current with guidance on AI use in regulated document environments.
Change leadership: Helping others adopt AI tools confidently and responsibly.

Final thoughts

The AI-assisted medical writing strategist is more than a job title — it’s a vision for the future of clinical documentation. As the life sciences industry embraces digital transformation, this role becomes essential to ensure that automation is paired with accountability, speed with accuracy, and efficiency with empathy.

By stepping into this role, medical writers don’t just adapt to the AI era — they lead it.

Streamlining Data Management and Improving Statistical Accuracy in Clinical Trials with AI

As clinical trials grow increasingly complex, the need for smarter, faster, and more efficient data processes and analysis is in demand. Artificial intelligence (AI) emerges as a powerful tool, especially in programming and data management. For clinical trial professionals, AI offers the promise of streamlining workflows, improving data quality, and reducing time to database lock.

The evolving role of AI in clinical data programming

AI is not replacing clinical programmers; it’s augmenting them. AI should be considered a tool to use within clinical trials, just as EDC and SAS are commonly used tools. Automation tools driven by machine learning can now handle routine, rules-based programming tasks such as edit check generation, derivation logic, and data transformation. This allows programmers to focus on more strategic activities like validating statistical code or optimizing data pipelines. AI needs the expertise of our clinical trial professionals.

Natural Language Processing (NLP) is also making great progress. For instance, NLP can interpret free-text protocol documents to auto-generate specifications, electronic case report form (eCRF) templates, or even suggest initial SDTM mappings, significantly reducing manual effort.

AI in data cleaning and quality oversight

Traditionally, data cleaning has been labor-intensive, with data managers manually reviewing queries, data listings, and edit checks across multiple data sources and systems. AI tools can now proactively flag anomalies or data trends that human review might miss, such as unexpected patterns in lab values, inconsistencies across visits, or possible fraudulent data across participants and sites.

Predictive models can help identify study participants at high risk of dropout or noncompliance, enabling earlier intervention. This not only improves data completeness but also enhances trial efficiency and participant retention. The effort and cost of replacing clinical trial participants is significant and felt across all stakeholders. Improving the patient’s experience would be a significant way to save time, money, and accelerating progress.

AI in statistical programming: From code automation to advanced insights

Statistical programming is central to clinical trial analysis from producing tables, listings, and figures (TLFs) to preparing submission-ready datasets. Traditionally reliant on manual coding in SAS or R, this work is now gaining speed, consistency, and quality through AI augmentation.

Where AI adds value in statistical programming

Automated code generation: AI models trained on historical programming logic can produce initial SAS macros or R scripts for common TLFs and dataset derivations. These drafts accelerate development by up to 40–60%, freeing programmers and biostatisticians to focus on complex analyses and interpretation.
Code review and validation: AI-assisted tools can scan code for logic errors, inefficiencies, redundant steps, and deviations from programming standards. Acting as a “second reviewer,” they flag potential issues early and suggest optimizations.
Dynamic statistical modeling: AI algorithms can rapidly explore large trial datasets to detect subgroup effects, anomalies, or emerging trends. When guided by statistical oversight, these insights can refine analysis plans and support adaptive trial decisions.

The aim is not to replace human judgment, but to boost productivity, reproducibility, and the speed of insight generation, without compromising scientific rigor.

AI in biostatistics: Powering smarter, more adaptive clinical trials

Biostatistics remains the foundation of evidence generation in clinical trials, providing the methodological rigor to transform raw data into reliable conclusions. In the context of AI, biostatisticians play a dual role: safeguarding scientific validity while leveraging new computational tools to enhance insight generation. This requires a careful balance between deep domain knowledge and technical proficiency in emerging AI-driven methodologies. From applying knowledge graphs (KGs) to map complex biomedical relationships, to developing predictive models that anticipate trial outcomes, biostatistics is evolving into a more dynamic and interconnected discipline.

Where AI adds value in biostatistics

Balanced expertise: Integrating statistical theory with AI/ML techniques to ensure robust, interpretable results.
Knowledge graph applications: Using KGs to uncover hidden relationships between biomarkers, treatments, and outcomes, supporting hypothesis generation and trial design.
Early prediction tools: Building predictive models for recruitment success, dropout risk, and endpoint achievement.
Segmentation and personalization: Identifying patient subgroups most likely to benefit from a therapy, improving trial efficiency and precision medicine strategies.
Support for registrational trials: Leveraging AI to optimize trial design, stratify patient populations, and run simulations that ensure the study is powered and structured for regulatory success.

Regulatory readiness and caution

Despite its promise, AI must be implemented thoughtfully. Regulatory agencies like the FDA are increasingly open to the use of advanced technologies but expect transparency, traceability, and validation. AI-based tools must be auditable and explainable, especially when used in clinical data workflows that feed into regulatory submissions.

What’s next?

As AI becomes more embedded in clinical trial ecosystems, we can expect increased integration with EDC systems, CDISC standards, and statistical programming tools. The goal isn’t to eliminate human oversight but to enhance it, allowing clinical data professionals to make faster, better-informed decisions.

Final takeaways

AI is reshaping programming and data management in clinical trials. For clinical trial professionals, now is the time to become familiar with these tools, understand their capabilities and limitations, and engage with cross-functional teams to ensure responsible and impactful implementation. Ultimately our goal is to shorten drug development timelines and improve patient outcomes. With AI, we can be part of the solution to provide improved treatments for patients.

Interested in learning more?

Join Steven Thacker, Sheree King, Kunal Sanghavi, and Juan Pablo Garcia Martinez for their upcoming webinar, “How AI Enhances Biometrics Services: Streamlining Data Management and Improving Statistical Accuracy in Clinical Trials” on Thursday, August 28 at 10 am ET:

Trustworthy AI in Action: Predicting Stroke Risk Transparently with Claims-Based Machine Learning

In recent years, deep learning and large neural networks have garnered most of the attention in the machine learning (ML) community. Their ability to model complex, high-dimensional data is indeed impressive. But in healthcare — where decisions can have serious consequences and interpretability is paramount — simpler, transparent models like logistic regression still have an important role to play.

Not every problem requires a black box. When it comes to predicting disease risk using structured data, such as insurance claims, traditional models can offer accuracy and insight.

Claims databases: An untapped resource for disease risk prediction

Claims databases are an increasingly valuable source of real-world data (RWD). Unlike clinical trial data, which is highly controlled but limited in scale and scope, administrative claims datasets cover millions of lives over multiple years, reflecting real patient behavior and care patterns.

These databases include information on diagnoses, procedures, prescriptions, and demographics — elements that, while lacking granular clinical detail, can still reveal important patterns in disease progression and risk. The scale of these datasets allows for robust statistical modeling, even for rare outcomes.

The case for explainable machine learning in claims-based risk prediction

When working with claims data, models like logistic regression, Lasso, or Ridge regression are not just sufficient — they are often ideal. These models:

Produce coefficients that quantify the relationship between features and outcomes.
Allow for transparent understanding of why a prediction was made.
Are easier to validate and communicate to clinicians, payers, and regulators.

In contrast, deep learning models often deliver slightly higher accuracy at the cost of interpretability — a trade-off that may not be acceptable in regulated healthcare environments.

A real-world example: Predicting stroke risk with claims data

In a recent study, Cytel used data from over 2.5 million insured individuals to predict the risk of stroke hospitalization. Using only claims-based features such as age, medication use, comorbidities (e.g., diabetes, hypertension), and health service utilization, we compared the performance of several models, including:

Logistic Regression
Regularized linear models (Lasso and Ridge)
XGBoost (a state-of-the-art ML algorithm)

The results? All models achieved similar predictive performance, with area under the ROC curve (AUC) values around 0.81. Logistic regression — simple, explainable, and well-established — performed on par with XGBoost, demonstrating that advanced complexity wasn’t necessary to achieve meaningful predictive power.

Transparency enables trust and action

What sets models like logistic regression apart is their explainability. Stakeholders can see precisely how risk factors like atrial fibrillation, hypercholesterolemia, or age contribute to predicted stroke risk. This level of clarity is essential not only for clinicians making decisions, but also for data governance, compliance, and patient communication.

In a time when “black box” AI models are under increasing scrutiny, explainable models offer a pragmatic path forward — especially when paired with large-scale real-world datasets like claims data.

Keep it simple, keep it transparent

Healthcare doesn’t just need powerful algorithms — it needs trustworthy ones. As our study shows, standard machine learning models remain highly relevant, especially when applied to well-structured real-world data. Claims databases, in particular, offer a rich foundation for developing these models and making preventive healthcare smarter, earlier, and more accessible.

Newer Posts

Older Posts

Discovery

Phase I-III Clinical Trials

Commercialization

Real-World Evidence Solutions

Clinical Trial Design

Trial Delivery

Advanced Analytics

Specialty Areas

Strategic Consulting

Beyond Functional Service Provider

Project-Based Analytical Solutions

Trial Design Software

Trial Implementation and Decision Support Software

LiveSLR® Software for Systematic Literature Reviews

Our Solutions

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

About Us

Quick Links

Insights

Quick Links

Careers

Quick Links

A benchmark designed around real patient harm

Key findings from the study

Relevance beyond clinical care: Implications for clinical development

The CSR example: Completeness as a safety criterion

The necessity of human-in-the-loop systems

A path forward: Benchmark-aligned, hybrid AI for regulated medicine

Interested in learning more?

The challenge: Making HTA understandable to everyone

The study: Can generative AI help bridge the communication gap?

The results: Speed meets substance

The implications: Toward patient-centered HTA with AI

The conclusion: AI as a partner in patient empowerment

Interested in learning more?

Monday topics: Agile code writing, extracting metadata from R OOP functions, and leadership

Tuesday topics: Industry trends, extracting macro usage and dependency information from SAS programs, and integrating ECA data into CDISC-compliant datasets

Wednesday topics: Real-time spreadsheet editing within Shiny applications and real-time validation and streamlined submissions

Meet us there!

Knowing where it works: A targeted approach

Expertly trained models vs. generalist giants

The imperative of data traceability

Measuring what matters: Precision and beyond

Where AI makes a difference: Screening and beyond

AI with guardrails

Meet with us at ISPOR 2025!

The rare disease data dilemma

Enter generative AI: Making data where there is none

How it works

Why synthetic data matters for rare diseases

1. Augmenting small cohorts

2. Enhancing privacy

3. Facilitating global collaboration

4. Accelerating drug development

Challenges and considerations

Looking ahead

Interested in learning more?

Rule-based automation: Transparent, trusted, and tunable

Transparency

Consistency

Customizability

Cytel’s metadata-driven RPA workflow in action

Where generative AI complements RPA

Business and quality benefits of rule-based RPA

Final takeaways

The evolution of agentic AI

Agentic AI in clinical development: A new operating model