Building a New Evidence Base for Rare Diseases by Structuring Clinical Narratives with Generative AI

Solutions

Discovery

Phase I-III Clinical Trials

Commercialization

Real-World Evidence Solutions

Clinical Trial Design

Trial Delivery

Advanced Analytics

Specialty Areas

Discovery

Our innovative preclinical solutions empower your drug development journey with cutting-edge analytics and insights from inception to preclinical stages.

Learn more

Phase I-III Clinical Trials

We help you navigate complex trial phases efficiently with data-driven methods & strategies tailored to Phase I-III trials for accelerated drug development.

Learn more

Commercialization

Maximize your market potential and optimize your commercial strategies with our advanced analytical methods, data science, and tailored commercialization solutions for data-driven commercial success.

Learn more

Real-World Evidence Solutions

Harness the power of real-world data and real-world evidence to gather insights and shape future drug development cycles for enhanced efficacy and regulatory compliance.

Find out more

Clinical Trial Design

Craft optimal trial designs with our advanced analytical methods to enhance efficiency and increase the probability of success throughout your drug’s lifecycle.

Learn more

Trial Delivery

Empower your trial delivery by transforming trial designs into actionable strategies with our data-driven approach to ensure seamless delivery and successful outcomes.

Find out more

Advanced Analytics

Unlock the power of data with our methods for actionable insights to drive informed decisions and optimize clinical trial outcomes.

Learn more

Specialty Areas

We offer tailored analytics solutions for your specialized or niche projects enabling you to optimize both efficiency and precision in drug development.

Learn more

Strategic Consulting

Beyond Functional Service Provider

Project-Based Analytical Solutions

Trial Design Software

Trial Implementation and Decision Support Software

LiveSLR® Software for Systematic Literature Reviews

Strategic Consulting

Enhance your clinical trial design with our strategic consulting services, featuring adaptive trial models and comprehensive regulatory guidance to ensure innovative, compliant, and successful trial outcomes.

Learn more

Beyond Functional Service Provider

Experience the future of flexible strategic partnerships with Analytics on Demand, providing adaptive and innovative solutions that transcend traditional models to achieve unparalleled operational excellence.

Learn more

Project-Based Analytical Solutions

Maximize efficiency and outcomes with our project-based services, delivering specialized expertise, end-to-end biometrics, and focused solutions for unique projects, ensuring timely completion and exceptional results.

Project-Based Analytical Solutions

Trial Design Software

Cytel's software platform enables precise trial design and simulation, utilizing adaptive and Bayesian tools to optimize protocols and accelerate drug development with confidence and efficiency.

Learn more

Trial Implementation and Decision Support Software

Our unique software package streamlines trial implementation with intuitive solutions for protocol development, randomization, and patient management, optimizing operational efficiency and ensuring study success.

Learn more

LiveSLR® Software for Systematic Literature Reviews

Learn more

Our Solutions

Discover comprehensive solutions for every stage of development — from data and regulatory strategies to Phase I-IV clinical trials, market access planning, and beyond.

Learn more

About Us

Insights

Careers

Solutions

Back to main

Drug Development Cycle

Back to Solutions
Drug Development Cycle
Discovery

Back to Drug Development Cycle

Our innovative preclinical solutions empower your drug development journey with cutting-edge analytics and insights from inception to preclinical stages.

Learn more

Quick Links

Data Strategy

Model-lnformed Drug Development

Clinical Pharmacology, Drug Metabolism and Pharmacokinetics

Phase I-III Clinical Trials

Back to Drug Development Cycle

We help you navigate complex trial phases efficiently with data-driven methods & strategies tailored to Phase I-III trials for accelerated drug development.

Learn more

Quick Links

Adaptive Trial Designs

Complex and Innovative Trial Design

Model-lnformed Drug Development

Clinical Development Strategy and Planning

End-to-End Biometrics

Axio® Data Monitoring Committee

Data Strategy

Regulatory Strategy

Market Access

Real-World Evidence

Software for Trial Design

Commercialization

Back to Drug Development Cycle

Maximize your market potential and optimize your commercial strategies with our advanced analytical methods, data science, and tailored commercialization solutions for data-driven commercial success.

Learn more

Quick Links

Health Economics and Outcomes Research (HEOR)

Market Access

Regulatory Strategy

Real-World Evidence

Health and Technology Assessments

stève consultants

Real-World Evidence Solutions

Back to Drug Development Cycle

Harness the power of real-world data and real-world evidence to gather insights and shape future drug development cycles for enhanced efficacy and regulatory compliance.

Find out more

Quick Links

Real-World Evidence

Real-World Data

Real-World Data Software

Post-Authorization Safety Studies

stève consultants

Strategic Data Science and Analytical Methods

Back to Solutions
Strategic Data Science and Analytical Methods
Clinical Trial Design

Back to Strategic Data Science and Analytical Methods

Craft optimal trial designs with our advanced analytical methods to enhance efficiency and increase the probability of success throughout your drug’s lifecycle.

Learn more

Quick Links

Adaptive Trial Design

Complex and Innovative Trial design

Model-lnformed Drug Development

Software for Trial Design

Trial Delivery

Back to Strategic Data Science and Analytical Methods

Empower your trial delivery by transforming trial designs into actionable strategies with our data-driven approach to ensure seamless delivery and successful outcomes.

Find out more

Quick Links

End-to-End Biometrics

Data Management

Early Phase Solutions

Late Phase Solutions

Axio® Data Monitoring Committee

Safety & Regulatory Compliance

FSP Outsourcing

Outsourcing per Project

Software for Trial Implementation

Advanced Analytics

Back to Strategic Data Science and Analytical Methods

Unlock the power of data with our methods for actionable insights to drive informed decisions and optimize clinical trial outcomes.

Learn more

Quick Links

Systematic Literature Reviews

Complex and Innovative Trial Design

Market Access

Health Economics & Outcomes Research

Real-World Evidence

Real-World Data

Health and Technology Assessments

Specialty Areas

Back to Strategic Data Science and Analytical Methods

We offer tailored analytics solutions for your specialized or niche projects enabling you to optimize both efficiency and precision in drug development.

Learn more

Quick Links

Synthetic Controls

External Controls

Rare Diseases

Oncology

Post-Authorization Safety Studies

Decentralized Trials

Pediatrics

Delivery Models

Back to Solutions
Delivery Models
Strategic Consulting

Back to Delivery Models

Enhance your clinical trial design with our strategic consulting services, featuring adaptive trial models and comprehensive regulatory guidance to ensure innovative, compliant, and successful trial outcomes.

Learn more

Quick Links

Adaptive Trial Designs

Advanced Analytics

Regulatory Strategy

Clinical Development Strategy & Planning

Discovery and Pre-clinical

Clinical Pharmacology, drug metabolism and pharmacokinetics

Beyond Functional Service Provider

Back to Delivery Models

Experience the future of flexible strategic partnerships with Analytics on Demand, providing adaptive and innovative solutions that transcend traditional models to achieve unparalleled operational excellence.

Learn more

Quick Links

Staff Augmentation

Strategic Capacity Management

Functional Service Provider

Hybrid FSP Model

Project-Based Analytical Solutions

Back to Delivery Models

Maximize efficiency and outcomes with our project-based services, delivering specialized expertise, end-to-end biometrics, and focused solutions for unique projects, ensuring timely completion and exceptional results.

Project-Based Analytical Solutions

Quick Links

Data Management

End-to-End Biometrics

Data Submission support

Axio® Data Monitoring Committee

Medical Writing

Software Solutions

Back to Solutions
Software Solutions
Trial Design Software

Back to Software Solutions

Cytel's software platform enables precise trial design and simulation, utilizing adaptive and Bayesian tools to optimize protocols and accelerate drug development with confidence and efficiency.

Learn more

Quick Links

East Horizon™ Platform for Trial Design

Xact software suite, StatXact®

Trial Implementation and Decision Support Software

Back to Software Solutions

Our unique software package streamlines trial implementation with intuitive solutions for protocol development, randomization, and patient management, optimizing operational efficiency and ensuring study success.

Learn more

Quick Links

Enforesys

LiveSLR® Software for Systematic Literature Reviews

Back to Software Solutions

Learn more

Quick Links

Therapeutic Areas

Back to Solutions
Therapeutic Areas
Rare Diseases
Oncology
Central Nervous System
Other Therapeutic Areas
About Us

Back to main

About Us

Learn about our rich history, visionary leadership, and core values. Our mission and vision drive us to deliver excellence in drug development globally.
Learn more

Quick Links

Our Experts Leadership Team Innovation Advisory Board Board of Directors Sustainability
Insights

Back to main

Insights

Explore Cytel’s Insights hub - your source for the latest news, event updates, and expert insights on advanced data analytics and data science.
Explore now

Quick Links

Perspectives News and Events Resource Library Publications
Careers

Back to main

Careers

Join our innovative team at Cytel! Explore exciting career opportunities in data science and statistics, analytical methods, and regulatory strategy. Advance the future of human health with us.
Learn more

Quick Links

Connect with Us Join our Talent Network Find our Latest Opportunities

Customer Support

Home

Perspectives

Building a New Evidence Base for Rare Diseases by Structuring Clinical Narratives with Generative AI

April 23, 2026

Manuel Cossio

Rare diseases present a paradox in modern healthcare. Individually, they affect small populations, yet collectively they impact millions of patients worldwide. Despite this, progress in diagnosis, treatment, and research remains slow. The fundamental challenge is not only scientific complexity but also a persistent lack of usable data.

Traditional sources of real-world data — electronic health records, claims databases, and clinical trials — struggle to capture rare disease populations at a meaningful scale. Patients are geographically dispersed, frequently misdiagnosed, and often excluded from structured datasets. As a result, generating robust evidence in rare diseases remains difficult.

At the same time, an overlooked resource has quietly accumulated over decades: clinical case reports. These narratives contain detailed descriptions of real patients, their symptoms, diagnostic journeys, and outcomes. The challenge has never been their value, but rather their accessibility and structure.

Recent advances in large language models (LLMs) suggest that this barrier may finally be overcome.

Case reports as a foundation for real-world evidence

Case reports represent one of the richest forms of clinical documentation available. Unlike structured datasets, they capture the full nuance of patient care, including symptom evolution, diagnostic uncertainty, and physician reasoning. They are inherently real-world, reflecting how diseases actually present and are managed in practice.

However, their utility has historically been limited. Case reports are written in free text, scattered across millions of publications, and lack standardization. Extracting meaningful insights at scale has required significant manual effort, making systematic use impractical.

The RareArena study demonstrates a new approach. By leveraging LLMs, researchers were able to automatically collect and process hundreds of thousands of case reports from PubMed, filter them for rare diseases, and transform them into a structured dataset comprising tens of thousands of patient cases. This process effectively converts unstructured clinical narratives into analyzable real-world data.

This shift is significant. It reframes case reports not as isolated anecdotes, but as components of a scalable data asset.

From unstructured text to scalable patient populations

One of the most important implications of this approach is the ability to expand patient populations in rare disease studies. Traditional datasets are constrained by institutional boundaries and data availability. In contrast, case reports aggregate knowledge globally, capturing patients from diverse healthcare systems and settings.

By structuring these reports, LLMs enable the creation of virtual cohorts that far exceed what any single registry or database could provide. Diagnoses can be standardized using reference ontologies, symptoms can be normalized, and cases can be grouped into clinically meaningful categories.

The RareArena dataset, for example, spans thousands of rare diseases and tens of thousands of patient cases, representing one of the broadest collections of rare disease data assembled to date. This kind of scale opens new possibilities for understanding disease heterogeneity, identifying subpopulations, and generating evidence where none previously existed.

In effect, LLMs allow researchers to move from fragmented observations to aggregated real-world populations.

Capturing the diagnostic journey

A particularly valuable aspect of the RareArena framework is its alignment with real clinical workflows. The dataset distinguishes between two stages of diagnosis: early suspicion based on symptoms alone, and confirmation after diagnostic testing.

This distinction mirrors how rare diseases are encountered in practice. Patients often experience long diagnostic odysseys, with years passing before a correct diagnosis is reached. By separating these stages, the dataset captures both the uncertainty of early presentation and the clarity provided by confirmatory tests.

This structure enables deeper analysis of diagnostic pathways, including where delays occur and how different signals contribute to clinical decision-making. It also provides a foundation for developing tools that support earlier recognition of rare diseases, an area where unmet need remains substantial.

Preserving clinical complexity in real-world data

A common limitation of many real-world datasets is the loss of clinical nuance. Structured data often simplifies patient information, omitting negative findings, confounding symptoms, and contextual details that are critical for diagnosis.

Case reports, by contrast, preserve this complexity. The RareArena study shows that most cases retain features such as negative symptoms and confounding factors, reflecting the challenges physicians face in real-world settings. This makes the resulting dataset not only large, but also clinically realistic.

Maintaining this level of detail is essential for rare diseases, where subtle distinctions can significantly alter diagnosis and treatment. LLMs play a key role here by rephrasing and structuring text while preserving the underlying clinical information.

The result is a form of real-world data that is both scalable and rich in context.

Implications for research and clinical development

The ability to generate structured datasets from case reports has far-reaching implications. For researchers, it enables the study of rare diseases across larger and more diverse populations than previously possible. Patterns of presentation, progression, and response to treatment can be explored with greater statistical power.

In clinical development, this approach offers new ways to identify and characterize patient populations. It can support the design of clinical trials by highlighting underrepresented groups and informing inclusion criteria. It also provides a potential source of external evidence, complementing traditional trial data.

Beyond research, there is a clear opportunity to improve clinical decision support. The RareArena study demonstrates that LLMs already show meaningful capability in diagnosing rare diseases, particularly when provided with comprehensive clinical information. While not yet sufficient for standalone use, these models can assist clinicians by surfacing relevant diagnostic possibilities.

Limitations and considerations

Despite its promise, this approach is not without limitations. Case reports are inherently selective, often focusing on unusual or severe presentations. This introduces potential bias in the resulting datasets. Additionally, the data is retrospective and curated, rather than continuously collected.

LLMs themselves introduce another layer of complexity. While they are effective at extracting and structuring information, they can also propagate errors or introduce subtle inaccuracies. Ensuring data quality and validation remains critical.

The RareArena study also highlights that even the most advanced models are far from perfect in diagnostic tasks, particularly in early-stage scenarios. This reinforces the need to view these tools as augmentative rather than autonomous.

A shift from data scarcity to data unlocking

What emerges from this work is a broader shift in how we think about data in rare diseases. The challenge is no longer solely about collecting new data, but about unlocking the value of existing information.

Case reports represent decades of accumulated clinical knowledge. With LLMs, it becomes possible to systematically extract, structure, and scale that knowledge into usable real-world data. This approach does not replace traditional data sources, but it significantly expands the available evidence base.

For rare diseases, where every patient case is valuable, this shift is particularly impactful.

Toward a more complete picture of rare diseases

The combination of case reports and large language models offers a compelling new pathway for advancing rare disease research. By transforming unstructured narratives into structured datasets, it enables the creation of larger, more representative patient populations and more realistic models of clinical care.

While challenges remain, the potential is clear. This approach can accelerate diagnosis, inform clinical development, and ultimately contribute to better outcomes for patients who have long been underserved.

In a field defined by scarcity, the ability to unlock hidden data may prove to be one of the most important innovations yet.

Subscribe to our newsletter

Thriving in an AI-Enabled Clinical Development Environment: A Practical Guide for Programmers and Statisticians

Artificial intelligence has moved quickly from experimentation to practical application across clinical development. From SDTM and ADaM generation to programming support, protocol review, and exploratory analyses, AI-powered tools are becoming increasingly common within pharmaceutical companies and CROs. For statistical programmers and biostatisticians, this creates both opportunity and uncertainty. At industry conferences, discussions often focus on […]

The EU HTA AI Guidance Has Arrived: What It Means for Joint Clinical Assessments

On July 15, 2026, the HTA Coordination Group (HTA CG) adopted the first General Principles on the Use of Artificial Intelligence in the Preparation of Dossiers for Joint Clinical Assessments (JCAs). Although the guidance is only a few pages long, it represents an important milestone for the industry. Rather than debating whether AI can be […]

Modeling Nonlinear Relationships in Clinical Trial Data with Generalized Additive Models Using AI, SAS, and R

Clinical trial data rarely behaves as neatly as our statistical models would like. Many traditional regression approaches assume that relationships between predictors and outcomes follow a straight line. Yet in practice, biomarkers, disease progression, and treatment response often exhibit complex nonlinear patterns that cannot be adequately described with simple linear effects. Recognizing and modeling these […]

Manuel Cossio

Head of AI Solutions, Real-World Evidence, Value, and Access

Manuel Cossio is Head of AI Solutions, Real-World Evidence, Value, and Access at Cytel. Manuel is an AI engineer with over a decade of experience in healthcare AI research and development. He currently leads the creation of generative AI solutions aimed at optimizing clinical trials, focusing on hierarchical multi-agent systems with multistage data governance and human-in-the-loop dynamic behavior control.

Manuel has an extensive research background with publications in computer vision, natural language processing, and genetic data analysis. He is a registered Key Opinion Leader at the Digital Medicine Society, a member of the ISPOR Community of Interest in AI, a Generative AI evaluator for the EU Commission, and an AI researcher at UB-UPC- Barcelona Supercomputing Center.

He holds an M.Sc. in Translational Medicine from Universitat de Barcelona, a Master of Engineering in AI from Universitat Politècnica de Catalunya, and a M.Sc. in Neuroscience from Universitat Autònoma de Barcelona.

Read full employee bio

Claim your free 30-minute strategy session

Book a free, no-obligation strategy session with a Cytel expert to get advice on how to improve your drug’s probability of success and plot a clearer route to market.

Discovery

Phase I-III Clinical Trials

Commercialization

Real-World Evidence Solutions

Clinical Trial Design

Trial Delivery

Advanced Analytics

Specialty Areas

Strategic Consulting

Beyond Functional Service Provider

Project-Based Analytical Solutions

Trial Design Software

Trial Implementation and Decision Support Software

LiveSLR® Software for Systematic Literature Reviews

Our Solutions

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

About Us

Quick Links

Insights

Quick Links

Careers

Quick Links