Building a New Evidence Base for Rare Diseases by Structuring Clinical Narratives with Generative AI
April 28, 2026
Rare diseases present a paradox in modern healthcare. Individually, they affect small populations, yet collectively they impact millions of patients worldwide. Despite this, progress in diagnosis, treatment, and research remains slow. The fundamental challenge is not only scientific complexity but also a persistent lack of usable data.
Traditional sources of real-world data — electronic health records, claims databases, and clinical trials — struggle to capture rare disease populations at a meaningful scale. Patients are geographically dispersed, frequently misdiagnosed, and often excluded from structured datasets. As a result, generating robust evidence in rare diseases remains difficult.
At the same time, an overlooked resource has quietly accumulated over decades: clinical case reports. These narratives contain detailed descriptions of real patients, their symptoms, diagnostic journeys, and outcomes. The challenge has never been their value, but rather their accessibility and structure.
Recent advances in large language models (LLMs) suggest that this barrier may finally be overcome.
Case reports as a foundation for real-world evidence
Case reports represent one of the richest forms of clinical documentation available. Unlike structured datasets, they capture the full nuance of patient care, including symptom evolution, diagnostic uncertainty, and physician reasoning. They are inherently real-world, reflecting how diseases actually present and are managed in practice.
However, their utility has historically been limited. Case reports are written in free text, scattered across millions of publications, and lack standardization. Extracting meaningful insights at scale has required significant manual effort, making systematic use impractical.
The RareArena study demonstrates a new approach. By leveraging LLMs, researchers were able to automatically collect and process hundreds of thousands of case reports from PubMed, filter them for rare diseases, and transform them into a structured dataset comprising tens of thousands of patient cases. This process effectively converts unstructured clinical narratives into analyzable real-world data.
This shift is significant. It reframes case reports not as isolated anecdotes, but as components of a scalable data asset.
From unstructured text to scalable patient populations
One of the most important implications of this approach is the ability to expand patient populations in rare disease studies. Traditional datasets are constrained by institutional boundaries and data availability. In contrast, case reports aggregate knowledge globally, capturing patients from diverse healthcare systems and settings.
By structuring these reports, LLMs enable the creation of virtual cohorts that far exceed what any single registry or database could provide. Diagnoses can be standardized using reference ontologies, symptoms can be normalized, and cases can be grouped into clinically meaningful categories.
The RareArena dataset, for example, spans thousands of rare diseases and tens of thousands of patient cases, representing one of the broadest collections of rare disease data assembled to date. This kind of scale opens new possibilities for understanding disease heterogeneity, identifying subpopulations, and generating evidence where none previously existed.
In effect, LLMs allow researchers to move from fragmented observations to aggregated real-world populations.
Capturing the diagnostic journey
A particularly valuable aspect of the RareArena framework is its alignment with real clinical workflows. The dataset distinguishes between two stages of diagnosis: early suspicion based on symptoms alone, and confirmation after diagnostic testing.
This distinction mirrors how rare diseases are encountered in practice. Patients often experience long diagnostic odysseys, with years passing before a correct diagnosis is reached. By separating these stages, the dataset captures both the uncertainty of early presentation and the clarity provided by confirmatory tests.
This structure enables deeper analysis of diagnostic pathways, including where delays occur and how different signals contribute to clinical decision-making. It also provides a foundation for developing tools that support earlier recognition of rare diseases, an area where unmet need remains substantial.
Preserving clinical complexity in real-world data
A common limitation of many real-world datasets is the loss of clinical nuance. Structured data often simplifies patient information, omitting negative findings, confounding symptoms, and contextual details that are critical for diagnosis.
Case reports, by contrast, preserve this complexity. The RareArena study shows that most cases retain features such as negative symptoms and confounding factors, reflecting the challenges physicians face in real-world settings. This makes the resulting dataset not only large, but also clinically realistic.
Maintaining this level of detail is essential for rare diseases, where subtle distinctions can significantly alter diagnosis and treatment. LLMs play a key role here by rephrasing and structuring text while preserving the underlying clinical information.
The result is a form of real-world data that is both scalable and rich in context.
Implications for research and clinical development
The ability to generate structured datasets from case reports has far-reaching implications. For researchers, it enables the study of rare diseases across larger and more diverse populations than previously possible. Patterns of presentation, progression, and response to treatment can be explored with greater statistical power.
In clinical development, this approach offers new ways to identify and characterize patient populations. It can support the design of clinical trials by highlighting underrepresented groups and informing inclusion criteria. It also provides a potential source of external evidence, complementing traditional trial data.
Beyond research, there is a clear opportunity to improve clinical decision support. The RareArena study demonstrates that LLMs already show meaningful capability in diagnosing rare diseases, particularly when provided with comprehensive clinical information. While not yet sufficient for standalone use, these models can assist clinicians by surfacing relevant diagnostic possibilities.
Limitations and considerations
Despite its promise, this approach is not without limitations. Case reports are inherently selective, often focusing on unusual or severe presentations. This introduces potential bias in the resulting datasets. Additionally, the data is retrospective and curated, rather than continuously collected.
LLMs themselves introduce another layer of complexity. While they are effective at extracting and structuring information, they can also propagate errors or introduce subtle inaccuracies. Ensuring data quality and validation remains critical.
The RareArena study also highlights that even the most advanced models are far from perfect in diagnostic tasks, particularly in early-stage scenarios. This reinforces the need to view these tools as augmentative rather than autonomous.
A shift from data scarcity to data unlocking
What emerges from this work is a broader shift in how we think about data in rare diseases. The challenge is no longer solely about collecting new data, but about unlocking the value of existing information.
Case reports represent decades of accumulated clinical knowledge. With LLMs, it becomes possible to systematically extract, structure, and scale that knowledge into usable real-world data. This approach does not replace traditional data sources, but it significantly expands the available evidence base.
For rare diseases, where every patient case is valuable, this shift is particularly impactful.
Toward a more complete picture of rare diseases
The combination of case reports and large language models offers a compelling new pathway for advancing rare disease research. By transforming unstructured narratives into structured datasets, it enables the creation of larger, more representative patient populations and more realistic models of clinical care.
While challenges remain, the potential is clear. This approach can accelerate diagnosis, inform clinical development, and ultimately contribute to better outcomes for patients who have long been underserved.
In a field defined by scarcity, the ability to unlock hidden data may prove to be one of the most important innovations yet.
Subscribe to our newsletter
Manuel Cossio
Head of AI Solutions, Real-World Evidence, Value, and Access
Manuel Cossio is Head of AI Solutions, Real-World Evidence, Value, and Access at Cytel. Manuel is an AI engineer with over a decade of experience in healthcare AI research and development. He currently leads the creation of generative AI solutions aimed at optimizing clinical trials, focusing on hierarchical multi-agent systems with multistage data governance and human-in-the-loop dynamic behavior control.
Manuel has an extensive research background with publications in computer vision, natural language processing, and genetic data analysis. He is a registered Key Opinion Leader at the Digital Medicine Society, a member of the ISPOR Community of Interest in AI, a Generative AI evaluator for the EU Commission, and an AI researcher at UB-UPC- Barcelona Supercomputing Center.
He holds an M.Sc. in Translational Medicine from Universitat de Barcelona, a Master of Engineering in AI from Universitat Politècnica de Catalunya, and a M.Sc. in Neuroscience from Universitat Autònoma de Barcelona.
Read full employee bioClaim your free 30-minute strategy session
Book a free, no-obligation strategy session with a Cytel expert to get advice on how to improve your drug’s probability of success and plot a clearer route to market.