Breaking Barriers in Rare Disease Research with Generative AI and Synthetic Data
October 14, 2025
In healthcare innovation, one of the most pressing challenges lies in rare disease research. There are approximately 7,000 rare diseases affecting over 300 million people worldwide. With only a handful of patients dispersed globally, gathering sufficient data to power robust clinical studies or predictive models is a monumental hurdle. However, a solution is emerging at the intersection of generative AI and real-world data (RWD) — a novel approach with the potential to reshape possibilities and unlock insights to address unmet medical needs in rare diseases.
The rare disease data dilemma
In the U.S., rare diseases are defined as conditions affecting fewer than 200,000 people. Despite their low individual prevalence, rare diseases collectively impose a significant burden on both patients and healthcare systems.
Research and development in rare diseases often face a vicious cycle: low prevalence leads to data scarcity. Traditional clinical trials are often infeasible and/or statistically underpowered due to the limited pool of participants.
Meanwhile, RWD sources such as electronic health records (EHRs), insurance claims, registries, and patient-reported outcomes offer valuable, albeit messy and fragmented, glimpses into the patient journey. Yet even RWD struggles to paint a complete picture in rare diseases. This is where generative AI steps in.
Enter generative AI: Making data where there is none
Generative AI — especially models like Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and, more recently, large foundation models — has a transformative ability: it can learn patterns from limited datasets and generate synthetic yet realistic datasets.
How it works
- Learning from RWD: Even small datasets from rare disease patients can be used to train and fine-tune generative models. These models identify patterns, distributions, and time-dependent relationships present in the data.
- Synthesizing patients: Once trained, the model can create new, synthetic patient records that preserve the statistical properties and characteristics of the original data. These “digital patients” simulate disease progression, treatment responses, and comorbidities.
- Validating realism: Synthetic data must be validated to ensure it reflects the real-world data it was trained on. Techniques like distributional comparison, propensity scoring, and expert validation are used to ensure accuracy and utility.
Why synthetic data matters for rare diseases
Synthetic data can enhance rare disease clinical research in many ways, including:
1. Augmenting small cohorts
Synthetic data can boost sample sizes for rare disease studies, enabling:
- Simulation of clinical trials
- Development of more robust predictive models
- Generation of synthetic control arms where traditional controls are ethically or logistically impractical
2. Enhancing privacy
In rare diseases, patient re-identification is an increased risk due to unique phenotypes or genetic markers. Synthetic data protects patient privacy, while at the same time preserves the utility of the data.
3. Facilitating global collaboration
As synthetic data is deidentified, it facilitates data sharing among researchers, institutions and borders, minimizing regulatory hurdles and fostering cross-collaborative discovery.
4. Accelerating drug development
Pharma and biotech companies can use synthetic data to:
- Test drug targeting strategies
- Model long-term outcomes
- Conduct in silico trials in the earliest stages of development
Challenges and considerations
While promising, this approach is not without its challenges:
- Bias amplification: Synthetic data reflects the biases of its training data. If the RWD is incomplete or skewed, so will the synthetic outputs be. Strategies to handle bias are essential.
- Regulatory acceptance: Regulatory bodies are still evaluating how to incorporate synthetic data into approval pathways.
- Validation standards: There is a need for consistent benchmarks and best practices for validating synthetic data — both in terms of privacy and utility, as well as broader generative AI applications in healthcare.
Looking ahead
The marriage of generative AI and RWD opens new doors for rare disease research. With the ability to synthesize patient data that preserves real-world complexity, we can begin to break free from the constraints of scarcity — generating insights, hypotheses, and interventions that were once out of reach.
As we move forward, interdisciplinary collaboration among clinicians, data scientists, regulatory bodies, and patient advocacy groups will be key to harnessing this potential ethically and effectively.
Interested in learning more?
Download our complimentary ebook, Rare Disease Clinical Trials: Design Strategies and Regulatory Considerations:
Download your copy today!Subscribe to our newsletter
Manuel Cossio
Director, Innovation and Strategic Consulting
Manuel Cossio is Director, Innovation and Strategic Consulting at Cytel. Manuel is an AI engineer with over a decade of experience in healthcare AI research and development. He currently leads the creation of generative AI solutions aimed at optimizing clinical trials, focusing on hierarchical multi-agent systems with multistage data governance and human-in-the-loop dynamic behavior control.
Manuel has an extensive research background with publications in computer vision, natural language processing, and genetic data analysis. He is a registered Key Opinion Leader at the Digital Medicine Society, a member of the ISPOR Community of Interest in AI, a Generative AI evaluator for the EU Commission, and an AI researcher at UB-UPC- Barcelona Supercomputing Center.
He holds an M.Sc. in Translational Medicine from Universitat de Barcelona, a Master of Engineering in AI from Universitat Politècnica de Catalunya, and a M.Sc. in Neuroscience from Universitat Autònoma de Barcelona.
Read full employee bio
Jonas Häggström
Vice President, Real World Evidence
Jonas Häggström is the Vice President of Real World Evidence and the Innovation Hub at Cytel, and has 25 years of experience working across all phases in drug development and real-world evidence applications, with emphasis on neurosciences, oncology, rare diseases, and immunology. Jonas frequently serves on Data Monitoring Committees gives input to regulatory authority interactions, drug development strategies, statistical methodology, and quantitative decision-making. In addition to his role at Cytel, Jonas also serves as an expert statistical advisor to the Gates Foundation.
Prior joining Cytel, Jonas was the Chief Scientific Officer at MTEK Sciences, a boutique analytics firm specializing in innovative clinical trial designs and real world advanced analytics, and Neuroscience Head of Biostatistics at AstraZeneca.
Read full employee bioClaim your free 30-minute strategy session
Book a free, no-obligation strategy session with a Cytel expert to get advice on how to improve your drug’s probability of success and plot a clearer route to market.