A New Frontier in Real-World Evidence: Can AI Create Reliable Synthetic Trial Data?
May 14, 2026
Synthetic data is a promising innovation for clinical studies that incorporate an external control arm. Relying solely on traditional control groups can be costly, time-consuming, or even unethical. Instead, researchers are exploring ways to generate “synthetic” patient cohorts that behave like real ones.
At ISPOR US 2026 in Philadelphia, we will be presenting a pilot study that takes an important step in this direction. Our work explores how large language models (LLMs), the same technology behind modern AI assistants, can be used to generate synthetic clinical trial datasets suitable for external control arms (ECAs).
Why external control arms matter
ECAs are increasingly important in clinical trials, especially in areas where recruiting patients into placebo groups is difficult or undesirable. By using existing data to simulate a control group, researchers can accelerate trials and reduce patient burden. This challenge is especially pronounced in rare diseases, oncology, gene and cell therapies, and severe or life‑threatening conditions, where patients and clinicians are understandably reluctant to accept randomization to non‑active treatment arms.
However, for ECAs to be useful, they must meet two critical requirements:
- They need to closely resemble real patient populations in terms of demographics and clinical characteristics.
- They must protect patient privacy and be reproducible for regulatory scrutiny.
This is where AI — and specifically LLMs — enters the picture.
Two approaches to generating synthetic data
In our study, we evaluated two different ways of using LLMs to generate synthetic clinical trial datasets.
The first approach was direct generation. Here, the LLM was given access to the original dataset along with a variable dictionary and asked to generate a new synthetic dataset in a single step. This method is fast and intuitive.
The second approach was more structured and code-driven. Instead of generating the dataset directly, the LLM created a Python-based pipeline that performed bootstrapping and anonymization. This pipeline included a noise injection mechanism, where small amounts of statistical “noise” were added to numerical variables. The noise was carefully calibrated — set to 5% of each variable’s standard deviation — and values were constrained within realistic ranges to maintain clinical plausibility.
What we found
Both methods were able to generate synthetic cohorts of 100 patients, demonstrating that LLMs can indeed produce usable clinical datasets.
However, the differences between the two approaches were striking. The direct generation method was extremely fast, completing the task in just 23 seconds with a single prompt. In contrast, the code-based approach took longer — around 40 seconds — and required multiple iterations to refine the pipeline.
Despite the extra effort, the code-driven method delivered better results. It more accurately preserved the statistical properties of the original trial data, including key variables like age, body mass index, sex, and race. The distributions in the synthetic dataset closely matched those of the real population, suggesting that the combination of bootstrapping and calibrated noise was effective.
Speed vs. scientific rigor
These findings highlight an important trade-off. Direct LLM generation is excellent for rapid prototyping and exploratory analysis. It allows researchers to quickly create synthetic datasets with minimal effort.
But when it comes to regulatory-grade applications — such as external control arms used in decision-making — transparency and control become essential. The code-augmented approach provides a clear, reproducible process that can be audited and validated. This level of rigor is crucial for building trust with regulators and stakeholders.
Balancing privacy and realism
A key challenge in synthetic data generation is protecting patient privacy without losing the statistical integrity of the dataset. Our study shows that adding carefully calibrated Gaussian noise can strike this balance.
By scaling noise to the variability of each variable and enforcing realistic bounds, we were able to anonymize the data while preserving meaningful population-level characteristics. This approach helps ensure that synthetic datasets remain useful for analysis while reducing the risk of re-identification.
What comes next?
While this pilot study demonstrates the potential of LLM-generated synthetic cohorts, it is only the beginning. Future research needs to explore whether these methods are robust under more challenging conditions.
One critical next step is to evaluate re-identification risk, particularly under adversarial scenarios where attackers actively attempt to reverse-engineer the data. It will also be important to compare noise-based approaches with other privacy-preserving techniques, such as differential privacy. This step would include understanding the amount of noise the model should introduce.
Closing thoughts
Synthetic data has the potential to transform clinical research by making trials faster, more efficient, and more ethical. Our findings suggest that LLMs can play a meaningful role in this transformation — but how they are used matters.
Fast, direct generation offers convenience, but structured, code-based approaches provide the reliability and transparency needed for real-world adoption. As the field moves forward, combining the strengths of both may unlock the full potential of AI-driven synthetic data in healthcare.
Interested in learning more?
Join Manuel Cossio and Deepa Jahagirdar, along with Anupama Vasudevan, at ISPOR US for their upcoming presentation, “A Pilot Assessment of LLM-Generated Synthetic Cohorts: A First Step Toward Robust Synthetic Control Arms” on May 18 at 4:00 PM.
Book a meeting!Subscribe to our newsletter
Manuel Cossio
Head of AI Solutions, Real-World Evidence, Value, and Access
Manuel Cossio is Head of AI Solutions, Real-World Evidence, Value, and Access at Cytel. Manuel is an AI engineer with over a decade of experience in healthcare AI research and development. He currently leads the creation of generative AI solutions aimed at optimizing clinical trials, focusing on hierarchical multi-agent systems with multistage data governance and human-in-the-loop dynamic behavior control.
Manuel has an extensive research background with publications in computer vision, natural language processing, and genetic data analysis. He is a registered Key Opinion Leader at the Digital Medicine Society, a member of the ISPOR Community of Interest in AI, a Generative AI evaluator for the EU Commission, and an AI researcher at UB-UPC- Barcelona Supercomputing Center.
He holds an M.Sc. in Translational Medicine from Universitat de Barcelona, a Master of Engineering in AI from Universitat Politècnica de Catalunya, and a M.Sc. in Neuroscience from Universitat Autònoma de Barcelona.
Read full employee bio
Deepa Jahagirdar
Associate Research Principal, Real World Evidence
Deepa Jahagirdar is Associate Research Principal, Real World Evidence at Cytel. Deepa is the technical lead for study design, methods, and statistics for a variety of projects, including target trials and ECA. Prior to this position, she completed her Ph.D. in epidemiology at McGill University, and her MSc. in Health, Community and Development at the London School of Economics. She has ten years of experience developing methodological solutions to complex data and statistical problems in epidemiology, enabling robust findings across various substantive areas. Additionally, she has extensive experience facilitating work with various stakeholders and clients, ranging from international funding agencies, corporations and academia to government. She excels at conveying highly technical concepts in meaningful ways to foster effective collaborations.
Read full employee bioClaim your free 30-minute strategy session
Book a free, no-obligation strategy session with a Cytel expert to get advice on how to improve your drug’s probability of success and plot a clearer route to market.