Solutions
About Us
Insights
Careers

Building External Control Arms in Rare Disease Clinical Trials: A Programmer’s Perspective

External Control Arms (ECAs) are gaining a lot of attention in clinical research, particularly in rare diseases, where traditional randomized trials are often difficult to execute. Much of the discussion focuses on the statistical methodology and study design required to identify appropriate populations and data sources. But in practice, one of the biggest challenges lies in the programming effort, which is equally critical, but often more complex than anticipated.

Given that ECAs are still an evolving area, formal regulatory and industry guidance remains relatively limited. However, available publications are beginning to address key considerations. For example, the FDA’s Data Standards for Drug and Biological Product Submissions Containing Real-World Data (2024) provides recommendations on preparing and submitting RWD-derived datasets, while highlighting challenges in standardization and traceability. In parallel, industry initiatives such as the PHUSE white paper on Data Standards for Non-Interventional Studies outline common data standardisation challenges and practical approaches to address them. In addition, dedicated working groups within PHUSE are actively contributing to the development of best practices for ECAs.

This article focuses on the practical challenges from a programming perspective, drawing on recent case study experience.

 

Working with real-world and heterogeneous data

From a programming perspective, ECAs differ significantly from traditional clinical trials. Instead of working with well-structured datasets collected under controlled protocols, programmers are required to integrate data from multiple sources, including Real-World Data (RWD), historical trials, observational studies, and natural history cohorts. Each source brings its own structure, conventions, and limitations, often with poor documentation.

In one case study, external control data was derived from two independent natural history cohorts across different regions. While both sources represented similar patient populations, differences in baseline definitions, visit schedules, and outcome assessments required careful reconciliation.

The programming team aligned key covariates, including baseline age, genetic subtype, and functional scores to support comparability with the treated trial population. This went far beyond standard data mapping and required informed decisions to standardize variables that were not originally designed for cross-study integration.

 

Harmonization and data standardization

Once data sources are understood, harmonization becomes a critical step. The validity of an ECA depends on ensuring consistent definitions across baseline variables, endpoints, covariates, and visit timing.

In practice, this involves standardizing baseline windows, assessment schedules, coding dictionaries (such as MedDRA, across multiple versions, and laboratory standard units), endpoint derivations, and covariates used for matching. Across the case studies, this proved to be one of the most time-intensive phase.

Even small differences required careful reconciliation. For example, the same functional score was recorded on different scales across studies, requiring re-derivation into a common format.

If not addressed early, these inconsistencies can significantly impact downstream analyses, including propensity score modelling and bias estimation. Early and systematic harmonization is therefore essential to ensure consistency and minimize rework.

 

CDISC alignment, missing data, and analytical complexity

For studies intended for regulatory submission, alignment with CDISC standards (SDTM and ADaM) is essential. However, external datasets are rarely structured with these standards in mind, requiring substantial programming effort during transformation.

In another case study, SDTM datasets pooled from multiple studies, were used as the source. However, inconsistencies in specifications and differences in SDTM Implementation Guide versions across studies created challenges in standardization and traceability during ADaM specifications development. Key variables including demographics and baseline characteristics such as age, sex, education, genotype, and clinical scores had to be consistently derived and validated across studies. Maintaining traceability was critical, with define.xml playing a key role in documenting transformations and assumptions.

At the same time, missing and inconsistent data remain inherent challenges. In the natural history cohort example, gaps in timepoints and patient coverage, limited direct comparability with the treated trial arm. Programmers addressed this by defining analysis windows and deriving aligned time variables, enabling more meaningful longitudinal comparisons. However, such adjustments introduce assumptions that must be clearly justified and documented in specifications and Reviewers guide.

ECA analyses also rely heavily on advanced statistical techniques, including propensity score matching, weighting, and longitudinal modelling. These methods can be computationally intensive, particularly when working with multiple heterogeneous datasets. In one case study, certain models required several hours to run for a single output, directly impacting timelines for quality control and iterative revisions.

As a result, programmers must optimize code for long-running processes, manage runtime constraints, and ensure reproducibility across environments. For example, when generating figures based on many simulations (e.g., 500,000 iterations), a single output could require several hours of execution time. To improve efficiency, figure generation was separated into independent programs rather than being combined within a single workflow, which significantly reduced total runtime. Similarly, validation procedures for computationally intensive simulations were performed in a staged manner, starting with smaller sample sizes and progressively increasing to the full scale, allowing for earlier detection of discrepancies, while minimizing unnecessary computational cost. In addition, parallel execution strategies were employed, with multiple programmers running processes concurrently, further reducing overall turnaround time.

Furthermore, the inherent uncertainty in external data typically necessitates multiple sensitivity analyses, requiring flexible and efficient programming workflows.

 

Operational constraints and regulatory expectations

Beyond technical challenges, ECAs introduce operational complexities. External datasets are often subject to strict privacy and governance requirements, with analyses conducted in secure or third-party environments. These constraints can limit direct data access, slow iteration cycles, and introduce additional layers of review and approval.

Programmers must therefore adapt to restricted computing environments, limited data visibility, and evolving access rules, all of which require careful planning to maintain timelines.

At the same time, regulatory expectations remain high. While agencies are increasingly open to ECAs, they require strong evidence of data quality, bias mitigation, and endpoint consistency. From a programming perspective, this places significant emphasis on transparency and documentation.

All transformations and analytical decisions must be fully traceable and clearly justified, including mapping approaches, imputation methods, endpoint derivations, harmonization decisions, and sensitivity analyses. Well-structured documentation is therefore as critical as the datasets themselves in supporting reproducibility and regulatory review.

 

Final takeaways

The development of ECAs extends far beyond data integration. It requires a structured and methodical programming approach to ensure consistency, traceability, and regulatory readiness.

The case studies highlight that successful ECA implementation depends not only on methodological rigor but also on the quality of data preparation and standardization. Early harmonization, robust documentation, and flexible programming frameworks are essential to delivering reliable and submission-ready results.

As ECAs continue to gain traction, programming plays a central role in bridging diverse data sources and generating credible evidence for regulatory decision-making. Despite the availability of industry white papers and broader guidance on observational data standardization, dedicated standards and detailed guidance specific to ECAs remain limited, highlighting the need for continued collaboration and development in this area.

 

Interested in learning more?

Join Gautham Selvaraj, Ralf Koelbach, and Steven Ting for their upcoming webinar, “Implementing External Control Arms in a Rare Disease Case Study” on April 30 at 10 am ET, where they will offer practical insights and experience-based strategies for implementing ECAs with real-world data:

Leveraging RWE Innovations to Inform Clinical Strategy and Strengthen Healthcare Decision-Making

Real-world evidence (RWE) is no longer a supporting actor, but rather a strategic asset that should be embedded across the product lifecycle.

We now have tools that were unimaginable a decade ago: synthetic data that preserves privacy while enabling scenario modeling and early go/no‑go decisions, external control arms (ECAs) to strengthen single‑arm trials and accelerate access in high unmet need settings,
and decentralized long‑term extensions via tokenization that reduce burden while capturing 10+ years of safety and effectiveness across the patients’ real-world journey.

These innovations aren’t just “nice to have.” They are how we accelerate access to needed therapies, demonstrate value with confidence, and build submissions that stand up to today’s scrutiny.

Here, I discuss how these capabilities are reshaping clinical strategy and unlocking smarter, faster, more equitable evidence generation.

 

Generating synthetic data with agentic AI

Synthetic data is artificially generated data that mimics the statistical properties of real data without containing identifiable patient information. Starting with appropriate real-world data (RWD) (patient-level) or randomized controlled trial (RCT) data source(s), sponsors can use an AI-supported pipeline to generate a synthetic dataset, then assess similarities to the original data to gauge success.

Synthetic data can:

  • Inform early go/no-go decisions: A cost-effective approach to optimizing asset strategy before large investments by simulating expected outcomes under various scenarios in Phase I–II.
  • Inform CT design: Model alternative controls and sample sizes and stress-test treatment effects in a cost-effective manner.
  • Build privacy-preserving cost-effective ECAs: Build an ECA partially (+ RWD) or totally through a fully de-identified synthetic cohort. This is not for regulatory purposes yet, but it can inform provider and payer decisions.

RWD has its limitations: it must closely resemble real patient populations and protect patient privacy, and can be costly, time-consuming, and potentially unethical. Synthetic data can help overcome these challenges.

 

Strengthen regulatory submission with an external control arm

External control arms use data from historical RCT or RWD when randomization is not feasible or ethical, or to power / accelerate a study where there is high unmet need.

ECAs can:

  • Strengthen single-arm trials (SAT): Provide contextual information for SAT regulatory submissions, increasing probability of success.
  • Accelerate access to needed therapies: For RCT in high unmet need (e.g., accelerated approval pathway) and/or with slow recruitment, RWD can augment the control arm.
  • Support a lifecycle management approach: Supports label expansions to new populations (e.g., to male breast cancer) or new lines of therapy for decisions by regulators, payers, and providers.

While RCTs are considered the “gold standard,” the FDA in 2023 wrote that “externally controlled studies may be considered” (with strong justification), while in 2025, the EMA guidance stated “in some situations, causal conclusions may be derived from a setting where the investigational medicinal product data was collected under a clinical trial protocol while the control arm was not a randomized arm in that same protocol.”

 

Assess long-term outcomes with long-term extension studies

Decentralized long‑term extensions for RCT assess long-term outcomes (safety and effectiveness) with or without drug provisions. The extension enables follow-up of tokenized trial patients via real-world databases or direct-to-patient data collection.

Long‑term extension studies can:

  • Allow for long-term follow-up: Cost-effective data collection by reducing site and patient burden while collecting key safety and effectiveness endpoints over 10+ years.
  • Enable earlier launch: For breakthrough therapies and high unmet need, launch can occur as soon as clinical efficacy is proven if the sponsor commits to a Phase IV study to collect long-term data.
  • Improve representativeness: Loss to follow-up in long-term studies can lead to confounding, and RCTs often under-represent certain populations. The shift to real-world endpoints makes the insights more relevant to decision-makers.

 

Key takeaways

Consider RWE as a strategic asset: Integrate RWE early and anticipate post-marketing collection of long-term data and adopt causal inference methods to protect ideals of safety and effectiveness.

Invest in robust RWD: Invest in RWD quality and governance to ensure credibility with regulators and payers.

Adopt a comprehensive strategy: Adopt flexible, hybrid evidence strategies that combine synthetic data, ECAs, and long-term real-world data collection approaches.

Ensure cross-functional readiness: Medical, regulatory, biostats, and data science must operate as one evidence engine.

The Delta Dossier: Why Germany Needs More Than a Reference-Based Approach

With the first Joint Clinical Assessments (JCAs) at the European level, pharmaceutical companies are by no means entering a phase of reduced national HTA requirements. Germany, in particular, is already showing that the so-called delta dossier — an informal term used in the German market access environment for the national content required in addition to the European JCA dossier — is not simply a shorter AMNOG dossier containing references to the European JCA dossier.

Instead, it is becoming the test of whether clinical trial evidence, European and German HTA requirements, and tight procedural timelines can be brought together at an early stage.

There is still only limited practical experience with real delta dossiers. All the more important, then, are the signals coming from Germany’s Federal Joint Committee (Gemeinsamer Bundesausschuss, G-BA), the country’s highest decision-making body in joint self-government and a central institution in the national HTA framework. Its spring 2025 events already made clear where the key requirements are likely to emerge and which questions pharmaceutical companies should be addressing now. The G-BA itself views the planned adjustments in the national setting, including the adaptation of the AMNOG dossier module templates, as a first step and intends to assess further developments on the basis of the first practical experience.

Here, we share five theses on the delta dossier.

 

Thesis 1: The EU JCA will not replace the German benefit assessment

A central point is often underestimated in the current debate: the JCA does not replace Germany’s early benefit assessment. The G-BA makes it clear that alignment with the European assessment does not change the assessment standards applied in the German benefit assessment. Decisions will continue to be taken at the national level. The JCA dossier is intended to inform national decision-making, but it does not itself provide a conclusion on additional clinical benefit compared with the national appropriate comparator therapy (zVT) — the foundation for the subsequent price negotiation.

This also clarifies the role of the delta dossier: the objective is not simply to pass through European content in a formal way, but to prepare it in a manner that is robust and usable for the German procedure.

 

Thesis 2: The delta dossier is about translation, not cross-referencing

The G-BA describes very specifically how the JCA dossier is to be used. References are possible, but only to clearly identified sections. General or dynamic references are not sufficient. At the same time, it remains the responsibility of the pharmaceutical company to determine whether the contents of the JCA dossier are sufficient for the German benefit assessment or whether updated or supplementary evidence is required. There will be no separate dossier template. The structure of the AMNOG modules will remain in place.

This is precisely where the quality of a good delta dossier becomes visible: it is the national translation of the European assessment process and brings the JCA dossier and the AMNOG dossier together. This is achieved not through references alone, but above all through the targeted selection of content that is truly robust and the addition of missing data needed for an evidence-based national assessment.

One point is particularly important here: the G-BA makes it clear that a full national AMNOG dossier may still be submitted. There is therefore no obligation to use the delta dossier as a lean referencing solution. What remains decisive is not the format, but the quality of the national dossier preparation.

 

Thesis 3: The real work starts well before the delta dossier

The determination of the relevant PICOs (PICO scoping) for the JCA already begins when the marketing authorization application is submitted to the EMA, and therefore well before the start of the national AMNOG procedure. The PICOs fed back by Germany are intended to reflect the relevant research questions for the later AMNOG procedure, but — just like, for example, the outcome of an early G-BA consultation on the appropriate comparator therapy — they are not legally binding. This creates a risk scenario, particularly for the national procedure, that must be anticipated and taken into account in strategic planning. Any company that only starts to structure populations, comparator therapies, endpoints, and potential subgroups when preparing the national dossier is already too late.

European scientific consultation on PICO scoping also takes place at a point when studies are still being planned. National consultations remain possible, but parallel duplicate structures are to be avoided. For manufacturers, this means that the real strategic work does not begin with the delta dossier, but with PICO scoping, study design, and early evidence planning.

 

Thesis 4: The biggest risks sit in comparator selection and endpoints

Translation into the German setting already becomes particularly demanding at the scoping stage. The first key question is which PICO, or which set of PICOs, actually reflects the requirements of the German benefit assessment. This determines which comparator therapy is relevant for Germany and whether the evidence addressed in the JCA will in fact support the national assessment. This is precisely where preparation for a strong delta dossier begins: with the early identification of the PICOs relevant for Germany, the selection of robust content, and the supplementation of evidence wherever European materials are not sufficient for the national assessment.

In addition, European JCA scoping may include endpoints that are not necessarily recognized as patient-relevant in the national procedure. The G-BA explicitly distinguishes between endpoints included at the European level and the criteria for patient relevance that apply in the German AMNOG procedure. The same applies to analytical methods: national requirements — such as the 15% relevance threshold for responder analyses — remain in place.

For this reason, the delta dossier is particularly demanding from a scientific and methodological perspective wherever European evidence must be made robust for German comparator therapies and nationally relevant endpoints.

 

Thesis 5: Timing and evidence updates will be decisive

In addition to scientific issues, procedural management is becoming more important. The G-BA continues to require that the underlying systematic literature review on relevant clinical evidence must not be more than three months old at the start of the procedure. Additional data cuts and newly completed studies may therefore become relevant in the AMNOG procedure even if they were not addressed in the JCA dossier. This means that the dataset underlying the national AMNOG dossier may differ from the dataset underlying the JCA dossier.

The timing of the publication of the JCA report is also particularly important. If it is available in time, it will be taken into account in the benefit assessment. If it becomes available later, it may still be considered during the written comments procedure or, at the latest, in the final resolution. However, if it is published only after the start of the written comments procedure, it can no longer formally be taken into account. At the same time, the G-BA points out that there is as yet no reliable practical experience in this regard — another source of uncertainty for pharmaceutical companies.

 

From JCA to delta dossier: Cytel combines global perspective with local execution

Cytel occupies the critical interface between European clinical assessment and national benefit assessment in Germany. Together with the German team at co.value, a Cytel brand, Cytel combines experience in PICO scoping, JCA dossier development, and statistical evidence generation with in-depth local AMNOG expertise. This means support does not begin only at the point of translating into the delta dossier, but much earlier: in evidence planning, the selection of robust comparator therapies, and the targeted shaping of European evidence for reliable use in the German AMNOG procedure.

 

The delta dossier as the true test

The first delta dossiers are only now beginning to emerge. But the substantive guardrails are already clearly visible, and they point in a clear direction: within the framework of European clinical assessment, the German AMNOG procedure will not become a process that can be handled through references alone.

What will matter instead is how early clinical trial evidence, European and German HTA requirements, and tight procedural timelines are brought together. The delta dossier is therefore not merely a new format. It is the clearest expression of whether this translation work has been accomplished in time.

Rethinking Evidence in Rare Disease Research: A Case Study Using Propensity Score Methods

Rare diseases pose unique challenges for researchers and clinicians. Due to small patient populations, conducting randomized controlled trials (RCTs) is often impractical or ethically difficult. As a result, observational data becomes a key source of evidence.

In the landscape of rare disease, data is both our most precious resource and our greatest challenge. For conditions like Infantile-Onset Pompe Disease (IOPD), the journey from the first life-saving Enzyme Replacement Therapy (ERT) to the next generation of optimized treatments is rarely a path free of challenges. It is a path marked by small patient populations, high clinical variability, and the heavy weight of every data point.

The difficulty in rare disease research often lies in the “how”: How do we prove a new therapy is truly superior when baseline functional levels vary so wildly? How do we ensure that a single data entry error doesn’t mask a breakthrough or suggest a false decline?

In this blog, we explore how propensity score methods can be used to estimate treatment effectiveness in a rare disease setting through a real world–inspired case study.

In this case study, we pull back the curtain on the analytical rigor required to compare motor function trajectories in IOPD. From Propensity Score Matching to “red-flag” data auditing, we explore how sophisticated analysis turns fragmented data into a clear roadmap for the future of neuromuscular treatment.

 

Case study: Advancing motor function outcomes in IOPD

The evolution from first-generation drug to next-generation drug

Infantile-Onset Pompe Disease (IOPD) is a rare, progressive neuromuscular disorder. While the first generation of ERT revolutionized survival, the quest for superior motor function remains the “North Star” for researchers. This study compares longitudinal motor outcomes between the First-Generation Drug and Next-Generation Drug cohorts using the Gross Motor Function Measure (GMFM-88).

 

The challenge: Comparing across clinical trials

Comparing results from different studies requires more than just looking at averages; it requires accounting for the inherent variability in how patients present at baseline. To test the hypothesis that the Next-Generation Drug offers a superior motor trajectory, we implemented a rigorous three-tier analytical approach.

 

A three-tier analytical approach

1. The power of precise matching

To ensure an “apples-to-apples” comparison, we restricted the analysis to patient pairs matched by both age and baseline functional level.

  • The criteria: Matches were strictly filtered to those within a +/- 13-point window of the GMFM-88 raw score (rather than a percentage).
  • The goal: By tightening these parameters, we eliminated “baseline noise,” allowing the true pharmacological impact of the treatment to surface in the longitudinal graphs.

 

2. Data integrity: Investigating the “jumps and drops”

In rare disease registries, a single data point can skew an entire trajectory. Our team conducted a “deep dive” into five specific patient profiles that exhibited extreme volatility — marked by sharp drops or vertical jumps in scores.

Expert insight: A drop to zero isn’t always a clinical decline; often, it’s a data entry artifact where a missing value was defaulted to ‘0.’ By identifying and correcting these anomalies, we ensure the motor trajectory reflects biology, not a spreadsheet error.

 

3. Sophisticated balancing: Propensity Score Matching (PSM)

Propensity score methods help simulate a randomized experiment by balancing observed characteristics between treated and untreated groups.

To further validate our findings, we moved beyond simple matching to Propensity Score Matching. This statistical technique allows us to predict a patient’s likelihood of being in a specific treatment group based on their baseline characteristics, effectively “balancing” the two groups.

 

Key covariates included:

  • Baseline status: Age and GMFM-88 total raw score.
  • Clinical history: Age at diagnosis and age at start of ERT.
  • Biological markers: CRIM status (Cross-Reactive Immunologic Material) and LVMI (Left Ventricular Mass Index) z-scores.
  • Treatment variables: Specific enzyme dosage levels.

 

Why this matters for the rare disease community

This case study demonstrates that in the world of rare diseases, how we analyze data is as important as the data itself. By correcting for entry errors and using high-fidelity matching, we can more clearly see if the next-generation drug truly provides the “superior trajectory” hypothesized.

 

Precision analytics as a catalyst for care

By applying high-fidelity matching and propensity score modelling, we move beyond “average” results to understand the true potential of new interventions. Furthermore, our dedication to data integrity — manually investigating anomalies and “red-arrow” outliers — ensures that our conclusions are built on a foundation of clinical reality rather than administrative error.

Ultimately, this study reinforces that in the fight against rare diseases, data is our most powerful ally. When we refine our lens through rigorous matching and clean data, the path toward better motor function and brighter futures for IOPD patients becomes clearer than ever.

Real-World Data Strategies and Challenges: Making Data Work for Your External Control Arm Study

External control arms (ECAs) are gaining popularity in comparative effectiveness studies, driven by a growing emphasis on robust evidence across disease areas and regulatory body acceptance. ECAs can provide a control group for single-arm studies, complement a larger portfolio of evidence, and enable research for rare or genetic conditions for which randomized controlled trials may be unethical or infeasible.

At the same time, real-world data (RWD) is becoming an essential foundation for building credible ECAs. RWD offers unique advantages: it reflects real clinical practice, captures diverse patient populations, and can provide data for robust treatment effects.

However, integrating data from multiple sources, such as historical trials, concurrent trials, patient registries, and cross-population datasets, requires careful methodological planning to ensure validity and regulatory acceptance.

To fully harness the value of external control arms, sponsors must ensure selected data is fit-for-purpose, index dates are aligned with trial eligibility, and rigorous statistical methods are applied to ensure comparable patient profiles. Here, we outline these three essential elements.

 

Choosing the right data source for your external control arm

When building ECAs, different types of external data sources have different strengths.

 

Historical or concurrent randomized trials

Historical or concurrent randomized trials contain systematically collected data and well-defined endpoints, following a detailed protocol. However, they often have small sample sizes, and evolving standards of care or diagnostic criteria can limit comparability over time.

 

Electronic health records and insurance claims

Electronic health records and insurance claims contain large, diverse cohorts and broad population coverage. But they frequently lack clinical details such as out-of-hospital care and non-prescription medications.

 

Patient registries

Patient registries provide systematic, detailed data collection, the potential for linkage​ and long-term follow up. Yet they can have high missingness and over-represent healthier patients, which could reduce the overlap in characteristics with trial populations.

 

Selecting the best data sources should be guided by fit-for-purpose assessments. These studies include exploring the availability of key prognostic characteristics and missingness, along with practical considerations such as access and timelines.

 

Defining appropriate eligibility criteria and index dates

Carefully establishing index dates is critical yet challenging when incorporating an ECA. In a trial population, the index date is clearly defined as when the patient meets eligibility or is randomized. The same eligibility criteria need to be applied to ECA patients using variables in the external data source. The index date should reflect the point at which those criteria are met. Misalignment of the index date leads to specific types of selection bias, including immortal time. This bias occurs when periods during which an outcome could not have occurred are misclassified, potentially creating a false treatment benefit.

 

Ensuring treatment and control patients are similar

In RCTs, randomization naturally balances prognostic factors between treatment arms. ECAs, by contrast, require explicit identification and adjustment of these variables. Clinical expertise is essential for determining which characteristics matter most. Comparing the distributions of these variables between the treated versus control arm helps to assess similarity. Statistical techniques including propensity-matched controls and inverse treatment of probability weighting can improve comparability and approximate the balance achieved through randomization. Assessing pre- and post-adjustment distribution of baseline characteristics quantifies the success of the method.

 

Final takeaways

Overall, to fully harness the value of external control arms, three elements are essential:

  1. Selecting fit-for-purpose data
  2. Defining index dates that align with trial eligibility
  3. Applying rigorous statistical methods to ensure comparable patient profiles

When executed thoughtfully, ECAs can meaningfully strengthen evidence generation and expand the possibilities for clinical research.

 

Interested in learning more?

Watch our on-demand webinar featuring Deepa Jahagirdar and Vartika Savarna, “Driving Credibility in External Control Arms with Real-World Data,” available now.

Insights from WEPA Amsterdam: When Policy Pressure Meets AI Maturity

The World EPA Congress in Amsterdam did not feel like a conference about isolated trends. It felt like a conference about structural transition.

Across sessions and conversations, one consistent narrative emerged: market access is being reshaped simultaneously by tightening policy frameworks and by the operational maturation of artificial intelligence. These are not parallel stories unfolding independently. They are interacting forces that together are redefining how evidence is generated, how value is assessed, and how global pricing strategies are constructed.

The underlying question throughout WEPA was not whether change is coming. It was whether organizations are structurally prepared to manage both forces at once.

 

1. A policy environment under structural redesign

Joint Clinical Assessment: Harmonization meets operational reality

The first year of Joint Clinical Assessment (JCA) implementation under the EU HTA Regulation represents a historic step toward harmonization of clinical evaluations across Europe. In principle, a single European-level clinical assessment promises efficiency, reduced duplication, and greater consistency in evaluating comparative effectiveness.

Yet the operational reality is more complex. Harmonization does not automatically mean simplification.

Early experience indicates that alignment between EU-level assessments and national reimbursement processes remains incomplete. Questions persist around how Member States will operationalize JCA outputs, how quickly EU HTAR assessors can deliver assessments, and whether national HTA bodies are fully prepared to transition to reliance on joint evaluations.

Methodological challenges are also emerging. PICO multiplicity, expanded evidence requirements, and the risk of unexpected analytical requests are increasing the burden on evidence generation teams, especially for products targeting rare diseases. While duplication of assessments may decrease, the sophistication and coordination required to navigate the system are increasing.

JCA is a milestone in European collaboration. But its success will depend on tighter synchronization between EU-level clinical conclusions and national pricing and reimbursement realities.

 

Real-world evidence: From complementary input to strategic pillar

Alongside JCA, the role of real-world evidence (RWE) is evolving rapidly. Regulators, payers, and clinicians increasingly seek insight into how therapies perform in routine clinical practice across diverse populations. The European Medicines Agency has clearly signaled its ambition to place patient voice and real-world data at the center of regulatory evaluation.

RWE is no longer supplementary. It is becoming central.

However, tension remains within the EU HTAR context. JCA assessments emphasize statistical precision and internal validity, while real-world evidence reflects the inherent heterogeneity of clinical practice. Methodological expectations between regulatory and HTA frameworks are not yet fully synchronized.

Europe now faces a strategic choice: either build robust, interoperable infrastructures for high-quality real-world data sharing across Member States, or risk creating friction between regulatory innovation and HTA conservatism. The credibility of future evidence strategies will depend on resolving this gap.

 

MFN pricing: Global interdependence redefines strategy

At the global level, Most-Favored-Nation (MFN) pricing dynamics are reshaping launch and market access strategies beyond the United States. Pricing has become an interconnected global system rather than a sequence of independent national decisions.

Launch sequencing is being reassessed as companies evaluate exposure to international reference pricing and MFN-linked rules. Markets are increasingly categorized by strategic risk, and cross-market interdependence is intensifying. Decisions taken in one jurisdiction reverberate across others.

Europe, despite its strong regulatory institutions, faces pressure due to fragmented access pathways, evolving JCA processes, and uncertainty in national budget negotiations. The traditional logic of “where to launch first” has become a far more complex strategic equation.

Taken together, JCA implementation, the rise of RWE, and MFN pricing pressures are increasing analytical complexity, accelerating timelines, and demanding greater coordination across functions and geographies. This rising structural pressure forms the backdrop to the second defining theme of WEPA.

 

2. AI moves from experimentation to operating model

From hype to governance

If policy discussions reflected systemic pressure, AI discussions reflected systemic adaptation.

The tone around artificial intelligence at WEPA 2026 was notably mature. The conversation quickly moved beyond questioning whether AI is hype. The focus shifted toward responsible operationalization, governance, and measurable value creation within regulated environments.

The key issue is no longer adoption. It is integration.

Organizations are developing governance frameworks, embedding AI into regulated workflows, and ensuring traceability and auditability of outputs. The emphasis is on scale and accountability rather than isolated experimentation.

 

AI as infrastructure in market access

Across sessions, AI was framed not as a productivity enhancement tool but as part of the operating model of modern market access organizations.

Companies are redesigning processes around AI-enabled capabilities. Evidence synthesis, systematic literature reviews, indirect treatment comparisons, dossier drafting, pricing simulations, and tender strategy development are increasingly supported by automated or semi-automated systems.

This represents a structural shift. AI is moving from peripheral pilot projects to enterprise-level infrastructure embedded within core functions.

In an environment where JCA increases analytical burden and MFN pricing demands multi-country scenario modeling, such capabilities are becoming operationally essential rather than optional.

 

From assistant to strategic copilot

One of the most forward-looking discussions centered on the evolution of AI from drafting assistant to strategic copilot.

The emergence of agentic AI and orchestration systems is enabling decision support in areas such as pricing negotiation, tender simulations, and contracting strategy optimization. Rather than merely accelerating document preparation, AI is beginning to inform strategic decision-making.

However, in highly regulated settings such as HTA and pricing negotiations, transparency and explainability remain non-negotiable. The credibility of AI-driven insights depends on robust governance and clear traceability.

The opportunity is substantial — speed, standardization, and efficiency. The responsibility is equally significant.

 

3. The convergence: Complexity requires capability

The most important insight from WEPA Amsterdam lies not in policy alone, nor in AI alone, but in their convergence.

Policy reforms are increasing complexity. JCA raises expectations for comparative evidence coordination across Europe. Real-world evidence demands stronger data ecosystems. MFN pricing intensifies global interdependence and strategic sensitivity.

At the same time, AI provides the analytical and operational capabilities necessary to manage this complexity. It enables faster synthesis of comparative data, structured analysis of heterogeneous real-world evidence, and dynamic cross-market pricing simulations.

In this sense, policy pressure and AI capability are two sides of the same transformation. The former raises the bar; the latter provides the tools to reach it.

The defining question for market access organizations is whether they can redesign their operating models quickly enough to integrate policy intelligence, evidence generation, pricing foresight, and AI-enabled execution into a coherent system.

WEPA 2026 signaled that the era of treating these dynamics as separate conversations is over. Market access is entering a phase where structural policy reform and technological capability must be managed together.

Those who integrate both dimensions — responsibly, transparently, and strategically — will shape the future of evidence-based access in Europe and beyond.

Central Statistical Monitoring: Transforming Clinical Trial Oversight Through Data Intelligence

As clinical trials grow in complexity — spanning more geographies, more data streams, and more endpoints — the traditional model of on-site monitoring alone is no longer sufficient to ensure data quality and patient safety. Regulatory expectations have evolved, trial budgets are under pressure, and sponsors need earlier, more objective insights into emerging risks.

Central Statistical Monitoring (CSM) sits at the intersection of these demands.

At Cytel, we see first-hand how sponsors are rethinking monitoring strategies to be more risk-based, data-driven, and efficient. Here, we introduce the foundations of CSM, how it supports Risk-Based Quality Management (RBQM), and why it has become a critical component of modern trial oversight.

 

What is Central Statistical Monitoring?

Central Statistical Monitoring can be defined as the statistical detection of anomalies in accumulating clinical trial data to identify sites, patients, or countries that are performing differently from the rest. These differences may signal issues related to data quality, site conduct, or even patient safety.

The origins of CSM can be traced to early work on fraud detection in clinical trials. However, while fraud is rare, it represents only a small part of the picture. In practice, most CSM findings relate to more common and impactful issues such as errors, sloppiness, or data-handling inconsistencies.

The key principle is straightforward: when most sites are performing consistently, statistically unusual patterns may indicate that something warrants a closer look.

Rather than relying solely on Source Data Verification (SDV) or manual review, CSM uses statistical techniques to evaluate patterns within and across sites — often detecting issues that traditional monitoring approaches would miss.

 

Beyond KRIs and QTLs: What makes CSM different?

Central Monitoring typically includes three types of analyses:

• Key Risk Indicators (KRIs): site-level metrics such as adverse event rates or protocol deviations
• Quality Tolerance Limits (QTLs): study-level thresholds for critical KRIs
• Central Statistical Monitoring (CSM): advanced anomaly detection across high-volume data

While KRIs and QTLs focus on predefined metrics, CSM goes further by applying broad statistical tests across many variables — often using unsupervised approaches that are now considered the industry gold standard.

These methods may involve single-variable comparisons (such as means, variability, proportions, rates, digit distributions) as well as multivariate techniques that evaluate patterns across multiple variables simultaneously. The result is a structured framework for identifying outliers in a reproducible, objective way.

 

Why does CSM matter now?

Over the past two decades, regulatory authorities have progressively endorsed risk-based and centralized monitoring approaches. FDA, EMA, and MHRA guidance have emphasized the importance of risk-based monitoring, culminating in ICH E6(R2) and most recently ICH E6(R3), which reinforce the role of centralized monitoring in identifying systemic and site-specific issues.

This regulatory evolution reflects a broader shift toward:

• Quality by Design (QbD)
• Identification of critical-to-quality factors
• Ongoing risk assessment
• Adaptive monitoring strategies

Within a Risk-Based Monitoring (RBM) framework, CSM complements KRIs and QTLs to provide a comprehensive view of trial risk. Insights from CSM can guide targeted on-site or remote monitoring, ensuring that resources are focused where they will have the greatest impact.

This approach aligns closely with the Clinical Trials Transformation Initiative’s definition of quality in clinical trials as the “absence of errors that matter to decision making — that is, errors which have a meaningful impact on the safety of trial participants or the credibility of the results.” By identifying anomalies early — before they escalate into systemic issues — CSM helps safeguard critical-to-quality factors.

For sponsors, the benefits are multifaceted:

• More efficient allocation of monitoring resources
• Potential reduction in unnecessary SDV
• Earlier detection of emerging risks
• Increased confidence in data integrity prior to regulatory submission

In short, CSM transforms monitoring from a predominantly reactive activity into a proactive, data-driven strategy.

 

Putting CSM into practice: Operational considerations for successful implementation

Understanding the statistical foundations of CSM is important — but translating that understanding into a well-functioning program requires deliberate operational planning. The following considerations provide a practical framework for teams preparing to implement CSM within a clinical trial.

 

Upfront preparation and governance

A formal CSM kickoff meeting — convened before any analyses begin — is one of the most valuable investments a team can make. This meeting should bring together representatives from biostatistics, data management, clinical operations, medical monitoring, and quality. The goal is to establish shared alignment on the objectives and scope of the CSM program, agree on which critical-to-quality (CtQ) factors will anchor the monitoring strategy, define escalation pathways for signals requiring action, and confirm documentation standards. Equally important is reaching consensus on how CSM integrates within the broader RBQM framework — clarifying how statistical signals will interact with KRI outputs, SDV decisions, and site risk classifications. Without this governance foundation, even technically sound CSM outputs can struggle to gain traction in day-to-day operations.

 

Determining frequency of analyses

The frequency with which CSM analyses are generated should be proportionate to study risk and dynamics. Key factors to consider include the rate of enrollment, total subject count, number of active sites, and overall study duration.  Trials with rapid, multi-site enrollment may benefit from more frequent reviews — bi-monthly — to catch emerging patterns before they compound. Slower-enrolling or smaller studies may reasonably support longer intervals between analyses without compromising oversight. Critically, frequency should not be treated as fixed. As study conditions evolve — sites activate or go on hold, enrollment accelerates, or a new safety signal emerges — the CSM schedule should be revisited. Building in flexibility from the outset ensures the program remains responsive rather than formulaic.

 

Communication and cross-functional review

CSM outputs are most actionable when they are presented in a structured, interpretable format — combining risk scores or site rankings with narrative interpretation that contextualizes what the statistics show and why it may matter. Findings should be reviewed collaboratively with the wider cross-functional team including Clinical Operations and Clinical Science, whose site-level and medical knowledge is indispensable for determining whether a statistical outlier reflects a genuine quality concern or a legitimate difference. A statistical signal is a prompt for investigation, not a conclusion. The review process should follow a clear feedback loop: identify the signal, evaluate it in context, decide on a response (monitor, query, or escalate), and document the rationale. This structured approach ensures accountability and creates an audit trail that supports both ongoing oversight and regulatory inspection readiness.

Ultimately, CSM delivers the greatest value when it is embedded operationally — treated not as a standalone statistical exercise, but as a living input to risk-based decision-making by the clinical team. When governance, data prioritization, analysis cadence, and cross-functional communication are aligned from the outset, CSM becomes what it is designed to be: an early warning system that enables smarter, more targeted oversight in service of patient safety and data integrity.

 

Interested in learning more?

Join Charles Warne and William Baker for their upcoming webinar, “Advancing Trial Oversight with Central Statistical Monitoring” on April 8 at 9AM ET / 3PM CET.

Central Statistical Monitoring is a practical, regulatory-aligned tool that can materially strengthen trial oversight and quality management.

In our upcoming webinar, we will explore:

• What CSM entails

  • When and how CSM adds value to clinical trials
  • Operational considerations for implementing CSM services

• Case study examples of CSM in action

Whether you work in biometrics, clinical operations, quality, or regulatory affairs, this session will provide actionable insights into building a smarter, more adaptive monitoring strategy.

The Invisibility Machine of the Women’s Health Gap

A 300-year warning

The global timeline for gender equality is not merely stalling; it is a sobering indictment of our collective priorities as a society. Current estimates from the United Nations reveal a staggering distance to parity: at our current trajectory, it will take 300 years to end child marriage, 286 years to eliminate discriminatory laws and legal protection gaps, 140 years to achieve equal representation in workplace leadership, and 47 years to reach an equal footing in national parliaments.

These are not just social milestones; they are structural barriers that define the “Gender Health Gap.” This gap represents the inequitable, systematic differences in health outcomes between women and men — differences rooted in under-researched medical needs, chronic underfunding, and a “medical model” that has historically treated male biology as the universal baseline. To close this divide, we must recognize that health equity is a strategic imperative for global stability, health capital, and economic prosperity.

 

A ledger of health inequality: The data and the reasons behind the gender gap

Sex is a fundamental genetic modifier of biology, influencing everything from disease susceptibility to treatment response. Yet we remain trapped in a “health-survival paradox”: while women generally live longer than men, they endure higher burdens of morbidity and disability throughout their lives. Some examples are:

  • Diagnostic Delays: On average, women are diagnosed nearly four years later than men for the same diseases.
  • Misdiagnosis: Women are twice as likely to die following a heart attack than men, partly because they have a 50% higher chance of receiving an incorrect initial diagnosis.
  • AI Bias: Modern digital tools often entrench these disparities; AI-powered symptom checkers have been found to flag women experiencing heart attacks as needing psychological care rather than emergency medical intervention.
  • Invisible Conditions: Many women-specific conditions are severely underdiagnosed. For example, 8 in 10 women with menopause and 6 in 10 women with endometriosis remain undiagnosed. Adenomyosis affects up to 35% of women but is often invisible in medical records due to misdiagnosis as fibroids.

 

Some of the key reasons for the gender health gap are related to systematic underinvestment in research and innovation funding and the intersection of biology with social factors that historically displaced women’s equal position in society.

A primary driver of the health gap is the systemic neglect of female biology in scientific research:

  • Underfunding: Only 5% of global research and development funding is allocated to female-related research. Of this, a mere 1% goes toward women-specific conditions like menopause and fertility.
  • Clinical Trial Underrepresentation: The inclusion of women in clinical research only became a requirement in the 1990s. Today, women make up only about 41.2% of participants in key disease clinical trials. In cardiovascular drug trials, female participation averages only 34%, often failing to match the actual disease prevalence in the population.
  • Adverse Drug Reactions: Because many drugs are tested primarily on men, women have a 34% increased risk of severe adverse events. A notable example is the sleep aid Zolpidem, which stays in women’s systems longer than men’s; it took until 2013 for the FDA to require reduced dosing for women after decades of increased emergency room visits.

 

The gap is also influenced not only by the complex interplay of biological sex (genetics, hormones), but also by social gender (norms, roles) and societal roadblocks such as lack of female representation in leadership positions directly shaping inequalities in health policy development not only for women but for all marginalized communities.

 

Fact vs. fiction: Debunking women’s health misconceptions

Effective strategy requires dismantling the myths that have long perpetuated gender health inequality.

  • Women’s health is not synonymous with OB/GYN: Progress has been hindered by the misconception that women’s health is limited to reproductive and sexual needs. In reality, the gap spans every disease area, including neurology, immunology, and cardiovascular health, where women present with unique symptoms and risk profiles.
  • Longevity does not equal better health: The “morbidity burden” is a critical indicator of inequity. Women spend more years in poor health, facing higher disability-adjusted life year rates for musculoskeletal, neurological, and mental health disorders.
  • Inequality is not solely about race, but intersectionality is critical: While gender is a standalone driver of health outcomes, it does not exist in a vacuum. For example, Black and Native American women face the highest rates of pregnancy-related mortality, and Black women are three times more likely to die from heart failure than White women. These data points illustrate why an intersectional lens is non-negotiable for any health equity strategist.

Progress has remained largely stagnant over the last decade because women remain “invisible” in methodological and decision-making frameworks. The ICH Guidance on Technical Requirements for Pharmaceuticals for Human Use still refers to women as a “special subgroup” to be considered “when appropriate.” This classification is mathematically and medically absurd: women represent half of the global population. This invisibility fuels a self-perpetuating cycle of Data Poverty. The recent FDA guidance on addressing sex differences in clinical trials is, though, a positive step towards recognition of such impact in clinical development.

The roadblocks to reform health technologies and decision-making frameworks to address women health needs and considerations are not just scientific — they are structural. They include a lack of political will, the absence of gender indicators for evaluation, and a strong position of gender norms and laws that favor the lack of protection of women on health matters and beyond.

 

Conclusion

Health equity does not need to take 300 years though some of those glacial aspects must be addressed for true success to be achieved.  

Big data, digital technologies, and advanced analytics provide the means to overcome the challenges to achieving women’s health equity in the coming years. Gender health equity is not an act of morality — it is the foundation of a sustainable, healthy, and economically stable future for all.

What a New Study on AI Adoption in US Hospitals May Tell Us About the Future of Real-World Data

Artificial intelligence is becoming increasingly common in US hospitals. Nearly half of hospitals surveyed in 2023–2024 reported using AI-based predictive models — but adoption is not evenly distributed across the country. Some regions and health systems are moving quickly, while others — particularly those in healthcare shortage areas — are adopting more slowly.

These findings come from “The Landscape of AI Implementation in US Hospitals,” led by Yeon-Mi Hwang and colleagues and published in Nature Health in 2026.1 The study analyzes data from more than 3,500 hospitals nationwide and maps where predictive AI tools are being implemented — and where they are not.

At first glance, this may seem like a technology adoption story. In reality, it is also a data story.

As healthcare increasingly relies on real-world data (RWD) for research, regulatory decisions, safety monitoring, and value-based payment models, the way hospitals adopt AI could directly influence the quality and coverage of the data being produced across the United States.

 

AI adoption signals digital maturity

Hwang and colleagues found that interoperability — the ability of hospital systems to exchange and integrate data — was the strongest predictor of AI adoption. Hospitals with better health information exchange capabilities and fewer data-sharing barriers were much more likely to implement predictive AI tools.

This matters because AI systems require structured, standardized, and well-integrated data to function effectively. When hospitals invest in AI, they often strengthen their documentation practices, data governance, and system integration in the process. Those same improvements elevate the overall quality of clinical data.

In other words, hospitals that are ready for AI are often also ready to produce higher-quality RWD.

 

Why high-adoption regions may produce richer RWD

Predictive AI systems frequently generate structured outputs such as risk scores, alerts, and time-stamped predictions. These outputs are recorded in electronic health records and become part of the clinical data landscape.

As a result, regions with higher AI adoption may generate data that is more complete, more standardized, and better linked across care settings. Their records may contain clearer severity markers, earlier detection signals, and more consistent documentation of clinical decision points.

This is why high-adoption regions may produce richer RWD. The data is not only documented — it is more granular and more measurable.

Because the study shows that AI adoption clusters geographically, these differences in data richness may also cluster by region.

 

The geography gap

One of the more striking findings in the study is that hospitals in healthcare shortage areas and medically underserved regions were less likely to adopt predictive AI. These areas often include rural and resource-constrained institutions.

If these hospitals have less advanced digital infrastructure, the data they generate may be more fragmented and less standardized. Over time, this could create meaningful differences in data coverage across the country. Regions with strong AI adoption may produce deeper, more analyzable datasets, while underserved areas may remain underrepresented in national RWD pipelines.

That imbalance could influence which populations are most visible in research and regulatory evidence.

 

AI changes the shape of the data

AI adoption does not simply improve data capture — it can also shape how care is delivered and recorded. Predictive systems may trigger alerts, influence documentation patterns, and alter clinical workflows. These changes become embedded in patient records.

As a result, RWD from high-adoption environments may reflect AI-influenced care pathways, while RWD from lower-adoption settings reflects more traditional workflows. Differences in adoption may therefore create differences not only in data volume, but also in data structure and interpretation.

 

Why this matters for real-world evidence

Real-world data increasingly underpins post-market surveillance, comparative effectiveness research, regulatory decision-making, and value-based care arrangements. If richer, more granular data clusters in digitally advanced regions, then the evidence generated from national datasets may disproportionately reflect those environments.

This is not necessarily intentional. It is a structural consequence of uneven infrastructure development. But without attention to digital equity, disparities in AI adoption could gradually translate into disparities in evidence generation.

 

The bottom line

The nationwide analysis by Yeon-Mi Hwang and colleagues offers one of the clearest early views of how AI is spreading across US hospitals. Because AI adoption is closely tied to interoperability, digital maturity, and institutional capacity, it likely influences how real-world data is captured, structured, and represented.

High-adoption regions may produce richer RWD — data that is more complete, more granular, and better connected across care settings. At the same time, uneven adoption raises important questions about representativeness and equity in national datasets.

Understanding how AI adoption is expanding — and where it remains limited — may become a key factor in strengthening the US data ecosystem. If increasing AI adoption leads to more complete and structured RWD, it could significantly enhance the power and reliability of real-world evidence. But ensuring that this digital maturity is broadly distributed will be essential. Otherwise, the strength of future RWE may reflect infrastructure patterns as much as clinical reality.

As AI becomes more embedded in healthcare, how and where it is implemented may quietly shape not only care delivery — but the evidence base that guides it.

Simulating Multiple Endpoints While Including External Historical Data in Adaptive Oncology Trial Designs

Multiple endpoints are now the rule, not the exception

In many contemporary Phase III oncology programs, a single primary endpoint is no longer sufficient. While Overall Survival (OS) remains the gold standard and regulators still view it as the most direct measure of clinical benefit, in practice, OS takes time to mature leading to very long and expensive clinical trials. In metastatic settings with multiple subsequent lines of therapy, the signal can dilute over time. As a result, sponsors frequently structure confirmatory trials with OS on top of an endpoint that is faster to measure, such as Progression-Free Survival (PFS), and sometimes Overall Response Rate (ORR), incorporated either as dual primary endpoints or within a gatekeeping framework.

For example, a Phase III trial in non-small cell lung cancer (NSCLC) where PFS is expected to read out at ~18 months, while OS may require 36 months of follow-up. The sponsor hopes PFS will support regulatory interaction earlier, potentially even forming the basis of accelerated approval, while OS continues to mature for full approval. The accelerated approval may save the sponsor resources or may bring in additional resources while still following OS data accrual, as the OS evidence is still required by regulatory agencies for the final claim of success.

Although this seems straightforward, this approach fails to take into account all the complexities that may impact that final claim. These endpoints are correlated, mature at different rates, and are influenced by post-progression therapy, imaging frequency, and dropout patterns. Designing such a study requires more than separate power computations for each endpoint, it requires understanding how they behave together. This is where simulation becomes essential.

 

The statistical reality of correlated endpoints

Endpoints such as ORR, PFS, and OS are not independent random variables. They arise from the same underlying disease process. Patients who achieve early tumor shrinkage (i.e., ORR) often experience delayed progression. But that does not guarantee improved OS. Subsequent therapy, crossover, and differential dropout can attenuate survival differences. Many programs begin by assuming independence when calculating sample size or multiplicity adjustments. Unfortunately, that assumption rarely holds once joint behavior is modeled explicitly.

For example:

  • If ORR and PFS have moderate positive correlation (e.g., driven by response durability), the probability of dual success may be higher than naïve calculations suggest.
  • If OS is weakly correlated with PFS due to heavy post-progression treatment, hierarchical strategies may protect alpha but substantially reduce the probability of demonstrating statistical significance on OS.

Note that statisticians usually include a range of correlation coefficients between endpoints to evaluate their impact on overall operating characteristics of the trial.

The FDA will typically focus first on control of familywise type I error across endpoints. But during review, questions often shift toward interpretability:

  • How was correlation justified?
  • Were joint distributions modelled based on empirical data?
  • How sensitive are conclusions to deviations in event timing?

Those questions are difficult to answer with closed-form approximations alone.

 

Why closed-form calculations do not apply

Closed testing procedures, alpha recycling, and parallel gatekeeping frameworks are well-established tools for multiplicity control. From a theoretical standpoint, they provide strong familywise error control under specified assumptions, but operating characteristics become non-intuitive once endpoints are correlated and events accrue at different rates.

For example, let’s assume a hierarchical testing strategy where OS is tested first and fails narrowly due to immature data, PFS may never formally be tested, even if the PFS hazard ratio is clinically meaningful.

Alternatively, reversing the order (i.e., PFS tested first followed by OS) may increase the probability of declaring success on PFS, but now OS significance depends on passing through earlier gates. Power becomes conditional in ways that clinical teams often underestimate.

Simulating such designs allows evaluation of:

  • Probability of joint success (OS and PFS both significant)
  • Probability of partial success (e.g., showing significant PFS while OS is not yet mature)
  • Impact of varying correlation assumptions
  • Sensitivity to delayed event accrual
  • Effect of interim analyses on overall power

This helps clinical teams focus on actual operating characteristics under realistic assumptions instead of theoretical power under ideal ones. For example, in some settings, probability of winning on both endpoints may drop from 75% to around 50% when introducing correlation structures.

 

Modeling multiple endpoint outcomes

Traditional simulations often generate each endpoint independently from parametric survival distributions (e.g., using Exponential or Weibull curves). This is convenient, but not always clinically realistic. The FDA will often ask how simulation assumptions were calibrated. “We assumed independence” is not persuasive.

Therefore, modelling patient outcome data based on a multistate model may generate more credible data that aligns better with what will come to be in practice. This is certainly not the only approach, but one we encourage using on top of the copula approach where correlation coefficients between the endpoints must be specified.

Leveraging prior internal data, particularly standard-of-care arms from earlier studies, can anchor assumptions about:

  • Correlation between endpoints
  • Event-time distributions
  • Dropout rates
  • Missing data mechanisms

Alternatively, external historical data can also be used for this purpose. However, clinical teams must ensure proper evaluation for exchangeability of this data to the assumptions they are using it for, especially if disease management has shifted from when this data was collected.

 

Multiplicity control considerations

As previously mentioned, testing multiple primary endpoints requires strict familywise type I error control. Common approaches include:

  • Hierarchical gatekeeping
  • Alpha recycling
  • Closed testing procedures
  • Pre-specified adaptive decision rules

Under strong positive correlation, alpha allocation may be conservative relative to realized joint behavior. Under weak correlation, nominal power calculations may overstate the chance of dual success.

One area that is often overlooked is how interim analyses interact with multiplicity. Early looks based on PFS may alter the distribution of OS information at final analysis, particularly if enrollment slows after interim data are reviewed. That secondary impact is unfortunately rarely captured.

Simulations accounting for the multiple endpoints decisions may help characterize type 1 error control and power trade-offs in more realistic execution scenarios.

 

Integrating external and historical data

In oncology, prior data are often available, particularly for standard-of-care arms. Including empirically derived components, such as correlation and dropout rate assumptions, in simulation makes projections more defensible.

Regulatory agencies may still require conservative assumptions, but a simulation framework grounded in observed data allows transparent discussion of where assumptions are aggressive, where they are conservative, and why.

 

A practical perspective

Multiple primary endpoints introduce scientific opportunity and statistical complexity at the same time. There is a list of trade-offs that must be accounted for, including but not limited to, overcommitting on sample size, conditional power dependencies across endpoints, sensitivity to correlation structures, event timing uncertainty, and interim decision impacts.

Simulation, when built on joint patient-level modelling and calibrated to empirical data, allows these trade-offs to be evaluated prospectively rather than discovered after a database lock.

In our experience, teams that invest early in this level of simulations and endpoints modelling encounter fewer redesign discussions, particularly once regulatory feedback begins. More importantly, cross-functional stakeholders gain a clearer understanding of what “success” actually means across endpoints.

That clarity is often worth as much as the statistical precision itself.

 

Interested in learning more?

Join J. Kyle Wathen, Valeria Mazzanti,  and Julija Saltane for their upcoming webinar “Simulating Multiple Endpoints to Drive Late-Stage Oncology Trials” on Thursday, April 2 at 10 AM ET: