Oncology Archives |

Simulating Multiple Endpoints While Including External Historical Data in Adaptive Oncology Trial Designs

Multiple endpoints are now the rule, not the exception

In many contemporary Phase III oncology programs, a single primary endpoint is no longer sufficient. While Overall Survival (OS) remains the gold standard and regulators still view it as the most direct measure of clinical benefit, in practice, OS takes time to mature leading to very long and expensive clinical trials. In metastatic settings with multiple subsequent lines of therapy, the signal can dilute over time. As a result, sponsors frequently structure confirmatory trials with OS on top of an endpoint that is faster to measure, such as Progression-Free Survival (PFS), and sometimes Overall Response Rate (ORR), incorporated either as dual primary endpoints or within a gatekeeping framework.

For example, a Phase III trial in non-small cell lung cancer (NSCLC) where PFS is expected to read out at ~18 months, while OS may require 36 months of follow-up. The sponsor hopes PFS will support regulatory interaction earlier, potentially even forming the basis of accelerated approval, while OS continues to mature for full approval. The accelerated approval may save the sponsor resources or may bring in additional resources while still following OS data accrual, as the OS evidence is still required by regulatory agencies for the final claim of success.

Although this seems straightforward, this approach fails to take into account all the complexities that may impact that final claim. These endpoints are correlated, mature at different rates, and are influenced by post-progression therapy, imaging frequency, and dropout patterns. Designing such a study requires more than separate power computations for each endpoint, it requires understanding how they behave together. This is where simulation becomes essential.

The statistical reality of correlated endpoints

Endpoints such as ORR, PFS, and OS are not independent random variables. They arise from the same underlying disease process. Patients who achieve early tumor shrinkage (i.e., ORR) often experience delayed progression. But that does not guarantee improved OS. Subsequent therapy, crossover, and differential dropout can attenuate survival differences. Many programs begin by assuming independence when calculating sample size or multiplicity adjustments. Unfortunately, that assumption rarely holds once joint behavior is modeled explicitly.

For example:

If ORR and PFS have moderate positive correlation (e.g., driven by response durability), the probability of dual success may be higher than naïve calculations suggest.
If OS is weakly correlated with PFS due to heavy post-progression treatment, hierarchical strategies may protect alpha but substantially reduce the probability of demonstrating statistical significance on OS.

Note that statisticians usually include a range of correlation coefficients between endpoints to evaluate their impact on overall operating characteristics of the trial.

The FDA will typically focus first on control of familywise type I error across endpoints. But during review, questions often shift toward interpretability:

How was correlation justified?
Were joint distributions modelled based on empirical data?
How sensitive are conclusions to deviations in event timing?

Those questions are difficult to answer with closed-form approximations alone.

Why closed-form calculations do not apply

Closed testing procedures, alpha recycling, and parallel gatekeeping frameworks are well-established tools for multiplicity control. From a theoretical standpoint, they provide strong familywise error control under specified assumptions, but operating characteristics become non-intuitive once endpoints are correlated and events accrue at different rates.

For example, let’s assume a hierarchical testing strategy where OS is tested first and fails narrowly due to immature data, PFS may never formally be tested, even if the PFS hazard ratio is clinically meaningful.

Alternatively, reversing the order (i.e., PFS tested first followed by OS) may increase the probability of declaring success on PFS, but now OS significance depends on passing through earlier gates. Power becomes conditional in ways that clinical teams often underestimate.

Simulating such designs allows evaluation of:

Probability of joint success (OS and PFS both significant)
Probability of partial success (e.g., showing significant PFS while OS is not yet mature)
Impact of varying correlation assumptions
Sensitivity to delayed event accrual
Effect of interim analyses on overall power

This helps clinical teams focus on actual operating characteristics under realistic assumptions instead of theoretical power under ideal ones. For example, in some settings, probability of winning on both endpoints may drop from 75% to around 50% when introducing correlation structures.

Modeling multiple endpoint outcomes

Traditional simulations often generate each endpoint independently from parametric survival distributions (e.g., using Exponential or Weibull curves). This is convenient, but not always clinically realistic. The FDA will often ask how simulation assumptions were calibrated. “We assumed independence” is not persuasive.

Therefore, modelling patient outcome data based on a multistate model may generate more credible data that aligns better with what will come to be in practice. This is certainly not the only approach, but one we encourage using on top of the copula approach where correlation coefficients between the endpoints must be specified.

Leveraging prior internal data, particularly standard-of-care arms from earlier studies, can anchor assumptions about:

Correlation between endpoints
Event-time distributions
Dropout rates
Missing data mechanisms

Alternatively, external historical data can also be used for this purpose. However, clinical teams must ensure proper evaluation for exchangeability of this data to the assumptions they are using it for, especially if disease management has shifted from when this data was collected.

Multiplicity control considerations

As previously mentioned, testing multiple primary endpoints requires strict familywise type I error control. Common approaches include:

Hierarchical gatekeeping
Alpha recycling
Closed testing procedures
Pre-specified adaptive decision rules

Under strong positive correlation, alpha allocation may be conservative relative to realized joint behavior. Under weak correlation, nominal power calculations may overstate the chance of dual success.

One area that is often overlooked is how interim analyses interact with multiplicity. Early looks based on PFS may alter the distribution of OS information at final analysis, particularly if enrollment slows after interim data are reviewed. That secondary impact is unfortunately rarely captured.

Simulations accounting for the multiple endpoints decisions may help characterize type 1 error control and power trade-offs in more realistic execution scenarios.

Integrating external and historical data

In oncology, prior data are often available, particularly for standard-of-care arms. Including empirically derived components, such as correlation and dropout rate assumptions, in simulation makes projections more defensible.

Regulatory agencies may still require conservative assumptions, but a simulation framework grounded in observed data allows transparent discussion of where assumptions are aggressive, where they are conservative, and why.

A practical perspective

Multiple primary endpoints introduce scientific opportunity and statistical complexity at the same time. There is a list of trade-offs that must be accounted for, including but not limited to, overcommitting on sample size, conditional power dependencies across endpoints, sensitivity to correlation structures, event timing uncertainty, and interim decision impacts.

Simulation, when built on joint patient-level modelling and calibrated to empirical data, allows these trade-offs to be evaluated prospectively rather than discovered after a database lock.

In our experience, teams that invest early in this level of simulations and endpoints modelling encounter fewer redesign discussions, particularly once regulatory feedback begins. More importantly, cross-functional stakeholders gain a clearer understanding of what “success” actually means across endpoints.

That clarity is often worth as much as the statistical precision itself.

Interested in learning more?

Join J. Kyle Wathen, Valeria Mazzanti, and Julija Saltane for their upcoming webinar “Simulating Multiple Endpoints to Drive Late-Stage Oncology Trials” on Thursday, April 2 at 10 AM ET:

External Control Arms in Drug Development: Methodological and Regulatory Considerations

Drug development is growing more complex, with compressed timelines and increasingly high expectations from regulators, payers, and health systems. In this setting, external control arms (ECAs) leveraging real‑world data (RWD) are emerging as a pragmatic approach to support clinical development and downstream commercial decision‑making.

Randomized controlled trials (RCTs) remain the gold standard for evidence generation. However, in many modern development programs, traditional randomized designs are not feasible or may raise ethical concerns. Sponsors increasingly encounter situations in which:

Patient recruitment is slow, limited, or not achievable
Randomization is ethically challenging
Development costs escalate rapidly
Competitive dynamics demand accelerated evidence generation
Patient populations are small or rapidly progressing
There is a high unmet medical need

These challenges are particularly acute in oncology, rare diseases, post‑approval expansion studies, and advanced or cell‑based therapies.

What is an external control arm?

An external control arm replaces or supplements a traditional control group by leveraging data from patients treated outside the clinical trial. These patients are drawn from routine clinical practice and reflect outcomes under standard‑of‑care treatment in real‑world settings.

External controls are typically constructed using real‑world data sources such as:

Electronic health records (EHRs)
Administrative and insurance claims
Disease and treatment registries

Unlike trial data, real‑world data reflect patterns of diagnosis, treatment, and follow‑up in everyday clinical care. The foundation of a well‑designed external control study is the use of fit‑for‑purpose data that are sufficiently complete, clinically relevant, and reliable to support robust and defensible analyses.

Strategic value of external control arms

When thoughtfully designed and appropriately governed, ECAs can provide meaningful strategic benefits, including:

Shortened development timelines
Improved feasibility of clinical studies
Evidence generation in small or rare populations
Stronger value narratives for payers and health technology assessment bodies
Support for lifecycle management and label expansion strategies

Methodological considerations and risks to manage

The credibility and acceptability of an external control arm depend heavily on methodological rigor.

Key considerations include the following:

1. Study design

External control studies should be designed to closely mirror the clinical trial, including:

Alignment of inclusion and exclusion criteria
Clear definition of index date and baseline
Comparable follow‑up periods and outcome assessment windows
Consistent treatment context and line of therapy

Pre-specification of the estimand and statistical analysis plan is critical to avoid post‑hoc decision‑making.

2. Patient selection and alignment

Ensuring comparability between trial participants and real‑world patients is one of the most critical aspects of ECA design. Sponsors should:

Use transparent, reproducible cohort selection algorithms
Apply consistent definitions for key demographic and clinical variables
Assess overlap and positivity between trial and external populations
Explicitly evaluate differences in baseline characteristics

Sensitivity analyses should be conducted to quantify the impact of residual differences where appropriate.

3. Handling confounding and bias

Because external control arms lack randomization, confounding must be actively addressed. Common analytical approaches include:

Propensity score methods (matching, weighting, stratification)
Multivariable outcome regression
Doubly robust methods that combine weighting and modeling

Method selection should be driven by study objectives, data characteristics, sample size, and variable completeness and not for analytical convenience.

4. Data quality and missingness

Real‑world data are inherently heterogeneous and incomplete. Methodological plans should address:

Data provenance, completeness, and validation
Handling of missing or partially observed variables
Measurement variability across providers, systems, or data sources
Differences in assessment timing and frequency

Imputation strategies and key assumptions should be explicitly documented and tested through sensitivity analyses.

5. Outcome definition and assessment

Endpoints derived from RWD must be clinically meaningful and aligned as closely as possible with trial definitions. Considerations include:

Use of validated real‑world endpoint definitions
Clear attribution and timing of outcomes
Consistency with regulatory‑recognized measures of clinical benefit
Avoidance of surrogate endpoints unless scientifically justified

Outcome misclassification remains a key risk and should be explicitly evaluated.

6. Sensitivity and robustness analyses

Regulators expect evidence that findings are robust under alternative assumptions. Analyses may include:

Variation in matching or weighting specifications
Alternative cohort definitions or look‑back periods
Use of negative control outcomes or exposures
Quantitative bias analyses where feasible

The objective is to demonstrate that conclusions are not driven by a single design or modeling decision.

7. Transparency and documentation

Methodological transparency is essential for regulatory and payer review. Best practices include:

Prespecifying analysis plans and decision rules
Fully documenting data sources, algorithms, and assumptions
Providing traceability from raw data to final outcomes
Enabling reproducibility of key analyses

Regulatory outlook and expectations

Regulatory agencies and health technology assessment bodies, including the U.S. Food and Drug Administration (FDA), the European Medicines Agency (EMA), and the Canadian Agency for Drugs and Technologies in Health (CADTH) have recognized the potential role of external control arms under conditions of methodological rigor and transparency.

Regulatory agencies have not lowered evidentiary standards. Rather, they have:

Provided greater clarity on scenarios in which external control arms may be acceptable
More explicitly articulated methodological expectations
Encouraged early and proactive dialogue with sponsors

Successful regulatory submissions that incorporate ECAs typically:

Provide a clear scientific and ethical rationale for why randomization is not feasible or appropriate
Use high‑quality, fit‑for‑purpose real‑world data sources
Transparently define patient selection criteria and demonstrate alignment with the trial population
Show that findings are robust, reproducible, and minimally biased

Early engagement with regulators remains critical to aligning expectations and maximizing the likelihood of success.

Join Anupama Vasudevan and James Matcham on February 3 at 10 a.m. ET for an open office hours on “Evidence Generation with External Control Arms”:

FDA Guidance on Assessing Overall Survival in Oncology Trials: A DMC Perspective

Overall survival (OS) is the ultimate endpoint — it is easily measured and of the utmost clinical relevance. However, it also takes the longest time to develop and may not be as sensitive to treatment effect as other endpoints. In many oncology studies, for example, alternative endpoints are used, such as progression-free survival (PFS). The hope is that these alternative endpoints are indeed clinically relevant and allow for quicker results with fewer numbers of events and/or subjects needed than a study powered to detect a difference in OS. However, there has been growing concern over the use of endpoints such as PFS.

As a result, the FDA has pondered guidance on how to prioritize endpoints: whether OS or an alternative endpoint should be primary, or both should be co-primary. The FDA has also pondered whether accelerated approval might be permitted using a quicker-developing alternative while waiting for the OS data to mature and (hopefully) substantiate the efficacious results from the co-primary endpoint.

To address this, the FDA has produced the draft guidance “Approaches to Assessment of Overall Survival in Oncology Clinical Trials.”

Here, we will review this document as it pertains to Data Monitoring Committees (DMCs), and provide our own thoughts based on our 30 years of work and thousands of studies.

FDA draft guidance: “Approaches to Assessment of Overall Survival in Oncology Clinical Trials”

A key section of the FDA guidance related to DMC activity is the following:

This text reinforces what Cytel has long advocated — more interim analyses, both for futility (and possibly harm) and benefit — and reinforces the independence of the DMC and the group directly facilitating the DMC’s work. Approaches to constructing and implementing interim analyses for OS can be flexible — for example, assessing futility at 50% information fraction of OS, and then assessing both futility and benefit at 75% information fraction.

In addition to the section above, there are other references within the draft guidance for how a DMC or other parties would use and interpret OS data. Some particular references are below:

Using and interpreting overall survival data

It is important the DMC understands the operating characteristics of these interim analyses. The analyses might be at a high level in the DMC charter, or more specifically described in the protocol, or study SAP or interim analysis plan (IAP). Translating statistical methodology informally into actual counts is helpful to ensure audiences, including the DMC, understand what the “tipping points” might be. For example, a stopping rule for futility at 50 deaths (with the stopping rule perhaps defined by O’Brien-Fleming or by Bayesian analysis) would helpfully be presented to the DMC that translates to clearly understandable events as a reality check — perhaps the study design and stopping rules might lead to the conclusion that 28 deaths on active vs. 22 deaths on placebo (using reasonable assumptions for timing of events and other aspects of the data) would be the tipping point to trigger futility. This could be informally presented to the DMC at their organizational meeting to ensure the DMC understands the high-level approach and how that might play out in real numbers.

It is also important for the group creating the analyses to communicate to the DMC the uncertainty of early results so that the DMC does not overinterpret. This is especially true in the (theoretic) presence of non-proportional hazards.

Overall survival data: A key safety parameter

OS is a critical variable to help the DMC assess the study even outside of formal interim analyses — both as the DMC interprets safety (more deaths on the active arm) and efficacy (fewer deaths on the active arm). Together, this provides a more complete way to assess risk-benefit in the context of other safety concerns. While some sponsors prefer a “safety only” DMC and are hesitant to provide endpoint data, it is typically well understood that, as a key safety parameter, OS data is

Therefore, we typically present a death table to the DMC, and likely two: one summary of deaths for the safety population and one for the randomized population. This is particularly relevant for open-label studies where there might be an imbalance in subjects not treated and therefore not included in the safety population. Note that, traditionally, not all deaths are included in the adverse event (AE) dataset. In many studies, AEs occurring after treatment has finished and/or due to underlying disease might not be captured in the AE dataset.

Cytel also may tabulate investigator-assessed cause of death. (We do note to DMCs the hazards of interpreting investigator-reported causality though. That is part of the reason OS is appealing, as it is immune to the potential biases of ascertaining relationship.) Two treatment arms might have the same number of deaths, but the interpretation of the DMC could be different if there is an excess of deaths due to disease progression on one arm, relative to an excess of deaths due to serious adverse events (SAEs)/toxicity on another arm. It is traditional to also show timing in a table (perhaps deaths <30 days after last dose vs. deaths ≥ 30 days after last dose). The DMC may drill down into the fatal SAEs/toxicity to look for patterns and perhaps make recommendations to mitigate that risk. We have seen examples of excess fatal interstitial lung disease (ILD), fatal COVID-19, and fatal infections on active arms compared to control arms. DMCs in these situations have thought hard about mitigation strategies, as they would for any concerning imbalance in a major safety domain.

A Kaplan-Meier figure of time to death (possibly without any inferential statistics) is also typically presented to the DMC — typically based on the randomized population. It can be important for the DMC to understand the timing of the deaths to help answer questions such as whether most deaths occur early in the study or if there is a differential pattern over time (i.e., crossing curves implying non-proportional hazards) between the treatment arms. The definitions used for programming the Kaplan-Meier figure might not match the exact definitions at end of study if used in a non-inferential way — the censoring rules might be flexible based on the data available (e.g., the choice of whether to censor subjects still alive and in study follow-up at their last known contact, or whether to censor at the data cut-off date). In some situations, especially where OS is the primary endpoint, the sponsor might be hesitant to even show a non-inferential Kaplan-Meier figure of time to death to the DMC outside of formal analyses (if those exist). Nonetheless, the DMC could argue these are needed and, at minimum, are available upon request by the group facilitating the DMC’s discussion.

For formal interim analyses of OS that show inferential statistics (e.g., p-values, hazard ratio, confidence intervals), there would be appropriate effort in advance to ensure that the precise censoring algorithm and inferential statistics specified in in the study SAP are put in place for the DMC review.

Oversight on study integrity and interpretability

The DMC has oversight on study integrity and interpretability in conjunction with the sponsor, but it is important to have the independent thoughts of the DMC. In this domain, the DMC may well voice concern if an excessive rate of subjects is being lost to long-term survival surveillance. That would be particularly important if an imbalance in arms is developing for subjects lost to follow-up for long-term survival (as could happen in open-label studies). The DMC might not explicitly state that observation externally, but an imbalanced rate of lost to follow-up would be particularly concerning to the interpretation of OS at the end of the study.

And the DMC may voice concern if OS is the primary endpoint or it is critical to have a certain number of deaths for secondary analysis, but the overall rate of deaths is appreciably lower than expected, which seems likely to extend the duration of the study by years – bringing into question whether the study will be relevant and funded for the additional years needed. The DMC may request current projections (with computed confidence interval) for when in calendar time the expected number of deaths will occur. This can be complicated if the minimum number of deaths is required in a subgroup, and particularly so if the minimum numbers of deaths is in a blinded subgroup (a blinded biomarker population, or in a subgroup of treatments for studies with more than two treatment groups).

The DMC should understand that OS data early in a study is statistically unreliable. And there might even be some expectation for excess deaths early on (if late benefit is expected, in the presence of early toxicity). There might even be the expectation of no OS benefit — there might still be overall value to the new treatment even if there is no difference in OS — if the treatment is less toxic or more easily administered or cheaper.

Increasing the confidence of subgroup analysis results

The DMC may rightfully be concerned upon seeing an excess of deaths on the active arm. There are options for the DMC, as for any potential safety concern. One of the first steps traditionally undertaken is to investigate which subgroup of patients is most at risk for the excess death — but with the full understanding of the hazards of subgroup analysis. It is easy for an unwary reviewer to overinterpret the results of subgroup analyses. The DMC should consider factors that increase the confidence of subgroup results when looking for consistency of signal or trying to identify a subgroup that could specifically have some risk mitigation plan implemented:

Biologic rationale
Larger sample size
Consistent findings in other trials
Included as a stratification factor

Ad hoc interim analyses for OS

If OS is the primary endpoint, and no statistical futility in place but the DMC sees worrisome or neutral trends in OS, the DMC might consider ad hoc analysis of OS for futility. This is usually in the context of excess Grade 3+ AEs or SAEs as well as this OS result. The DMC might see if a lower limit of a confidence interval of hazard ratio (suitably adjusted for the interim nature by using information fraction and perhaps O’Brien-Fleming boundary) excludes 0.9 or lower. Or perhaps the DMC will request the supporting CRO to compute conditional power. We have seen DMC recommend termination or major change based on the totality of the data — largely influenced by neutral or negative OS results. Obviously, it is preferred to simply have pre-planned looks for OS futility.

The DMC might also see amazing benefit of OS, but without any formal interim analysis planned to assess benefit. The DMC could decide that the scientific question has been answered and that it is unethical to continue and therefore alert the sponsor. We have seen this occur and after intense focused discussion between the DMC and sponsor and regulatory agencies the decision was made to move forward towards regulatory approval, which accelerated the regulatory process by years. Clearly, it is preferred to simply have pre-planned looks for OS benefit.

A recent situation involved a relatively high toxicity observed, but the intervention was expected to provide a long-term OS benefit. The DMC decided to recommend halting enrollment in the population with the lowest baseline disease severity. These subjects were expected to have the same rate of toxicity as all groups, but with minimal expected absolute OS benefit.

Difficult recommendations

DMCs must make difficult recommendations as PFS and OS data emerge, if PFS is a primary or co-primary endpoint. The DMC should ask for and be provided with (perhaps non-inferential) summaries of both PFS and OS. DMC recommendations are challenging if results are discrepant. Most common would be an interim analysis of PFS for benefit that crosses a boundary for benefit, but OS results are extremely immature or neutral or perhaps even in the negative direction. The DMC hopefully has the flexibility to explain the context of the situation to a senior liaison at the sponsor for discussion if the study should continue and obtain additional valuable OS data to help answer questions about OS more precisely. This is particularly problematic if there is a discrepancy between investigator-provided PFS and blinded independent review committee (BICR) PFS, or a delay or unavailability of BICR data to the DMC.

Controversies exist when the co-primary endpoint achieves full statistical information before OS does and deciding what the DMC’s role is at the point, and after that point in time. In some situations, the DMC’s obligations for oversight of the study conclude once PFS has completed. The sponsor (perhaps the full study team, or a subset of the study team, or a separate team) becomes unblinded to final PFS (and likely interim OS), and oversight for safety of ongoing patients during continued OS surveillance is undertaken by the sponsor. The assumption is that unblinding and knowledge of PFS results will not bias the ongoing OS collection. However, if results of PFS are impressive and that knowledge becomes public, that might impact behavior that affects future collection of OS — especially in an open-label study.

Therefore, many DMCs have argued that the DMC should still be involved in oversight of OS even after final analysis of PFS — perhaps in conjunction with making sure that the final PFS results are handled very securely and led by DMC and not sponsor personnel. The value of this approach is that the continued OS is not impacted by patients, sites, or general sponsor team knowledge of the PFS results or interim OS results. The DMC can take the lead on communicating final PFS results and perhaps interim OS results to a small group within the sponsor or perhaps (with agreement from all parties) directly to regulatory agencies.

Final takeaways

We believe, and the FDA agrees, that DMC access to overall survival results can be critical to the DMC’s remit. Hopefully, DMC members will insist on receiving the outputs that meet their needs, which could include either non-inferential or inferential analyses of overall survival; CROs will be able to create the outputs needed in an accurate and timely way; and sponsors will trust DMCs to act responsibly with the outputs provided to them.

Interested in learning more?

Download our white paper, “DMCs for Oncology Studies”:

Finding the Optimal Biological Dose with New PKBOIN-12 Method

With the rise of targeted and immunotherapies, we have recently seen a shift away from finding a drug’s maximum tolerated dose (MTD) in Phase II dose-finding studies and toward identifying the optimal biological dose (OBD) — the dose that optimally balances safety, tolerability, and early efficacy. A new method, PKBOIN-12, extends the BOIN12 framework to integrate Pharmacokinetic (PK) parameters to refine the dose-finding and final OBD selection.

Here, we discuss PKBOIN-12, recent regulatory shifts regarding dose finding, including the FDA’s Project Optimus, and Cytel’s East Horizon™ dose-finding module.

What is PKBOIN-12?

PKBOIN-12, developed by Dr. Hao Sun of Bristol Myers Squibb and Tu Jieqi of the University of Illinois Chicago, is an innovative dose-finding method that enhances the established BOIN12 algorithm by incorporating Pharmacokinetic (PK) information into the Optimal Biological Dose (OBD) determination process. In recent years, particularly with the rise of targeted and immunotherapies, the focus in early-phase dose-finding studies has shifted away from finding the Maximum Tolerated Dose (MTD) and toward identifying the OBD, the dose that optimally balances safety, tolerability, and early efficacy.

BOIN12 is one such method that assesses both safety and efficacy, but, like many dose-finding designs, it typically does not formally use auxiliary data. Researchers routinely collect PK measurements in order to characterize drug exposure associated with the various tested dose levels, but these are not usually incorporated into the risk-benefit analysis when designing clinical trials. PKBOIN-12 addresses this by extending the BOIN-12 framework to integrate collected PK data to refine the dose-finding and final OBD selection.

Indeed, simulation results comparing PKBOIN-12 and BOIN-12 demonstrate that the former more effectively identified the OBD and allocated a greater proportion of patients to that optimal dose.

Project Optimus: A regulatory shift toward the OBD

In addition to the general industry trend in collecting and considering a broader set of data in early-phase dose-finding oncology studies, we have seen a real shift in regulatory interest in this area, encapsulated in the FDA’s Project Optimus.

In a previous blog post, James Matcham and Michael Fossler highlight how a recognition of the changing nature of oncology therapies — away from chemotherapies and towards more advanced biologics — necessitated a change in how these products are developed and assessed for efficacy and safety.

Project Optimus posits that the dose-finding paradigm must shift away from safety and tolerability alone, and towards incorporating efficacy considerations at this stage. An ideal dose-finding study under the Project Optimus lens emphasizes the determination of a dose range that does not focus on the MTD, but rather the OBD, or the dose range that considers efficacy, tolerability, safety, and pharmacokinetics.

PKBOIN12 is therefore well-suited to meet the challenges presented by Project Optimus and is indeed at the forefront of both industry trends and regulatory expectations.

Dose finding with the East Horizon™ platform

Cytel’s software development teams will soon be launching the dose-finding module, the sixth installation of the East Horizon platform. This module completes an almost two-year journey of migrating Cytel’s flagship software heritage, East, into a cloud-native, modern, and updated East Horizon platform. Over these months, our teams worked tirelessly to select from our wide repertoire of software solutions, those features, methods, and tests most relevant to our user base, and thoughtfully curated additional frequentist and Bayesian methods that are completely new for Cytel software. One such method is the new PKBOIN-12 dose-finding method.

Interested in learning more?

On November 18, 2025, Cytel will host Dr. Hao Sun for a webinar to discuss this new method in depth, and to highlight the technical as well as tactical aspects of implementing this method. Register today and join us for a fascinating conversation:

External Control Arms: A Powerful Tool for Oncology and Rare Disease Research

In clinical research, the randomized controlled trial (RCT) has been considered the gold standard. Yet in many areas — especially in oncology and rare diseases — running an RCT with a balanced control arm is not always possible. Patients, physicians, and regulators often face a difficult reality: how do we evaluate promising new therapies when traditional designs aren’t feasible?

This is where external control arms (ECAs) come into play. By carefully drawing on existing data sources and applying rigorous methodology, ECAs can help provide the context and comparative evidence needed to make better decisions.

Here, we will explore why ECAs are particularly valuable in oncology and rare diseases, how they support decision-making and study design, what data sources they can rely on, and which statistical methods are essential to reduce bias. We will also introduce the concept of quantitative bias analysis and conclude with why experienced statisticians are key to the success of this methodology.

Why external control arms matter in oncology and rare diseases

Oncology and rare disease research share several challenges that make traditional RCTs difficult:

Small patient populations: In rare diseases, the number of eligible patients is often extremely limited. Asking half of them to enroll in a control arm may make recruitment impossible.
High unmet need: In oncology, patients and families are eager for new options. Many consider it unacceptable to randomize patients to placebo or outdated standards of care.
Ethical constraints: For life-threatening conditions, denying patients access to an experimental therapy can be ethically challenging.
Rapidly changing standards of care: In oncology, new treatments are approved frequently. A control arm that was relevant when a trial began may become outdated by the time results are available.

In such contexts, single-arm studies (where all patients receive the experimental therapy) are common. But single-arm results alone are not sufficient. Without a comparator, how do we know if the observed survival or response rate truly reflects an advance? ECAs provide the missing context.

Even when a trial includes a control arm, unbalanced designs — such as smaller control groups or cross-over to experimental treatment — can limit the ability to make clean comparisons. External controls can augment these designs, helping to stabilize estimates and provide reassurance that results are robust.

Supporting internal and regulatory decision-making

ECAs serve multiple purposes:

Internal decision-making:
- Companies developing new therapies must decide whether to advance to the next trial phase, expand into new indications, or pursue partnerships.
- ECAs help answer questions like: Is the observed benefit large enough compared to historical data? Do safety signals look acceptable in context?
Regulatory decision-making:
- Regulatory agencies such as FDA and EMA increasingly accept ECAs as part of submissions, especially in rare diseases and oncology.
- While not a replacement for RCTs, ECAs can strengthen the evidence package and demonstrate comparative effectiveness in situations where randomization is not feasible.
Helping the medical community:
- Physicians, payers, and patients need to interpret trial results. An overall survival rate of 18 months in a single-arm study may sound promising, but how does it compare to similar patients receiving standard of care?
- ECAs help put numbers into perspective, allowing the community to better understand the true value of a new therapy.

Designing better studies with ECAs

External controls are not only a tool for analyzing results — they can also improve study design.

Feasibility assessments: By examining real-world data or prior trial results, sponsors can estimate expected event rates, patient characteristics, and recruitment timelines. This reduces the risk of under- or over-powered studies.
Endpoint selection: Understanding how endpoints behave in historical or real-world settings helps refine choices for the trial, ensuring relevance to both regulators and clinicians.
Eligibility criteria: RWD and earlier trial data can reveal which inclusion/exclusion criteria are overly restrictive. Adjusting them can broaden access while maintaining scientific rigor.
Sample size planning: By leveraging ECAs, trialists may reduce the number of patients required for an internal control arm, easing recruitment in small populations.

In other words, ECAs can shape trials from the start, rather than being seen only as a “rescue” option after the fact.

Sources of external control data

An ECA is only as good as the data it relies on. Broadly, there are three main sources:

Other clinical trials:
- Prior trials of standard of care treatments can serve as external comparators.
- Individual patient-level data (IPD) is preferred, but often only summary data is available.
- These data are typically high quality but may not perfectly match the new study population.
Published studies:
- Systematic reviews and meta-analyses of the literature can provide comparator data.
- Useful when IPD is unavailable but limited by reporting standards and heterogeneity across studies.
Real-world data (RWD):
- Sources include electronic health records, registries, and insurance claims databases.
- These capture routine clinical practice, reflecting the diversity of real patients.
- However, RWD often suffers from missing data, variable quality, and lack of standardized endpoints.

Each source has strengths and weaknesses. Often, the best approach is to triangulate across multiple sources, ensuring that conclusions do not rest on a single dataset.

The value of earlier clinical trials

Earlier-phase trials (Phase I and II) can be particularly valuable in constructing ECAs. These studies often include control arms, detailed eligibility criteria, and well-captured endpoints.

For rare diseases and oncology, earlier trials may be the only available benchmark. By carefully aligning populations and endpoints, statisticians can extract maximum value from these datasets.

The challenge, of course, is ensuring comparability. Patient populations may differ in prognostic factors, supportive care practices may evolve, and definitions of endpoints may shift over time.

This is where advanced statistical methods become essential.

Reducing bias with propensity scoring

One of the key criticisms of ECAs is the risk of bias. Without randomization, patients receiving the experimental therapy may differ systematically from those in the external control.

Propensity score methods are a powerful way to reduce this bias. The idea is simple:

For each patient, estimate the probability (the “propensity”) of receiving the experimental treatment based on baseline characteristics.
Match or weight patients in the external control group so that their distribution of covariates mirrors that of the trial patients.

This approach creates a “pseudo-randomized” comparison, balancing measured variables. While it cannot eliminate unmeasured confounding, it greatly improves fairness in comparisons.

Quantitative bias analysis: Addressing the unmeasured

Even with careful propensity scoring, unmeasured confounding remains a concern. Clinical researchers often ask: What if there are factors we didn’t account for?

This is where quantitative bias analysis (QBA) enters. QBA does not eliminate bias but helps us understand its potential impact.

For example:

Analysts can model how strong an unmeasured confounder would need to be to explain away the observed treatment effect.
Sensitivity analyses can simulate scenarios with different assumptions about unmeasured variables.

By explicitly quantifying uncertainty, QBA provides transparency. Regulators and clinicians gain confidence that conclusions are robust — or at least, that limitations are clearly understood.

The need for experienced statisticians

Constructing an ECA is not a “plug-and-play” exercise. It requires expertise across multiple domains:

Data curation: Selecting fit-for-purpose datasets, cleaning and harmonizing variables, and aligning endpoints.
Study design: Defining eligibility, follow-up time, and analysis plans that minimize bias.
Statistical methodology: Applying techniques like propensity scoring, inverse probability weighting, Bayesian borrowing, and QBA.
Regulatory communication: Explaining assumptions, limitations, and sensitivity analyses in language that regulators and clinicians can understand.

In short, ECAs demand both technical skill and strategic judgment. Partnering with experienced statisticians ensures that external controls provide credible, decision-grade evidence rather than misleading comparisons.

Final takeaways

External control arms are rapidly becoming an indispensable tool in modern clinical research — especially in oncology and rare diseases, where traditional RCTs often fall short.

They offer:

Context for single-arm studies and unbalanced designs.
Support for both internal and regulatory decisions.
Guidance in study design and feasibility planning.

By leveraging diverse data sources — from earlier trials to real-world evidence — and applying rigorous methods such as propensity scoring and quantitative bias analysis, ECAs can bring clarity and credibility to difficult development programs.

But the value of ECAs depends on how well they are planned and implemented. Done poorly, they risk misleading decisions. Done well, they empower researchers, regulators, and clinicians to make better choices for patients.

As the field evolves, one thing is clear: the expertise of skilled statisticians is the cornerstone of successful ECAs.

Interested in learning more?

Join Alexander Schacht, Steven Ting, and Vahe Asvatourian for their upcoming webinar, “Beyond the Standard Clinical Trial in Early Development: When and Why to Consider External Controls” on Thursday, October 16 at 10 a.m. ET:

Master Protocols in Oncology Trials

A master protocol is defined as a protocol designed with multiple sub-studies, which may have different objectives and involve coordinated efforts to evaluate one or more investigational drugs in one or more disease subtypes within the overall trial structure. Master protocol trials include three trial designs: basket trials, umbrella trials, and platform trials.

FDA guidance released in March 2022 provides recommendations for master protocol trials.

In this blog, we discuss master protocol trial designs, challenges and best practices, and the benefit of these innovative designs in oncology trials.

Types of master protocol trials

Basket trials

Basket trials are designed to test a single investigational drug or drug combination in different populations defined by different cancers, disease stages for a specific cancer, histologies, number of prior therapies, genetic or other biomarkers, or demographic characteristics.

Umbrella trials

Umbrella trials are designed to evaluate multiple investigational drugs administered as single drugs or as drug combinations in a single disease population.

Platform trials

Platform trials are master protocols in which arm(s) can be dropped or added based on knowledge gained from previously evaluated parts of the trial.

Figure 1: Basket Trials, Umbrella Trials, and Platform Trials

Image credit: Park, J. J. H., Siden, E., Zoratti, M. J., Dron, L., Harari, O., Singer, J., Lester, R. T., Thorlund, K., & Mills, E. J. (2019). Systematic review of basket trials, umbrella trials, and platform trials: A landscape analysis of master protocols. Trials, 20.

Key challenges with master protocol trials

Master protocol trials are inherently complex due to their expansive scope and varied components. Let’s refine these challenges further:

Data management and analysis

Large amounts of data need efficient integration and processing.
Basket trials involve multiple indications and endpoint definitions, and/or response criteria may vary across the indications.
Umbrella trials have multiple drugs, leading to complex exposure and safety summaries.
Platform trials continuously add new treatment arms, generating a dynamic dataset that requires real-time integration and analysis. This necessitates robust data management systems capable of handling evolving data structures and ensuring consistency across various cohorts.

Safety profile considerations

Variability in drug effects requires tailored safety monitoring strategies.
Adverse events of special interest might need to be defined for each drug separately.

Biomarker data complexity

Data can be relatively large and complex.
Having the data transfer specifications at an early stage is important to ensure that the correct data will be received and in the expected format.
Intensive discussion might be needed with biomarker data specialists to define the rules for deriving biomarker/genomic profile of interest.
Mapping those data from raw data to SDTM can also be challenging.

Statistical Analysis Plan (SAP) and shell development

Potential additional complexity for statistical inference (e.g., adaptive features, multiplicity, and Bayesian methods).
Require the team to focus on the main objectives of the study, otherwise SAP and shell can become very extensive.
The number of tables, figures, and listings can grow significantly, making prioritization essential.
Layout complexities arise when need to display numerous columns across multiple cohorts.

Operational and reporting challenges

Each cohort may follow different timelines, complicating interim and final analyses.
Frequent reportings require good planning.
CSR(s) strategy (e.g., separate CSR for each cohort versus single CSR) should be defined sufficiently early.

Staying focused on the key study objectives is crucial to prevent data overload and inefficiencies in reporting. Exploratory analyses can be planned in a second step.

Comparative Overview: Basket vs. Umbrella vs. Platform Trials

(Click table to enlarge)

Final takeaways

Master protocol trials represent a transformative shift in clinical research — enabling the simultaneous evaluation of multiple therapies or disease subtypes under a unified framework. While designs like basket, umbrella, and platform trials offer flexibility and efficiency, they also introduce significant operational, statistical, and data management complexities.

Success is built on early planning, early discussion with safety and biomarker teams, and a focus on core study objectives to ensure meaningful insights and readiness.

Implementing RECIST/iRECIST in Oncology Clinical Trials

The majority of clinical trials evaluating cancer treatments for objective response in solid tumors are using RECIST, or Response Evaluation Criteria in Solid Tumors. RECIST is crucial for evaluating the effectiveness of cancer therapies, but it’s not without its challenges.

In this blog, we detail RECIST, how it’s used in statistical analysis, the development of iRECIST for immunotherapy trials, statistical and clinical challenges with RECIST/iRECIST, and best practices for implementing RECIST/iRECIST in oncology trials.

What is RECIST 1.1 and why is it important in oncology?

RECIST (Response Evaluation Criteria in Solid Tumors) 1.1 is a standardized set of rules used to measure tumor response to treatment using imaging. It helps determine whether a tumor is shrinking, stable, or growing, which is crucial for evaluating the effectiveness of cancer therapies. As of today, the majority of clinical trials evaluating cancer treatments for objective response in solid tumors are using RECIST.

What are the key response assessments in RECIST 1.1?

The overall response for a given timepoint is the combination of target lesion response relying on unidimensional measurements, non-target lesion response, and presence/absence of new lesions.

Complete Response (CR): Disappearance of all target lesions and non-target lesions and no new lesions. Any pathological lymph nodes must have a reduction in short axis to <10mm.
Partial Response (PR): At least a 30% decrease in the sum of diameters of target lesions, taking as reference the baseline sum diameters, and no progression of non-target lesions and no new lesions.
Stable Disease (SD): Neither sufficient shrinkage to qualify for PR nor sufficient increase to qualify for PD, taking as reference the smallest sum diameters while on study, and no new lesions.
Progressive Disease (PD): At least a 20% increase in the sum of diameters and an absolute increase of ≥ 5mm, taking as reference the smallest sum of diameters on-study, or progression of non-target lesions or appearance of new lesions.
Not Evaluable (NE)

How is RECIST used in statistical analysis?

RECIST is used to derive key endpoints like:

Objective Response Rate (ORR)
Disease Control Rate (DCR)
Progression-Free Survival (PFS)
Duration of Response (DOR)
Time to Response (TTR)

RECIST 1.1 criteria state that confirmation of response (CR or PR) is required for non-randomized trials with a response primary endpoint to ensure responses identified are not the result of measurement error. However, in all other circumstances, i.e., in randomized trials (phase II or III) or studies where stable disease or progression are the primary endpoints, confirmation of response is not required since it will not add value to the interpretation of trial results.

The FDA generally expects a confirmed response for ORR in single-arm trials where it is the primary endpoint, especially for accelerated approval. The FDA/EMA may also request confirmation if ORR is a primary of key secondary endpoints or if imaging intervals are long. Nevertheless, this point should be discussed with Health Authorities as this additional confirmatory scan is usually requested 4 weeks later and the protocol might not plan for it; such analyses cannot be conducted ad hoc if the confirmatory assessment is not initially planned in the protocol.

What are the statistical and clinical challenges with RECIST 1.1?

Inter-reader variability

Despite the use of standardized RECIST 1.1 criteria for response, different radiologists may interpret imaging results differently, especially when measuring borderline lesions. This can introduce measurement bias and affect response classification (e.g., PR vs. SD) and therefore impact trial outcomes. As an example, the average discrepancy rate at the patient level was found to be 59.2% in lung cancer trials using RECIST 1.1.1

Lesion selection and measurement errors

RECIST 1.1 limits the number of target lesions (up to 5 total, max 2 per organ).
Importance of selecting the same target and non-target lesions to be followed across all timepoints otherwise patient level response will not be valid.
Small errors in measuring lesion diameters can significantly impact response categorization.

Non-measurable disease

When the patient has only non-measurable disease, the increase must be substantial to lead to an overall response PD, which is relatively subjective.

Handling non-target and new lesions

Non-target lesions are assessed qualitatively, which introduces subjectivity.

The appearance of new lesions automatically triggers PD, even if the overall tumor burden is decreasing. Therefore, the finding of a new lesion should be unequivocal, i.e., not attributable to differences in scanning technique, change in imaging modality, or findings thought to represent something other than a tumor.

RECIST criteria are based on anatomical size, not functional or viable tumor volume

Focuses on unidimensional measurements, regardless of internal characteristics like necrosis or cavitation (common in lung or liver metastases).

Other criteria (e.g., Choi criteria for GISTs) may be more appropriate when necrosis is a key feature of response.

In some tumor types or trials, modified criteria (e.g., mRECIST for hepatocellular carcinoma) are used, which do consider viable tumor (e.g., arterial enhancement) rather than total size.

RECIST does not capture atypical responses

Especially in immunotherapy, tumors respond differently compared with chemotherapy, raising questions about the assessment of changes in tumor burden. In particular, for immunotherapy, RECIST 1.1 may misclassify pseudoprogression as PD.

This has led to the development of iRECIST, but many trials still rely on RECIST 1.1.

Time-to-event endpoint challenges

PFS and DOR depend on accurate and timely assessments.

Delays in imaging or inconsistent scan intervals can lead to informative censoring or biased survival estimates.

Missing or incomplete data

Patients may miss scans or drop out, leading to missing data that complicates statistical modeling. Interval censoring can be used as sensitivity in that case.

Imputation is difficult due to the non-linear and categorical nature of RECIST outcomes.

Impact on interpretation

Low concordance between Independent Central Review and the Investigator would question the reliability of results.

Why was iRECIST developed and how does it differ from RECIST 1.1?

Traditional RECIST criteria may misclassify immune-related responses as progression. iRECIST was developed to:

Reflect atypical response patterns in immunotherapy
Allow continued treatment beyond initial progression
Improve consistency in trial design and data interpretation

iRECIST is an adaptation of RECIST 1.1 designed for immunotherapy trials. It accounts for pseudoprogression, where tumors may initially appear to grow before shrinking due to immune cell infiltration. iRECIST introduces:

Unconfirmed Progressive Disease (iUPD)
Confirmed Progressive Disease (iCPD)

This two-step confirmation helps avoid prematurely stopping effective immunotherapy.

What are the statistical challenges with iRECIST?

Delayed treatment effects

Immunotherapies may show delayed clinical benefits, which violate the proportional hazards assumption used in standard survival analysis (e.g., Cox models). This can complicate sample size estimation, primary analysis, and, in particular, hazard ratio interpretation.

Pseudoprogression and confirmation requirements

iRECIST introduces iUPD and requires a follow-up scan to confirm progression as iCPD, which delays the determination of progression and requires more complex modelling of iPFS. This also introduces interval censoring and time-dependent bias.

The exact time of progression is not precisely known — it lies between the iUPD and iCPD scans. Uncertainty around the exact date of progression, which is already present with RECIST, is larger with iRECIST, given that the second scan is needed to confirm the PD. A specific method like the interval censoring method might be more appropriate than the Kaplan-Meier and Cox models.

Patients who survive long enough and/or are still in the study to get a confirmation scan are not randomly selected — they may be little healthier. This may introduce selection bias and time-dependent confounding.

Endpoint ambiguity

Common endpoints like PFS and ORR are harder to define, which can lead to inconsistent endpoint definitions across trials:

Should PFS be based on iUPD or iCPD?
How should iDOR be calculated?
What if patients drop out before confirmation?
SAP should clearly define the derivations

Data interpretation and trial comparability

Trials using iRECIST are not directly comparable to those using RECIST 1.1.

Meta-analyses and pooled analyses become more difficult.

The protocol/SAP may plan for both RECIST and iRECIST analyses, increasing complexity.

Increased risk of missing data

Patients may discontinue before confirmation scans for progression.

Imaging schedules may not align with iRECIST requirements: iRECIST requires a follow-up scan (typically within 4–8 weeks) after an initial iUPD to determine if the progression is real or pseudoprogression. However, in many clinical trials or treatment protocols, imaging is scheduled every 8–12 weeks, which may not fit with the expected confirmation window and increase the risk of missing data.

This leads to informative censoring and missing not at random (MNAR) data, which are hard to handle statistically.

Limited validation and standardization

iRECIST is still considered exploratory, especially for phase III trials (as per the guidelines).

There is no consensus on how to incorporate iRECIST endpoints into pivotal trials.

Validation requires large-scale data sharing, which is still limited.

Best practices for implementing RECIST/iRECIST in trials

Follow published guidelines.
Ensure the CRF appropriately collects the data (e.g., date of new lesions). Examples are available on the RECIST website.
Ensure standardized imaging schedules and methods.
Train radiologists and clinicians on RECIST/iRECIST criteria.
Consider blinded independent central review to reduce variability, when relevant.
Plan for additional scans to confirm progression with iRECIST.
Ensure responses criteria used are clear in SAP, outputs, CSR, and manuscripts.

Where can I learn more or access the guidelines?

Full RECIST guideline

Full iRECIST guideline

RECIST Questions and Clarifications

iRECIST

Final takeaways

RECIST 1.1 is the standard tool for evaluating tumor response in oncology trials, offering a consistent framework based on anatomical measurements. While it has brought uniformity to clinical research, it comes with some limitations — such as subjectivity in lesion selection and inability to capture atypical responses — especially with immunotherapies. To address these challenges, iRECIST was introduced as an adaptation that accounts for immune-related phenomena like pseudoprogression. However, it also brings statistical complexity and remains exploratory and is not yet fully reliable, with limited validation for pivotal trials.

This is precisely where Cytel can bring value to sponsors. By combining deep statistical expertise with operational insight, Cytel helps design and implement robust RECIST and iRECIST strategies — from endpoint definition to handling complex censoring and missing data. Cytel supports sponsors in navigating regulatory expectations, ensuring that trial results are both scientifically sound and submission-ready.

The Estimand Framework in Oncology Trials

Oncology clinical trials are complex due to the nature of cancer progression, long follow-up times, start of further therapies, and ethical considerations. The estimand framework introduced in ICH E9(R1) provides a structured approach to align the clinical question with endpoints, intercurrent events, and analysis strategies.

Understanding the estimand framework in oncology

The estimand framework helps define what exactly a trial aims to measure, especially in the presence of intercurrent events (ICEs) that occur after treatment initiation and affect either the interpretation or existence of the outcome (like treatment discontinuation or new therapies).

Estimands need to be clearly defined in both the protocol and the Statistical Analysis Plan (SAP) using the five attributes outlined in the ICH E9(R1) addendum: population, variable (endpoint), treatment, intercurrent events and handling strategies, and population-level summary.

ICEs can complicate the estimation of treatment effects in oncology trials. Among these, the start of further anticancer therapy is particularly complex, especially when evaluating endpoints like Progression-Free Survival (PFS) and Overall Survival (OS).

Among all ICE handling strategies, two strategies are often used to handle the start of further anticancer therapy:

Hypothetical strategy

Estimate treatment effect in a world where further anticancer therapy would not exist.

Implementation: Typically involves censoring patients at the time they start further anticancer therapy or using advanced statistical methods.
Could be more meaningful from patient’s and prescriber’s perspective if subsequent therapies are not yet approved drugs and thus do not reflect clinical practice.
May require additional data on baseline and/or time-dependent covariates to support modeling.

Treatment policy

Estimate treatment effect regardless of any further anticancer therapy, aiming to reflect real-world clinical practice.

Implementation: Includes all events regardless of further anticancer therapy.
Often considered as most relevant by regulatory authorities and other stakeholders if subsequent therapies are already approved and reflect clinical practice.
Tend to dilute treatment effect.
Assessments must continue beyond start of subsequent therapy.

Regulatory landscape

Historically, the FDA’s 2007 guidance leaned toward censoring at the start of new anticancer therapy — aligning with the hypothetical strategy. However, more recent guidance (2015, 2018) acknowledges both strategies, and the EMA’s 2012 guidance implicitly supports the treatment policy approach by recommending that progression should be considered even when observed after new anticancer treatment.

In Acute Myeloid Leukemia (AML), the FDA’s 2022 guidance is particularly clear: subsequent treatments like HSCT or anti-AML drugs should be considered part of the overall treatment regimen and not censored in the primary analysis.

When the hypothetical strategy may be preferable

In trials where conditions diverge significantly from routine clinical practice — such as early crossover or use of unapproved therapies — a hypothetical strategy may better capture the true clinical question.

Advanced methods like Rank Preserving Structural Failure Time (RPSFT) and Inverse Probability Censoring Weighting (IPCW) can help estimate what would have happened without treatment switching — but they come with assumptions and complexity.

Handling missing data in oncology

Effectively addressing missing data is essential for ensuring the reliability and integrity of statistical analyses in oncology trials. With regulatory agencies embracing the estimand framework, it’s essential to distinguish between ICEs and missing values, and to navigate their implications for primary and sensitivity analyses.

There is no consensus yet regarding how to handle missing tumor assessments in the primary analysis of PFS. Here’s a snapshot of key regulatory viewpoints:

According to the FDA’s 2018 guidance, “We recommend assigning the progression date to the earliest time when any progression is observed without prior missing assessments and censoring at the date when the last radiological assessment determined a lack of progression.”

The 2015 FDA NSCLC guidance offers case-based examples where progression events after two or more missed assessments are either censored or considered as events depending on the context, illustrating a cautious approach to ensure data robustness.

The 2012 EMA oncology guidelines advise against censoring for missed assessments: “The time of the progression or recurrence event is determined using the first date when there is documented evidence that the criteria have been met, even in situations where progression is observed after one or more missed visits, treatment discontinuation, or new anti-cancer treatment.”

Those different censoring rules can deeply impact PFS estimates, especially when early dropout rates are imbalanced between treatment arms.

Depending on the approach retained for the primary analysis, sensitivity analyses should be considered to assess the impact of missing tumor assessments. It may include a different set of censoring from the FDA guidance, but also interval censoring method.

Sensitivity and supplementary analyses

Understanding how different analyses relate to the primary estimand is critical for drawing robust and credible conclusions from clinical trial data. Two important analysis categories — sensitivity and supplementary analyses — serve distinct purposes and must be thoughtfully pre-specified in the SAP.

Sensitivity analyses: Testing the estimand’s foundations

According to ICH E9(R1), a sensitivity analysis is “a series of analyses conducted with the intent to explore the robustness of inferences from the main estimator to deviations from its underlying modeling assumptions and limitations in the data.”

Purpose: To verify that conclusions drawn from the primary analysis remain valid under alternative assumptions or data limitations. These analyses also probe key risks to inference, such as missing data or model specification.

Examples:

Using an unstratified Cox model instead of a stratified one.
Comparing investigator-assessed PFS with blinded independent central review (BICR)-assessed PFS.
Applying alternative censoring rules (e.g., censoring after ≥2 missed tumor assessments), or interval-censored models for PFS.
Using Restricted Mean Survival Time (RMST) to explore robustness under non-proportional hazards.

Supplementary analyses: Exploring beyond the estimand

While less explicitly defined, ICH E9(R1) describes supplementary analyses as: “Other analyses that are conducted in order to more fully investigate and understand the trial data.”

Purpose: To explore different strategies or assumptions that may be clinically or scientifically relevant.

Example: Using a different intercurrent event (ICE) strategy than the primary estimand.

Final takeaways

There’s no one-size-fits-all approach. Regulatory expectations continue to evolve, and sponsor decisions should balance regulatory guidelines, clinical practice norms, relevance to prescribers and patients, and feasibility of continued assessments.

Engaging in early discussions to align estimands with trial objectives and regulatory requirements is critical to ensuring efficient drug development and timely delivery to patients.

Blinded Independent Central Review in Oncology Trials: Key Challenges

Blinded independent central review (BICR) is a process used in clinical trials, in which a group of independent experts review trial data, like radiographic images, to review assessments without access to information on patients’ treatment assignments. BICR of radiographic images is frequently conducted in oncology trials to address the potential bias of local evaluation by investigators (INV) of endpoints such as progression-free survival (PFS) and objective response rate (ORR).

What is the aim of BICR?

The BICR process serves several purposes. These include:

Reducing bias: An investigator can be influenced in his or her assessment by prior knowledge of the treatment assignment in the case of an open-label study or patient toxicities. Blinded review enhances objectivity.
Reducing measurement variability across sites and readers: Tasking a small number of central reviewers with expertise in a specific area with reviewing imaging may lead to more accurate and reliable assessments compared to local site reads. This is particularly important in multi-center trials.
Ensure standardization: Centralized review ensures the standardized application of response criteria (e.g., RECIST 1.1, iRECIST).
Improved data quality: Centralized monitoring allows for regular quality control, helping to identify issues such as inconsistent imaging techniques or poor-quality scans.
Enhanced regulatory confidence: Regulatory agencies like the FDA and EMA often prefer or require BICR for pivotal oncology trials, in particular for open-label studies or those with higher bias risk. This strengthens the credibility of primary endpoints derived from tumor assessments.

What are the limitations of BICR?

While the BICR process can help achieve the above aims, there are limitations. For example:

Operational complexity, time, and cost considerations: BICR requires a lot of coordination between multiple stakeholders (sponsors, imaging CROs, radiologists, adjudicators), resources, and logistics (e.g., reader’s training and data blinding, transfer, storage, and tracking).
Informative censoring: BICR may introduce bias due to informative censoring, which results from having to censor unconfirmed locally determined progressions. Indeed, once a patient has progressed according to the local assessment, s/he might discontinue the study, and further imaging is unlikely to occur. As a result, determining the BICR progression time may be impossible. This type of censoring will be informative: patients who progress according to local review (but not according to central review) will be more likely to progress by the next scheduled scan than patients who have not been determined to progress by local review. One alternative to this issue is to request at least one additional scan beyond progression assessed by investigators. Even though it may be required in protocol, it may be difficult to implement.
Regulatory agencies often expect BICR in pivotal oncology trials, in particular for open-label studies. However, inconsistencies between local and central results can complicate data interpretation and submission.

How is BICR data used?

As the primary endpoint (e.g., PFS-BICR): All patients need to be reviewed by BICR either in real time or by regular batch to be agreed upon.

For sensitivity analysis: All patients need to be reviewed by BICR either in real time or by regular batch (to be agreed upon) or as retrospective BICR. Retrospective BICR can be implemented only if the trial is positive based on local assessments (INV). Ideally, images should be collected and archived even after progression, to allow for retrospective BICR, if needed.

As an audit tool: In such context, only a random subset of patients can be reviewed and concordance between BICR and INV assessed on this subset of patients.

Practical aspects in BICR implementation

Imaging Charter

The Imaging Charter needs to be written in accordance with the protocol to ensure consistent methodology between investigators and independent review assessments.

Readers involved need to be specified in the charter; in general, two primary readers and one adjudicator are involved.

Adjudication paradigms should be detailed, in particular:

Are the primary readers compared at the patient level or visit level?
Which criteria are used to consider whether assessments from primary readers are different? For example:
- At the patient level: the date of progression and/or best response are different
- At the visit level: the sum of diameters on target lesions > xx mm

The choice of criteria impacts the number of cases that go to adjudication. Indeed, studies have shown that discrepancies between two readers can be substantial. For example, in lung cancer trials using RECIST 1.1, the average discrepancy rate at the patient level was around 59.2%,1 with adjudications often required to resolve differences. These discrepancies may stem from medically justifiable differences in interpretation or from errors, both of which can affect trial outcomes. Training and monitoring by a Central Imaging vendor can mitigate a large portion of the commonly encountered reading errors and therefore reduce variability.

If adjudication occurs, which reader is the accepted one? (e.g., one of the primary readers (forced adjudication) and/or a new reader by adjudication (open adjudication, less common)).
If adjudication does not occur, which reader is the accepted one? (e.g., reader 1 as default).
The timing of readers and adjudication should be defined in the Charter (e.g., real-time review or review only once the patient discontinues the treatment, adjudication only once per patient or once per study, etc.).

Data transfers

The independent review must remain independent; imaging results should not be shared from the site to the BICR or vice versa.

The timing/frequency of the transfer to be defined:

If only adjudication data are included, there might be a greater backlog for the tracking of BICR events.

Data reconciliation. Review of general consistency between the INV and BICR:

Check whether patient populations, set of scans, visits, dates, and method of assessments (if needed) are consistent between BICR and INV datasets.
Visits with Investigator Tumor Assessment but Missing BICR Assessment

Concordance between the INV and BICR results can be assessed during data review and at time of analysis.

For PFS: Concordance of Occurrence and Timing of disease progression
For ORR: Concordance of occurrence of Complete Response/Partial Response

Impact of tracking events and prediction of analysis timing

When the primary endpoint is PFS-BICR, interim or primary analyses are triggered by the number of BICR-assessed events. In such context:

Monitoring BICR events may be more challenging than INV events. Indeed, it is hardly ever real-time event monitoring, there is some backlog for the review and adjudication time is to be considered.
For event projections:
- PFS-INV can be used as a first surrogate; if observed concordance is relatively high, it should provide relatively good estimate. The study team could consider estimating the BICR-assessed number of events based on an estimated ratio of investigator-assessed events (e.g., xx% of INV events).
- Adjudication must be initiated early, to ensure that the number of BICR-assessed events are accurate. However, in some instances, the study may have pending adjudications at the time of data transfer for event projection.
- BICR readers should read the full set of scans done up to a certain cutoff date and then transfer the data. Predictions can also be made for each reader separately if the adjudication is not yet completed.

Final takeaways

BICR enhances the objectivity and regulatory credibility of oncology trial endpoints by minimizing bias and standardizing radiographic assessments.

However, its implementation introduces operational challenges, may add complexity around data analysis with the risk of informative censoring, data interpretation in case of poor concordance between INV and BICR and prediction of analysis timing.

Improving Efficiency in Oncology Dose-Escalation Trials: A Cautious Bayesian Approach

In the dynamic world of oncology drug development, the complexity of dose-finding studies increases substantially when multiple disease types are evaluated within a single trial. The heterogeneity between cancer types poses a critical challenge: how can we design efficient dose-escalation procedures that account for patient differences across indications particularly when one indication recruits more quickly than the other?

A new approach, cautious iBOIN (ciBOIN), offers a compelling answer. Built on the foundation of the Bayesian Optimal Interval (BOIN) design and its variant with informative priors (iBOIN), ciBOIN introduces a prudent method for borrowing strength from common cancer types that recruit faster to rarer types with slower recruitment while maintaining separate maximum tolerated dose (MTD) estimation for each cancer type.

The dose-escalation dilemma in multi-cohort trials

Traditional dose-escalation designs often face a trade-off between safety and efficiency. When trials pool data across disease types, they risk obscuring differences in toxicity profiles.

On the other hand, treating each type entirely independently can lead to missed opportunities to leverage valuable information.

Enter ciBOIN: A pragmatic compromise

The ciBOIN method was developed as a compromise between pooling disease types and separate dose-escalation. It allows dose-escalation decisions in the slower-recruiting disease type to be cautiously influenced by data from the faster-recruiting one. The design is particularly appealing in trials where each disease type may require a distinct MTD estimation due to differing patient profiles.

Through extensive simulations, ciBOIN was compared against separate dose-escalation using BOIN over a range of scenarios. The assessed scenarios and results can be classified in three categories:

Same toxicity in both disease types: ciBOIN leads to similar or slightly better MTD detection rates with less patients overdosed and a lower DLT rate compared to a separate dose-escalation.
Higher toxicity in the common disease type: ciBOIN underestimates the MTD for the rare type but achieves improved safety, reducing the number of patients exposed to overly toxic doses and lowering the overall dose-limiting toxicity (DLT) rate compared to a separate dose-escalation.
Higher toxicity in the rare disease type: Here, ciBOIN again underestimates the MTD a bit, this time in the common disease type, but again with reduced overdosing rates.

Overall, ciBOIN results in smaller trial sizes. The highest reduction (~3 patients) with ciBOIN compared to separate dose escalation was observed in the highest dose-toxicity profile.

A balanced path forward

The findings support ciBOIN as a viable compromise between full pooling and strict separation. It ensures that dose recommendations are never too aggressive, thereby safeguarding patient safety while still achieving gains in operational efficiency.

Notably, ciBOIN enables a nuanced strategy: one that adapts to the heterogeneity of real-world oncology trials without overcomplicating implementation. For sponsors and statisticians navigating increasingly complex pipelines, this approach may offer a timely and practical innovation.

Looking ahead

As oncology trials continue to evolve toward platform and umbrella designs, methods like ciBOIN will be instrumental in ensuring both flexibility and rigor. Future work may explore extending the framework to accommodate more than two cohorts or using other approaches than BOIN and iBOIN.

Ultimately, ciBOIN exemplifies how thoughtful design choices, informed by Bayesian thinking and tempered by clinical caution, can help meet the dual mandate of safety and speed in early-phase drug development.

Interested in learning more?

Martin Kappler, along with Yuan Ji from the University of Chicago, will present “ciBOIN — A Bayesian-Informed Dose-Escalation Design for Multi-Cohort Oncology Trials with Potentially Varying Maximum Tolerated Doses” at the 46th Annual Conference of the International Society for Clinical Biostatistics (ISCB) on August 24–28, 2025, in Basel, Switzerland.

Older Posts

Discovery

Phase I-III Clinical Trials

Commercialization

Real-World Evidence Solutions

Clinical Trial Design

Trial Delivery

Advanced Analytics

Specialty Areas

Strategic Consulting

Beyond Functional Service Provider

Project-Based Analytical Solutions

Trial Design Software

Trial Implementation and Decision Support Software

LiveSLR® Software for Systematic Literature Reviews

Our Solutions

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

Quick Links

About Us

Quick Links

Insights

Quick Links

Careers

Quick Links

Multiple endpoints are now the rule, not the exception

The statistical reality of correlated endpoints

Why closed-form calculations do not apply

Modeling multiple endpoint outcomes

Multiplicity control considerations

Integrating external and historical data

A practical perspective

Interested in learning more?

What is an external control arm?

Strategic value of external control arms

Methodological considerations and risks to manage

Regulatory outlook and expectations

FDA draft guidance: “Approaches to Assessment of Overall Survival in Oncology Clinical Trials”

Using and interpreting overall survival data

Overall survival data: A key safety parameter

Oversight on study integrity and interpretability

Increasing the confidence of subgroup analysis results

Ad hoc interim analyses for OS

Difficult recommendations

Final takeaways

Interested in learning more?

What is PKBOIN-12?

Project Optimus: A regulatory shift toward the OBD

Dose finding with the East Horizon™ platform

Interested in learning more?

Why external control arms matter in oncology and rare diseases

Supporting internal and regulatory decision-making

Designing better studies with ECAs

Sources of external control data

The value of earlier clinical trials

Reducing bias with propensity scoring

Quantitative bias analysis: Addressing the unmeasured

The need for experienced statisticians

Final takeaways

Interested in learning more?

Types of master protocol trials

Figure 1: Basket Trials, Umbrella Trials, and Platform Trials

Key challenges with master protocol trials

Comparative Overview: Basket vs. Umbrella vs. Platform Trials

Final takeaways

What is RECIST 1.1 and why is it important in oncology?

What are the key response assessments in RECIST 1.1?

How is RECIST used in statistical analysis?

What are the statistical and clinical challenges with RECIST 1.1?

Why was iRECIST developed and how does it differ from RECIST 1.1?