Solutions
About Us
Insights
Careers

Living Evidence and the Rise of AI-Enabled HEOR Infrastructure: Insights from ISPOR Philadelphia

The ISPOR US 2026 conference in Philadelphia drew together colleagues and industry partners across evidence, value, and access. Across the presentations and sessions, a major theme emerged: we are an industry moving rapidly from AI experimentation and toward AI-enabled infrastructure. Here we share some of the key takeaways.

 

AI becomes core infrastructure

The strongest signal from ISPOR Philadelphia was that AI is no longer viewed as a side tool for productivity gains. Across HEOR, HTA, and RWE, organizations are beginning to embed AI directly into evidence generation and submission workflows. Discussions focused less on experimentation and more on operationalization, governance, and scalability.

AI is now being explored across the full evidence lifecycle, including systematic literature reviews, economic modeling, patient-reported outcomes, HTA submissions, payer communication, and regulatory documentation. The industry appears to be shifting toward continuously learning evidence systems rather than static, project-based workflows.

 

Agentic AI moves beyond simple automation

One of the biggest themes was the emergence of agentic AI systems. Instead of using isolated prompts, organizations are experimenting with coordinated AI agents that can generate models, review outputs, create documentation, and prepare evidence packages.

Several workshops demonstrated how AI can move from model concept to full implementation in both R and Excel while maintaining human oversight. The emphasis throughout was not full autonomy, but “human-at-the-helm” governance where AI accelerates and supports execution while experts retain accountability.

This reflects a broader transition from AI-assisted work toward AI-orchestrated workflows.

 

AI-supported SLRs reach a turning point

AI-assisted systematic literature reviews (SLRs) dominated the conference agenda. However, the conversation has evolved significantly from earlier discussions focused mainly on efficiency gains.

The field is now grappling with questions around reproducibility, transparency, benchmarking, and governance. Multiple sessions highlighted the lack of shared standards for evaluating AI-SLR performance and proposed industry-wide benchmarking frameworks and validation challenges.

ISPOR itself is increasingly positioning itself as a central body for developing good-practice guidance and methodological standards for AI-enabled evidence synthesis, with the anticipated publication of the GenAI in SLR taskforce report.

 

Regulatory readiness becomes critical

Another major theme was regulatory credibility. Panels focused heavily on FDA, EMA, NICE, and Health Canada guidance regarding AI-assisted evidence generation and real-world data curation.

The industry discussion has shifted from asking whether regulators will engage with AI-generated evidence to determining what documentation, validation, and governance standards will be required for acceptance.

Speakers repeatedly emphasized auditability, traceability, reproducibility, and version control as foundational requirements for regulatory-grade AI workflows.

 

Real-world data and AI converge

Many sessions positioned AI as the enabling layer needed to unlock the value of modern real-world data. Much of healthcare’s most clinically meaningful information remains trapped in unstructured formats such as clinician notes, pathology reports, and medical charts.

AI methods including NLP and machine learning are increasingly being used to transform this information into structured, research-ready evidence. This was especially prominent in sessions involving medical devices, exploratory evidence planning, and dynamic evidence generation strategies.

AI is increasingly being viewed not simply as an analytics tool, but as foundational infrastructure for modern RWE generation.

 

Patient voice gains new attention

Several workshops explored how large language models and conversational AI can support patient-centered research. These applications included free-text analysis, conversational patient interviews, social media analysis, and narrative symptom capture.

The interest in AI application in qualitative research represents an important expansion beyond traditional structured analytics. Researchers are now exploring whether AI can preserve the nuance of lived patient experience while enabling scalability.

At the same time, concerns around hallucination risk, construct validity, and bias remain central to these discussions.

 

HEOR leadership roles are evolving

As AI automates more technical tasks, the role of HEOR and RWE leaders appears to be changing. Multiple sessions suggested that future leadership value will increasingly center on governance, strategic interpretation, stakeholder trust, and organizational coordination.

Rather than replacing experts, AI may elevate the importance of human judgment and scientific oversight. Organizations will need leaders who can balance innovation with credibility in payer and regulatory environments.

This suggests AI adoption is not simply a technology challenge, but an organizational transformation challenge.

 

Responsible AI emerges as the central principle

Across nearly every session, the same themes repeatedly appeared: transparency, reproducibility, validation, governance, and human oversight.

The HEOR community appears to be converging around a shared understanding that AI adoption will only succeed if scientific credibility and integrity remain intact. The conversation is no longer about replacing traditional rigor, but about scaling evidence generation responsibly.

 

ISPOR Philadelphia ultimately showed an industry moving rapidly from AI experimentation toward AI-enabled infrastructure. The next phase of HEOR will likely be defined by organizations that can operationalize AI while maintaining trust, methodological rigor, and decision relevance.

From Automation to Audit-Readiness: AI’s Growing Role in Statistical Programming

In the fast-evolving world of pharmaceutical clinical development, the demand for faster, more accurate, and scalable solutions around patient data is increasing rapidly. Traditional statistical programming, although reliable, faces a growing challenge to keep up with the enormous volume of data, complex protocols, and regulatory requirements.

Here, I discuss how AI and automation are reshaping statistical programming, and how to adopt these tools responsibly.

 

The increasingly labor-intensive role of statistical programming

As clinical trials grow in complexity, the role of statistical programmers has become increasingly labor-intensive and time consuming, particularly in tasks such as data cleaning, validation, and the creation of datasets like SDTM (Study Data Tabulation Model), ADaM (Analysis Data Model), study report required Tables, Figures and Listings, and regulatory submission packages.

In response, modern statistical programming is redefining clinical development data reports by automating routine tasks and enhancing efficiency while preserving data integrity and keeping outputs ready for inspections.

 

Reducing manual effort and errors with automation

Automation in clinical trials statistical programming is evolving to reduce manual effort and errors, especially in TFL generation and SDTM/ADaM processes.

Key advancements include metadata-driven approaches and open-source ecosystems like Pharmaverse aligned with CDISC standards, which provides a complete, production-ready toolchain for clinical trials reporting. It supports dataset generation, TFL creation, submission and validation, and workflow orchestration using various tools.

A metadata-driven automation advances TFL generation by using CDISC ARS metadata defined in TFL shells to create ready-to-run SAS programs. This approach reduces manual effort, improves consistency and traceability, and allows quick adaptation to changes, shifting TFL development from code-centric to metadata-centric. Overall, it enables fully automated, metadata-driven clinical reporting with faster delivery, improved quality, and scalable workflows.

 

Improving efficiency with risk-based validation

Risk-based validation, aligned with ICH Q9, improves efficiency by scaling quality control according to output criticality:

  • High-risk analyses require full double programming
  • Medium-risk outputs use peer review supported by automated testing
  • Low-risk outputs rely primarily on automation

 

Growing role of artificial intelligence

Along with the many automation efforts, AI has evolved in the realm of statistical programming. We are already seeing the following advancements:

  1. Large Language Models (LLMs): AI-driven systems like GPT-4, Claude, GatorTron, and ClinicalBERT are being used to assist with code drafting, interpreting analysis specifications, and even reviewing code for errors.
  2. Natural Language Processing (NLP): This technology helps translate unstructured text, like clinical trial protocols, into structured inputs for statistical programming workflows.
  3. Deep Learning for Predictive Insights: Beyond data mapping, deep learning can help with pattern detection and predictive tasks, providing insights that were previously out of reach.

 

LLMs offer fast code generation and explanations but carry hallucination risks, while NLP helps extract meaning from text but requires domain-specific tuning. Deep learning is strong at identifying complex patterns but lacks interpretability.

AI-generated outputs should not be treated as final deliverables. Under good programming practices, traceability, human review, and explainability are essential, with clinical trial team review and statistician oversight required throughout the process.

Automation must maintain ALCOA+ data integrity principles: Attributable (clear authorship), Legible (readable outputs), Contemporaneous (timestamped), Original (source preservation), Accurate (verified correctness), plus Complete, Consistent, Enduring, and Available. Automated systems can enhance ALCOA+ compliance through immutable audit trails via version control, automated provenance tracking, reproducible execution environments, and systematic documentation generation.

 

Final takeaways

As AI continues to integrate into clinical development workflows, the role of the statistical programmer is changing. Rather than replacing programmers, AI is enhancing their capabilities, allowing them to focus on higher-order tasks such as analysis standardization, quality regulatory submissions and technological innovations.

Programmers who embrace these technologies will be better positioned to thrive in this rapidly evolving landscape. The future of clinical development is promising, but it will require a shift in how we think about programming. Upskilling in areas like R, Python, and machine learning, combined with strong communication and collaboration skills, will be key to staying ahead in this new era. As AI and automation continue to reshape statistical programming, it’s essential that we adopt these tools responsibly keeping human oversight at the forefront while leveraging AI to enhance efficiency, accuracy, and compliance.

Accelerating Database Lock Timelines Without Sacrificing Data Quality

Database Lock (DBL) is a critical milestone in the clinical trial lifecycle. A final step of clinical data management, database lock indicates the completion of data collection, cleaning, and validation, readying the data for statistical analysis. This milestone typically occurs 4–6 weeks after Last Patient Last Visit (LPLV). However, if a more challenging timeline (like 1-3 weeks) for DBL is required — due perhaps to expedited regulatory submissions and pressing business or scientific requirements — it creates a high-pressure scenario for all stakeholders.

As gatekeepers of data integrity and quality in clinical trials, the Clinical Data Manager (CDM) plays an important role in ensuring DBL is achieved on time without sacrificing data quality.

Here, I share best practices for achieving accelerated database lock timelines.

 

Successful database lock depends on early planning

To ensure the database lock is successful, meticulous planning and key stakeholder involvement are vital from the start. Key stakeholders may include Clinical Data Managers (CDM), Clinical Research Associates (CRA), Medical Monitors, Site Staff (Investigators, Coordinators), and Biostatisticians.

 

Stakeholder involvement

Different domains have different perspectives when looking at data, although we share the same goal. Since the biostatisticians are the ones who process and analyze the data, it is important to involve them early on so that our perspectives are aligned.

 

Stakeholder expectations

It’s important to align expectations, responsibilities, and timelines with all key stakeholders early in the planning process, ensuring all parties are on the same page. This will help to identify potential risks, evaluate the likelihood and impact of risks to determine their severity, and allow for contingency planning.

 

Accelerate database lock with continuous data cleaning

Adopting a strategy of continuous data cleaning throughout the trial significantly accelerates DBL. This involves performing regular, structured data review and correction of accumulated trial data.

 

Locking data in groups

Locking data periodically throughout the trial reduces the volume of data that needs to be finalized at the end. Grouping data for locking, verification, and cleaning must be completed before locking can take place.

 

Timelines for each group of data locking

Collaborate with stakeholders on how data could be grouped together for locking and agree on realistic timelines for each group, having specific needs of each stakeholder in mind as some tasks are dependent on one another. These timelines could include last data entry, last query sent, last query resolved, investigator sign-off, and lock date.

For example, a grouping strategy includes:

  1. Looking at the participant recruitment plan
  2. Identifying the number of participants expected to complete the last visit or specific visit in a certain number of months
  3. Grouping these data together to form a batch
  4. Defining timelines for data cleaning activities before performing a lock on these data

When pre-defining timelines, it is important to take into consideration source data verification (SDV) intervals, and the feasible aspects to minimize data unlocking after the lock. For example, if the locking group contains a substantial volume of data, then the timeline for each activity typically needs to be longer. Aiming for a smaller volume of data when nearer to LPLV is essential to facilitate shorter data cleaning turnaround time.

 

Identify issues early with clinical data manager oversight

The CDM should closely monitor data and perform trend analyses to detect common data entry discrepancies, lagging query resolutions, unexpectedly high open queries, or pending SDVs, and alert stakeholders to address the issues promptly. This significantly helps to identify issues early and optimize data quality, which minimize costly delays. The CDM should also monitor the unlocking rates of previously locked data. If the unlocking rate is high, consider revising the data locking plan with more realistic timelines.

 

Delay in Investigator sign-off

Investigator sign-off of Electronic Case Report Forms (eCRFs) is a foundational regulatory requirement serving as documented evidence for the accuracy, completeness, and integrity of the data submitted. It is frequently delayed due to a combination of high investigator workloads, technical complexities, or cumbersome processes. Early discussion on this critical milestone and including the timeline in the Data Locking Plan contribute significantly to expedited DBL.

 

Avoid bottlenecks caused by data outside electronic data capture

External data that are not part of electronic data capture (EDC) often become the bottleneck in DBL due to complexity and time-intensive processes. Early proactive discussions with vendors regarding timelines for data delivery are critical to avoid jeopardizing an accelerated DBL timeline.

 

Final takeaways

Adopting continuous data cleaning approach is essential for organizations aiming to shorten the timeline between LPLV and DBL. With strong attention to planning, timelines, and ongoing stakeholders’ engagement, DBL can be achieved on an accelerated timeline.

A New Frontier in Real-World Evidence: Can AI Create Reliable Synthetic Trial Data?

Synthetic data is a promising innovation for clinical studies that incorporate an external control arm. Relying solely on traditional control groups can be costly, time-consuming, or even unethical. Instead, researchers are exploring ways to generate “synthetic” patient cohorts that behave like real ones.

At ISPOR US 2026 in Philadelphia, we will be presenting a pilot study that takes an important step in this direction. Our work explores how large language models (LLMs), the same technology behind modern AI assistants, can be used to generate synthetic clinical trial datasets suitable for external control arms (ECAs).

 

Why external control arms matter

ECAs are increasingly important in clinical trials, especially in areas where recruiting patients into placebo groups is difficult or undesirable. By using existing data to simulate a control group, researchers can accelerate trials and reduce patient burden. This challenge is especially pronounced in rare diseases, oncology, gene and cell therapies, and severe or life‑threatening conditions, where patients and clinicians are understandably reluctant to accept randomization to non‑active treatment arms.

However, for ECAs to be useful, they must meet two critical requirements:

  1. They need to closely resemble real patient populations in terms of demographics and clinical characteristics.
  2. They must protect patient privacy and be reproducible for regulatory scrutiny.

This is where AI — and specifically LLMs — enters the picture.

 

Two approaches to generating synthetic data

In our study, we evaluated two different ways of using LLMs to generate synthetic clinical trial datasets.

The first approach was direct generation. Here, the LLM was given access to the original dataset along with a variable dictionary and asked to generate a new synthetic dataset in a single step. This method is fast and intuitive.

The second approach was more structured and code-driven. Instead of generating the dataset directly, the LLM created a Python-based pipeline that performed bootstrapping and anonymization. This pipeline included a noise injection mechanism, where small amounts of statistical “noise” were added to numerical variables. The noise was carefully calibrated — set to 5% of each variable’s standard deviation — and values were constrained within realistic ranges to maintain clinical plausibility.

 

What we found

Both methods were able to generate synthetic cohorts of 100 patients, demonstrating that LLMs can indeed produce usable clinical datasets.

However, the differences between the two approaches were striking. The direct generation method was extremely fast, completing the task in just 23 seconds with a single prompt. In contrast, the code-based approach took longer — around 40 seconds — and required multiple iterations to refine the pipeline.

Despite the extra effort, the code-driven method delivered better results. It more accurately preserved the statistical properties of the original trial data, including key variables like age, body mass index, sex, and race. The distributions in the synthetic dataset closely matched those of the real population, suggesting that the combination of bootstrapping and calibrated noise was effective.

 

Speed vs. scientific rigor

These findings highlight an important trade-off. Direct LLM generation is excellent for rapid prototyping and exploratory analysis. It allows researchers to quickly create synthetic datasets with minimal effort.

But when it comes to regulatory-grade applications — such as external control arms used in decision-making — transparency and control become essential. The code-augmented approach provides a clear, reproducible process that can be audited and validated. This level of rigor is crucial for building trust with regulators and stakeholders.

 

Balancing privacy and realism

A key challenge in synthetic data generation is protecting patient privacy without losing the statistical integrity of the dataset. Our study shows that adding carefully calibrated Gaussian noise can strike this balance.

By scaling noise to the variability of each variable and enforcing realistic bounds, we were able to anonymize the data while preserving meaningful population-level characteristics. This approach helps ensure that synthetic datasets remain useful for analysis while reducing the risk of re-identification.

 

What comes next?

While this pilot study demonstrates the potential of LLM-generated synthetic cohorts, it is only the beginning. Future research needs to explore whether these methods are robust under more challenging conditions.

One critical next step is to evaluate re-identification risk, particularly under adversarial scenarios where attackers actively attempt to reverse-engineer the data. It will also be important to compare noise-based approaches with other privacy-preserving techniques, such as differential privacy. This step would include understanding the amount of noise the model should introduce.

 

Closing thoughts

Synthetic data has the potential to transform clinical research by making trials faster, more efficient, and more ethical. Our findings suggest that LLMs can play a meaningful role in this transformation — but how they are used matters.

Fast, direct generation offers convenience, but structured, code-based approaches provide the reliability and transparency needed for real-world adoption. As the field moves forward, combining the strengths of both may unlock the full potential of AI-driven synthetic data in healthcare.

 

Interested in learning more?

Join Manuel Cossio and Deepa Jahagirdar, along with Anupama Vasudevan, at ISPOR US for their upcoming presentation, “A Pilot Assessment of LLM-Generated Synthetic Cohorts: A First Step Toward Robust Synthetic Control Arms” on May 18 at 4:00 PM.

Insights From Our Work with CDISC Standards: A Preview of Cytel’s Contributions to the 2026 CDISC + TMF EU Interchange

A year ago, I stepped down from the CDISC EU Committee and, guess what? Just few weeks later, CDISC chose Milan, my hometown, as next destination for CDISC + TMF EU Interchange.

May 20–21 is fast approaching, so be sure to check the agenda if you haven’t already registered. You may immediately notice a different style compared to most conferences, and that’s the influence of a city known for beauty and design (and I promise I’m not biased). But have you ever seen conference tracks with such artistic titles? “AI Espresso Shot,” “TMF Standards and Governance — Allegoria del Buon Governo,” “Uno Standard Da Mangiare (USDM),” “Protocolli alla Moda: The M11-USDM Collection,” just to name a few.

Together with my Cytel colleagues, we will have two presentations and one poster, sharing insights from our work with CDISC standards, including the Datasets-JSON and CORE.

 

JSON and CORE Unlocking Adoption

Silvia Faini (co-author: Hugo Signol, Sebastia Barcelo, Angelo Tinazzi),

Wednesday, May 20, 12:30-13:30 – Poster Session

In this poster, we share our experience working with both CDISC datasets-JSON and CORE. Using several anonymized studies, we assessed available tools, both SAS and R, for creating and importing datasets-JSON files We highlight criticality key challenges, risks, and a comparison of CDISC CORE outputs versus tools such as Pinnacle21 (for SDTM only).

 

Authenticity Matters: Preserving Standards Integrity from Clinical Data Models to Tiramisù

Angelo Tinazzi

Thursday, May 21, 12:00-12:30 – Session 6C: L’Architettura degli Standard

What started as a joke, trying to “cheat” my Belgian friends who always complain about the original alcohol-free Tiramisu recipe, evolved into a serious internal project. We analyzed metadata from more than 300 anonymized SDTM packages (with ADaM to follow), spanning multiple versions, therapeutic areas, and trial phases. Using these metadata, we explored how SDTM implementation and adherence to regulatory expectations, particularly FDA requirements, have evolved over time, assessing quality and consistency through quantitative metrics.


It Got Worse Than Expected: Three Years of Retrospective CBER Requests on SDTM, ADaM, and TFLs

Mark Malayas (co-author: Angelo Tinazzi)

Thursday, May 21, 14:30-15:00 – Session 7: Regulatory Eccellenza

In 2023, we presented at PHUSE-EU Connect an initial experience with a FDA CBER Vaccine submission, following some initial interactions with the FDA. We shared our concerns about requests that, in many cases, required retrospective changes to already concluded studies. But that was not all! Three years later, the situation evolved further, with increasing and often unexpected requests from the agency. Curious to learn more? Join Mark on Thursday.

 

Silvia Faini, CDISC E3C Vice-Chair, will also be moderating Session 6C: L’Architettura degli Standard, Thursday, May 21 from 11:00–12:30.

 

Meet us there

Cytel will also have a booth at the conference! Stop by with our presenters, but also with our Business Development colleagues.

We look forward to reconnecting with colleagues from around the world, meeting new peers, and exchanging ideas at CDISC + TMF EU Interchange 2026.

We hope to see you in Milan!

risk.assessr: R Package Validation for Regulatory Submission in Pharmaceutical Development

In pharmaceutical development, the reliability of statistical software is not a luxury; it is a regulatory requirement. For organizations leveraging R in regulated environments, this mandate means a rigorous approach to validation is needed. Tools like risk.assessr allow users to create a practical, data-driven process to meet regulatory requirements.

 

Validation in R

In pharmaceutical development, validation typically refers to systems validation. The system validation should incorporate all of the following elements:

  • Accuracy
  • Reproducibility
  • Traceability

When assessing the accuracy of R packages, the R Validation Hub differentiates R packages by the following types:

  • Base and recommended (core) packages: developed by the R Foundation and shipped with the basic installation and represent the highest tier of reliability.
  • Contributed open-source packages: developed by anyone in the community and may vary significantly in their accuracy and robustness.

 

Validation using risk.assessr

Recognizing the need for a structured, risk-based approach to R package validation, we developed the open-source tool, risk.assessr. The risk.assessr package takes a risk-based approach to evaluate the potential risks linked to each R package.

The assessment considers:

  • A package’s complexity and structure
  • Unit test coverage
  • Traceability
  • Documentation quality
  • License
  • Popularity
  • Package activity and maintenance

 

By extracting these risk-based metrics, risk.assessr allows users to make informed decisions about whether a package is suitable for use in regulated environments or in exploratory analysis.

 

Key metrics for package validation

Validation metrics are gathered by risk.assessr through specific functions that retrieve desired data. Table 1 lays out some of the key metrics and risk.assessr functions:

 

Table 1: Key Metrics

 

Risk analysis using risk.assessr

The power of risk.assessr lies in its risk analysis capabilities, which employ rule-based criteria. These risk criteria can be used to enforce stricter standards, accommodate internal tooling priorities, or meet compliance requirements. Users define threshold values for high, medium, and low risk across the metrics mentioned above or for metrics that they define themselves. These thresholds are stored in inst/config/risk-definition.json, allowing for centralized, version-controlled governance of validation standards.

The get_risk_analysis() function applies these rules to calculate risk ratings, transforming raw metrics into actionable, easy-to-understand intelligence. This approach recognizes that validation requirements vary by organization and use case — what constitutes acceptable risk for an exploratory analysis differs from risk tolerance for a regulatory submission.

 

Risk analysis: Reporting

risk.assessr generates two complementary reports that serve different audiences and purposes.

The generate_html_report() function produces a detailed report for developers and validation teams that translates the three level threshold risk values into a three-level visualization: red for high risk, yellow for medium risk, and green for low risk. This visual approach makes risk assessment immediately apparent and facilitates technical discussions about package suitability.

For validation team sign-off of R packages, write_summary_report() generates a concise one-page summary that produces three actionable recommendations: Approved, Rejected, or Remediation Needed. This report, typically generated by a Validation GitHub Action, provides a structured framework for validation teams to apply critical thinking and make final decisions about package inclusion in submission or other environments.

 

Final takeaways

risk.assessr can be a critical component of an easy-to-use, reliable, and detailed validation workflow. These workflows allow organizations to confidently create validated R environments for submission purposes and/or exploratory purposes. They also help maintain audit trails and compliance documentation.

How Agentic AI Can Transform HTA Landscaping for EU JCA

Health Technology Assessment (HTA) in the European Union (EU) is entering a new phase with the introduction of the EU Joint Clinical Assessment (JCA). The goal of the new HTA regulation is to improve the availability of innovative health technologies in the EU by ensuring efficient resource use and strengthening the scientific quality of HTA across Member States (MS).

At the heart of this process is the JCA scope, which consolidates diverse evidence requests from all MS into the PICO (Population, Intervention, Comparator, Outcome) framework. Anticipating these policy-driven PICO requests is critical for a successful JCA submission and can turn into a complex, time- and labor-intensive exercise. In addition to understanding the potentially diverse clinical practices across the MS, it demands an in-depth assessment of the different national HTA evidence requirements. Teams working on PICO predictions need a clear mapping of what evidence has been accepted, questioned, or rejected across the different HTA systems. Building that mapping is multifaceted.

 

Why HTA landscaping is challenging

HTA landscaping requires careful review of past HTA decisions to understand what evidence leads to positive HTA outcomes. This involves identifying relevant patient populations, accepted comparators, and meaningful outcomes. It also requires going deeper in the HTA documentation, uncovering why certain choices were criticized or dismissed.

Much of this information is hidden in long reports, potentially including appendices. These HTA documents are written in different languages, follow different formats, and often include subtle but important contextual details that unravel the HTA critiques and reasoning for specific evidence requests. As a result, landscaping is still largely manual, time-consuming, and difficult to scale.

 

What makes agentic AI different

Agentic AI offers a new way to approach this problem. Instead of simply summarizing documents or answering one-off questions, agentic systems are designed to carry out structured tasks. They can follow a defined set of instructions, extract specific types of information, and organize results in a consistent way.

This makes them particularly suited for HTA landscaping, where the goal is not just to read documents, but to systematically extract comparable insights across multiple sources.

 

Our research: Using AI agents for HTA extraction

In our recent research, which will be presented at ISPOR US this May, we explored how autonomous AI agents can support HTA landscaping for EU JCA.

We developed two large language model–based agents designed to extract structured information from HTA reports using a set of 21 expert-defined questions. These questions covered both standard PICO elements, such as population, comparators, and outcomes, as well as more context-specific insights. This included methodological requirements, reasons for rejecting certain outcomes or comparators, and other critique points raised by HTA bodies.

The two agents differed in how they were guided. The first used a general prompt, while the second incorporated additional clarification within selected questions to improve contextual understanding.

 

How we evaluated performance

To test the agents, we used publicly available HTA reports for osimertinib (in locally advanced or metastatic NSCLC with EGFR T790M mutation) from Spain, the Netherlands, and France. These reports varied in length, structure, and language, providing a realistic test of performance.

Local HTA experts applied a strict scoring framework that assessed both accuracy and completeness. Importantly, any answer containing hallucinated content was automatically scored as zero. This ensured that reliability remained central to the evaluation.

 

What we found

Both agents were able to complete the full extraction across all HTA reports, and around 90% of responses were generated without hallucinations. The second agent performed better overall, achieving a higher number of fully correct answers and fewer partially correct responses.

The first agent, while still effective, produced some hallucinated content, particularly in the Spanish report. The second agent avoided hallucinations entirely in this evaluation. Both agents performed best on the French HTA report, suggesting that clearer structure and language can improve AI performance.

One of the most important findings was the impact of prompt design. Adding targeted clarification significantly improved the agent’s ability to interpret and extract complex HTA information.

 

What this means for EU JCA landscaping

These results suggest that agentic AI can meaningfully improve how HTA landscaping is performed. By automating structured extraction, it becomes possible to review multiple reports more quickly and consistently. This allows teams to build a more comprehensive understanding of the landscape in less time.

Importantly, this approach goes beyond standard PICO elements. It captures the context-specific insights that often drive HTA decisions, such as methodological concerns or other reasons for rejecting evidence. This is critical for developing realistic PICO scenarios in the context of JCA.

Another key advantage is the ability to work across languages. Since EU HTA involves multiple jurisdictions, multilingual capability removes a major barrier and enables a more unified analysis.

 

The role of human expertise

Despite these advances, AI alone is not enough. Some limitations remain, including occasional hallucinations and variability depending on the source material. For this reason, human oversight continues to be essential.

The most effective approach is to combine agentic AI with human HTA expertise. AI can handle large-scale extraction and structuring of information, while experts validate the outputs and ensure that interpretations are accurate and relevant.

 

Looking ahead

Agentic AI is unlikely to replace HTA professionals, but it will fundamentally reshape how they work. By reducing the burden of manual review, it frees experts to focus on higher-value activities such as interpretation, strategic planning, and decision-making.

In the context of EU JCA, this shift brings clear advantages. It enables faster, more scalable landscaping and PICO predictions, helping to identify potential evidence gaps earlier in the process. As the methodology evolves, further testing will expand the integration of HTA reports from additional MS into the agent-driven workflows. At the same time, engineering adaptations may be needed to accommodate ongoing changes in local HTA documents as they continue to evolve together with the JCA reports.

 

Interested in learning more?

Manuel Cossio and Lilia Leisle will be presenting their poster “Accelerating Dynamic HTA Landscaping in Oncology Through Autonomous Generative AI-Driven Multilingual Data Extraction” at ISPOR US on May 18 at 4 PM. We hope to see you there!

Embedding R into GxP-Compliant Statistical Computing Environments

Biotech and mid-sized pharmaceutical companies are increasingly modernizing their statistical computing environments (SCEs) to keep pace with growing data complexity, advanced analytics, and evolving regulatory expectations. Open-source languages such as R offer clear advantages in flexibility and innovation. However, in GxP-compliant settings, adoption introduces challenges that go far beyond technology itself.

Much of the discussion around R focuses on its capabilities. In practice, the real challenge lies in operationalizing it within a compliant ecosystem — where validation, governance, and reproducibility become critical.

This article explores these challenges from a practical perspective and outlines how organizations are addressing them.

 

The real barrier: GxP complexity

Adopting R is not the primary hurdle; embedding it into a GxP-compliant environment is. This requires:

  • Validation of open-source packages
  • Governance and auditability
  • Reproducibility and traceability
  • Ongoing lifecycle management

For organizations without established frameworks, these requirements can introduce significant overhead, often slowing innovation rather than accelerating it.

 

Why mid-sized organizations are disproportionately impacted

Mid-sized biotech and pharmaceutical companies face a structural challenge. While regulatory expectations are the same as for large pharma, available resources are not.

Smaller teams must manage validation, infrastructure, and delivery simultaneously, often without dedicated support functions. As a result, system complexity scales faster than internal capacity, directly impacting timelines and limiting the ability to innovate.

 

Different starting points, different challenges

In practice, organizations face different realities depending on their level of SCE maturity:

  • Some lack the infrastructure to support GxP-compliant open-source environments
  • Others have established systems but face integration challenges with external partners
  • A third group is transitioning toward R and multi-language workflows but lacks maturity in governance and tooling

These scenarios require flexible approaches tailored to each organization’s context.

 

Moving toward integrated, multi-language environments

To address fragmentation, many organizations are adopting polyglot SCEs, where SAS and R coexist within unified workflows.

This approach enables greater flexibility while maintaining compliance, ensuring traceability, reproducibility, and smoother collaboration across internal teams and external partners.

 

A practical path forward

Rather than building and maintaining complex infrastructure internally, many organizations are exploring CRO-based service models.

By leveraging GxP-validated environments, sponsors can access production-ready R ecosystems without the burden of developing validation frameworks or managing platform engineering. This approach supports both full outsourcing and hybrid collaboration models, while ensuring alignment with client-specific systems.

 

Final takeaways

The challenge is not adopting R — it is managing the complexity of making it compliant.

Organizations that successfully unlock its value do so by:

  • Addressing GxP requirements early and systematically
  • Adapting approaches to their level of SCE maturity
  • Leveraging integrated, multi-language workflows
  • Exploring service-based models to accelerate adoption

With the right strategy, R becomes not a source of complexity, but a powerful enabler of innovation in clinical development.

 

Interested in learning more?

Join our upcoming webinar, “Navigating GxP Complexity: Unlocking the Value of R,” where we will share practical experience from Cytel’s polyglot SCE, including validation approaches, governance models, and operational best practices.

Register now to learn how to modernize your statistical computing environment — without adding unnecessary complexity.

Why “More Data” Isn’t Helping You Run Better Trials

Clinical Operations teams are being asked to let go of traditional approaches and do more than ever before:

Deliver more complex trials, faster — with fewer resources — and higher confidence in outcomes.

And how has the industry responded?

With a proliferation of data access, tools, and dashboards.  But does a dashboard really help navigate complexity with speed and well-managed risk?  No.

Let’s discuss the methods and tools that help turn this complexity into clarity.

 

The problem isn’t just complexity — It’s information overload

Clinical trials have changed dramatically:

  • 7x increase in data points
  • 4x increase in data sources
  • Increasing reliance on external data, RWE, and predictive modeling

Yet often you’re still expected to manage across multiple systems, in spreadsheets and trackers: CTMS, EDC, RBQM dashboards, query reports, enrollment trackers, deviation logs, and monitoring reports.

None of these disparate sources of information tell the whole story, and every critical study execution decision you make is plagued with data gaps, inconsistencies or discrepancies, and latency issues.

How then can we consolidate and automate our use of the data to make timely decisions that we trust?  There are certainly technology stacks that large organizations license and deploy.  But what happens when you can’t afford them?  You partner with a data management and biometrics specialty provider who understands what you are up against, what is needed to successfully deliver a study, who understands the data and what is required, and who offers critical solutions to help heads of clinical operations gain control at a price that they can afford.

Tools that actually make a difference offer:

  • Actionable insights, not static reports
  • Continuous visibility, not retrospective analysis
  • Aligned teams, not handoffs

 

Central statistical monitoring: Detecting emerging risks early

Early intervention is key to managing trial risks and ensuring reliable results. As clinical trials grow in complexity, data quality and patient safety can no longer be ensured within system reports. And with evolving regulatory expectations, trial budget pressures, and the need for earlier, more objective insights into emerging risks, central statistical monitoring (CSM) has become a critical component of modern trial oversight.

Tools, such as Cytel’s Cytelytics, can leverage statistics to identify trends, detect risks, and optimize source data verification efforts.

Regulatory agencies now treat audit trail data with the same level of scrutiny as clinical data, and expect proactive, ongoing reviews. Relying on outdated or manual approaches is a risk you can’t afford.

Additionally, regulatory agencies emphasize the need for proactive and ongoing audit trail reviews, treating audit trail data with the same level of scrutiny as clinical data. Manual approaches are time sinks and can introduce unnecessary risk. Tools like Cytel’s Audit Detective enhance compliance and data integrity by identifying inconsistencies, unauthorized access, and unusual activity patterns in audit trails.

 

Better data visualization: Driving decisions, not just reporting

Traditional reports tell you what happened. Modern visualization:

  • Links operational metrics to clinical outcomes
  • Allows drill-down from summary to patient level
  • Highlights where intervention changes the outcome

Tools like Cytel’s ClinCytesDV provides interactive graphs, tables, and listings, layering data together to tell a richer story.

 

Data management: Operating environments that drive speed and quality

Data ingestion, cleaning, reconciliation, and reporting should not operate in lock step. A modern approach:

  • Automates data ingestion across sources (EDC, RWD, wearables)
  • Standardizes data structures (CDISC, OMOP, FHIR)
  • Enables real-time cleaning and review

The result is better data processing, reduced site burden, faster lock — and less firefighting. This is the difference between oversight and control.

 

Final takeaways

The answer isn’t more dashboards, systems, or data, but rather the methods and tools that result in fewer reconciliations across systems, earlier visibility into risks, faster decisions with higher confidence, and ultimately, that allow you to spend less time managing the process — and more time managing the study.

Building a New Evidence Base for Rare Diseases by Structuring Clinical Narratives with Generative AI

Rare diseases present a paradox in modern healthcare. Individually, they affect small populations, yet collectively they impact millions of patients worldwide. Despite this, progress in diagnosis, treatment, and research remains slow. The fundamental challenge is not only scientific complexity but also a persistent lack of usable data.

Traditional sources of real-world data — electronic health records, claims databases, and clinical trials — struggle to capture rare disease populations at a meaningful scale. Patients are geographically dispersed, frequently misdiagnosed, and often excluded from structured datasets. As a result, generating robust evidence in rare diseases remains difficult.

At the same time, an overlooked resource has quietly accumulated over decades: clinical case reports. These narratives contain detailed descriptions of real patients, their symptoms, diagnostic journeys, and outcomes. The challenge has never been their value, but rather their accessibility and structure.

Recent advances in large language models (LLMs) suggest that this barrier may finally be overcome.

 

Case reports as a foundation for real-world evidence

Case reports represent one of the richest forms of clinical documentation available. Unlike structured datasets, they capture the full nuance of patient care, including symptom evolution, diagnostic uncertainty, and physician reasoning. They are inherently real-world, reflecting how diseases actually present and are managed in practice.

However, their utility has historically been limited. Case reports are written in free text, scattered across millions of publications, and lack standardization. Extracting meaningful insights at scale has required significant manual effort, making systematic use impractical.

The RareArena study demonstrates a new approach. By leveraging LLMs, researchers were able to automatically collect and process hundreds of thousands of case reports from PubMed, filter them for rare diseases, and transform them into a structured dataset comprising tens of thousands of patient cases. This process effectively converts unstructured clinical narratives into analyzable real-world data.

This shift is significant. It reframes case reports not as isolated anecdotes, but as components of a scalable data asset.

 

From unstructured text to scalable patient populations

One of the most important implications of this approach is the ability to expand patient populations in rare disease studies. Traditional datasets are constrained by institutional boundaries and data availability. In contrast, case reports aggregate knowledge globally, capturing patients from diverse healthcare systems and settings.

By structuring these reports, LLMs enable the creation of virtual cohorts that far exceed what any single registry or database could provide. Diagnoses can be standardized using reference ontologies, symptoms can be normalized, and cases can be grouped into clinically meaningful categories.

The RareArena dataset, for example, spans thousands of rare diseases and tens of thousands of patient cases, representing one of the broadest collections of rare disease data assembled to date. This kind of scale opens new possibilities for understanding disease heterogeneity, identifying subpopulations, and generating evidence where none previously existed.

In effect, LLMs allow researchers to move from fragmented observations to aggregated real-world populations.

 

Capturing the diagnostic journey

A particularly valuable aspect of the RareArena framework is its alignment with real clinical workflows. The dataset distinguishes between two stages of diagnosis: early suspicion based on symptoms alone, and confirmation after diagnostic testing.

This distinction mirrors how rare diseases are encountered in practice. Patients often experience long diagnostic odysseys, with years passing before a correct diagnosis is reached. By separating these stages, the dataset captures both the uncertainty of early presentation and the clarity provided by confirmatory tests.

This structure enables deeper analysis of diagnostic pathways, including where delays occur and how different signals contribute to clinical decision-making. It also provides a foundation for developing tools that support earlier recognition of rare diseases, an area where unmet need remains substantial.

 

Preserving clinical complexity in real-world data

A common limitation of many real-world datasets is the loss of clinical nuance. Structured data often simplifies patient information, omitting negative findings, confounding symptoms, and contextual details that are critical for diagnosis.

Case reports, by contrast, preserve this complexity. The RareArena study shows that most cases retain features such as negative symptoms and confounding factors, reflecting the challenges physicians face in real-world settings. This makes the resulting dataset not only large, but also clinically realistic.

Maintaining this level of detail is essential for rare diseases, where subtle distinctions can significantly alter diagnosis and treatment. LLMs play a key role here by rephrasing and structuring text while preserving the underlying clinical information.

The result is a form of real-world data that is both scalable and rich in context.

 

Implications for research and clinical development

The ability to generate structured datasets from case reports has far-reaching implications. For researchers, it enables the study of rare diseases across larger and more diverse populations than previously possible. Patterns of presentation, progression, and response to treatment can be explored with greater statistical power.

In clinical development, this approach offers new ways to identify and characterize patient populations. It can support the design of clinical trials by highlighting underrepresented groups and informing inclusion criteria. It also provides a potential source of external evidence, complementing traditional trial data.

Beyond research, there is a clear opportunity to improve clinical decision support. The RareArena study demonstrates that LLMs already show meaningful capability in diagnosing rare diseases, particularly when provided with comprehensive clinical information. While not yet sufficient for standalone use, these models can assist clinicians by surfacing relevant diagnostic possibilities.

 

Limitations and considerations

Despite its promise, this approach is not without limitations. Case reports are inherently selective, often focusing on unusual or severe presentations. This introduces potential bias in the resulting datasets. Additionally, the data is retrospective and curated, rather than continuously collected.

LLMs themselves introduce another layer of complexity. While they are effective at extracting and structuring information, they can also propagate errors or introduce subtle inaccuracies. Ensuring data quality and validation remains critical.

The RareArena study also highlights that even the most advanced models are far from perfect in diagnostic tasks, particularly in early-stage scenarios. This reinforces the need to view these tools as augmentative rather than autonomous.

 

A shift from data scarcity to data unlocking

What emerges from this work is a broader shift in how we think about data in rare diseases. The challenge is no longer solely about collecting new data, but about unlocking the value of existing information.

Case reports represent decades of accumulated clinical knowledge. With LLMs, it becomes possible to systematically extract, structure, and scale that knowledge into usable real-world data. This approach does not replace traditional data sources, but it significantly expands the available evidence base.

For rare diseases, where every patient case is valuable, this shift is particularly impactful.

 

Toward a more complete picture of rare diseases

The combination of case reports and large language models offers a compelling new pathway for advancing rare disease research. By transforming unstructured narratives into structured datasets, it enables the creation of larger, more representative patient populations and more realistic models of clinical care.

While challenges remain, the potential is clear. This approach can accelerate diagnosis, inform clinical development, and ultimately contribute to better outcomes for patients who have long been underserved.

In a field defined by scarcity, the ability to unlock hidden data may prove to be one of the most important innovations yet.