Solutions
About Us
Insights
Careers
Contact us
Contact Us
Customer Support
Customer Support

Rethinking Evidence in Rare Disease Research: A Case Study Using Propensity Score Methods

Rare diseases pose unique challenges for researchers and clinicians. Due to small patient populations, conducting randomized controlled trials (RCTs) is often impractical or ethically difficult. As a result, observational data becomes a key source of evidence.

In the landscape of rare disease, data is both our most precious resource and our greatest challenge. For conditions like Infantile-Onset Pompe Disease (IOPD), the journey from the first life-saving Enzyme Replacement Therapy (ERT) to the next generation of optimized treatments is rarely a path free of challenges. It is a path marked by small patient populations, high clinical variability, and the heavy weight of every data point.

The difficulty in rare disease research often lies in the “how”: How do we prove a new therapy is truly superior when baseline functional levels vary so wildly? How do we ensure that a single data entry error doesn’t mask a breakthrough or suggest a false decline?

In this blog, we explore how propensity score methods can be used to estimate treatment effectiveness in a rare disease setting through a real world–inspired case study.

In this case study, we pull back the curtain on the analytical rigor required to compare motor function trajectories in IOPD. From Propensity Score Matching to “red-flag” data auditing, we explore how sophisticated analysis turns fragmented data into a clear roadmap for the future of neuromuscular treatment.

 

Case study: Advancing motor function outcomes in IOPD

The evolution from first-generation drug to next-generation drug

Infantile-Onset Pompe Disease (IOPD) is a rare, progressive neuromuscular disorder. While the first generation of ERT revolutionized survival, the quest for superior motor function remains the “North Star” for researchers. This study compares longitudinal motor outcomes between the First-Generation Drug and Next-Generation Drug cohorts using the Gross Motor Function Measure (GMFM-88).

 

The challenge: Comparing across clinical trials

Comparing results from different studies requires more than just looking at averages; it requires accounting for the inherent variability in how patients present at baseline. To test the hypothesis that the Next-Generation Drug offers a superior motor trajectory, we implemented a rigorous three-tier analytical approach.

 

A three-tier analytical approach

1. The power of precise matching

To ensure an “apples-to-apples” comparison, we restricted the analysis to patient pairs matched by both age and baseline functional level.

  • The criteria: Matches were strictly filtered to those within a +/- 13-point window of the GMFM-88 raw score (rather than a percentage).
  • The goal: By tightening these parameters, we eliminated “baseline noise,” allowing the true pharmacological impact of the treatment to surface in the longitudinal graphs.

 

2. Data integrity: Investigating the “jumps and drops”

In rare disease registries, a single data point can skew an entire trajectory. Our team conducted a “deep dive” into five specific patient profiles that exhibited extreme volatility — marked by sharp drops or vertical jumps in scores.

Expert insight: A drop to zero isn’t always a clinical decline; often, it’s a data entry artifact where a missing value was defaulted to ‘0.’ By identifying and correcting these anomalies, we ensure the motor trajectory reflects biology, not a spreadsheet error.

 

3. Sophisticated balancing: Propensity Score Matching (PSM)

Propensity score methods help simulate a randomized experiment by balancing observed characteristics between treated and untreated groups.

To further validate our findings, we moved beyond simple matching to Propensity Score Matching. This statistical technique allows us to predict a patient’s likelihood of being in a specific treatment group based on their baseline characteristics, effectively “balancing” the two groups.

 

Key covariates included:

  • Baseline status: Age and GMFM-88 total raw score.
  • Clinical history: Age at diagnosis and age at start of ERT.
  • Biological markers: CRIM status (Cross-Reactive Immunologic Material) and LVMI (Left Ventricular Mass Index) z-scores.
  • Treatment variables: Specific enzyme dosage levels.

 

Why this matters for the rare disease community

This case study demonstrates that in the world of rare diseases, how we analyze data is as important as the data itself. By correcting for entry errors and using high-fidelity matching, we can more clearly see if the next-generation drug truly provides the “superior trajectory” hypothesized.

 

Precision analytics as a catalyst for care

By applying high-fidelity matching and propensity score modelling, we move beyond “average” results to understand the true potential of new interventions. Furthermore, our dedication to data integrity — manually investigating anomalies and “red-arrow” outliers — ensures that our conclusions are built on a foundation of clinical reality rather than administrative error.

Ultimately, this study reinforces that in the fight against rare diseases, data is our most powerful ally. When we refine our lens through rigorous matching and clean data, the path toward better motor function and brighter futures for IOPD patients becomes clearer than ever.

A Preview of Cytel’s Contributions at PHUSE EU 2025

I can’t believe it has already been a year since we wrapped up PHUSE EU Connect 2024, and in two weeks we will be gathering another exciting PHUSE EU Connect conference, only a few kilometers from Heidelberg, where everything started twenty years ago with the very first PHUSE event. I was one of the couple hundred lucky attendees and now, twenty years later, I have the great honor of supporting Jennie McGuirk and Jinesh Patel as Conference Co-chair for this year’s edition.

With a promising agenda featuring about 190 presentations, 34 posters, 9 hands-on workshops, 2 panel discussions, and 3 inspiring keynote speakers, this year we are going to the city of Hamburg for the 21st PHUSE EU Connect. The agenda is full of topics looking toward the future, with about 40 talks and posters referring to AI in their titles, and once again open source will be the confirmed leitmotif.

Cytel will make a significant contribution this year, perhaps more than ever, with six presentations, one poster, active participation in both panel discussions, and co-chairing the “Scripts, Macros and Automation” and “People Leadership & Management” streams.

 

Monday topics: Agile code writing, extracting metadata from R OOP functions, and leadership

The week kicks off on Monday with Kamil Foltynski, who will present “Overcoming Challenges in Collaborative Spreadsheet Editing with Shiny, SpreadJS and JSON-Patch” in the Application Development stream at 11:30 am. Kamil will provide a technical deep dive into enabling real-time spreadsheet editing within Shiny applications, using tools such as SpreadJS, sharing key lessons learned so far. Following Kamil’s presentation, Eswara Satyanarayana Gunisetti, will present “Micro-Decisions, Macro Impact: The Role of Agile Thinking in Every Line of Code” in theCoding Tips & Tricks” stream at 12 pm. See his recent blog on the topic. Eswara will share how an agile “mindset” can positively influence the way we write code.

In the same stream, a few hours later at 2 pm, another colleague Edward Gillian, in collaboration with Sanofi, will present “Risk.assessr: Extracting OOP Function Details,” discussing strategies for extracting metadata from R Object-Oriented Programming functions. Prior to Eswara and Edward’s sessions, at 1:30 pm, Kath Wright, will moderate the Interactive People Leadership & Management session “Invisible Glue: Trust, Influence and The Architecture of Teamwork.” With this live workshop, attendees will engage in practical exercises to learn how to identify barriers to trust, evaluate influence dynamics, and apply evidence-based strategies to strengthen collaboration in both physical and virtual environments.

 

Tuesday topics: Industry trends, extracting macro usage and dependency information from SAS programs, and integrating ECA data into CDISC-compliant datasets

Tuesday also brings two presentations and one poster. Right after lunch at 1:30 pm, Cedric Marchand will join other industry leaders in the panel discussion “Reimagining Statistical Programming: AI, Standards & the Talent of Tomorrow.” The panel will explore how current industry trends, such as AI, open source, and the evolution of data standards, will influence the next generation of statistical programmers.

The afternoon continues at 4 pm with my young and talented colleague Marie Poupelin, who will present “From Zero to Programming Hero: How Internships Shape Statistical Programmers in a CRO” in the “Professional Development” stream. Marie is a great example of the success of our internship program, and she will share her journey from having “zero” statistical programming experience to becoming an industry-ready programmer. Thirty minutes later, at 4:30 pm, Guido Wendland will present “Which Macros Are Used in the Study?” in the “Scripts, Macros and Automation” stream, a stream co-led this year for the first time by my colleague Sebastià Barceló. Guido will discuss techniques to extract macro usage and dependency information from SAS programs; this is particularly useful for identifying potential issues or estimating the impact of macro updates.

Later, in the traditional Tuesday evening poster session, you can join my colleague Cyril Sombrin in discussing “Our Journey in Integrating External Control Arms (ECAs) and RWD for Rare Disease Trials.” There you can discuss real-world case studies on integrating ECA data into CDISC-compliant datasets, exploring the unique challenges and solutions when aligning real-world data with CDISC standards.

 

Wednesday topics: Real-time spreadsheet editing within Shiny applications and real-time validation and streamlined submissions

On Wednesday at 12 pm, Hugo Signol, another young talented Cytel statistical programmer and a product of our internship program, will present his talk “From XPT to Dataset-JSON: Enabling Real-Time Validation and Streamlined Submissions.” Building on Cytel’s experience from CDISC Dataset-JSON-Viewer Hackathon, Hugo will demonstrate a Shiny application that supports interactive exploration and real-time validation through API-based checks.

 

Meet us there!

Cytel will be at Booth 9 at the conference, where you can engage in discussions with our team or meet any of us throughout the week.

I hope I didn’t miss anyone, or anything! We look forward again to reuniting with colleagues and friends from around the world and meeting new acquaintances.

See you all in Hamburg!

Micro-Decisions, Macro Impact: Cultivating an Agile Mindset in Every Line of Statistical Code

Statistical programming is a cornerstone of clinical research, converting raw data into the standard datasets, tables, listings, and figures (TLFs) that support decision-making, regulatory submissions, and publications.

Traditional workflows often limit collaboration, adaptability, and early input from programmers. As timelines shrink and expectations grow, it’s clear that a new way of thinking is needed, one that goes beyond efficiency, and into adaptability, collaboration, and value creation.

In clinical statistical programming, agility isn’t only about sprints or ceremonies, it starts with the smallest choices we make at the keyboard.

 

Every day, statistical programmers make hundreds of tiny decisions such as

  • How to name a variable
  • Design a macro
  • Structure a dataset

Most of these choices happen quietly, almost on autopilot. Yet together, they define

  • How flexible our studies are
  • How easily we can adapt to change
  • How smoothly teams can collaborate.

 

These small choices (micro-decisions), multiplied across teams and studies, drive what I call a macro impact.

 

Agility at the code level

Agile thinking refers to building programs with change in mind, favoring adaptability over perfection, and prioritizing clarity and consistency over clever shortcuts. These ideas might sound subtle, but together, they create the difference between rigid code and resilient code.

Programmers can apply agile thinking directly at the code level, through clarity, simplicity, adaptability, and value orientation.

Habits like intentional naming, smart commenting, modular macros, and built-in quality checks make code more resilient and teams more responsive to change.

 

Agility at the code level shows up in many subtle but powerful ways:

  • Intentional naming makes programs self-explanatory and audit ready.
  • Smart commenting tells the why, not just the how.
  • Scalable macros turn adaptability into a default setting.
  • Readable structures make collaboration effortless.
  • Built-in quality checks turn QC from a final gate into a shared rhythm.

 

When practiced consistently, these habits turn teams into systems that learn, adapt, and deliver faster with accuracy and compliance.

 

Thinking differently

This isn’t about doing more; it’s about thinking differently while doing what we already do.

Proven models like Kaizen and Toyota Lean philosophies manifest that, through continuous improvement with a culture of cooperation and eliminating waste, we can deliver maximum value to the customers by sticking to the existing process and not letting go of what we have already learned. Through the lens of these philosophies, we see how small enhancements in daily programming can scale major gains in collaboration, reuse, and efficiency.

 

Final takeaways

Code is communication. Every variable, macro, and comment is a message to a collaborator, a regulator, or your future self.

Let every line you write carry clarity. Let every structure you build invite change. Let every decision reflect agility. That’s the path from micro-decisions to macro impact.

 

Interested in learning more?

Eswara Gunisetti will be at PHUSE EU Connect 2025 to present “Micro-Decisions, Macro Impact: The Role of Agile Thinking in Every Line of Code.” Discover how every line of code can contribute to a more adaptive, transparent, and rewarding way of working, where agility lives not just in our processes, but in our programming decisions themselves.

Register below to book a meeting or visit Booth 9 to connect with our experts:

Streamlining Data Management and Improving Statistical Accuracy in Clinical Trials with AI

As clinical trials grow increasingly complex, the need for smarter, faster, and more efficient data processes and analysis is in demand. Artificial intelligence (AI) emerges as a powerful tool, especially in programming and data management. For clinical trial professionals, AI offers the promise of streamlining workflows, improving data quality, and reducing time to database lock.

 

The evolving role of AI in clinical data programming

AI is not replacing clinical programmers; it’s augmenting them. AI should be considered a tool to use within clinical trials, just as EDC and SAS are commonly used tools. Automation tools driven by machine learning can now handle routine, rules-based programming tasks such as edit check generation, derivation logic, and data transformation. This allows programmers to focus on more strategic activities like validating statistical code or optimizing data pipelines. AI needs the expertise of our clinical trial professionals.

Natural Language Processing (NLP) is also making great progress. For instance, NLP can interpret free-text protocol documents to auto-generate specifications, electronic case report form (eCRF) templates, or even suggest initial SDTM mappings, significantly reducing manual effort.

 

AI in data cleaning and quality oversight

Traditionally, data cleaning has been labor-intensive, with data managers manually reviewing queries, data listings, and edit checks across multiple data sources and systems. AI tools can now proactively flag anomalies or data trends that human review might miss, such as unexpected patterns in lab values, inconsistencies across visits, or possible fraudulent data across participants and sites.

Predictive models can help identify study participants at high risk of dropout or noncompliance, enabling earlier intervention. This not only improves data completeness but also enhances trial efficiency and participant retention. The effort and cost of replacing clinical trial participants is significant and felt across all stakeholders. Improving the patient’s experience would be a significant way to save time, money, and accelerating progress.

 

AI in statistical programming: From code automation to advanced insights

Statistical programming is central to clinical trial analysis from producing tables, listings, and figures (TLFs) to preparing submission-ready datasets. Traditionally reliant on manual coding in SAS or R, this work is now gaining speed, consistency, and quality through AI augmentation.

 

Where AI adds value in statistical programming

  • Automated code generation: AI models trained on historical programming logic can produce initial SAS macros or R scripts for common TLFs and dataset derivations. These drafts accelerate development by up to 40–60%, freeing programmers and biostatisticians to focus on complex analyses and interpretation.
  • Code review and validation: AI-assisted tools can scan code for logic errors, inefficiencies, redundant steps, and deviations from programming standards. Acting as a “second reviewer,” they flag potential issues early and suggest optimizations.
  • Dynamic statistical modeling: AI algorithms can rapidly explore large trial datasets to detect subgroup effects, anomalies, or emerging trends. When guided by statistical oversight, these insights can refine analysis plans and support adaptive trial decisions.

The aim is not to replace human judgment, but to boost productivity, reproducibility, and the speed of insight generation, without compromising scientific rigor.

 

AI in biostatistics: Powering smarter, more adaptive clinical trials

 Biostatistics remains the foundation of evidence generation in clinical trials, providing the methodological rigor to transform raw data into reliable conclusions. In the context of AI, biostatisticians play a dual role: safeguarding scientific validity while leveraging new computational tools to enhance insight generation. This requires a careful balance between deep domain knowledge and technical proficiency in emerging AI-driven methodologies. From applying knowledge graphs (KGs) to map complex biomedical relationships, to developing predictive models that anticipate trial outcomes, biostatistics is evolving into a more dynamic and interconnected discipline.

 

Where AI adds value in biostatistics

  • Balanced expertise: Integrating statistical theory with AI/ML techniques to ensure robust, interpretable results.
  • Knowledge graph applications: Using KGs to uncover hidden relationships between biomarkers, treatments, and outcomes, supporting hypothesis generation and trial design.
  • Early prediction tools: Building predictive models for recruitment success, dropout risk, and endpoint achievement.
  • Segmentation and personalization: Identifying patient subgroups most likely to benefit from a therapy, improving trial efficiency and precision medicine strategies.
  • Support for registrational trials: Leveraging AI to optimize trial design, stratify patient populations, and run simulations that ensure the study is powered and structured for regulatory success.

 

Regulatory readiness and caution

Despite its promise, AI must be implemented thoughtfully. Regulatory agencies like the FDA are increasingly open to the use of advanced technologies but expect transparency, traceability, and validation. AI-based tools must be auditable and explainable, especially when used in clinical data workflows that feed into regulatory submissions.

 

What’s next?

As AI becomes more embedded in clinical trial ecosystems, we can expect increased integration with EDC systems, CDISC standards, and statistical programming tools. The goal isn’t to eliminate human oversight but to enhance it, allowing clinical data professionals to make faster, better-informed decisions.

 

Final takeaways

AI is reshaping programming and data management in clinical trials. For clinical trial professionals, now is the time to become familiar with these tools, understand their capabilities and limitations, and engage with cross-functional teams to ensure responsible and impactful implementation. Ultimately our goal is to shorten drug development timelines and improve patient outcomes. With AI, we can be part of the solution to provide improved treatments for patients.

 

Interested in learning more?

Join Steven Thacker, Sheree King, Kunal Sanghavi, and Juan Pablo Garcia Martinez for their upcoming webinar, “How AI Enhances Biometrics Services: Streamlining Data Management and Improving Statistical Accuracy in Clinical Trials” on Thursday, August 28 at 10 am ET:

Moving to Agile: A New Approach to Statistical Programming

Prior to recent advances, traditional software development processes have been characterized by rigid methods that required teams to follow pre-defined processes. However, the advent of Agile programming revolutionized traditional development processes by shifting the focus to flexibility, collaboration, and continuous improvement. Unlike traditional methods, Agile embraces change and enables teams to respond quickly to new requirements.

Now, the Agile approach has moved from software development into statistical programming, allowing teams to work in small increments rather than following a linear, pre-planned process. Instead of extensive upfront planning, Agile encourages adaptability and frequent reassessment of project goals.

Here, I discuss Agile methodologies, the benefits and challenges, and invite readers to learn more with our new case study on implementing Agile and Scrum for SAS programming in clinical development.

 

What is Agile programming?

Agile is an iterative project management and development approach that prioritizes flexibility, collaboration, and responsiveness to change. Though originally developed for software engineering, Agile has since gained widespread adoption across various industries, including healthcare and clinical research.

At the heart of Agile is the concept of breaking down complex projects into smaller, manageable units of work, called “sprints,” typically lasting one to four weeks. At the end of each sprint, the team delivers a functional product increment, ensuring continuous feedback and the ability to adjust course as needed.

Key tenets of Agile in statistical programming include:

  • Prioritizing individuals and interactions over processes and tools to foster teamwork and effective communication.
  • Prioritizing customer collaboration over contract negotiation to involve stakeholders throughout the process.
  • Prioritizing responding to change over following a plan to support remaining flexible to evolving needs.

These tenets support incremental delivery of outputs, frequent feedback loops to all programmers, and overall team collaboration.

 

Benefits of Agile programming

Agile methodologies offer numerous additional advantages, making them a preferred choice for modern development teams:

 

Faster delivery times

Agile focuses on small, manageable iterations (sprints), allowing teams to release interim deliverables frequently rather than waiting for the entire product to be complete.

 

Higher customer satisfaction

Continuous delivery and ongoing stakeholder involvement ensure products align with user needs, leading to better adoption and positive feedback.

 

Reduced risk of project failure

By regularly assessing project goals, teams can detect potential issues early and make adjustments before they become costly problems.

 

Agile methodologies

Agile methodologies come in different flavors, each tailored to unique team dynamics and project needs.

 

Scrum

Scrum is one of the most widely used Agile frameworks. It divides development into short cycles called sprints (typically 2 weeks), during which teams work on prioritized tasks. Scrum incorporates daily stand-up meetings and reviews to track progress and remove obstacles.

 

Kanban

Kanban is a visual workflow management system that emphasizes continuous delivery. Teams use a Kanban board to track tasks in various stages (To-Do, In Progress, Completed), ensuring transparency and limiting work in progress to prevent bottlenecks.

 

XP

XP focuses on high-quality development practices like test-driven development (TDD) and continuous integration (CI). It encourages pair programming and frequent code reviews to enhance software quality.

 

Challenges to adopting Agile

While Agile offers many benefits, teams may face challenges when adopting Agile practices. Rapid development cycles can lead to frequent scope changes, making it hard to maintain focus. This can be avoided by clearly defining priorities and using backlog refinement sessions to keep scope manageable.

Additionally, Agile relies heavily on collaboration, but without proper communication, misunderstandings can arise. Strategies for preventing this include encouraging daily stand-ups, using standard project management tools, and fostering a culture in which open commentary is encouraged.

Finally, transitioning to Agile can be difficult, especially in organizations accustomed to traditional methods. But a gradual approach to this new methodology is warranted: provide Agile training, start with pilot projects, and celebrate early wins to build confidence.

 

Final takeaways

Agile programming is more than just a methodology — it’s a mindset that promotes adaptability, efficiency, and collaboration. By embracing Agile, teams can deliver high-quality software faster while continuously improving their processes. Whether you’re a startup or an enterprise, adopting Agile can lead to better productivity and customer satisfaction.

 

Interested in learning more?

Download our new white paper that provides a detailed case study on implementing Agile and Scrum for SAS programming in clinical development.

Expediting the Regulatory Submission Process with Automated Tools

In the biopharmaceutical industry, expediting regulatory submissions is crucial for timely access to life-saving medications. As a statistical programming team, our role involves accelerating the drug approval process by meticulously preparing Electronic Common Technical Document (eCTD) packages, including the statistical review and programming process of mapping SDTM, deriving ADaM, and TLF generation.

Here we discuss the process and benefits of the metadata-driven approach. From mapping to report, this approach enhances the efficiency in attaining results and generating submission packages promptly by reducing manual interventions.

 

What are eCTD packages and how are they prepared?

The eCTD is the “standard format for submitting applications, amendments, supplements, and reports to FDA’s Center for Drug Evaluation and Research (CDER) and Center for Biologics Evaluation and Research (CBER).”1 It facilitates the electronic submission of dossiers for market approval requests, such as for a new drug (NDA).

Among files stored in the eCTD, there are some key components related to Biometrics deliverables:

  • SDTM Dataset: The Study Data Tabulation Model (SDTM) is one of the most important CDISC data standards. It’s a framework used for organizing source data collected in human clinical trials.
  • ADaM Datasets: Analysis datasets are created to enable statistical and scientific analysis of the study results. CDISC Analysis Data Model (ADaM) specifies the fundamental principles and standards to ensure that there is clear lineage from data collection to analysis.
  • TLF: Analytical outputs, in the form of tables or figures, are used to summarize the analysis required for the submission to the regulatory agencies. These outputs are supported by listings that display the actual data at all the data points.

 

The need for automation

When working on any project / analysis, certain elements remain unchanged regardless of the study design. Therefore, standardizing and automating their production could lead to efficiency, ensuring consistency, and reduce the overall time required for submission. Also, by automating these items, we could reduce manual intervention, thereby minimizing the chances of human error.

This approach has several benefits, including:

  • Efficiency: Since the team can focus more on the non-standard parts of the outputs, the overall efficiency of the team is increased.
  • Consistency: Since automated tools generate standard code based on a set of rules, the resulting code remains highly consistent across various projects. This makes it easier to understand and debug (in case of any updates).
  • Quality: Since the tools have been rigorously tested, they produce extremely high-quality and reliable outputs.
  • Reduced manual intervention: Since manual intervention is limited, the possibility of human error is minimized. As long as the specifications are correctly drafted, the output generated by the standard code should be error-free.

A metadata-driven approach

Many companies, including Cytel, have adopted a metadata-driven approach to accelerate tasks such as SDTM, ADaM, and TLF code generation. The goal of this approach is not to automate 100% of the final code but rather to generate as much standardized and structured code as possible. This approach enhances efficiency while simplifying modifications when needed.

While a Metadata Repository (MDR) can maximize automation in the long run, currently available MDR tools remain cumbersome.2 For this reason, while still assessing the benefit of MDR solutions, Cytel has taken a different approach — extracting metadata from existing documents that statistical programmers already use in their daily work. Without adding extra workload, this metadata is stored in a structured format, allowing us to apply automated rules to enrich it. From there, we can generate SDTM, ADaM, and TLF code efficiently.

For example, metadata can be extracted from ODM.xml files or raw datasets to streamline SDTM specification mapping. These specifications can then be leveraged to generate SAS or R code automatically. Similarly, metadata from study mock shells — such as titles, footnotes, table headers, and table body structure and content — can drive the creation of TLFs with minimal manual intervention.

Another key advantage of this metadata-driven approach is its language agnosticism. By structuring metadata independently of the programming language, the same metadata can be used to generate both SAS and R code. This ensures consistency, facilitates the transition for SAS programmers moving to R, and maintains quality without impacting project timelines.

 

Final takeaways

In line with the premise that “one solution does not fit all,” CROs can maximize the value of metadata within clinical trial delivery by leveraging the metadata already embedded inherent in study artifacts. If you are able to define the way of extracting as much metadata as possible from the documents you already use, you can obtain a lot of value if you are able to transform that metadata into real deliverables.

This metadata-driven approach is sensitive to the fact that CROs must accommodate a multitude of sponsor standards and delivery requirements, without sacrificing the benefits of automation in an ecosystem rich in interdependencies between regulatory authorities, industry consortia, sponsors, CROs, and other third-party technology vendors.

 

1 US FDA. (4 October, 2024). Electronic Common Technical Document (eCTD).

2 PHUSE White Paper (2 October, 2024). Best Practices in Data Standards Implementation Governance.

 

Interested in learning more?

Watch Manish Deole and Sebastià Barceló’s on-demand webinar, “Expediting Regulatory Submissions through Automation”:

AI’s Influence on SAS Programming

The advent of Artificial Intelligence (AI) has transformed numerous fields, and the domain of SAS (Statistical Analysis System) programming is no exception. From automating tedious tasks to enhancing decision-making processes, AI has made significant inroads into how SAS programmers work.

However, AI is not a substitute but a companion to programmers. While AI can help us focus our critical thinking, creativity, and problem-solving skills, AI needs our expertise. Domain expertise is still essential.

To understand this transformation better, here we explore key ways AI has impacted SAS programming, particularly by comparing skills of traditional to AI-assisted programming, examining the days before and after AI, and discussing the new responsibilities and skills required in the modern programming landscape.

 

Traditional SAS programming vs. AI-assisted SAS programming

Traditional SAS programming has long been a manual, code-intensive practice requiring a high level of expertise in statistical analysis and programming. In the earlier days, SAS programmers worked with well-defined, often repetitive tasks. The process of developing code required a deep understanding of the data and statistical methodologies, all while meticulously debugging and quality-checking code.

AI-assisted SAS programming introduces a new level of efficiency, allowing programmers to focus more on value-added tasks rather than repetitive work. Traditional SAS programming workflows are now supported by AI-driven automation tools that can generate code, optimize algorithms, and even offer suggestions for complex statistical analyses. For example, where traditional methods would require a programmer to sift through data to find patterns, AI can now analyze large datasets in seconds and offer insights that help in decision-making. This allows the SAS programmers to focus on more strategic and high-level interpretations.

In essence, the role of the SAS programmer is evolving from being a “code generator” to a “code curator” and they maintain control over every step, providing deep customization and understanding of the entire process.

 

AI as a companion, not a substitute

The fear of AI replacing jobs has become a common narrative, but in the case of SAS programming, AI should be viewed as a companion rather than a replacement. While AI can optimize code, automate reporting, or even suggest corrections, it is still far from replacing the creative and analytical skills of programmers. AI systems can generate insights based on patterns within datasets, but understanding the nuances of those patterns and making informed decisions based on them remains a unique programmer’s skill.

SAS programmers have a deep understanding of the data they work with, including the context, limitations, and real-world implications of their findings. While AI can handle the heavy lifting in terms of data processing and analytics, the role of the programmer is to interpret these findings, cross-check their accuracy, and ensure the outputs are aligned with business goals or research questions.

Additionally, AI’s suggestions aren’t always perfect, especially when dealing with edge cases or complex datasets with nuanced relationships. In such scenarios, a programmer’s oversight is crucial to prevent AI-driven errors from propagating throughout the analysis.

 

Before and after AI

The landscape of SAS programming before the integration of AI was characterized by manual coding, exhaustive debugging processes, and labor-intensive quality control procedures. Let’s break down the key changes AI has brought to these areas:

 

Code development

Before AI, coding was manual and depended heavily on a programmer’s syntax knowledge and experience to ensure that the code adhered to best practices for efficiency and performance. This could be a time-consuming process, especially when dealing with large, complex datasets.

In the post-AI era, code development is becoming more efficient through AI-assisted coding tools. These tools can automatically suggest code snippets based on previous coding patterns or even generate entire blocks of code tailored to the dataset. AI-driven auto-complete features and advanced libraries that recommend the best statistical models or data manipulation techniques have significantly sped up the development process.

 

Debugging

Debugging used to be a meticulous and painstaking part of the SAS programmer’s job. Identifying errors in code or incorrect outputs is often required by going through large blocks of code line by line, manually reviewing logic and syntax.

AI has revolutionized debugging by identifying errors in real time, suggesting fixes, and even automatically correcting syntax errors. AI tools can also track changes in code and predict where potential issues might arise based on past errors, significantly reducing debugging time and enhancing code accuracy.

 

Quality control (QC)

Before AI, the QC process was often manual or semi-automated, prone to missed errors, and involved peer reviews, statistical validations, and rigorous testing to ensure that the code met the necessary standards. This was particularly important in industries such as healthcare or finance, where data accuracy is critical.

Today, AI-driven QC tools can automatically verify the integrity of datasets, flag inconsistencies, and ensure that statistical models meet predefined accuracy thresholds. These tools can run tests much faster than human reviewers, allowing for quicker validation cycles and better compliance with industry standards.

AI doubles productivity, without replacing the need for programmer’s intuition and expertise, so we can opt for other developmental activities like enhancing the client outcomes, learning new skills, and mentoring to strengthen the overall team.

 

New responsibilities and skills for SAS programmers in the AI age

New responsibilities and skills required for AI platforms

  • To understand how to work along with AI tools
  • To adopt AI-driven workflows for faster development cycles
  • To learn to guide and review AI-generated code
  • Additional skills like data literacy, critical thinking, and ethical AI considerations are also required

 

Industry AI tools

  • Tabnine: AI-powered code predictions
  • Snyk: AI-driven security checks
  • DeepCode: Real-time AI code review
  • SAS Viya: Integrate existing code with AI tools

 

Final takeaways

AI tools are transforming the role of SAS programmers, making them faster and more effective, but human expertise remains crucial in directing AI and ensuring high-quality outcomes. The future of programming likely lies in a hybrid approach that leverages both human expertise and AI-driven efficiencies.

 

Interested in learning more about AI in clinical development? Watch our recent webinar:

The Journey into Open Source … So Far!

Written by Sebastià Barceló, Malte Stein, and Angelo Tinazzi

Open source has been a leitmotif in our industry for many years now, but its adoption poses a number of challenges. At Cytel, our journey into open source began a couple of years ago. Since then, we have focused on building a dedicated Statistical Computing Environment (SCE), defining new processes, and developing new tools to support these processes. Additionally, we also contributed to industry initiatives such as the R {admiral}.

This year, PHUSE-EU will feature a dedicated stream, Open-Source Technology, where presenters will share their experience with open-source technology adoption. In this spirit of collaboration, we will be contributing with two presentations, both addressing critical aspects:

  • The co-existence of R and SAS in the same SCE
  • The risk assessment of R packages

 

Integrating RStudio POSIT and SAS in the same environment

Our new SCE integrates RStudio POSIT and SAS Grid across both Windows and Linux servers. The integration was designed to create a unified and efficient environment for data analytics, leveraging both SAS and POSIT’s capabilities.

The integration was complex and presented several obstacles and surprises along the way. For instance, we encountered compatibility issues, particularly around data access and permissions. To address these, we implemented dual protocol drive, enabling real-time data sharing across platforms, and the use of Git as a version control system, which allows us to maintain and publish content in Connect in a more robust and secure way.

Additional challenges in managing this SCE include balancing security with usability for internal and external access to POSIT Connect and optimizing R package management.

Figure 1 illustrates the final infrastructure.

 

 

R packages risk assessment

Installing and using R packages in the SCE requires assessing the risks associated using these packages. These packages are typically accessed through CRAN, the primary source for R packages developed by various organizations and individuals. Risk assessment is especially critical in industries like pharmaceuticals, where strong compliance requirements (e.g., GxP), necessitate that packages are well maintained, documented, and, after all, reliable.

A key aspect of the risk assessment is the collection of packages metadata, enabling us to classify and assess the reliability of all potential packages we will want to make available in our SCE.

At Cytel, we applied a comprehensive assessment approach by extracting metadata from R packages. We began by evaluating various techniques, such as APIs and web scraping, and compared our approach with the R riskmetric package. This comparison highlighted limitations in conventional methods, which often only address the latest package version. As a result, we enhanced our metadata extraction process.

 

Interested in learning more?

If you are attending the PHUSE-EU in Strasbourg from November 10–13, do not miss Sebastià and Malte’s poster and presentation, where the co-existence of R and SAS and our approach to extracting metadata from R packages will be discussed in more detail:

 

“Bridging Platforms: Integrating RStudio POSIT and SAS Grid in the Same Environment”

Cytel presenters: Sebastià Barceló and Malte Stein

Tuesday, November 12, at 5:30 p.m. (Poster Session – PP28)

 

“Unveiling R Package Risk Assessment: A Comparative Analysis of Metadata Extraction”

Cytel presenters: Malte Stein and Sebastià Barceló

Wednesday, November 13, at 1:30 p.m. (Open-Source Technology Stream – OS14)

 

Angelo Tinazzi will moderate the Scripts, Macros and Automation stream, which will also cover some open-source experiences from other organizations.

 

Cytel will be at Booth #6! We hope to see you there!

P_MACRO: Parameters Extraction from Macros to SAS Dataset

In clinical development, SAS programmers manage, analyze, and interpret clinical data, helping to ensure accuracy, which is essential for regulatory submission and approval. SAS programmers may also create new programs to conduct this work more efficiently and effectively.

SAS has a powerful programming feature called Macros, which allows programmers to avoid repetitive sections of code and to use them again and again when needed. It also helps create dynamic variables within the code that can take different values for different run instances of the same code.

Parameters are local to the macro that defines them. A parameter list can contain any number of macro parameters separated by commas. These macro parameters are variables whose values are initialized when we invoke the macro and provide flexibility to supply different values at each invocation. However, there is currently no automated facility within SAS where a complete list of macro parameters defined for a group of macros can be easily checked

Our solution is P_MACRO, which has been programmed to help programmers refer to the complete list of parameters defined within a macro program within the SAS environment itself. Here, I discuss what P_MACRO is capable of, why it’s needed, how it’s programmed, and its limitations.

 

What is P_MACRO and what does it do?

P_MACRO is an SAS program that extracts parameter-level information from a group of macros and saves them to an SAS dataset. Once set up, P_MACRO accomplishes several tasks, including:

  1. Extracting parameters from the group of macro programs to the SAS dataset.
  2. Extracting default values along with parameters, if already defined.
  3. Retaining the order of parameters.
  4. Prioritizing main macro information over nested macros.
  5. Providing a common text for macro programs without parameters.
  6. Generating an automated macro call.

 

Why is P_MACRO needed?

In SAS macros, we have the flexibility to exclude some parameters at invocation and the macro will still execute well, if there is no dependency. But when we do not include the complete list of parameters in a call, it is difficult for the programmer to decide on adding parameters to the existing call when an update/modification is needed. If the programmer is not aware of the complete list of parameters when an update is needed, then they may need to either use debugging options or manually open the macro code and check.

However, we do not have an automated facility within SAS where we can check the complete list of macro parameters defined for a group of macros. Thus P_MACRO, when released for wider group usage, will help programmers gather and refer to the complete list of parameters defined within a macro program in the SAS environment itself. With this, it will be easy to get ahold of the complete list of parameters defined along with their default values and position/order. An automatic macro call for each macro is generated using the information stored in resultant dataset, saving valuable time for the programmer.

 

Steps involved in P_MACRO programming

  1. Read macro program files to SAS
  2. Extract macro name
  3. Determine macro start and end points
  4. Handle nested macros
  5. Handle macro programs with no parameters
  6. Retain macro parameters position
  7. Bring out default values
  8. Generate macro call

 

Limitations

Despite its benefits, the P_MACRO program has a few limitations:

  1. %LET statement is used in macro programs to conditionally check and assign a default value for a parameter. Such default values are not extracted through the macro.
  2. SAS Macros has the ability to use the PARMBUFF option and SYSPBUFF to define a macro that accepts a varying number of parameters at each invocation. In such cases, P_MACRO won’t be able to extract any parameters.

 

Final takeaways

In the SASHELP library, there is a dataset named VCOLUMN that holds detailed information about the metadata of the libraries, datasets, and variables present for that SAS session. This helps programmers to identify/query some of the important information about the datasets/variables for that active session. Like the VCOLUMN dataset, the dataset generated through P_MACRO will help programmers find the list of macros within one folder, with their entire list of parameters in defined order along with default values in one place within SAS. Generating an automated macro call using the resultant dataset would help programmers by having the entire parameters list handy and ready to use as needed.

 

Interested in learning more?  

Eswara Gunisetti will be at PHUSE EU 2024 to present “P_MACRO, Parameters Extraction from Macros to SAS Dataset” on Wednesday, November 13 at 12:00 p.m. We hope to see you there!

Career Perspectives: A Conversation with Guillaume Hervé

In this latest edition of our Career Perspectives series, we had the privilege of interviewing Guillaume Hervé, Director Statistical Programming in PBS. Guillaume shares his journey in statistical programming, highlighting his extensive experience and pivotal roles. He discusses Cytel’s collaborative culture, innovative project management approaches, and the importance of mentorship. Additionally, Guillaume offers insights into the skills essential for success in the field and advice for aspiring statistical programmers.

Can you give us a little background on your career and your professional journey so far?

After completing my master’s degree in biostatistics and multiple internships as a biostatistician, I started my career as a statistical programmer 14 years ago at Novartis in Rueil-Malmaison (near Paris). I was quickly promoted to lead programmer, a position that allowed me to express my full potential as both a programmer and a team lead. During those 8 years, I gained a solid foundation of knowledge and experience in the pharmaceutical industry, especially within biometrics and clinical trial management.

In 2018, Cytel opened their new office in Basel, which is where my journey with Cytel began. I now had the opportunity to evolve in a new environment — the world of CROs. Cytel was expanding, which opened the door for me to consolidate and strengthen my experience as a team leader and provided me with the opportunity to take on the role of operational manager, and later line manager. I currently supervise a team of 20+ programmers across various regions, including Europe and APAC.

What is your role at Cytel?

I’m Director of Statistical Programming for Cytel’s Project-Based Analytical Solutions in Europe. My current role involves line management responsibilities, oversight of projects’ scope management, and development/expansion of the programming group.

Scope management mainly involves ensuring optimal utilization of our programmers across projects, controlling the quality of deliverables, overseeing the financial health of projects, and monitoring the correct implementation of programming processes. I am also actively involved in recruiting and onboarding new team members, establishing company processes, developing standard tools, and supporting department initiatives.

An illustration of such an initiative is the internship program in the programming department I developed in 2021. During the past 3 years, sustainable partnerships with 3 different universities have been built, and each year, for 6 months, we welcome students aiming to discover the role of statistical programmer in the pharmaceutical industry. This program often concludes by the conversion of the internship into a permanent contract, which shows how successful it really is.

What motivated your transition from biostatistics to statistical programming? How has your background in biostatistics influenced your approach to statistical programming?

While I have a background as a biostatistician, I have always enjoyed programming. When I first started working as a statistical programmer, I realized my expertise in biostatistics was an incredible asset, especially for programming complex statistical models. I could fully understand these models and their results, detect potential issues, and easily discuss biostatistics topics such as the management of missing data with biostatisticians. Sometimes, I could even challenge them. To me, being a statistical programmer is the perfect combination of everything I like, and it allows me to play a central role in the analysis of clinical trials.

How have your managers or colleagues at Cytel supported your professional growth since you joined the company? From your perspective, what specific aspects of Cytel’s culture or environment contribute to making it an exceptional place to work?

I have been fortunate to receive close mentorship from my managers since I began my journey at Cytel. It empowered my continuous professional growth. My current manager Nicolas Rouillé (Senior Director Statistical Programming) always looks for opportunities to get me more involved in my role at Cytel. His trust and willingness to share his experience across various fields gave me the confidence to succeed in any challenge I might face. In turn, I strive to apply the same principles with my direct reports, to strengthen the team and the organization as a whole.

At Cytel, we foster a strong team spirit and have numerous experts across all functions. I’m always grateful to work in an environment where, every day, people demonstrate enthusiasm, courage, collaboration, and commitment to achieving a common goal — delivering high-quality results to clients and actively contributing to the improvement of patient care.

Could you discuss Cytel’s integrated project management approach, which aims to synchronize delivery among data managers, biostatisticians, and programmers? How has this approach benefited our clients?

Cytel provides end-to-end biometric solutions, including data management, programming, and biostatistics services. One example of the automation of cross-functional delivery is the implementation of the standard data library and CDASH during the eCRF design/development, and the generation of SDTM template programs. When eCRFs comply with CDASH standards, the corresponding STDM mapping in CDISC standards can be automated. The main benefit is that it enables us to increase our compliance with industry standards and improve the efficiency from data collection to reporting. CDISC compliance for analysis datasets is a key requirement from health authorities at the submission stage, which is why this automation benefits our clients directly.
Another cross-functional automation we developed at Cytel involves a tool that generates template output programs from standard mock shells and metadata. This collaboration between the biostatistics and programming teams has resulted in the production of high-quality deliverables.

Could you provide an example or project that illustrates how we deliver added value for our clients?

Recently, a client requested us to handle health authority questions for one of their Phase III oncology studies. We were contracted for biostatistics and programming services on very short timelines — what we call a rescue study. The scope wasn’t straightforward either, as we had to produce six complex efficacy ADaMs including multiple imputation rules and around 70 unique efficacy outputs presenting different statistical models.

We were able to successfully deliver a high-quality package to the client, on time, and received only minimal comments. Following this, the client informed us that they received a positive CHMP opinion for this submission. They expressed their gratitude for our collaboration and support during the submission process.

What strategies do you employ to ensure the quality and accuracy of deliverables, particularly when working on projects with tight timelines or complex data sets?

My team is composed of individuals with different seniority and experience levels, from junior programmer to associate director. When a complex project with tight timelines arises, my priority is an optimal resource assignment based on the availabilities as well as individual experience and knowledge. Sometimes a switch of resources across projects will lead to the best team setup.
When working on the project, we pay a lot of attention to writing specifications and performing programming and biostatistical review of ADaM datasets with a focus on the computational methods of complex derivations. We perform advanced quality controls or cross-checks against other outputs to ensure the accuracy of the results. Any findings related to data, such as missing data, data issues, or specific study data scenarios that can impact study results are shared with the client before proceeding with the delivery. It’s crucial to be proactive in these cases.
Lastly, the strong collaboration across biometric line functions is essential to delivering quality to clients, especially when timelines are short.

What combination of knowledge, skills, and technical competencies is essential for individuals to succeed as statistical programmers at Cytel? What qualities do you look for when hiring new members for your team?

Obviously, technical skills are incredibly important. We pay a lot of attention to the candidate’s proficiency in statistical programming languages and their experience in clinical data and industry standards. For senior roles, we also dive into their experience as team lead, which can include several topics of interest like resource assignments, quality controls, budget awareness and management, and communication with internal or external stakeholders.

In addition, we also assess the motivation of the candidate and their appetite to learn. This can easily counterbalance a potential lack of technical skills or experience. As hiring manager, I’m also very focused on interpersonal skills and the mindset of the candidate. Skills such as self-organization, proactivity, multi-tasking, and/or strong adaptability are ones I look for.

What advice would you give to aspiring statistical programmers or individuals aiming for roles within the field?

I would advise to first familiarize yourself with clinical trial fundamentals such as different phases of clinical trials, study designs (e.g., randomized controlled trials, observational studies), and endpoint definitions. Understanding the clinical trial process is crucial for effective programming. Additionally, studying the regulatory framework surrounding clinical trials, including Good Clinical Practice (GCP) and ICH guidelines, is essential. This knowledge is key for compliance and data integrity.

Then, it’s important to learn a relevant programming language such as SAS or R and gain a solid understanding of biostatistics and the statistical methods commonly used in clinical trials, such as survival analysis, mixed models, and meta-analysis. Acquiring in-depth knowledge of programming standards used in pharmaceutical industry such as CDISC standards would also be a plus.
However, do not forget to develop your soft skills. Good communication skills, team spirit, collaboration, and problem-solving skills are vital in programming roles.

My last piece of advice to candidates is to look for internships or entry-level positions that provide exposure to clinical data analysis or programming. Real-world experience is invaluable.

Lastly, what are your main interests outside of work?

I like spending time with my family. I have two young kids, a nine-year-old and a six-year-old. My wife and I like to visit new places with them, especially European cities. We also like to hike, and since the Basel area is at the intersection of three countries — France, Germany, and Switzerland — we have plenty of good spots to enjoy the fresh air.

I also like spending time in my garden, I play football with my former Novartis colleagues, and regularly go to the gym. I’m turning 40, so staying in shape is becoming a serious objective!

Thank you, Guillaume, for sharing your experience with us!