Trustworthy AI in Action: Predicting Stroke Risk Transparently with Claims-Based Machine Learning


August 7, 2025

In recent years, deep learning and large neural networks have garnered most of the attention in the machine learning (ML) community. Their ability to model complex, high-dimensional data is indeed impressive. But in healthcare — where decisions can have serious consequences and interpretability is paramount — simpler, transparent models like logistic regression still have an important role to play.

Not every problem requires a black box. When it comes to predicting disease risk using structured data, such as insurance claims, traditional models can offer accuracy and insight.

 

Claims databases: An untapped resource for disease risk prediction

Claims databases are an increasingly valuable source of real-world data (RWD). Unlike clinical trial data, which is highly controlled but limited in scale and scope, administrative claims datasets cover millions of lives over multiple years, reflecting real patient behavior and care patterns.

These databases include information on diagnoses, procedures, prescriptions, and demographics — elements that, while lacking granular clinical detail, can still reveal important patterns in disease progression and risk. The scale of these datasets allows for robust statistical modeling, even for rare outcomes.

 

The case for explainable machine learning in claims-based risk prediction

When working with claims data, models like logistic regression, Lasso, or Ridge regression are not just sufficient — they are often ideal. These models:

  • Produce coefficients that quantify the relationship between features and outcomes.
  • Allow for transparent understanding of why a prediction was made.
  • Are easier to validate and communicate to clinicians, payers, and regulators.

In contrast, deep learning models often deliver slightly higher accuracy at the cost of interpretability — a trade-off that may not be acceptable in regulated healthcare environments.

 

A real-world example: Predicting stroke risk with claims data

In a recent study, Cytel used data from over 2.5 million insured individuals to predict the risk of stroke hospitalization. Using only claims-based features such as age, medication use, comorbidities (e.g., diabetes, hypertension), and health service utilization, we compared the performance of several models, including:

  • Logistic Regression
  • Regularized linear models (Lasso and Ridge)
  • XGBoost (a state-of-the-art ML algorithm)

The results? All models achieved similar predictive performance, with area under the ROC curve (AUC) values around 0.81. Logistic regression — simple, explainable, and well-established — performed on par with XGBoost, demonstrating that advanced complexity wasn’t necessary to achieve meaningful predictive power.

 

Transparency enables trust and action

What sets models like logistic regression apart is their explainability. Stakeholders can see precisely how risk factors like atrial fibrillation, hypercholesterolemia, or age contribute to predicted stroke risk. This level of clarity is essential not only for clinicians making decisions, but also for data governance, compliance, and patient communication.

In a time when “black box” AI models are under increasing scrutiny, explainable models offer a pragmatic path forward — especially when paired with large-scale real-world datasets like claims data.

 

Keep it simple, keep it transparent

Healthcare doesn’t just need powerful algorithms — it needs trustworthy ones. As our study shows, standard machine learning models remain highly relevant, especially when applied to well-structured real-world data. Claims databases, in particular, offer a rich foundation for developing these models and making preventive healthcare smarter, earlier, and more accessible.

Contact Us!
Subscribe to our newsletter

Manuel Cossio

Director, Innovation and Strategic Consulting

Manuel Cossio is Director, Innovation and Strategic Consulting at Cytel. Manuel is an AI engineer with over a decade of experience in healthcare AI research and development. He currently leads the creation of generative AI solutions aimed at optimizing clinical trials, focusing on hierarchical multi-agent systems with multistage data governance and human-in-the-loop dynamic behavior control.

Manuel has an extensive research background with publications in computer vision, natural language processing, and genetic data analysis. He is a registered Key Opinion Leader at the Digital Medicine Society, a member of the ISPOR Community of Interest in AI, a Generative AI evaluator for the EU Commission, and an AI researcher at UB-UPC- Barcelona Supercomputing Center.

He holds an M.Sc. in Translational Medicine from Universitat de Barcelona, a Master of Engineering in AI from Universitat Politècnica de Catalunya, and a M.Sc. in Neuroscience from Universitat Autònoma de Barcelona.

Read full employee bio

Marco Ghiani

Senior Director, Real World Evidence

Marco Ghiani is Senior Director, Real World Evidence at Cytel. Marco leads the strategic design and execution of retrospective observational studies across Europe, guiding cross-functional teams and driving real-world evidence generation. He has led numerous RWE initiatives leveraging statistical and econometric methods, with experience spanning multiple therapeutic areas. Marco holds a PhD in Economics from Boston College with a dissertation titled Essays in Applied Health Economics.

 

Read full employee bio

Claim your free 30-minute strategy session

Book a free, no-obligation strategy session with a Cytel expert to get advice on how to improve your drug’s probability of success and plot a clearer route to market.

glow-ring
glow-ring-second