06/11/2025

A Probabilistic Approach for Building Disease Phenotypes Across Electronic Health Records

BioData Mining MANUSCRIPT
Authors David Vidmar, Jessica De Freitas, Will Thompson, John M. Pfeifer, Brandon K. Fornwalt, Noah Zimmerman, Riccardo Miotto & Ruijun Chen

Background – Identifying the set of patients with a particular disease diagnosis across electronic health records (EHRs), referred to as a phenotype, is an important step in clinical research and applications. However, this task is often challenging, where incomplete data can render definitive classifications impossible. We propose a probabilistic approach to phenotyping based on Bayesian inference and without the need for gold-standard labels. In this paper, we develop multiple heuristic “labeling functions’’ (LFs) for 4 diseases across de-identified EHR data and aggregate their votes through a majority vote approach (MV), a popular open-source approach (Snorkel OSS), and our proposed probabilistic approach (LEVI). We compare the resulting phenotypes to those built using expert-curated logic from the literature, as well as an off-the-shelf natural language processing pipeline (Medspacy), using a curated sample of physician-reviewed labels for evaluation.

Results – Phenotypes built using LFs perform better than off-the-shelf alternatives on classification performance (F1 scores of 0.79–0.82 vs. expert-logic: 0.68, Medspacy: 0.55). Compared to output scores from Snorkel OSS, LEVI provides better probabilistic performance (expected calibration error of 0.04 vs. 0.12), ROC AUC estimates (interval score [loss] of 0.03 vs. 0.10), and operating point selection (equal-cost net benefit of 0.18 vs. 0.15).

Conclusions – For challenging disease states, phenotyping using probabilities rather than binary classification can lead to improved and more personalized downstream decision-making. Probabilistic phenotypes built using LEVI exhibit low calibration error without the need for labels, allowing for better risk-benefit tradeoffs.

VIEW THE PUBLICATION