Geoffrey Schau, PhD, Kunal Nagpal, MS, Rohan Joshi, MD, PhD, Rachel Baits, BS, Sebastian Pretzer, BS, Irvin Ho, BS, Adam Cole, MD, Roberto Nussenzveig, PhD, Abigail Gordhamer, Martin C Stumpe, PhD, Nike Beaubier, MD,
Matt Leavitt, MD
Microsatellite instability (MSI) is associated with patient response to cellular immunotherapy
across several cancer types. Previous studies have shown that AI-based imaging assays can
infer MSI status from H&E whole slide images but external site generalizability remains a key
challenge for successful deployment of deep learning models in digital pathology. In this study,
we develop and evaluate a model trained to predict MSI status from whole-slide H&E images of
prostate cancer and directly evaluate stain and scanner generalizability by assessing our model
on an internal test set and an external test set that contains a serial section of each slide in the
internal test set but stained at a different site and scanned using a different scanner model.
This study assessed a real-world data cohort composed of 2,253 patient samples of primary
prostate cancer with digitized H&E images and NGS sequencing-confirmed MSI status (69
MSI-High [MSI-H], 2183 microsatellite stable [MSS]) from the development site. Of these, 114
cases (MSI-H= 29, MSS=85) were assigned to the internal test set and held-out from training. All internal test set slides were scanned on a Leica GT450 and a serial section for each of these
samples was stained at an external site with a Leica AT2 (Table 1). All remaining data were
assigned to a 4-fold cross-validation development dataset (MSI-H=40, MSS=2098), which was
used to train an ensemble of attention-based multiple instance learning models.
On the internal test set, the predictor achieved a mean area under the receiver operating
characteristic (AUROC) score of 0.75 (0.64-0.86, 95% CI) with a mean external site AUROC of
0.73 (0.62-0.84, 95% CI) (Figure 1). Site-specific model drift tends to under-estimate MSI
probability, which is not explained by distance between slices within the sample block (Figure
2B). Further, while Gleason Score is shown to be positively associated with the model’s MSI
predictions (Figure 2C), procedure type generally was not strongly associated with positive MSI
prediction in the deployment setting (Figure 2D).
We developed a paired dataset of adjacent slides stained and scanned at different clinical sites,
and illustrated that a model trained to predict MSI status from H&E images is robust to
differences across institutions in predictive performance as measured by AUC, but may need
calibration prior to setting prediction thresholds.
VIEW THE PUBLICATION
VIEW THE POSTER