Towards Cancer Mega-Cohorts: A Novel Homogenization Algorithm Applied to Diverse Breast Cancer RNA-Seq Datasets

05/13/2020

Towards Cancer Mega-Cohorts: A Novel Homogenization Algorithm Applied to Diverse Breast Cancer RNA-Seq Datasets

American Society of Clinical Oncology Annual Meeting 2020

Authors Talal Ahmed, Mark Carty, Stephane Wenric, and Raphael Pelossof

Background: Recent advances in transcriptomics have resulted in the emergence of several publicly available breast cancer RNA-Seq datasets, such as TCGA, SCAN-B, and METABRIC. However, molecular predictors cannot be applied across datasets without the correction of batch differences. In this study, we demonstrate a homogenization algorithm that allows the transfer of molecular subtype predictors from one RNA-Seq cohort to another. The algorithm only uses cohort-level RNA-Seq summary statistics, and therefore, does not require joint normalization of both datasets nor the transfer of patient information. Using this approach, we transferred a breast cancer subtype (Luminal A, Luminal B, HER2+, Basal) predictor trained on SCAN-B data to accurately predict subtypes from TCGA.

Methods: First, we randomly split the TCGA cohort (n = 481 Luminal A, n = 189 Luminal B, n = 73 Her2+, n = 168 Basal) into two sets: TCGA-train and held-out TCGA-test (n = 455 and n = 456, respectively). Second, the SCAN-B cohort (n = 837) was homogenized with the TCGA-train set. Third, a molecular subtype predictor, based on a logistic regression model, was trained on homogenized SCAN-B RNA-Seq samples and used to predict the subtypes of TCGA-test RNA-Seq samples. For baseline comparison, a similar predictor trained on the non-homogenized SCAN-B cohort was tested on the TCGA-test set. The experimental framework was iterated 250 times. Reported P-values reflect a paired one-sided t-test.

Results: To quantify model performance, we measured the average F1 score for each tumor subtype prediction from the held-out TCGA test set with and without cohort homogenization. The average F1 scores with vs. without homogenization were: Luminal A, 0.88 vs. 0.85 (P< 1e-69); Luminal B, 0.74 vs. 0.51 (P< 1e-183); Her2+, 0.73 vs. 0.53 (P< 1e-99); Basal, 0.98 vs. 0.97 (P< 1e-53). Overall, homogenization significantly outperformed no homogenization.

Conclusions: We developed a novel homogenization algorithm that accurately transfers subtype predictors across diverse, independent breast cancer cohorts.

VIEW THE PUBLICATION