Multimodal Prediction of Diagnosis for Cancers of Unknown Primary

American Association for Cancer Research Annual Meeting 2020 Presentation
Authors Jackson Michuda, Benjamin Leibowitz, Shlomit Amar-Farkash, Crystal Bevis, Alessandra Breschi, Joshuah Kapilivsky, Catherine Igartua, Joshua S. K. Bell, Kyle A. Beauchamp, Kevin White, Martin Stumpe, Nike Beaubier, and Timothy Taxter.

Background: Tumors of unknown origin account for up to 5% of newly diagnosed cancers and the average survival time is 9 to 12 months from diagnosis. Establishing tumor type and subtype guides standard of care treatment for several NCCN targeted therapy guidelines.

Methods: Targeted DNA sequencing for more than 500 cancer-associated genes and exome-capture RNA sequencing was carried out in more than 25,000 fresh frozen or paraffin embedded tumor samples, including both primary and metastatic tumors. Mutations, copy number variants, and viral sequences were detected from DNA sequencing while gene expression and fusion events were determined from RNA sequencing. We aimed to predict cancer type by utilizing multiple machine learning models trained on individual data types and harmonize predictions across multiple data types.

Results: The transcriptome model predicts more than 60 unique diagnoses covering both solid and hematological cancers with >90% overall accuracy on a held-out test set. Of note, the model can accurately predict 10 subtypes of sarcoma and 6 subtypes of neuroendocrine tumors. Gene expression and splicing were the most informative data types, but a performant DNA-only model was also evaluated for application when only DNA data is available. Finally, we evaluated the model on an unlabeled cohort of poorly differentiated samples with inconclusive diagnosis.

Conclusions: The incorporation of multiple modes of omics data can improve the interpretability and robustness of machine learning models to predict cancer diagnosis.