Transcriptome profiling of tumors by standard RNA-seq measures the average effect of the tumor microenvironment. This expression profile is largely shaped by the tumor architecture where tumor purity can directly influence the genomic interpretation and consequent associations with clinical outcomes. This challenge is most pronounced in metastatic cancers, where the primary tumor and non-cancerous background tissue have distinct expression profiles.
We developed a novel machine learning method, Correlated Composition AdMixture model (CoCoAMix), that learns to distinguish tumor and background transcriptomes and deconvolve metastatic mixtures. Our method first performs differential expression analysis to automatically select the genes that best discriminate between cancer and adjacent-normal transcriptomes. We then apply an admixture model that uses the selected genes to estimate tumor purities and builds prototypical cancer and adjacent-normal transcriptome-wide profiles. Our model utilizes pure primary cancer samples and pure background tissue samples, but does not require further information about the metastatic cancer samples. Finally, CoCoAMix produces sample-specific tumor and adjacent-normal deconvoluted transcriptomes which approximately reconstitute the observed bulk RNA.
We optimized and validated this approach on 1,166 liver metastases for six different cancer types. The most frequent liver metastases were from colorectal (28.6%), pancreatic (26.6%), and breast (23.4%) cancers. Our RNA-based tumor purity estimates were correlated with pathology-based and DNA-based estimates in all six cancer types. Pure primary cancer samples inputted into CoCoAMix as negative controls were correctly identified as high-purity, with very small errors introduced by deconvolution. We also validated our approach by deconvoluting synthetic metastatic samples produced by mixing read counts.
We found that deconvolution of metastatic cancers resulted in substantial changes to their expression profiles. Low-dimensional embeddings of metastases revealed that low-purity metastases cluster near normal liver transcriptomes, but deconvolution shifted these samples to cluster among primary samples of their same type while maintaining metastatic signal. Differential expression analyses between pre- and post-deconvolution transcriptomes also uncovered a large number of differentially-deconvoluted genes. Furthermore, we discovered that many gene-gene correlations arise from tumor purity, and that this confounding is reduced in deconvoluted transcriptomes. Integration of deconvolution into the Tempus Origin ™ tumor of unknown primary prediction pipeline resulted in classification accuracy of 87% in liver metastases. These findings demonstrate a robust and versatile method that controls for tumor-adjacent tissue contributions, an absolute necessity when using RNA-seq data to inform clinical decisions.
VIEW THE PUBLICATION