05/15/2020

Robust Detection of Sequencing Batch Effects in RNA Through Low Dimensional Embedding With Subtype-Matched Reference Samples

American Association for Cancer Research Annual Meeting 2020 Presentation
Authors Joshua Drews, Joshua Bell, Wesley Munson, Saksham Saini, Benjamin Leibowitz, Jackson Michuda, Calvin McCarter, Lee Langer, Catherine Igartua, and Kevin White

Laboratories conducting high volumes of RNA sequencing must be extremely wary of technical batch effects if samples are to be compared across extended time periods, which is imperative for the most well-powered analyses of cancer transcriptomes. Changes in reagents, protocols, or technologies used in nucleic acid extraction, library preparation, and sequencing can alter transcriptomes in ways that invalidate or complicate comparisons of samples from different batches, necessitating continuous monitoring. This monitoring can be particularly difficult when analyzing samples from distinct tissue sites as tumor type is the major biological determinant of transcriptome variance in cancer. Brain and liver cancer transcriptomes, for example, are expected to differ so drastically that their comparison is not informative for batch effect detection. Detection methods must also be robust to disparate batch effects that can manifest as minor changes in expression among many genes or major changes in a subset of genes making ad hoc detection unfeasible.

To overcome these challenges, we developed MaCoBED (matched cohort batch effect detection), a novel method that evaluates technical batch effects in a set of transcriptome samples (e.g., a flow cell) by pooling them with a set of validated reference samples matched by cancer type and tissue site. This pooled set of transcriptomes is then subjected to low-dimensional embedding using Uniform Manifold Approximation and Projection (UMAP), and each component is tested for deviation from the reference set using a Wilcox test. Matching new and legacy samples by cancer type and tissue site ensures that any differences in UMAP clustering are not driven by known biological contributions. We found that UMAP was preferable to Principal Components Analysis (PCA). UMAP can capture variability in just two dimensions, accentuating modest but consistent transcriptome differences among batches that would otherwise be manifested among multiple minor principal components, making batch effects more obvious and readily detectable.

This approach was able to detect a number of simulated batch effects with high specificity and sensitivity relative to randomly sampled validated legacy samples. Thus, we propose MaCoBED as a simple and rapid approach for batch effect monitoring of high-throughput RNA sequencing datasets that is versatile in detecting distinct kinds of batch effects, easily automatable, readily interpretable upon visualization, and extensible to small or large batch sizes.

VIEW THE PUBLICATION