Researchers should be mindful of potential technical artifacts in next-generation sequencing data when calling variants, particularly insertion and deletion variants, according to a team of researchers from the J. Craig Venter Institute and the University of California, San Diego.
The researchers analyzed datasets from the Cancer Genome Atlas project, focusing on germline variants from more than 9,000 cases representing 31 different cancer types. As they described in a publication posted on the bioRxiv preprint server earlier this month, the team found that technical artifacts especially impacted loss-of-function indel calls.
The study highlights the importance of standards, which can be “extremely helpful to try and control everything and minimize the effects of bias,” Keegan Korthauer, a postdoctoral research fellow at Dana-Farber Cancer Institute, who was not involved with the study, said in an interview.
In addition, it serves as a warning for researchers to “be careful when using these large datasets,” senior author Nicholas Schork, director of human biology at JCVI, said in an interview. There was “no one uniform workflow used to call variants” in the TCGA data, added Schork, who also has an appointment at the Translational Genomics Institute.
Researchers from the JCVI and different UCSD departments all interested in studying germline variants and how they may influence tumor properties and tumor phenotypes joined forces to analyze the TCGA data, which includes phenotypic data along with sequence data.
Alexandra Buckley, lead author of the paper, said that when she began analyzing the germline TCGA data she first did some basic quality control of the data set since she knew that different institutions used different methods to process the samples. Nonetheless, when she started doing association studies, she found “all sorts of significant results that seemed suspicious.” Therefore, Buckley along with the other JCVI and UCSD researchers began further investigating the TCGA data to identify biases and technical artifacts.
The TCGA data includes more than 11,000 patient samples collected by researchers from 20 different institutions. There was not one specified method for collecting and sequencing the samples, so there are differences in sample collection, processing, and sequencing that could lead to variation in the data. Much of the analyses to date have focused on somatic variants, but the JCVI and UCSD researchers were interested in doing a pan-cancer analysis of the germline data. As such, it was first necessary to figure out how the various processing and analysis methods used impacted the data.
Of 31 different cancer types evaluated, the same workflow was used for only two types. The researchers identified seven sources of variation: the tissue source of normal DNA, the exome capture kit that was used, whole-genome amplification prior to sequencing, sequencing center, sequencing technology, the BWA version that was used, and capture efficiency.
Because the team was most interested in looking at the potential impact of germline variants on cancer-relevant pathways, they focused on loss-of-function variants, since those would most likely disrupt gene function.
The researchers used the statistical pipeline known as Analysis of Variance (ANOVA) to assess how each of the seven sources of variation impacted LOF variants. An initial analysis found that sequencing technology and the source of normal DNA had minimal impacts on LOF variants, so they eliminated those two sources from subsequent analysis. Looking at the different types of LOF variants, they found that SNVs were large unaffected by technical variation.
All sources of technical variation combined explained less than 1 percent of the variance in LOF SNVs. However, technical variation had a much large impact on LOF indels, impacting 59 percent of the variation of indels. When they dug down deeper into the individual sources of technical variation, the team found that whole-genome amplification explained more than 50 percent of the variation in LOF indels.
Whole-genome amplification was used to analyze samples from four different cancer types: colon adenocarcinoma, rectum adenocarcinoma, ovarian cancer, and acute myeloid leukemia. For the colon and rectum adenocarcinoma samples, whole-genome amplification was used on 26 percent and 33 percent of the respective samples. By comparing amplified and non-amplified samples within the same cancer type, the researchers further confirmed that amplification created biases in the data that could lead to false-positive calls, particularly for LOF indels. There was a 1.5-fold increase in LOF indels among the amplified samples compared to the non-amplified ones.
Next, the researchers tested various methods to remove the artifacts from the true variant calls. The best filtering method removed about half of the expected LOF indel calls, and so likely also removed some true signal. As such, the researchers removed all samples that used whole-genome amplification from further analyses so as not to bias the data.
Olivier Harismendy, assistant professor at UCSD, said that although the study did not analyze why amplification impacted primarily LOF indels, one possible explanation could be slippage of the polymerase during amplification. That is a known error, he said, and when it happens, it can cause the small insertion or deletion.
Going forward, the researchers plan to continue pan-cancer analyses of germline variants on tumor properties, and in particular, to address the question of whether certain germline variants coupled with certain somatic variants lead to cancer, Schork said.
He added that he hoped this study will shed light on the importance of quality control when trying to do association studies across data generated in different ways. Harismendy added that it is an issue that has been highlighted among researchers studying somatic variants, but has not been as discussed among those studying germline variants, even though it is just as important.
“Batch effects exist, and researchers need to be sensitive to that,” Schork said.