Evaluating the Performance of Copy Number Variant Detection Tools in Clinical Applications: A Comparative Benchmark for Short-Read Whole Genome Sequencing

Authors Francisco M. De La Vega, Sean Irvine, Pavana Anur, Kelly Potts, Lewis Kraft, Raul Torres, Sean Truong, Yeonghun Lee, Shunhua Han, Vitor Onuchic, James Han, Peter Kang

As whole genome sequencing (WGS) becomes cheaper, it is set to become a standard approach in clinical settings due to its superior copy number variant (CNV) detection. Current tools for short-read WGS CNV calling need to be evaluated in clinical settings where orthogonal confirmation of CNVs may be required and there is a higher priority placed on sensitivity over specificity compared to research uses. We evaluated several CNV detection tools designed for short-read WGS data, including Delly, DRAGEN 4.0, CNVkit, CNVpytor, Lumpy, Manta, and Parliament2, as well as two newer tools: Cue, a machine learning-based method, and DRAGEN 4.2’s integrated CNV caller that combines breakpoint and depth-based calls. We used data from independent PCR-free libraries of the HG002 reference cell line, sequenced to a mean depth of 50X using paired-end 2x150bp reads on Illumina NovaSeq 6000 and X-Plus instruments. Benchmarking CNVs is often aimed to evaluate event-level similarities, but in clinical contexts, the primary concern is whether a variant disrupts protein structure. Thus, to calculate accuracy we evaluated CNV overlaps with coding exons, defining a match as an event intersecting an exon with the same dosage direction as the truth set. The event’s contribution is adjusted by the number of exons spanned to account for events overlapping multiple exons. Using GRCh37-defined exon boundaries, we confined our analysis to exons intersecting the HG002 GIAB v0.6 SV truth set for events of 1-100kb, including 13 deletions and 4 duplications that overlap 45 and 8 exons, respectively. Given the limited examples in the truth set, we placed simulated gene models across the truth set to increase deletion and duplication exon overlaps to 125 and 20, respectively. We show that all callers struggle with detecting single-exon events (typically <5kb) and duplications. However, DRAGEN 4.2 and Parliament2 demonstrated the highest sensitivity (91 and 93%), with the former showing superior specificity (61 vs 22%). Remarkably, despite a reduction in specificity (42%), DRAGEN’s v4.2 high sensitivity mode achieved 100% sensitivity for both deletions and duplications >5kb. This could be advantageous in clinical settings where sensitivity is paramount and confirmatory tests are performed.