Large structural variation discovery using long reads with several degrees of error
Embargo Lift Date: 2021-07-28
Item Usage Stats
Genomic structural variations (SVs) are brieﬂy deﬁned as large-scale alterations of DNA content, copy, and organization. Although signiﬁcant progress has been made since the introduction of high throughput sequencing (HTS) in character-izing SVs, accurate detection of complex SVs and balanced rearrangements still remains elusive due to the sequence complexity at the breakpoints. Until very recently, the diﬃculty of read mapping in such regions when the reads were short and the high error rates of long read platforms kept the problem challenging. However, with the introduction of the Paciﬁc Biosciences’ High Fidelity (HiFi) sequencing methodology, powerful SV detection and breakpoint resolution be-came possible as a result of its capability to produce highly accurate (> 99%) long reads (10 − 20 kbps). Here, we introduce DALEK, a novel algorithm that aims to use long-read tech-nologies to discover large structural variations with high break-point resolution. DALEK uses split read and read depth signatures from long read data to dis-cover large (≥ 10 kbps) deletions, inversions and segmental duplications. We also develop methods to detect large SVs in existing high-error Oxford Nanopore Technologies data.