Characterization of structural variation through assembly-to-assembly comparison
Date
Authors
Editor(s)
Advisor
Supervisor
Co-Advisor
Co-Supervisor
Instructor
BUIR Usage Stats
views
downloads
Series
Abstract
Structural variations (SVs) are genomic variations affecting more than 50 nucleotides of DNA. SVs play a crucial role in evolution and have critical phenotypic effects on organisms, such as genetic diseases in humans like autism, schizophrenia, epilepsy, and cancer. Thus, SV characterization is of great significance. In the past, read-based methodologies were utilized due to the infeasibility of constructing genome assemblies. However, with technological advancements, assembling genomes has become significantly more feasible, and complete assemblies of human and other primate genomes have been constructed. Despite the high-quality assemblies, SV discovery in human genomes remains challenging due to the genome's repetitive nature and complex rearrangements caused by a combination of SVs. Most existing SV discovery tools operating on genome assemblies require whole genome alignments, leading to high preprocessing times and memory usage. Therefore, new algorithms are still needed to efficiently discover SVs. Here, we propose Strive, a linear time algorithm that operates on genome assembly sketches instead of whole genome alignments to characterize insertions, deletions, and inversions. We evaluated the performance Strive with two experiments: simulated data from the human reference genome (GRCh38.p14 / hg38) and real data using a full genome assembly from the Telomere to Telomere Consortium (CHM13). Strive is able to accurately detect insertions, deletions, and inversions in 11 to 12 seconds in addition to preprocessing times ranging from 50 to 55 seconds. Strive achieved over 95% precision and recall values in the simulations without duplications. In the simulations that included segmental duplications and SNPs and in the experiment with CHM13 assembly, although still maintaining over 95% recall in inversion discovery, the precision and recall for insertions and deletions were lower, suggesting a need for increased robustness to duplications.