Using Bloom filters to quickly and efficiently characterize genomic repeats and segmental duplications
Date
Authors
Editor(s)
Advisor
Supervisor
Co-Advisor
Co-Supervisor
Instructor
BUIR Usage Stats
views
downloads
Series
Abstract
Advances in sequencing technologies are expected to further reduce the occurrence of sequencing-related misassemblies. Nevertheless, errors caused by repetitive sequences and duplications remain a persistent challenge and are likely to continue impacting genome assemblies. This highlights the need for fast and efficient algorithms specifically designed to address repeat-induced errors. In this study, we present KonuSeg, a versatile k-mer counting tool that leverages Bloom filters and assigns copy numbers to genomic regions in a segmentbased manner across the genome. KonuSeg employs a non-mapping-based approach that is computationally efficient and readily integrable into assembly graph frameworks, providing improved scalability and memory performance. We demonstrate its effectiveness through comprehensive analyses on data from multiple species under various configurations and evaluate its performance in combination with a widely used scaffolding algorithm to showcase its potential for enhancing assembly quality.