Pairwise whole genome alignment using locally consistent parsing
Date
Authors
Editor(s)
Advisor
Supervisor
Co-Advisor
Co-Supervisor
Instructor
BUIR Usage Stats
views
downloads
Attention Stats
Series
Abstract
Pairwise whole-genome alignment is a fundamental problem in computational biology, with applications in evolutionary analysis, variant discovery and comparative genomics. This work focuses on the massive scaling challenges in pangenome analysis by using a hierarchical sketching method based on Locally Consistent Parsing (LCP). On a scale of billions of base pairs, efficient alignment typically relies on the seed-chain-extend heuristic: find exact-matching sketches (seeds), chain them co-linearly, and extend into the gaps. Established tools use minimizers or maximal unique matches (MUMs); we instead use LCP cores, which offer complete coverage, consistent spacing, and fewer seeds at higher levels. Distributed and parallelized multiple genome alignment relies on efficiently partitioning the input genomes into smaller segments that can be processed independently. Existing partitioning methods often rely on maximal exact matches (MEMs), maximal unique matches (MUMs), or minimizers for sketching. However, for MEMs/MUMs, the alignment process is complicated by the O(m· log n) time required to find MEMs of size m in a string of size n. Similarly, minimizers exhibit drawbacks in their distribution patterns and frequencies due to their short length, leading to suboptimal partitioning in terms of computational and communication overhead. Compared to minimizers, Locally Consistent Parsing (LCP) can offer a more thorough and condensed representation of the input data by identifying “cores,” or brief genomic sequences that are consistently present across genomes. We develop a fast, parallelizable pairwise genome alignment framework that uses a hierarchical seed-chain-extend strategy: seed at one LCP level, chain and merge matches, find unaligned regions, and, for each region, recurse by seeding only that region at the next lower level until a minimum level is reached. LCP cores can be computed hierarchically in linear time, leading to more balanced computational loads. We integrated LCPtools with the ChainX-LCP chaining algorithm and evaluated on E. coli (K-12 vs Sakai) and human (GRCh38 vs CHM13); on the human genome our seeding completed in 68 h while Mumemto was still running after 540 h, demonstrating scalability for reference-grade assemblies.