Pairwise whole genome alignment using locally consistent parsing

Limited Access
This item is unavailable until:
2026-07-30

Date

2026-01

Editor(s)

Advisor

Alkan, Can

Supervisor

Co-Advisor

Co-Supervisor

Instructor

BUIR Usage Stats
4
views
0
downloads

Attention Stats

Series

Abstract

Pairwise whole-genome alignment is a fundamental problem in computational biology, with applications in evolutionary analysis, variant discovery and comparative genomics. This work focuses on the massive scaling challenges in pangenome analysis by using a hierarchical sketching method based on Locally Consistent Parsing (LCP). On a scale of billions of base pairs, efficient alignment typically relies on the seed-chain-extend heuristic: find exact-matching sketches (seeds), chain them co-linearly, and extend into the gaps. Established tools use minimizers or maximal unique matches (MUMs); we instead use LCP cores, which offer complete coverage, consistent spacing, and fewer seeds at higher levels. Distributed and parallelized multiple genome alignment relies on efficiently partitioning the input genomes into smaller segments that can be processed independently. Existing partitioning methods often rely on maximal exact matches (MEMs), maximal unique matches (MUMs), or minimizers for sketching. However, for MEMs/MUMs, the alignment process is complicated by the O(m· log n) time required to find MEMs of size m in a string of size n. Similarly, minimizers exhibit drawbacks in their distribution patterns and frequencies due to their short length, leading to suboptimal partitioning in terms of computational and communication overhead. Compared to minimizers, Locally Consistent Parsing (LCP) can offer a more thorough and condensed representation of the input data by identifying “cores,” or brief genomic sequences that are consistently present across genomes. We develop a fast, parallelizable pairwise genome alignment framework that uses a hierarchical seed-chain-extend strategy: seed at one LCP level, chain and merge matches, find unaligned regions, and, for each region, recurse by seeding only that region at the next lower level until a minimum level is reached. LCP cores can be computed hierarchically in linear time, leading to more balanced computational loads. We integrated LCPtools with the ChainX-LCP chaining algorithm and evaluated on E. coli (K-12 vs Sakai) and human (GRCh38 vs CHM13); on the human genome our seeding completed in 68 h while Mumemto was still running after 540 h, demonstrating scalability for reference-grade assemblies.

Source Title

Publisher

Course

Other identifiers

Book Title

Degree Discipline

Computer Engineering

Degree Level

Master's

Degree Name

MS (Master of Science)

Citation

Published Version (Please cite this version)

Language

English

Type