BLEND: A fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis

Firtina, C.; Park, J.; Alser, M.; Kim, J. S.; Cali, D. S.; Shahroodi, T.; Ghiasi, N. M.; Singh, G.; Kanellopoulos, K.; Alkan, Can; Mutlu, O.

BLEND: A fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis

buir.contributor.author	Alkan, Can
buir.contributor.orcid	Alkan, Can\|0000-0002-5443-0706
dc.citation.epage	lqad004-18	en_US
dc.citation.issueNumber	1
dc.citation.spage	lqad004-1
dc.citation.volumeNumber	5
dc.contributor.author	Firtina, C.
dc.contributor.author	Park, J.
dc.contributor.author	Alser, M.
dc.contributor.author	Kim, J. S.
dc.contributor.author	Cali, D. S.
dc.contributor.author	Shahroodi, T.
dc.contributor.author	Ghiasi, N. M.
dc.contributor.author	Singh, G.
dc.contributor.author	Kanellopoulos, K.
dc.contributor.author	Alkan, Can
dc.contributor.author	Mutlu, O.
dc.date.accessioned	2024-03-21T14:56:16Z
dc.date.available	2024-03-21T14:56:16Z
dc.date.issued	2023-01-10
dc.department	Department of Computer Engineering
dc.description.abstract	Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only exact-matching seeds causes either (i) increasing the use of the costly sequence alignment or (ii) limited sensitivity. We introduce BLEND, the first efficient and accurate mechanism that can identify both exact-matching and highly similar seeds with a single lookup of their hash values, called fuzzy seed matches. BLEND (i) utilizes a technique called SimHash, that can generate the same hash value for similar sets, and (ii) provides the proper mechanisms for using seeds as sets with the SimHash technique to find fuzzy seed matches efficiently. We show the benefits of BLEND when used in read overlapping and read mapping. For read overlapping, BLEND is faster by 2.4×-83.9× (on average 19.3×), has a lower memory footprint by 0.9×-14.1× (on average 3.8×), and finds higher quality overlaps leading to accurate de novo assemblies than the state-of-the-art tool, minimap2. For read mapping, BLEND is faster by 0.8×-4.1× (on average 1.7×) than minimap2. Source code is available at https://github.com/CMU-SAFARI/BLEND. © 2023 The Author(s). Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.
dc.identifier.doi	10.1093/nargab/lqad004	en_US
dc.identifier.eissn	2631-9268	en_US
dc.identifier.uri	https://hdl.handle.net/11693/115049	en_US
dc.language.iso	English	en_US
dc.publisher	Oxford University Press	en_US
dc.relation.isversionof	https://dx.doi.org/10.1093/nargab/lqad004
dc.rights	CC BY 4.0 DEED (Attribution 4.0 International)
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.source.title	NAR Genomics and Bioinformatics
dc.title	BLEND: A fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis
dc.type	Article

Files

Original bundle

Now showing 1 - 1 of 1

Name:: BLEND_A_fast,_memory-efficient_and_accurate_mechanism_to_find_fuzzy_seed_matches_in_genome_analysis.pdf
Size:: 1.94 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.01 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Scholarly Publications - Computer Engineering