Hercules: a profile HMM-based hybrid error correction algorithm for long reads
Date
2018Source Title
Nucleic Acids Research
Print ISSN
0305-1048
Electronic ISSN
1362-4962
Publisher
Oxford University Press
Volume
46
Issue
21
Language
English
Type
ArticleItem Usage Stats
119
views
views
88
downloads
downloads
Abstract
Choosing whether to use second or third generation
sequencing platforms can lead to trade-offs between
accuracy and read length. Several types of
studies require long and accurate reads. In such
cases researchers often combine both technologies
and the erroneous long reads are corrected using
the short reads. Current approaches rely on various
graph or alignment based techniques and do not
take the error profile of the underlying technology
into account. Efficient machine learning algorithms
that address these shortcomings have the potential
to achieve more accurate integration of these two
technologies.We proposeHercules, the first machine
learning-based long read error correction algorithm.
Hercules models every long read as a profile Hidden
Markov Model with respect to the underlying platform’s
error profile. The algorithm learns a posterior
transition/emission probability distribution for each
long read to correct errors in these reads. We show
on two DNA-seq BAC clones (CH17-157L1 and CH17-
227A2) that Hercules-corrected reads have the highest
mapping rate among all competing algorithms
and have the highest accuracy when the breadth of
coverage is high. On a large human CHM1 cell line
WGS data set, Hercules is one of the few scalable algorithms;
and among those, it achieves the highest
accuracy.