Assessment and correction of errors in DNA sequencing technologies

Fırtına, Can

Assessment and correction of errors in DNA sequencing technologies

buir.advisor	Alkan, Can
dc.contributor.author	Fırtına, Can
dc.date.accessioned	2018-01-10T13:45:51Z
dc.date.available	2018-01-10T13:45:51Z
dc.date.copyright	2017-12
dc.date.issued	2017-12
dc.date.submitted	2018-01-10
dc.description	Cataloged from PDF version of article.	en_US
dc.description	Includes bibliographical references (leaves 35-41).	en_US
dc.description.abstract	Next Generation Sequencing technologies differ by several parameters where the choice to use whether short or long read sequencing platforms often leads to trade-offs between accuracy and read length. In this thesis, I first demonstrate the problems in reproducibility in analyses using short reads. Our comprehensive analysis on the reproducibility of computational characterization of genomic variants using high throughput sequencing data shows that repeats might be prone to ambiguous mapping. Short reads are more vulnerable to repeats and, thus, may cause reproducibility problems. Next, I introduce a novel algorithm Hercules, the first machine learning-based long read error correction algorithm. Several studies require long and accurate reads including de novo assembly, fusion and structural variation detection. In such cases researchers often combine both technologies and the more erroneous long reads are corrected using the short reads. Current approaches rely on various graph based alignment techniques and do not take the error profile of the underlying technology into account. Memory- and time- efficient machine learning algorithms that address these shortcomings have the potential to achieve better and more accurate integration of these two technologies. Our algorithm models every long read as a profile Hidden Markov Model with respect to the underlying platform's error profile. The algorithm learns a posterior transition/ emission probability distribution for each long read and uses this to correct errors in these reads. Using datasets from two DNA-seq BAC clones (CH17-157L1 and CH17-227A2), and human brain cerebellum polyA RNA-seq, we show that Hercules-corrected reads have the highest mapping rate among all competing algorithms and highest accuracy when most of the basepairs of a long read are covered with short reads.	en_US
dc.description.statementofresponsibility	by Can Fırtına.	en_US
dc.embargo.release	2020-01-09
dc.format.extent	xiii, 68 leaves : charts (some color) ; 30 cm	en_US
dc.identifier.itemid	B157364
dc.identifier.uri	http://hdl.handle.net/11693/35728
dc.language.iso	English	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	DNA	en_US
dc.subject	Sequencing	en_US
dc.subject	Repeats	en_US
dc.subject	Error correction	en_US
dc.subject	Long reads	en_US
dc.subject	Machine learning	en_US
dc.title	Assessment and correction of errors in DNA sequencing technologies	en_US
dc.title.alternative	DNA dizilim teknolojilerindeki hatalar üzerine değerlendirme ve hataların düzeltilmesi	en_US
dc.type	Thesis	en_US
thesis.degree.discipline	Computer Engineering
thesis.degree.grantor	Bilkent University
thesis.degree.level	Master's
thesis.degree.name	MS (Master of Science)

Files

Original bundle

Now showing 1 - 1 of 1

Name:: thesis.pdf
Size:: 1.9 MB
Format:: Adobe Portable Document Format
Description:: Full printable version

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Graduate School of Engineering and Science