Assessment and correction of errors in DNA sequencing technologies

buir.advisorAlkan, Can
dc.contributor.authorFırtına, Can
dc.date.accessioned2018-01-10T13:45:51Z
dc.date.available2018-01-10T13:45:51Z
dc.date.copyright2017-12
dc.date.issued2017-12
dc.date.submitted2018-01-10
dc.descriptionCataloged from PDF version of article.en_US
dc.descriptionThesis (M.S.): Bilkent University, Department of Computer Engineering, İhsan Doğramacı Bilkent University, 2017.en_US
dc.descriptionIncludes bibliographical references (leaves 35-41).en_US
dc.description.abstractNext Generation Sequencing technologies differ by several parameters where the choice to use whether short or long read sequencing platforms often leads to trade-offs between accuracy and read length. In this thesis, I first demonstrate the problems in reproducibility in analyses using short reads. Our comprehensive analysis on the reproducibility of computational characterization of genomic variants using high throughput sequencing data shows that repeats might be prone to ambiguous mapping. Short reads are more vulnerable to repeats and, thus, may cause reproducibility problems. Next, I introduce a novel algorithm Hercules, the first machine learning-based long read error correction algorithm. Several studies require long and accurate reads including de novo assembly, fusion and structural variation detection. In such cases researchers often combine both technologies and the more erroneous long reads are corrected using the short reads. Current approaches rely on various graph based alignment techniques and do not take the error profile of the underlying technology into account. Memory- and time- efficient machine learning algorithms that address these shortcomings have the potential to achieve better and more accurate integration of these two technologies. Our algorithm models every long read as a profile Hidden Markov Model with respect to the underlying platform's error profile. The algorithm learns a posterior transition/ emission probability distribution for each long read and uses this to correct errors in these reads. Using datasets from two DNA-seq BAC clones (CH17-157L1 and CH17-227A2), and human brain cerebellum polyA RNA-seq, we show that Hercules-corrected reads have the highest mapping rate among all competing algorithms and highest accuracy when most of the basepairs of a long read are covered with short reads.en_US
dc.description.provenanceSubmitted by Betül Özen (ozen@bilkent.edu.tr) on 2018-01-10T13:45:51Z No. of bitstreams: 1 thesis.pdf: 1991968 bytes, checksum: bd9f59c9b24604746de246e7fc160b00 (MD5)en
dc.description.provenanceMade available in DSpace on 2018-01-10T13:45:51Z (GMT). No. of bitstreams: 1 thesis.pdf: 1991968 bytes, checksum: bd9f59c9b24604746de246e7fc160b00 (MD5) Previous issue date: 2018-01en
dc.description.statementofresponsibilityby Can Fırtına.en_US
dc.embargo.release2020-01-09
dc.format.extentxiii, 68 leaves : charts (some color) ; 30 cmen_US
dc.identifier.itemidB157364
dc.identifier.urihttp://hdl.handle.net/11693/35728
dc.language.isoEnglishen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectDNAen_US
dc.subjectSequencingen_US
dc.subjectRepeatsen_US
dc.subjectError correctionen_US
dc.subjectLong readsen_US
dc.subjectMachine learningen_US
dc.titleAssessment and correction of errors in DNA sequencing technologiesen_US
dc.title.alternativeDNA dizilim teknolojilerindeki hatalar üzerine değerlendirme ve hataların düzeltilmesien_US
dc.typeThesisen_US
thesis.degree.disciplineComputer Engineering
thesis.degree.grantorBilkent University
thesis.degree.levelMaster's
thesis.degree.nameMS (Master of Science)

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
thesis.pdf
Size:
1.9 MB
Format:
Adobe Portable Document Format
Description:
Full printable version

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: