Algorithms for structural variation discovery using multiple sequence signatures

buir.advisorAlkan, Can
dc.contributor.authorSöylev, Arda
dc.date.accessioned2018-09-14T08:44:20Z
dc.date.available2018-09-14T08:44:20Z
dc.date.copyright2018-09
dc.date.issued2018-09
dc.date.submitted2018-09-13
dc.descriptionCataloged from PDF version of article.en_US
dc.descriptionThesis (Ph.D.): Bilkent University, Department of Computer Engineering, İhsan Doğramacı Bilkent University, 2018.en_US
dc.descriptionIncludes bibliographical references (leaves 87-119).en_US
dc.description.abstractGenomic variations including single nucleotide polymorphisms (SNPs), small INDELs and structural variations (SVs) are known to have significant phenotypic effects on individuals. Among them, SVs, that alter more than 50 nucleotides of DNA, are the major source of complex genetic diseases such as Crohn's, schizophrenia and autism. Additionally, the total number of nucleotides affected by SVs are substantially higher than SNPs (3.5 Mbp SNP, 15-20 Mbp SV). Today, we are able to perform whole genome sequencing (WGS) by utilizing high throughput sequencing technology (HTS) to discover these modifications unimaginably faster, cheaper and more accurate than before. However, as demonstrated in the 1000 Genomes Project, HTS technology still has significant limitations. The major problem lies in the short read lengths (<250 bp) produced by the current sequencing platforms and the fact that most genomes include large amounts of repeats make it very challenging to unambiguously map and accurately characterize genomic variants. Thus, most of the existing SV discovery tools focus on detecting relatively simple types of SVs such as insertions, deletions, and short inversions. In fact, other types of SVs including the complex ones are of crucial importance and several have been associated with genomic disorders. To better understand the contribution of these SVs to human genome, we need new approaches to accurately discover and genotype such variants. Therefore, there is still a need for accurate algorithms to fully characterize a broader spectrum of SVs and thus improve calling accuracy of more simple variants. Here we introduce TARDIS that harbors novel algorithms to accurately characterize various types of SVs including deletions, novel sequence insertions, inversions, transposon insertions, nuclear mitochondria insertions, tandem duplications and interspersed segmental duplications in direct or inverted orientations using short read whole genome sequencing datasets. Within our framework, we make use of multiple sequence signatures including read pair, read depth and split read in order to capture different sequence signatures and increase our SV prediction accuracy. Additionally, we are able to analyze more than one possible mapping location of each read to overcome the problems associated with repeated nature of genomes. Recently, due to the limitations of short-read sequencing technology, newer library preparation techniques emerged and 10x Genomics is one of these initiatives. This technique is regarded as a cost-effective alternative to long read sequencing, which can obtain long range contiguity information. We extended TARDIS to be able to utilize Linked-Read information of 10x Genomics to overcome some of the constraints of short-read sequencing technology. We evaluated the prediction performance of our algorithms through several experiments using both simulated and real data sets. In the simulation experiments, TARDIS achieved 97.67% sensitivity with only 1.12% false discovery rate. For experiments that involve real data, we used two haploid genomes (CHM1 and CHM13) and one human genome (NA12878) from the Illumina Platinum Genomes set. Comparison of our results with orthogonal PacBio call sets from the same genomes revealed higher accuracy for TARDIS than state of the art methods. Furthermore, we showed a surprisingly low false discovery rate of our approach for discovery of tandem, direct and inverted interspersed segmental duplications prediction on CHM1 (less than 5% for the top 50 predictions). The algorithms we describe here are the first to predict insertion location and the various types of new segmental duplications using HTS data.en_US
dc.description.provenanceSubmitted by Betül Özen (ozen@bilkent.edu.tr) on 2018-09-14T08:44:20Z No. of bitstreams: 1 10211386.pdf: 3553369 bytes, checksum: f7a27515f89597e53f8171829c733298 (MD5)en
dc.description.provenanceMade available in DSpace on 2018-09-14T08:44:20Z (GMT). No. of bitstreams: 1 10211386.pdf: 3553369 bytes, checksum: f7a27515f89597e53f8171829c733298 (MD5) Previous issue date: 2018-09en
dc.description.statementofresponsibilityby Arda Söylev.en_US
dc.format.extentxv, 120 leaves : charts (some color) ; 30 cm.en_US
dc.identifier.itemidB158951
dc.identifier.urihttp://hdl.handle.net/11693/47879
dc.language.isoEnglishen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectStructural Variationen_US
dc.subjectHigh Throughput Sequencingen_US
dc.subjectCombinatorial Algorithmsen_US
dc.titleAlgorithms for structural variation discovery using multiple sequence signaturesen_US
dc.title.alternativeÇoklu dizi sinyalleri kullanarak yapısal varyasyon keşfi için algoritmalaren_US
dc.typeThesisen_US
thesis.degree.disciplineComputer Engineering
thesis.degree.grantorBilkent University
thesis.degree.levelDoctoral
thesis.degree.namePh.D. (Doctor of Philosophy)

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
10211386.pdf
Size:
3.39 MB
Format:
Adobe Portable Document Format
Description:
Full printable version

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: