Browsing by Subject "Structural Variation"
Now showing 1 - 2 of 2
- Results Per Page
- Sort Options
Item Open Access Algorithms for structural variation discovery using multiple sequence signatures(2018-09) Söylev, ArdaGenomic variations including single nucleotide polymorphisms (SNPs), small INDELs and structural variations (SVs) are known to have significant phenotypic effects on individuals. Among them, SVs, that alter more than 50 nucleotides of DNA, are the major source of complex genetic diseases such as Crohn's, schizophrenia and autism. Additionally, the total number of nucleotides affected by SVs are substantially higher than SNPs (3.5 Mbp SNP, 15-20 Mbp SV). Today, we are able to perform whole genome sequencing (WGS) by utilizing high throughput sequencing technology (HTS) to discover these modifications unimaginably faster, cheaper and more accurate than before. However, as demonstrated in the 1000 Genomes Project, HTS technology still has significant limitations. The major problem lies in the short read lengths (<250 bp) produced by the current sequencing platforms and the fact that most genomes include large amounts of repeats make it very challenging to unambiguously map and accurately characterize genomic variants. Thus, most of the existing SV discovery tools focus on detecting relatively simple types of SVs such as insertions, deletions, and short inversions. In fact, other types of SVs including the complex ones are of crucial importance and several have been associated with genomic disorders. To better understand the contribution of these SVs to human genome, we need new approaches to accurately discover and genotype such variants. Therefore, there is still a need for accurate algorithms to fully characterize a broader spectrum of SVs and thus improve calling accuracy of more simple variants. Here we introduce TARDIS that harbors novel algorithms to accurately characterize various types of SVs including deletions, novel sequence insertions, inversions, transposon insertions, nuclear mitochondria insertions, tandem duplications and interspersed segmental duplications in direct or inverted orientations using short read whole genome sequencing datasets. Within our framework, we make use of multiple sequence signatures including read pair, read depth and split read in order to capture different sequence signatures and increase our SV prediction accuracy. Additionally, we are able to analyze more than one possible mapping location of each read to overcome the problems associated with repeated nature of genomes. Recently, due to the limitations of short-read sequencing technology, newer library preparation techniques emerged and 10x Genomics is one of these initiatives. This technique is regarded as a cost-effective alternative to long read sequencing, which can obtain long range contiguity information. We extended TARDIS to be able to utilize Linked-Read information of 10x Genomics to overcome some of the constraints of short-read sequencing technology. We evaluated the prediction performance of our algorithms through several experiments using both simulated and real data sets. In the simulation experiments, TARDIS achieved 97.67% sensitivity with only 1.12% false discovery rate. For experiments that involve real data, we used two haploid genomes (CHM1 and CHM13) and one human genome (NA12878) from the Illumina Platinum Genomes set. Comparison of our results with orthogonal PacBio call sets from the same genomes revealed higher accuracy for TARDIS than state of the art methods. Furthermore, we showed a surprisingly low false discovery rate of our approach for discovery of tandem, direct and inverted interspersed segmental duplications prediction on CHM1 (less than 5% for the top 50 predictions). The algorithms we describe here are the first to predict insertion location and the various types of new segmental duplications using HTS data.Item Open Access Characterization of large structural variation using linked-reads(2018-08) Karaoğlanoğlu, FatihMany algorithms aimed at characterizing genomic structural variation (SV) have been developed since the inception of high-throughput sequencing. However, the full spectrum of SVs in the human genome is not yet assessed. Most of the existing methods focus on discovery and genotyping of deletions, insertions, and mobile elements. Detection of balanced SVs with no gain or loss of genomic segments (e.g. inversions) is particularly a challenging task. Long read sequencing has been leveraged to find short inversions but there is still a need to develop methods to detect large genomic inversions. Furthermore, currently there are no algorithms to predict the insertion locus of large interspersed segmental duplications. Here we propose novel algorithms to characterize large (>40Kbp) interspersed segmental duplications and (>80Kbp) inversions using Linked-Read sequencing data. Linked-Read sequencing provides long range information, where Illumina reads are tagged with barcodes that can be used to assign short reads to pools of larger (30-50 Kbp) molecules. Our methods rely on split molecule sequence signature that we have previously described. Similar to the split read, split molecules refer to large segments of DNA that span an SV breakpoint. Therefore, when mapped to the reference genome, the mapping of these segments would be discontinuous. We redesign our earlier algorithm, VALOR, to specifically leverage Linked-Read sequencing data to discover large inversions and characterize interspersed segmental duplications. We implement our new algorithms in a new software package, called VALOR2.