Browsing by Author "Hormozdiari, F."

Now showing 1 - 12 of 12

Open Access
Accelerating read mapping with FastHASH
(BioMed Central Ltd., 2013) Xin, H.; Lee, D.; Hormozdiari, F.; Yedkar, S.; Mutlu, O.; Alkan C.
With the introduction of next-generation sequencing (NGS) technologies, we are facing an exponential increase in the amount of genomic sequence data. The success of all medical and genetic applications of next-generation sequencing critically depends on the existence of computational techniques that can process and analyze the enormous amount of sequence data quickly and accurately. Unfortunately, the current read mapping algorithms have difficulties in coping with the massive amounts of data generated by NGS. We propose a new algorithm, FastHASH, which drastically improves the performance of the seed-and-extend type hash table based read mapping algorithms, while maintaining the high sensitivity and comprehensiveness of such methods. FastHASH is a generic algorithm compatible with all seed-and-extend class read mapping algorithms. It introduces two main techniques, namely Adjacency Filtering, and Cheap K-mer Selection. We implemented FastHASH and merged it into the codebase of the popular read mapping program, mrFAST. Depending on the edit distance cutoffs, we observed up to 19-fold speedup while still maintaining 100% sensitivity and high comprehensiveness. © 2013 Xin et al.
Open Access
Demographically-based evaluation of genomic regions under selection in domestic dogs
(Public Library of Science, 2016) Freedman, A. H.; Schweizer, R. M.; Vecchyo, D. Ortega-Del; Han, E.; Davis, B. W.; Gronau, I.; Silva, P. M.; Galaverni, M.; Fan, Z.; Marx, P.; Lorente-Galdos, B.; Ramirez, O.; Hormozdiari, F.; Alkan C.; Vilà, C.; Squire K.; Geffen, E.; Kusak, J.; Boyko, A. R.; Parker, H. G.; Lee C.; Tadigotla, V.; Siepel, A.; Bustamante, C. D.; Harkins, T. T.; Nelson, S. F.; Marques Bonet, T.; Ostrander, E. A.; Wayne, R. K.; Novembre, J.
Controlling for background demographic effects is important for accurately identifying loci that have recently undergone positive selection. To date, the effects of demography have not yet been explicitly considered when identifying loci under selection during dog domestication. To investigate positive selection on the dog lineage early in the domestication, we examined patterns of polymorphism in six canid genomes that were previously used to infer a demographic model of dog domestication. Using an inferred demographic model, we computed false discovery rates (FDR) and identified 349 outlier regions consistent with positive selection at a low FDR. The signals in the top 100 regions were frequently centered on candidate genes related to brain function and behavior, including LHFPL3, CADM2, GRIK3, SH3GL2, MBP, PDE7B, NTAN1, and GLRA1. These regions contained significant enrichments in behavioral ontology categories. The 3rdtop hit, CCRN4L, plays a major role in lipid metabolism, that is supported by additional metabolism related candidates revealed in our scan, including SCP2D1 and PDXC1. Comparing our method to an empirical outlier approach that does not directly account for demography, we found only modest overlaps between the two methods, with 60% of empirical outliers having no overlap with our demography-based outlier detection approach. Demography-aware approaches have lower-rates of false discovery. Our top candidates for selection, in addition to expanding the set of neurobehavioral candidate genes, include genes related to lipid metabolism, suggesting a dietary target of selection that was important during the period when proto-dogs hunted and fed alongside hunter-gatherers. © 2016, Public Library of Science. All Rights Reserved.
Open Access
Discovery of tandem and interspersed segmental duplications using high-throughput sequencing
(Oxford University Press, 2019-04) Söylev, Arda; Le, T. M.; Amini, H.; Alkan, Can; Hormozdiari, F.
Motivation: Several algorithms have been developed that use high-throughput sequencing technology to characterize structural variations (SVs). Most of the existing approaches focus on detecting relatively simple types of SVs such as insertions, deletions and short inversions. In fact, complex SVs are of crucial importance and several have been associated with genomic disorders. To better understand the contribution of complex SVs to human disease, we need new algorithms to accurately discover and genotype such variants. Additionally, due to similar sequencing signatures, inverted duplications or gene conversion events that include inverted segmental duplications are often characterized as simple inversions, likewise, duplications and gene conversions in direct orientation may be called as simple deletions. Therefore, there is still a need for accurate algorithms to fully characterize complex SVs and thus improve calling accuracy of more simple variants. Results: We developed novel algorithms to accurately characterize tandem, direct and inverted interspersed segmental duplications using short read whole genome sequencing datasets. We integrated these methods to our TARDIS tool, which is now capable of detecting various types of SVs using multiple sequence signatures such as read pair, read depth and split read. We evaluated the prediction performance of our algorithms through several experiments using both simulated and real datasets. In the simulation experiments, using a 30 coverage TARDIS achieved 96% sensitivity with only 4% false discovery rate. For experiments that involve real data, we used two haploid genomes (CHM1 and CHM13) and one human genome (NA12878) from the Illumina Platinum Genomes set. Comparison of our results with orthogonal PacBio call sets from the same genomes revealed higher accuracy for TARDIS than state-of-the-art methods. Furthermore, we showed a surprisingly low false discovery rate of our approach for discovery of tandem, direct and inverted interspersed segmental duplications prediction on CHM1(<5% for the top 50 predictions).
Open Access
Fast and accurate mapping of complete genomics reads
(Academic Press, 2015) Lee, D.; Hormozdiari, F.; Xin, H.; Hach, F.; Mutlu, O.; Alkan C.
Many recent advances in genomics and the expectations of personalized medicine are made possible thanks to power of high throughput sequencing (HTS) in sequencing large collections of human genomes. There are tens of different sequencing technologies currently available, and each HTS platform have different strengths and biases. This diversity both makes it possible to use different technologies to correct for shortcomings; but also requires to develop different algorithms for each platform due to the differences in data types and error models. The first problem to tackle in analyzing HTS data for resequencing applications is the read mapping stage, where many tools have been developed for the most popular HTS methods, but publicly available and open source aligners are still lacking for the Complete Genomics (CG) platform. Unfortunately, Burrows-Wheeler based methods are not practical for CG data due to the gapped nature of the reads generated by this method. Here we provide a sensitive read mapper (sirFAST) for the CG technology based on the seed-and-extend paradigm that can quickly map CG reads to a reference genome. We evaluate the performance and accuracy of sirFAST using both simulated and publicly available real data sets, showing high precision and recall rates.
Open Access
Genome sequencing highlights the dynamic early history of dogs
(Public Library of Science, 2014) Freedman, A. H.; Gronau I.; Schweizer, R. M.; Ortega-Del Vecchyo, D.; Han, E.; Silva, P. M.; Galaverni, M.; Fan, Z.; Marx P.; Lorente-Galdos, B.; Beale, H.; Ramirez, O.; Hormozdiari, F.; Alkan C.; Vilà, C.; Squire K.; Geffen, E.; Kusak, J.; Boyko, A. R.; Parker, H. G.; Lee C.; Tadigotla, V.; Siepel, A.; Bustamante, C. D.; Harkins, T. T.; Nelson, S. F.; Ostrander, E. A.; Marques Bonet, T.; Wayne, R. K.; Novembre, J.
To identify genetic changes underlying dog domestication and reconstruct their early evolutionary history, we generated high-quality genome sequences from three gray wolves, one from each of the three putative centers of dog domestication, two basal dog lineages (Basenji and Dingo) and a golden jackal as an outgroup. Analysis of these sequences supports a demographic model in which dogs and wolves diverged through a dynamic process involving population bottlenecks in both lineages and post-divergence gene flow. In dogs, the domestication bottleneck involved at least a 16-fold reduction in population size, a much more severe bottleneck than estimated previously. A sharp bottleneck in wolves occurred soon after their divergence from dogs, implying that the pool of diversity from which dogs arose was substantially larger than represented by modern wolf populations. We narrow the plausible range for the date of initial dog domestication to an interval spanning 11-16 thousand years ago, predating the rise of agriculture. In light of this finding, we expand upon previous work regarding the increase in copy number of the amylase gene (AMY2B) in dogs, which is believed to have aided digestion of starch in agricultural refuse. We find standing variation for amylase copy number variation in wolves and little or no copy number increase in the Dingo and Husky lineages. In conjunction with the estimated timing of dog origins, these results provide additional support to archaeological finds, suggesting the earliest dogs arose alongside hunter-gathers rather than agriculturists. Regarding the geographic origin of dogs, we find that, surprisingly, none of the extant wolf lineages from putative domestication centers is more closely related to dogs, and, instead, the sampled wolves form a sister monophyletic clade. This result, in combination with dog-wolf admixture during the process of domestication, suggests that a re-evaluation of past hypotheses regarding dog origins is necessary. © 2014.
Open Access
The genome sequencing of an albino Western lowland gorilla reveals inbreeding in the wild
(BioMed Central Ltd., 2013-05-31) Prado-Martinez, J.; Hernando-Herraez, I.; Lorente-Galdos, B.; Dabad, M.; Ramirez, O.; Baeza-Delgado, C.; Morcillo-Suarez, C.; Alkan C.; Hormozdiari, F.; Raineri, E.; Estellé, J.; Fernandez-Callejo, M.; Valles, M.; Ritscher, L.; Schöneberg, T.; Calle-Mustienes, Elisa de la; Casillas, S.; Rubio-Acero, R.; Melé, M.; Engelken, J.; Caceres, M.; Gomez-Skarmeta, L. L.; Gut, M.; Bertranpetit, J.; Gut, I. G.; Abello, T.; Eichler, E. E.; Mingarro, I.; Lalueza-Fox, C.; Navarro, A.; Marques Bonet, T.
Background: The only known albino gorilla, named Snowflake, was a male wild born individual from Equatorial Guinea who lived at the Barcelona Zoo for almost 40 years. He was diagnosed with non-syndromic oculocutaneous albinism, i.e. white hair, light eyes, pink skin, photophobia and reduced visual acuity. Despite previous efforts to explain the genetic cause, this is still unknown. Here, we study the genetic cause of his albinism and making use of whole genome sequencing data we find a higher inbreeding coefficient compared to other gorillas. Results: We successfully identified the causal genetic variant for Snowflake’s albinism, a non-synonymous single nucleotide variant located in a transmembrane region of SLC45A2. This transporter is known to be involved in oculocutaneous albinism type 4 (OCA4) in humans. We provide experimental evidence that shows that this amino acid replacement alters the membrane spanning capability of this transmembrane region. Finally, we provide a comprehensive study of genome-wide patterns of autozygogosity revealing that Snowflake’s parents were related, being this the first report of inbreeding in a wild born Western lowland gorilla. Conclusions: In this study we demonstrate how the use of whole genome sequencing can be extended to link genotype and phenotype in non-model organisms and it can be a powerful tool in conservation genetics (e.g., inbreeding and genetic diversity) with the expected decrease in sequencing cost.
Open Access
A global reference for human genetic variation
(Nature Publishing Group, 2015) Auton, A.; Abecasis, G. R.; Altshuler, D. M.; Durbin, R. M.; Bentley, D. R.; Chakravarti, A.; Clark, A. G.; Donnelly, P.; Eichler, E. E.; Flicek, P.; Gabriel, S. B.; Gibbs, R. A.; Green, E. D.; Hurles, M. E.; Knoppers, B. M.; Korbel, J. O.; Lander, E. S.; Lee, C.; Lehrach, H.; Mardis, E. R.; Marth, G. T.; McVean, G. A.; Nickerson, D. A.; Schmidt, J. P.; Sherry, S. T.; Wang, J.; Wilson, R. K.; Boerwinkle, E.; Doddapaneni, H.; Han, Y.; Korchina, V.; Kovar, C.; Lee, S.; Muzny, D.; Reid, J. G.; Zhu, Y.; Chang, Y.; Feng, Q.; Fang, X.; Guo, X.; Jian, M.; Jiang, H.; Jin, X.; Lan, T.; Li, G.; Li, J.; Li, Y.; Liu, S.; Liu, X.; Lu, Y.; Ma, X.; Tang, M.; Wang, B.; Wang, G.; Wu, H.; Wu, R.; Xu, X.; Yin, Y.; Zhang, D.; Zhang, W.; Zhao, J.; Zhao, M.; Zheng, X.; Gupta, N.; Gharani, N.; Toji, L. H.; Gerry, N. P.; Resch, A. M.; Barker, J.; Clarke, L.; Gil, L.; Hunt, S. E.; Kelman, G.; Kulesha, E.; Leinonen, R.; McLaren, W. M.; Radhakrishnan, R.; Roa, A.; Smirnov, D.; Smith, R. E.; Streeter, I.; Thormann, A.; Toneva, I.; Vaughan, B.; Zheng-Bradley, X.; Grocock, R.; Humphray, S.; James, T.; Kingsbury, Z.; Sudbrak, R.; Albrecht, M. W.; Amstislavskiy, V. S.; Borodina, T. A.; Lienhard, M.; Mertes, F.; Sultan, M.; Timmermann, B.; Yaspo, Marie-Laure; Fulton, L.; Ananiev, V.; Belaia, Z.; Beloslyudtsev, D.; Bouk, N.; Chen, C.; Church, D.; Cohen, R.; Cook, C.; Garner, J.; Hefferon, T.; Kimelman, M.; Liu, C.; Lopez, J.; Meric, P.; O'Sullivan, C.; Ostapchuk, Y.; Phan, L.; Ponomarov, S.; Schneider, V.; Shekhtman, E.; Sirotkin, K.; Slotta, D.; Zhang, H.; Balasubramaniam, S.; Burton, J.; Danecek, P.; Keane, T. M.; Kolb-Kokocinski, A.; McCarthy, S.; Stalker, J.; Quail, M.; Davies, C. J.; Gollub, J.; Webster, T.; Wong, B.; Zhan, Y.; Campbell, C. L.; Kong, Y.; Marcketta, A.; Yu, F.; Antunes, L.; Bainbridge, M.; Sabo, A.; Huang, Z.; Coin, L. J. M.; Fang, L.; Li, Q.; Li, Z.; Lin, H.; Liu, B.; Luo, R.; Shao, H.; Xie, Y.; Ye, C.; Yu, C.; Zhang, F.; Zheng, H.; Zhu, H.; Alkan, C.; Dal, E.; Kahveci, F.; Garrison, E. P.; Kural, D.; Lee, W. P.; Leong, W. F.; Stromberg, M.; Ward, A. N.; Wu, J.; Zhang, M.; Daly, M. J.; DePristo, M. A.; Handsaker, R. E.; Banks, E.; Bhatia, G.; Del Angel, G.; Genovese, G.; Li, H.; Kashin, S.; McCarroll, S. A.; Nemesh, J. C.; Poplin, R. E.; Yoon, S. C.; Lihm, J.; Makarov, V.; Gottipati, S.; Keinan, A.; Rodriguez-Flores, J. L.; Rausch, T.; Fritz, M. H.; Stütz, A. M.; Beal, K.; Datta, A.; Herrero, J.; Ritchie, G. R. S.; Zerbino, D.; Sabeti, P. C.; Shlyakhter, I.; Schaffner, S. F.; Vitti, J.; Cooper, D. N.; Ball, E. V.; Stenson, P. D.; Barnes, B.; Bauer, M.; Cheetham, R. K.; Cox, A.; Eberle, M.; Kahn, S.; Murray, L.; Peden, J.; Shaw, R.; Kenny, E. E.; Batzer, M. A.; Konkel, M. K.; Walker, J. A.; MacArthur, D. G.; Lek, M.; Herwig, R.; Ding, L.; Koboldt, D. C.; Larson, D.; Ye, K.; Gravel, S.; Swaroop, A.; Chew, E.; Lappalainen, T.; Erlich, Y.; Gymrek, M.; Willems, T. F.; Simpson, J. T.; Shriver, M. D.; Rosenfeld, J. A.; Bustamante, C. D.; Montgomery, S. B.; De La Vega, F. M.; Byrnes, J. K.; Carroll, A. W.; DeGorter, M. K.; Lacroute, P.; Maples, B. K.; Martin, A. R.; Moreno-Estrada, A.; Shringarpure, S. S.; Zakharia, F.; Halperin, E.; Baran, Y.; Cerveira, E.; Hwang, J.; Malhotra, A.; Plewczynski, D.; Radew, K.; Romanovitch, M.; Zhang, C.; Hyland, F. C. L.; Craig, D. W.; Christoforides, A.; Homer, N.; Izatt, T.; Kurdoglu, A. A.; Sinari, S. A.; Squire, K.; Xiao, C.; Sebat, J.; Antaki, D.; Gujral, M.; Noor, A.; Ye, K.; Burchard, E. G.; Hernandez, R. D.; Gignoux, C. R.; Haussler, D.; Katzman, S. J.; Kent, W. J.; Howie, B.; Ruiz-Linares, A.; Dermitzakis, E. T.; Devine, S. E.; Kang, H. M.; Kidd, J. M.; Blackwell, T.; Caron, S.; Chen, W.; Emery, S.; Fritsche, L.; Fuchsberger, C.; Jun, G.; Li, B.; Lyons, R.; Scheller, C.; Sidore, C.; Song, S.; Sliwerska, E.; Taliun, D.; Tan, A.; Welch, R.; Wing, M. K.; Zhan, X.; Awadalla, P.; Hodgkinson, A.; Li, Y.; Shi, X.; Quitadamo, A.; Lunter, G.; Marchini, J. L.; Myers, S.; Churchhouse, C.; Delaneau, O.; Gupta-Hinch, A.; Kretzschmar, W.; Iqbal, Z.; Mathieson, I.; Menelaou, A.; Rimmer, A.; Xifara, D. K.; Oleksyk, T. K.; Fu, Y.; Liu, X.; Xiong, M.; Jorde, L.; Witherspoon, D.; Xing, J.; Browning, B. L.; Browning, S. R.; Hormozdiari, F.; Sudmant, P. H.; Khurana, E.; Tyler-Smith, C.; Albers, C. A.; Ayub, Q.; Chen, Y.; Colonna, V.; Jostins, L.; Walter, K.; Xue, Y.; Gerstein, M. B.; Abyzov, A.; Balasubramanian, S.; Chen, J.; Clarke, D.; Fu, Y.; Harmanci, A. O.; Jin, M.; Lee, D.; Liu, J.; Mu, X. J.; Zhang, J.; Zhang, Y.; Hartl, C.; Shakir, K.; Degenhardt, J.; Meiers, S.; Raeder, B.; Casale, F. P.; Stegle, O.; Lameijer, E. W.; Hall, I.; Bafna, V.; Michaelson, J.; Gardner, E. J.; Mills, R. E.; Dayama, G.; Chen, K.; Fan, X.; Chong, Z.; Chen, T.; Chaisson, M. J.; Huddleston, J.; Malig, M.; Nelson, B. J.; Parrish, N. F.; Blackburne, B.; Lindsay, S. J.; Ning, Z.; Zhang, Y.; Lam, H.; Sisu, C.; Challis, D.; Evani, U. S.; Lu, J.; Nagaswamy, U.; Yu, J.; Li, W.; Habegger, L.; Yu, H.; Cunningham, F.; Dunham, I.; Lage, K.; Jespersen, J. B.; Horn, H.; Kim, D.; Desalle, R.; Narechania, A.; Sayres, M. A. W.; Mendez, F. L.; Poznik, G. D.; Underhill, P. A.; Mittelman, D.; Banerjee, R.; Cerezo, M.; Fitzgerald, T. W.; Louzada, S.; Massaia, A.; Yang, F.; Kalra, D.; Hale, W.; Dan, X.; Barnes, K. C.; Beiswanger, C.; Cai, H.; Cao, H.; Henn, B.; Jones, D.; Kaye, J. S.; Kent, A.; Kerasidou, A.; Mathias, R.; Ossorio, P. N.; Parker, M.; Rotimi, C. N.; Royal, C. D.; Sandoval, K.; Su, Y.; Tian, Z.; Tishkoff, S.; Via, M.; Wang, Y.; Yang, H.; Yang, L.; Zhu, J.; Bodmer, W.; Bedoya, G.; Cai, Z.; Gao, Y.; Chu, J.; Peltonen, L.; Garcia-Montero, A.; Orfao, A.; Dutil, J.; Martinez-Cruzado, J. C.; Mathias, R. A.; Hennis, A.; Watson, H.; McKenzie, C.; Qadri, F.; LaRocque, R.; Deng, X.; Asogun, D.; Folarin, O.; Happi, C.; Omoniwa, O.; Stremlau, M.; Tariyal, R.; Jallow, M.; Joof, F. S.; Corrah, T.; Rockett, K.; Kwiatkowski, D.; Kooner, J.; Hien, T. T.; Dunstan, S. J.; ThuyHang, N.; Fonnie, R.; Garry, R.; Kanneh, L.; Moses, L.; Schieffelin, J.; Grant, D. S.; Gallo, C.; Poletti, G.; Saleheen, D.; Rasheed, A.; Brooks, L. D.; Felsenfeld, A. L.; McEwen, J. E.; Vaydylevich, Y.; Duncanson, A.; Dunn, M.; Schloss, J. A.
The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies. © 2015 Macmillan Publishers Limited. All rights reserved.
Open Access
Great ape genetic diversity and population history
(Nature Publishing Group, 2013) Prado-Martinez, J.; Eichler, E. E.; Marques-Bonet, T.; Sudmant, P. H.; Kidd, J. M.; Li, H.; Kelley, J. L.; Lorente-Galdos, B.; Veeramah, K. R.; Woerner, A. E.; O’Connor, T. D.; Santpere, G.; Cagan, A.; Theunert, C.; Casals, F.; Laayouni, H.; Munch, K.; Hobolth, A.; Halager, A. E.; Malig, M.; Hernandez-Rodriguez, J.; Hernando-Herraez, I.; Prüfer, K.; Pybus, M.; Johnstone, L.; Lachmann, M.; Alkan C.; Twig, D.; Petit, N.; Baker, C.; Hormozdiari, F.; Fernandez-Callejo, M.; Dabad, M.; Wilson, M. L.; Stevison, L.; Camprubí, C.; Carvalho, T.; RuizHerrera, A.; Vives, L.; Mele, M.; Abello, T.; Kondova, I.; Bontrop, R. E.; Pusey, A.; Lankester, F.; Kiyang, J. A.; Bergl, R. A.; Lonsdorf, E.; Myers, S.; Ventura, M.; Gagneux, P.; Comas, D.; Siegismund, H.; Blanc, J.; Agueda-Calpena, L.; Gut, M.; Fulton, L.; Tishkoff, S. A.; Mullikin, J. C.; Wilson, R. K.; Gut, I. G.; Gonder, M K.; Ryder, O. A.; Hahn, B. H.; Navarro, A.; Akey, J. M.; Bertranpetit, J.; Reich, D.; Mailund, T.; Schierup, M. H.; Hvilsom, C.; Andrés, A. M.; Wall, J. D.; Bustamante, C. D.; Hammer, M. F.
Most great ape genetic variation remains uncharacterized(1,2); however, its study is critical for understanding population history(3-6), recombination(7), selection(8) and susceptibility to disease(9,10). Here we sequence to high coverage a total of 79 wild-and captive-born individuals representing all six great ape species and seven subspecies and report 88.8 million single nucleotide polymorphisms. Our analysis provides support for genetically distinct populations within each species, signals of gene flow, and the split of common chimpanzees into two distinct groups: Nigeria-Cameroon/western and central/eastern populations. We find extensive inbreeding in almost all wild populations, with eastern gorillas being the most extreme. Inferred effective population sizes have varied radically over time in different lineages and this appears to have a profound effect on the genetic diversity at, or close to, genes in almost all species. We discover and assign 1,982 loss-of-function variants throughout the human and great ape lineages, determining that the rate of gene loss has not been different in the human branch compared to other internal branches in the great ape phylogeny. This comprehensive catalogue of great ape genome diversity provides a framework for understanding evolution and a resource for more effective management of wild and captive great ape populations.
Open Access
An integrated map of structural variation in 2,504 human genomes
(Nature Publishing Group, 2015) Sudmant, P. H.; Rausch, T.; Gardner, E. J.; Handsaker, R. E.; Abyzov, A.; Huddleston, J.; Zhang, Y.; Ye, K.; Jun, G.; Fritz, M. Hsi-Yang; Konkel, M. K.; Malhotra, A.; Stütz, A. M.; Shi, X.; Casale, F. P.; Chen, J.; Hormozdiari, F.; Dayama, G.; Chen, K.; Malig, M.; Chaisson, M. J. P.; Walter, K.; Meiers, S.; Kashin, S.; Garrison, E.; Auton, A.; Lam, H. Y. K.; Mu, X. J.; Alkan, C.; Antaki, D.; Bae, T.; Cerveira, E.; Chines, P.; Chong, Z.; Clarke, L.; Dal, E.; Ding, L.; Emery, S.; Fan, X.; Gujral, M.; Kahveci, F.; Kidd, J. M.; Kong, Y.; Lameijer, Eric-Wubbo; McCarthy, S.; Flicek, P.; Gibbs, R. A.; Marth, G.; Mason, C. E.; Menelaou, A.; Muzny, D. M.; Nelson, B. J.; Noor, A.; Parrish, N. F.; Pendleton, M.; Quitadamo, A.; Raeder, B.; Schadt, E. E.; Romanovitch, M.; Schlattl, A.; Sebra, R.; Shabalin, A. A.; Untergasser, A.; Walker J. A.; Wang, M.; Yu, F.; Zhang, C.; Zhang, J.; Zheng-Bradley, X.; Zhou, W.; Zichner, T.; Sebat, J.; Batzer, M. A.; McCarroll, S. A.; Mills, R. E.; Gerstein, M. B.; Bashir, A.; Stegle, O.; Devine, S. E.; Lee, C.; Eichler, E. E.; Korbel, J. O.
Structural variants are implicated in numerous diseases and make up the majority of varying nucleotides among human genomes. Here we describe an integrated set of eight structural variant classes comprising both balanced and unbalanced variants, which we constructed using short-read DNA sequencing data and statistically phased onto haplotype blocks in 26 human populations. Analysing this set, we identify numerous gene-intersecting structural variants exhibiting population stratification and describe naturally occurring homozygous gene knockouts that suggest the dispensability of a variety of human genes. We demonstrate that structural variants are enriched on haplotypes identified by genome-wide association studies and exhibit enrichment for expression quantitative trait loci. Additionally, we uncover appreciable levels of structural variant complexity at different scales, including genic loci subject to clusters of repeated rearrangement and complex structural variants with multiple breakpoints likely to have formed through individual mutational events. Our catalogue will enhance future studies into structural variant demography, functional impact and disease association.
Open Access
mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications
(Oxford University Press, 2014) Hach, F.; Sarrafi, I.; Hormozdiari, F.; Alkan C.; Eichler, E. E.; Sahinalp, S. C.
High throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for processing and downstream analysis. While tools that report the 'best' mapping location of each read provide a fast way to process HTS data, they are not suitable for many types of downstream analysis such as structural variation detection, where it is important to report multiple mapping loci for each read. For this purpose we introduce mrsFAST-Ultra, a fast, cache oblivious, SNP-aware aligner that can handle the multi-mapping of HTS reads very efficiently. mrsFAST-Ultra improves mrsFAST, our first cache oblivious read aligner capable of handling multi-mapping reads, through new and compact index structures that reduce not only the overall memory usage but also the number of CPU operations per alignment. In fact the size of the index generated by mrsFAST-Ultra is 10 times smaller than that of mrsFAST. As importantly, mrsFAST-Ultra introduces new features such as being able to (i) obtain the best mapping loci for each read, and (ii) return all reads that have at most n mapping loci (within an error threshold), together with these loci, for any user specified n. Furthermore, mrsFAST-Ultra is SNP-aware, i.e. it can map reads to reference genome while discounting the mismatches that occur at common SNP locations provided by db-SNP; this significantly increases the number of reads that can be mapped to the reference genome. Notice that all of the above features are implemented within the index structure and are not simple post-processing steps and thus are performed highly efficiently. Finally, mrsFAST-Ultra utilizes multiple available cores and processors and can be tuned for various memory settings. Our results show that mrsFAST-Ultra is roughly five times faster than its predecessor mrsFAST. In comparison to newly enhanced popular tools such as Bowtie2, it is more sensitive (it can report 10 times or more mappings per read) and much faster (six times or more) in the multi-mapping mode. Furthermore, mrsFAST-Ultra has an index size of 2GB for the entire human reference genome, which is roughly half of that of Bowtie2. mrsFAST-Ultra is open source and it can be accessed at http://mrsfast.sourceforge.net. © 2014 The Author(s).
Open Access
Rates and patterns of great ape retrotransposition
(National Academy of Sciences, 2013) Hormozdiari, F.; Konkel, M. K.; Prado-Martinez, J.; Chiatante, G.; Herraez, I. H.; Walker, J. A.; Nelson, B.; Alkan, C.; Sudmant, P. H.; Huddleston, J.; Catacchio, C. R.; Ko, A.; Malig, M.; Baker, C.; Marques-Bonet, T.; Ventura, M.; Batzer, M. A.; Eichler, E. E.
We analyzed 83 fully sequenced great ape genomes for mobile element insertions, predicting a total of 49,452 fixed and polymorphic Alu and long interspersed element 1 (L1) insertions not present in the human reference assembly and assigning each retrotransposition event to a different time point during great ape evolution. We used these homoplasy-free markers to construct a mobile element insertions-based phylogeny of humans and great apes and demonstrate their differential power to discern ape subspecies and populations. Within this context, we find a good correlation between L1 diversity and single-nucleotide polymorphism heterozygosity (r2 =0.65) in contrast to Alu repeats, which show little correlation (r2 =0.07). We estimate that the rate of Alu retrotransposition has differed by a factor of 15-fold in these lineages. Humans, chimpanzees, and bonobos show the highest rates of Alu accumulation-the latter two since divergence 1.5 Mya. The L1 insertion rate, in contrast, has remained relatively constant, with rates differing by less than a factor of three. We conclude that Alu retrotransposition has been the most variable form of genetic variation during recent human-great ape evolution, with increases and decreases occurring over very short periods of evolutionary time.
Open Access
Toolkit for automated and rapid discovery of structural variants
(Academic Press, 2017) Soylev, A.; Kockan, C.; Hormozdiari, F.; Alkan C.
Structural variations (SV) are broadly defined as genomic alterations that affect >50 bp of DNA, which are shown to have significant effect on evolution and disease. The advent of high throughput sequencing (HTS) technologies and the ability to perform whole genome sequencing (WGS), makes it feasible to study these variants in depth. However, discovery of all forms of SV using WGS has proven to be challenging as the short reads produced by the predominant HTS platforms (<200 bp for current technologies) and the fact that most genomes include large amounts of repeats make it very difficult to unambiguously map and accurately characterize such variants. Furthermore, existing tools for SV discovery are primarily developed for only a few of the SV types, which may have conflicting sequence signatures (i.e. read pairs, read depth, split reads) with other, untargeted SV classes. Here we are introduce a new framework, TARDIS, which combines multiple read signatures into a single package to characterize most SV types simultaneously, while preventing such conflicts. TARDIS also has a modular structure that makes it easy to extend for the discovery of additional forms of SV. © 2017 Elsevier Inc.