Algorithms for the discovery of large genomic inversions using pooled clone sequencing
Rasekh, Marzieh Eslami
Item Usage Stats
MetadataShow full item record
An inversion is a chromosomal rearrangement in which an internal segment of a chromosome has been broken twice, flipped 180 degrees, and rejoined. Most known examples of large inversions were found indirectly from studies on human disease where inversions have no detectable effect in parents, but increase the risk of a disease-associated rearrangement in the offspring. The development of a map of inversion polymorphisms will provide valuable information regarding their distribution and frequency in the human genome and will help unravel how inversions and the segmental duplications architecture associated with inverted haplotypes contribute to genomic susceptibility to disease rearrangements. The 1000 Genomes Project spearheaded the development of several methods to identify inversions, however, they are limited to relatively short inversions, and there are currently no available algorithms to discover large inversions using high throughput sequencing technologies (HTS). This is mainly because the breakpoints of such events typically lie within segmental duplications and common repeats, reducing the mappability of short reads. We propose using pooled clone sequencing (PCS), a method originally developed to improve haplotype phasing, to characterize large genomic inversions. PCS merges the advantages of clone based sequencing approaches with the speed and cost efficiency of HTS technologies. Using this sequencing data, we developed a novel algorithm, dipSeq for discovering large inversions (>500 Kbp) following the observation that clones that span the inversion breakpoint will be split into two sections, split clones, when mapped to the reference genome. We evaluate the performance of dipSeq on 3 sets of simulated data, demonstrating its correctness and robustness to structural duplications and other types of structural variations. We further applied dipSeq to the genome of a HapMap individual (NA12878). dipSeq was able to accurately discover all previously known and experimentally validated large inversions. We also identified a new inversion and confirmed using fluorescent in situ hybridization. Although dipSeq displays a relatively high false positive rate using real data, it performed better with simulated data, suggesting that the performance with the NA12878 genome may be improved with higher depth of coverage.