期刊论文详细信息
BMC Genomics
Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing
Mark R Chance4  Xiaofeng Zhu2  Zhenghe Wang4  Thomas LaFramboise4  Li Li1  Bamidele O Tayo5  Richard S Cooper5  Martina Veigl6  Min Xiang3  Sean Maxwell8  Mehmet Koyutürk7  Yu Liu8 
[1] Department of Family Medicine and Community Health, Case Western Reserve University, Cleveland, OH, USA;Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA;Department of Pharmacy, Suzhou Health College, Suzhou, Jiangsu 215009, P. R. China;Department of Genetics and Genome Science, Case Western Reserve University, Cleveland, OH, USA;Department of Public Health Sciences, Stritch School of Medicine, Loyola University Chicago, Maywood, IL, USA;Department of General Medical Sciences, Case Western Reserve University, Cleveland, OH, USA;Case Comprehensive Cancer Center, Case Western Reserve University, Cleveland, OH, USA;Center for Proteomics and Bioinformatics, Case Western Reserve University, Cleveland, OH, USA
关键词: Genome evolution;    Transcription factor binding;    Expression in brain;    Next generation sequencing;    De novo assembling;    Missing common sequence;   
Others  :  1216265
DOI  :  10.1186/1471-2164-15-685
 received in 2014-02-25, accepted in 2014-08-04,  发布年份 2014
PDF
【 摘 要 】

Background

Sequences up to several megabases in length have been found to be present in individual genomes but absent in the human reference genome. These sequences may be common in populations, and their absence in the reference genome may indicate rare variants in the genomes of individuals who served as donors for the human genome project. As the reference genome is used in probe design for microarray technology and mapping short reads in next generation sequencing (NGS), this missing sequence could be a source of bias in functional genomic studies and variant analysis. One End Anchor (OEA) and/or orphan reads from paired-end sequencing have been used to identify novel sequences that are absent in reference genome. However, there is no study to investigate the distribution, evolution and functionality of those sequences in human populations.

Results

To systematically identify and study the missing common sequences (micSeqs), we extended the previous method by pooling OEA reads from large number of individuals and applying strict filtering methods to remove false sequences. The pipeline was applied to data from phase 1 of the 1000 Genomes Project. We identified 309 micSeqs that are present in at least 1% of the human population, but absent in the reference genome. We confirmed 76% of these 309 micSeqs by comparison to other primate genomes, individual human genomes, and gene expression data. Furthermore, we randomly selected fifteen micSeqs and confirmed their presence using PCR validation in 38 additional individuals. Functional analysis using published RNA-seq and ChIP-seq data showed that eleven micSeqs are highly expressed in human brain and three micSeqs contain transcription factor (TF) binding regions, suggesting they are functional elements. In addition, the identified micSeqs are absent in non-primates and show dynamic acquisition during primate evolution culminating with most micSeqs being present in Africans, suggesting some micSeqs may be important sources of human diversity.

Conclusions

76% of micSeqs were confirmed by a comparative genomics approach. Fourteen micSeqs are expressed in human brain or contain TF binding regions. Some micSeqs are primate-specific, conserved and may play a role in the evolution of primates.

【 授权许可】

   
2014 Liu et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150629140839757.pdf 2572KB PDF download
Figure 7. 52KB Image download
Figure 6. 73KB Image download
Figure 5. 59KB Image download
Figure 4. 95KB Image download
Figure 3. 71KB Image download
Figure 2. 92KB Image download
Figure 1. 113KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

【 参考文献 】
  • [1]Jacobs PA, Browne C, Gregson N, Joyce C, White H: Estimates of the frequency of chromosome-abnormalities detectable in unselected newborns using moderate levels of banding. J Med Genet 1992, 29(2):103-108.
  • [2]Reich DE, Schaffner SF, Daly MJ, McVean G, Mullikin JC, Higgins JM, Richter DJ, Lander ES, Altshuler D: Human genome sequence variation and the influence of gene history, mutation and recombination. Nat Genet 2002, 32(1):135-142.
  • [3]Sachidanandam R, Weissman D, Schmidt SC, Kakol JM, Stein LD, Marth G, Sherry S, Mullikin JC, Mortimore BJ, Willey DL, Hunt SE, Cole CG, Coggill PC, Rice CM, Ning Z, Rogers J, Bentley DR, Kwok PY, Mardis ER, Yeh RT, Schultz B, Cook L, Davenport R, Dante M, Fulton L, Hillier L, Waterston RH, McPherson JD, Gilman B, Schaffner S: A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 2001, 409(6822):928-933.
  • [4]The International HapMap Consortium: A haplotype map of the human genome. Nature 2005, 437(7063):1299-1320.
  • [5]Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM, Pasternak S, Wheeler DA, Willis TD, Yu F, Yang H, Zeng C, Gao Y, Hu H, Hu W, Li C, Lin W, Liu S, Pan H, Tang X, Wang J, Wang W, Yu J, Zhang B, Zhang Q, Zhao H: A second generation human haplotype map of over 3.1 million SNPs. Nature 2007, 449(7164):851-861.
  • [6]International HapMap 3 Consortium: Integrating common and rare genetic variation in diverse human populations. Nature 2010, 467(7311):52-58.
  • [7]The 1000 genomes project consortium: An integrated map of genetic variation from 1,092 human genomes. Nature 2012, 491(7422):56.
  • [8]Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA: A map of human genome variation from population-scale sequencing. Nature 2010, 467(7319):1061-1073.
  • [9]Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C: Initial sequencing and analysis of the human genome. Nature 2001, 409(6822):860-921.
  • [10]International Human Genome Sequencing Consortium: Finishing the euchromatic sequence of the human genome. Nature 2004, 431(7011):931-945.
  • [11]Kidd JM, Sampas N, Antonacci F, Graves T, Fulton R, Hayden HS, Alkan C, Malig M, Ventura M, Giannuzzi G, Kallicki J, Anderson P, Tsalenko A, Yamada NA, Tsang P, Kaul R, Wilson RK, Bruhn L, Eichler EE: Characterization of missing human genome sequences and copy-number polymorphic insertions. Nat Methods 2010, 7(5):365-371.
  • [12]Li R, Li Y, Zheng H, Luo R, Zhu H, Li Q, Qian W, Ren Y, Tian G, Li J, Zhou G, Zhu X, Wu H, Qin J, Jin X, Li D, Cao H, Hu X, Blanche H, Cann H, Zhang X, Li S, Bolund L, Kristiansen K, Yang H, Wang J, Wang J: Building the sequence map of the human pan-genome. Nat Biotechnol 2010, 28(1):57-63.
  • [13]Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G: De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet 2012, 44(2):226-232.
  • [14]Hajirasouliha I, Hormozdiari F, Alkan C, Kidd JM, Birol I, Eichler EE, Sahinalp SC: Detection and characterization of novel sequence insertions using paired-end next-generation sequencing. Bioinformatics 2010, 26(10):1277-1283.
  • [15]Wang Q, Jia P, Zhao Z: VirusFinder: software for efficient and accurate detection of viruses and their integration sites in host genomes through next generation sequencing data. PLoS One 2013, 8(5):e64465.
  • [16]Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, Lin Y, MacDonald JR, Pang AW, Shago M, Stockwell TB, Tsiamouri A, Bafna V, Bansal V, Kravitz SA, Busam DA, Beeson KY, McIntosh TC, Remington KA, Abril JF, Gill J, Borman J, Rogers YH, Frazier ME, Scherer SW, Strausberg RL: The diploid genome sequence of an individual human. PLoS Biol 2007, 5(10):e254.
  • [17]Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte CL, Irzyk GP, Lupski JR, Chinault C, Song XZ, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny DM, Margulies M, Weinstock GM, Gibbs RA, Rothberg JM: The complete genome of an individual by massively parallel DNA sequencing. Nature 2008, 452(7189):872-876.
  • [18]Au KF, Jiang H, Lin L, Xing Y, Wong WH: Detection of splice junctions from paired-end RNA-seq data by SpliceMap. Nucleic Acids Res 2010, 38(14):4570-4578.
  • [19]Twine NA, Janitz K, Wilkins MR, Janitz M: Whole transcriptome sequencing reveals gene expression and splicing differences in brain regions affected by Alzheimer’s disease. PLoS One 2011, 6(1):e16266.
  • [20]Voineagu I, Wang X, Johnston P, Lowe JK, Tian Y, Horvath S, Mill J, Cantor RM, Blencowe BJ, Geschwind DH: Transcriptomic analysis of autistic brain reveals convergent molecular pathology. Nature 2011, 474(7351):380-384.
  • [21]Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ: Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet 2008, 40(12):1413-1415.
  • [22]Mazin P, Xiong J, Liu X, Yan Z, Zhang X, Li M, He L, Somel M, Yuan Y, Phoebe Chen YP, Li N, Hu Y, Fu N, Ning Z, Zeng R, Yang H, Chen W, Gelfand M, Khaitovich P: Widespread splicing changes in human brain development and aging. Mol Syst Biol 2013, 9:633.
  • [23]Brawand D, Soumillon M, Necsulea A, Julien P, Csardi G, Harrigan P, Weier M, Liechti A, Aximu-Petri A, Kircher M, Albert FW, Zeller U, Khaitovich P, Grützner F, Bergmann S, Nielsen R, Pääbo S, Kaessmann H: The evolution of gene expression levels in mammalian organs. Nature 2011, 478(7369):343-348.
  • [24]Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A, Rinn JL: Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev 2011, 25(18):1915-1927.
  • [25]Li JB, Church GM: Deciphering the functions and regulation of brain-enriched A-to-I RNA editing. Nat Neurosci 2013, 16(11):1518-1522.
  • [26]Bernstein BE, Birney E, Dunham I, Green ED, Gunter C, Snyder M: An integrated encyclopedia of DNA elements in the human genome. Nature 2012, 489(7414):57-74.
  • [27]Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer A, Lagarde J, Lin W, Schlesinger F, Xue C, Marinov GK, Khatun J, Williams BA, Zaleski C, Rozowsky J, Röder M, Kokocinski F, Abdelhamid RF, Alioto T, Antoshechkin I, Baer MT, Bar NS, Batut P, Bell K, Bell I, Chakrabortty S, Chen X, Chrast J, Curado J: Landscape of transcription in human cells. Nature 2012, 489(7414):101-108.
  • [28]Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA: Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA 2009, 106(23):9362-9367.
  • [29]Altshuler D, Daly MJ, Lander ES: Genetic mapping in human disease. Science 2008, 322(5903):881-888.
  • [30]Gibson G: Rare and common variants: twenty arguments. Nat Rev Genet 2011, 13(2):135-145.
  • [31]Zuk O, Hechter E, Sunyaev SR, Lander ES: The mystery of missing heritability: genetic interactions create phantom heritability. Proc Natl Acad Sci USA 2012, 109(4):1193-1198.
  • [32]Eichler EE, Flint J, Gibson G, Kong A, Leal SM, Moore JH, Nadeau JH: Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet 2010, 11(6):446-450.
  • [33]Albers CA, Lunter G, MacArthur DG, McVean G, Ouwehand WH, Durbin R: Dindel: accurate indel calls from short-read data. Genome Res 2011, 21(6):961-973.
  • [34]Lynch M, Conery JS: The evolutionary fate and consequences of duplicate genes. Science 2000, 290(5494):1151-1155.
  • [35]Jern P, Coffin JM: Effects of retroviruses on host genome function. Annu Rev Genet 2008, 42:709-732.
  • [36]Keeling PJ, Palmer JD: Horizontal gene transfer in eukaryotic evolution. Nat Rev Genet 2008, 9(8):605-618.
  • [37]Hotopp JCD, Clark ME, Oliveira DCSG, Foster JM, Fischer P, Torres MC, Giebel JD, Kumar N, Ishmael N, Wang SL, Ingram J, Nene RV, Shepard J, Tomkins J, Richards S, Spiro DJ, Ghedin E, Slatko BE, Tettelin H, Werren JH: Widespread lateral gene transfer from intracellular bacteria to multicellular eukaryotes. Science 2007, 317(5845):1753-1756.
  • [38]Gladyshev EA, Meselson M, Arkhipova IR: Massive horizontal gene transfer in bdelloid rotifers. Science 2008, 320(5880):1210-1213.
  • [39]Pace JK 2nd, Gilbert C, Clark MS, Feschotte C: Repeated horizontal transfer of a DNA transposon in mammals and other tetrapods. Proc Natl Acad Sci USA 2008, 105(44):17023-17028.
  • [40]Walsh AM, Kortschak RD, Gardner MG, Bertozzi T, Adelson DL: Widespread horizontal transfer of retrotransposons. Proc Natl Acad Sci USA 2013, 110(3):1012-1016.
  • [41]Reich D, Green RE, Kircher M, Krause J, Patterson N, Durand EY, Viola B, Briggs AW, Stenzel U, Johnson PL, Maricic T, Good JM, Marques-Bonet T, Alkan C, Fu Q, Mallick S, Li H, Meyer M, Eichler EE, Stoneking M, Richards M, Talamo S, Shunkov MV, Derevianko AP, Hublin JJ, Kelso J, Slatkin M, Pääbo S: Genetic history of an archaic hominin group from Denisova Cave in Siberia. Nature 2010, 468(7327):1053-1060.
  • [42]Green RE, Krause J, Briggs AW, Maricic T, Stenzel U, Kircher M, Patterson N, Li H, Zhai W, Fritz MH, Hansen NF, Durand EY, Malaspinas AS, Jensen JD, Marques-Bonet T, Alkan C, Prüfer K, Meyer M, Burbano HA, Good JM, Schultz R, Aximu-Petri A, Butthof A, Höber B, Höffner B, Siegemund M, Weihmann A, Nusbaum C, Lander ES, Russ C: A draft sequence of the Neandertal genome. Science 2010, 328(5979):710-722.
  • [43]Fu Q, Meyer M, Gao X, Stenzel U, Burbano HA, Kelso J, Paabo S: DNA analysis of an early modern human from Tianyuan Cave, China. Proc Natl Acad Sci USA 2013, 110(6):2223-2227.
  • [44]Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H, Walker M, Chi M, Navin N, Lucito R, Healy J, Hicks J, Ye K, Reiner A, Gilliam TC, Trask B, Patterson N, Zetterberg A, Wigler M: Large-scale copy number polymorphism in the human genome. Science 2004, 305(5683):525-528.
  • [45]Lohse M, Bolger AM, Nagel A, Fernie AR, Lunn JE, Stitt M, Usadel B: RobiNA: a user-friendly, integrated software solution for RNA-Seq-based transcriptomics. Nucleic Acids Res 2012, 40(Web Server issue):W622-W627.
  • [46]Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nat Methods 2012, 9(4):357-359.
  • [47]Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 2008, 18(5):821-829.
  • [48]Robinson JT, Thorvaldsdottir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP: Integrative genomics viewer. Nat Biotechnol 2011, 29(1):24-26.
  文献评价指标  
  下载次数:36次 浏览次数:1次