期刊论文详细信息
BioData Mining
Mycoplasma contamination in the 1000 Genomes Project
William B Langdon1 
[1] Department of Computer Science, University College London, Gower Street, London WC1E 6BT, UK
关键词: SOLiD;    454;    Solexa;    High throughput;    Data cleansing;    Next-generation DNA sequencing;    Data mining;    metagenomic;    genetics;    Microbiology;    Molecular biology;   
Others  :  795079
DOI  :  10.1186/1756-0381-7-3
 received in 2013-05-23, accepted in 2014-02-19,  发布年份 2014
PDF
【 摘 要 】

Background

In silco Biology is increasingly important and is often based on public data. While the problem of contamination is well recognised in microbiology labs the corresponding problem of database corruption has received less attention.

Results

Mapping 50 billion next generation DNA sequences from The Thousand Genome Project against published genomes reveals many that match one or more Mycoplasma but are not included in the reference human genome GRCh37.p5. Many of these are of low quality but NCBI BLAST searches confirm some high quality, high entropy sequences match Mycoplasma but no human sequences.

Conclusions

It appears at least 7% of 1000G samples are contaminated.

【 授权许可】

   
2014 Langdon; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20140705081137828.pdf 2195KB PDF download
Figure 10. 74KB Image download
Figure 9. 68KB Image download
Figure 8. 33KB Image download
Figure 7. 112KB Image download
Figure 6. 32KB Image download
Figure 5. 40KB Image download
Figure 4. 32KB Image download
Figure 3. 42KB Image download
Figure 2. 2KB Image download
Figure 1. 58KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Figure 9.

Figure 10.

【 参考文献 】
  • [1]Miller CJ, Kassem HS, Pepper SD, Hey Y, Ward TH, Margison GP: Mycoplasma infection significantly alters microarray gene expression profiles. BioTechniques 2003, 35(4):812-814. [http://www.biotechniques.com/BiotechniquesJournal/2003/October/ webcite]
  • [2]Drexler HG, Uphoff CC: Mycoplasma contamination of cell cultures Incidence, sources, effects, detection, elimination, prevention. Cytotechnology 2002, 39(2):75-90. [http://dx.doi.org/doi:10.1023/A:1022913015916 webcite]
  • [3]Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R: NCBI GEO: mining tens of millions of expression profiles–database and tools update. Nucleic Acids Research 2007, 35(Database issue):D760-D765. [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve\&db=pubmed\&dopt= Abstract\&list\_uids=17099226 webcite]
  • [4]Aldecoa-Otalora E, Langdon WB, Cunningham P, Arno MJ: Unexpected presence of mycoplasma probes on human microarrays. BioTechniques 2009, 47(6):1013-1016. [http://dx.doi.org/doi:10.2144/000113271 webcite]
  • [5]Langdon WB: Correlation of microarray probes give evidence for Mycoplasma contamination in human studies. In GECCO-2013 Workshop MedGEC Medical Applications of Genetic and Evolutionary Computation. Edited by Smith SL, Cagnoni S, Patton RM. Amsterdam: ACM; 2013:1447-1454. [http://doi.acm.org/10.1145/2464576.2482725 webcite]
  • [6]Langdon WB, Arno M: In Silico infection of the human genome. In 10th European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics, EvoBIO 2012, Volume 7246 of LNCS. Edited by Giacobini M, Vanneschi L, Bush WS WS. Malaga: Springer Verlag; 2012:245-249. [http://dx.doi.org/doi:10.1007/978-3-642-29066-4\_22 webcite]
  • [7]Durbin RM: A map of human genome variation from population-scale sequencing. Nature 2010, 467(7319):1061-1073. [http://dx.doi.org/10.1038/nature09534 webcite]
  • [8]Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 2009, 10(3):R25. [http://genomebiology.com/2009/10/3/R25 webcite] BioMed Central Full Text
  • [9]Schmieder R, Edwards R: Fast identification and removal of sequence contamination from genomic and metagenomic datasets. PLoS ONE 2011., 6(3) [http://dx.doi.org/10.1371/journal.pone.0017288 webcite]
  • [10]Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389-3402. [http://nar.oxfordjournals.org/content/25/17/3389.abstract webcite]
  • [11]Jun G, Flickinger M, Hetrick KN, Romm JM, Doheny KF, Abecasis GR, Boehnke M, Kang HM: Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. Am J Human Genet 2012, 91(5):839-848. [http://dx.doi.org/10.1016/j.ajhg.2012.09.004 webcite]
  • [12]Cibulskis K, McKenna A, Fennell T, Banks E, DePristo M, Getz G: ContEst: estimating cross-contamination of human samples in next-generation sequencing data. Bioinformatics 2011, 27(18):2601-2602. [http://bioinformatics.oxfordjournals.org/content/27/18/2601 webcite]
  • [13]File describing The 1000 Genomes Project FTP mirror [ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence.index]
  • [14]Pippard AB: Elements of Classical Thermodynamics. Cambridge University Press; 1957. [http://adsabs.harvard.edu/abs/1957ectf.book.....P webcite]
  • [15]Shannon CE, Weaver W: The Mathematical Theory of Communication. Urbana: The University of Illinois Press; 1964.
  文献评价指标  
  下载次数:156次 浏览次数:14次