期刊论文详细信息
BMC Biotechnology
Pattern analysis approach reveals restriction enzyme cutting abnormalities and other cDNA library construction artifacts using raw EST data
Sun Zhou1  Guoli Ji1  Xiaolin Liu1  Pei Li4  James Moler2  John E Karro3  Chun Liang2 
[1] Department of Automation, Xiamen University, Xiamen, Fujian, 361005, China
[2] Department of Computer Science and Systems Analysis, Oxford, OH, 45056, USA
[3] Department of Statistics, Miami University, Oxford, OH, 45056, USA
[4] Department of Botany, Oxford, OH, 45056, USA
关键词: Chimeric EST sequences;    Restriction enzyme cutting abnormality;    Pattern analysis;    cDNA library construction;    cDNA terminus;   
Others  :  1135194
DOI  :  10.1186/1472-6750-12-16
 received in 2011-12-18, accepted in 2012-03-15,  发布年份 2012
PDF
【 摘 要 】

Background

    E
xpressed
    S
equence
    T
ag (EST) sequences are widely used in applications such as genome annotation, gene discovery and gene expression studies. However, some of GenBank dbEST sequences have proven to be “unclean”. Identification of cDNA termini/ends and their structures in raw ESTs not only facilitates data quality control and accurate delineation of transcription ends, but also furthers our understanding of the potential sources of data abnormalities/errors present in the wet-lab procedures for cDNA library construction.

Results

After analyzing a total of 309,976 raw Pinus taeda ESTs, we uncovered many distinct variations of cDNA termini, some of which prove to be good indicators of wet-lab artifacts, and characterized each raw EST by its cDNA terminus structure patterns. In contrast to the expected patterns, many ESTs displayed complex and/or abnormal patterns that represent potential wet-lab errors such as: a failure of one or both of the restriction enzymes to cut the plasmid vector; a failure of the restriction enzymes to cut the vector at the correct positions; the insertion of two cDNA inserts into a single vector; the insertion of multiple and/or concatenated adapters/linkers; the presence of 3′-end terminal structures in designated 5′-end sequences or vice versa; and so on. With a close examination of these artifacts, many problematic ESTs that have been deposited into public databases by conventional bioinformatics pipelines or tools could be cleaned or filtered by our methodology. We developed a software tool for

    A
bnormality
    F
iltering and
    S
equence
    T
rimming for ESTs (AFST, http://code.google.com/p/afst/ webcite) using a pattern analysis approach. To compare AFST with other pipelines that submitted ESTs into dbEST, we reprocessed 230,783 Pinus taeda and 38,709 Arachis hypogaea GenBank ESTs. We found 7.4% of Pinus taeda and 29.2% of Arachis hypogaea GenBank ESTs are “unclean” or abnormal, all of which could be cleaned or filtered by AFST.

Conclusions

cDNA terminal pattern analysis, as implemented in the AFST software tool, can be utilized to reveal wet-lab errors such as restriction enzyme cutting abnormities and chimeric EST sequences, detect various data abnormalities embedded in existing Sanger EST datasets, improve the accuracy of identifying and extracting bona fide cDNA inserts from raw ESTs, and therefore greatly benefit downstream EST-based applications.

【 授权许可】

   
2012 Zhou et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150307011411463.pdf 3805KB PDF download
Figure 5. 190KB Image download
Figure 4. 26KB Image download
Figure 3. 53KB Image download
Figure 2. 81KB Image download
Figure 1. 39KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

【 参考文献 】
  • [1]Cairney J, Zheng L, Cowels A, Hsiao J, Zismann V, Liu J, Ouyang S, Thibaud-Nissen F, Hamilton J, Childs K, Pullman GS, Zhang Y, Oh T, Buell CR: Expressed Sequence Tags from loblolly pine embryos reveal similarities with angiosperm embryogenesis. Plant Mol Biol 2006, 62:485-501.
  • [2]Lorenz WW, Sun F, Liang C, Kolychev D, Wang H, Zhao X, Cordonnier-Pratt MM, Pratt LH, Dean JF: Water stress-responsive genes in loblolly pine (Pinus taeda) roots identified by analyses of expressed sequence tag libraries. Tree Physiol 2006, 26:1-16.
  • [3]Pavy N, Laroche J, Bousquet J, Mackay J: Large-scale statistical analysis of secondary xylem ESTs in pine. Plant Mol Biol 2005, 57:203-224.
  • [4]Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF, et al.: Complementary DNA sequencing: expressed sequence tags and human genome project. Science 1991, 252:1651-1656.
  • [5]Adams MD, Dubnick M, Kerlavage AR, Moreno R, Kelley JM, Utterback TR, Nagle JW, Fields C, Venter JC: Sequence identification of 2,375 human brain genes. Nature 1992, 355:632-634.
  • [6]Liang F, Holt I, Pertea G, Karamycheva S, Salzberg SL, Quackenbush J: Gene index analysis of the human genome estimates approximately 120,000 genes. Nature Genet 2000, 25:239-240.
  • [7]Clark MS, Edwards YJ, Peterson D, Clifton SW, Thompson AJ, Sasaki M, Suzuki Y, Kikuchi K, Watabe S, Kawakami K, Sugano S, Elgar G, Johnson SL: Fugu ESTs: New resources for transcription analysis and genome annotation. Genome Res 2003, 13:2747-2753.
  • [8]Brent MR: Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nature Rev Genet 2008, 9:62-73.
  • [9]GenBank dbEST; [http://www.ncbi.nlm.nih.gov/projects/dbEST/ webcite]
  • [10]Liang C, Wang G, Liu L, Ji GL, Fang L, Liu YS, Carter K, Webb JS, Dean JFD: ConiferEST, an integrated bioinformatics system for data reprocessing and mining of conifer expressed sequence tags (ESTs). BMC Genomics 2007, 8:134. BioMed Central Full Text
  • [11]Liang C, Liu YS, Liu L, Davis AC, Shen YJ, Li QSQ: Expressed Sequence Tags With cDNA Termini: Previously Overlooked Resources for Gene Annotation and Transcriptome Exploration in Chlamydomonas reinhardtii. Genetics 2008, 179:83-93.
  • [12]Liang C, Wang G, Liu L, Ji GL, Liu Y, Chen J, Webb JS, Reese G, Dean JF: WebTraceMiner: a web service for processing and mining EST sequence trace files. Nucleic Acids Res 2007, 35:W137-W142.
  • [13]Chou HH, Holmes MH: DNA sequence quality trimming and vector removal. Bioinformatics 2001, 17:1093-1104.
  • [14]Li S, Chou HH: Lucy 2: an interactive DNA sequence quality trimming and vector removal tool. Bioinformatics 2004, 20:2865-2866.
  • [15]SeqClean; [http://sourceforge.net/projects/seqclean webcite]
  • [16]Hillier L, Lennon G, Becker M, Bonaldo MF, Chiapelli B, Chissoe S, Dietrich N, DuBuque T, Favello A, Gish W, et al.: Generation and analysis of 280,000 human expressed sequence tags. Genome Res 1996, 6:807-828.
  • [17]Peterson LA, Brown MR, Carlisle AJ, Kohn EC, Liotta LA, Emmert-Buck MR, Krizman DB: An improved method for construction of directionally cloned cDNA libraries from microdissected cells. Cancer Res 1998, 58:5326-5328.
  • [18]Beißbarth T, Hyde L, Smyth GK, Job C, Boon WM, Tan SS, Scott HS, Speed TP: Statistical Modeling of Sequencing Errors in SAGE Libraries. Bioinformatics 2004, 20:i31-i39.
  文献评价指标  
  下载次数:47次 浏览次数:21次