期刊论文详细信息
BMC Bioinformatics
SnowyOwl: accurate prediction of fungal genes by using RNA-Seq and homology information to select among ab initio models
Ian Reid2  Nicholas O’Toole2  Omar Zabaneh1  Reza Nourzadeh1  Mahmoud Dahdouli1  Mostafa Abdellateef1  Paul MK Gordon1  Jung Soh1  Gregory Butler2  Christoph W Sensen1  Adrian Tsang2 
[1] Faculty of Medicine, Visual Genomics Centre, University of Calgary, 3330 Hospital Drive NW, Calgary, AB T2N 4N1, Canada
[2] Centre for Structural and Functional Genomics, Concordia University, 7141 Sherbrooke St. W, Montreal, QC H4B 1R6, Canada
关键词: Neurospora crassa;    Thermomyces lanuginosus;    Phanerochaete chrysosporium;    Aspergillus niger;    Fungi;    Gene prediction;    RNA-Seq;   
Others  :  818304
DOI  :  10.1186/1471-2105-15-229
 received in 2013-08-15, accepted in 2014-06-17,  发布年份 2014
PDF
【 摘 要 】

Background

Locating the protein-coding genes in novel genomes is essential to understanding and exploiting the genomic information but it is still difficult to accurately predict all the genes. The recent availability of detailed information about transcript structure from high-throughput sequencing of messenger RNA (RNA-Seq) delineates many expressed genes and promises increased accuracy in gene prediction. Computational gene predictors have been intensively developed for and tested in well-studied animal genomes. Hundreds of fungal genomes are now or will soon be sequenced. The differences of fungal genomes from animal genomes and the phylogenetic sparsity of well-studied fungi call for gene-prediction tools tailored to them.

Results

SnowyOwl is a new gene prediction pipeline that uses RNA-Seq data to train and provide hints for the generation of Hidden Markov Model (HMM)-based gene predictions and to evaluate the resulting models. The pipeline has been developed and streamlined by comparing its predictions to manually curated gene models in three fungal genomes and validated against the high-quality gene annotation of Neurospora crassa; SnowyOwl predicted N. crassa genes with 83% sensitivity and 65% specificity. SnowyOwl gains sensitivity by repeatedly running the HMM gene predictor Augustus with varied input parameters and selectivity by choosing the models with best homology to known proteins and best agreement with the RNA-Seq data.

Conclusions

SnowyOwl efficiently uses RNA-Seq data to produce accurate gene models in both well-studied and novel fungal genomes. The source code for the SnowyOwl pipeline (in Python) and a web interface (in PHP) is freely available from http://sourceforge.net/projects/snowyowl/ webcite.

【 授权许可】

   
2014 Reid et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20140711093359372.pdf 974KB PDF download
Figure 7. 104KB Image download
Figure 6. 19KB Image download
Figure 5. 42KB Image download
Figure 4. 43KB Image download
Figure 3. 76KB Image download
Figure 2. 62KB Image download
Figure 1. 59KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

【 参考文献 】
  • [1]Majoros WH: Methods for Computational Gene Prediction. New York: Cambridge University Press; 2007.
  • [2]Stanke M, Waack S: Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 2003, 19(Suppl 2):ii215-ii225.
  • [3]Salamov AA, Solovyev VV: Ab initio gene finding in Drosophila genomic DNA. Genome Res 2000, 10:516-522.
  • [4]Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M: Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res 2008, 18:1979-1990.
  • [5]Korf I: Gene finding in novel genomes. BMC Bioinformatics 2004, 5:59.
  • [6]Majoros WH, Pertea M, Salzberg SL: TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 2004, 20:2878-2879.
  • [7]DeCaprio D, Vinson JP, Pearson MD, Montgomery P, Doherty M, Galagan JE: Conrad: gene prediction using conditional random fields. Genome Res 2007, 17:1389-1398.
  • [8]Schweikert G, Zien A, Zeller G, Behr J, Dieterich C, Ong CS, Philips P, De Bona F, Hartmann L, Bohlen A, Kruger N, Sonnenburg S, Ratsch G: mGene: accurate SVM-based gene finding with an application to nematode genomes. Genome Res 2009, 19:2133-2143.
  • [9]Blanco E, Parra G, Guigó R: Using geneid to identify genes. Curr Protoc Bioinformatics 2007, 18:4.3.1-4.3.28.
  • [10]Birney E, Durbin R: Using GeneWise in the Drosophila annotation experiment. Genome Res 2000, 10:547-548.
  • [11]Keller O, Kollmar M, Stanke M, Waack S: A novel hybrid gene prediction method employing protein multiple sequence alignments. Bioinformatics 2011, 27:757-763.
  • [12]Stanke M, Diekhans M, Baertsch R, Haussler D: Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 2008, 24:637-644.
  • [13]Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD, Salzberg SL, White O: Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 2003, 31:5654-5666.
  • [14]Allen JE, Salzberg SL: JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics 2005, 21:3596-3603.
  • [15]Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 2008, 18:1509-1517.
  • [16]Schulz MH, Zerbino DR, Vingron M, Birney E: Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 2012, 28:1086-1092.
  • [17]Grabherr M, Haas B, Yassour M, Levin J, Thompson D, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren B, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 2011, 29:644-652.
  • [18]Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 2010, 28:511-515.
  • [19]Steijger T, Abril JF, Engstrom PG, Kokocinski F, Akerman M, Alioto T, Ambrosini G, Antonarakis SE, Behr J, Bohnert R, Bucher P, Cloonan N, Derrien T, Djebali S, Du J, Dudoit S, Gerstein M, Gingeras TR, Gonzalez D, Grimmond SM, Habegger L, Hubbard TJ, Iseli C, Jean G, Kahles A, Lagarde J, Leng J, Lefebvre G, Lewis S, Mortazavi A, et al.: Assessment of transcript reconstruction methods for RNA-seq. Nat Methods 2013, 10:1177-1184.
  • [20]Reese MG, Hartzell G, Harris NL, Ohler U, Abril JF, Lewis SE: Genome annotation assessment in Drosophila melanogaster. Genome Res 2000, 10:483-501.
  • [21]Guigo R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, Castelo R, Eyras E, Ucla C, Gingeras TR, Harrow J, Hubbard T, Lewis SE, Reese MG: EGASP: the human ENCODE genome annotation assessment project. Genome Biol 2006, 7(Suppl 1):S2.1-S31.
  • [22]Coghlan A, Fiedler TJ, McKay SJ, Flicek P, Harris TW, Blasiar D, Stein LD: nGASP–the nematode genome annotation assessment project. BMC Bioinformatics 2008, 9:549.
  • [23]1000 Fungal Genomes Project http://1000.fungalgenomes.org webcite
  • [24]Fungal Genome Initiative http://www.broadinstitute.org/scientific-community/science/projects/fungal-genome-initiative/fungal-genome-initiative webcite
  • [25]Galagan JE, Henn MR, Ma L, Cuomo CA, Birren B: Genomics of the fungal kingdom: insights into eukaryotic biology. Genome Res 2005, 15:1620-1631.
  • [26]Nakagawa S, Niimura Y, Gojobori T, Tanaka H, Miura K: Diversity of preferred nucleotide sequences around the translation initiation codon in eukaryote genomes. Nucleic Acids Res 2008, 36:861-871.
  • [27]van der Burgt A, Severing E, Collemare J, de Wit PJ: Automated alignment-based curation of gene models in filamentous fungi. BMC Bioinformatics 2014, 15:19.
  • [28]Bradnam KR, Fass JN, Alexandrov A, Baranay P, Bechner M, Birol I, Boisvert S, Chapman JA, Chapuis G, Chikhi R, Chitsaz H, Chou WC, Corbeil J, Del Fabbro C, Docking TR, Durbin R, Earl D, Emrich S, Fedotov P, Fonseca NA, Ganapathy G, Gibbs RA, Gnerre S, Godzaridis E, Goldstein S, Haimel M, Hall G, Haussler D, Hiatt JB, Ho IY, et al.: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience 2013, 2:10.
  • [29]Genozymes http://genome.fungalgenomics.ca webcite
  • [30]Andersen MR, Salazar MP, Schaap PJ, van de Vondervoort PJ, Culley D, Thykaer J, Frisvad JC, Nielsen KF, Albang R, Albermann K, Berka RM, Braus GH, Braus-Stromeyer SA, Corrochano LM, Dai Z, van Dijck PW, Hofmann G, Lasure LL, Magnuson JK, Menke H, Meijer M, Meijer SL, Nielsen JB, Nielsen ML, van Ooyen AJ, Pel HJ, Poulsen L, Samson RA, Stam H, Tsang A, et al.: Comparative genomics of citric-acid-producing Aspergillus niger ATCC 1015 versus enzyme-producing CBS 513.88. Genome Res 2011, 21:885-897.
  • [31]Martinez D, Larrondo LF, Putnam N, Gelpke MD, Huang K, Chapman J, Helfenbein KG, Ramaiya P, Detter JC, Larimer F, Coutinho PM, Henrissat B, Berka R, Cullen D, Rokhsar D: Genome sequence of the lignocellulose degrading fungus Phanerochaete chrysosporium strain RP78. Nat Biotechnol 2004, 22:695-700.
  • [32]Cherry JM, Hong EL, Amundsen C, Balakrishnan R, Binkley G, Chan ET, Christie KR, Costanzo MC, Dwight SS, Engel SR, Fisk DG, Hirschman JE, Hitz BC, Karra K, Krieger CJ, Miyasato SR, Nash RS, Park J, Skrzypek MS, Simison M, Weng S, Wong ED: Saccharomyces genome database: the genomics resource of budding yeast. Nucleic Acids Res 2012, 40:D700-D705.
  • [33]Galagan J, Calvo S, Borkovich K, Selker E, Read N, Jaffe D, FitzHugh W, Ma L, Smirnov S, Purcell S, Rehman B, Elkins T, Engels R, Wang S, Nielsen CB, Butler J, Endrizzi M, Qui D, Ianakiev P, Bell-Pedersen D, Nelson MA, Werner-Washburne M, Selitrennikoff CP, Kinsey JA, Braun EL, Zelter A, Schulte U, Kothe GO, Jedd G, Mewes W, et al.: The genome sequence of the filamentous fungus Neurospora crassa. Nature 2003, 422:859-868.
  • [34]Neurospora Crassa Sequencing Project, Broad Institute of Harvard and MIT http://www.broadinstitute.org/ webcite
  • [35]Neurospora Crassa Gene Finding Methods http://www.broadinstitute.org/annotation/genome/neurospora/GeneFinding.html webcite
  • [36]BLAST+ http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download webcite
  • [37]Fungal Refseq Proteins ftp://ftp.ncbi.nih.gov/refseq/release/fungi
  • [38]Uniprot-Swissprot Database ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
  • [39]Tuque: Tools for Mapping RNA-Seq Reads to Eukaryotic Genomes http://sourceforge.net/projects/tuque/ webcite
  • [40]GBrowse http://gmod.org/wiki/GBrowse webcite
  • [41]Transcript Reconstruction Evaluation Software https://github.com/RGASP-consortium/reconstruction webcite
  • [42]Short Read Archive http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi webcite
  • [43]Benz J, Chau B, Zheng D, Bauer S, Glass N, Somerville C: A comparative systems analysis of polysaccharide-elicited responses in Neurospora crassa reveals carbon source-specific cellular adaptations. Mol Microbiol 2014, 91:275-299.
  • [44]Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, Couger MB, Eccles D, Li B, Lieber M, Macmanes MD, Ott M, Orvis J, Pochet N, Strozzi F, Weeks N, Westerman R, William T, Dewey CN, Henschel R, Leduc RD, Friedman N, Regev A: De novo transcript sequence reconstruction from RNA-seq using the trinity platform for reference generation and analysis. Nat Protoc 2013, 8:1494-1512.
  • [45]Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR: STAR: ultrafast universal RNA-seq aligner. Bioinformatics 2013, 29:15-21.
  • [46]Joint Genome Institute http://genome.jgi-psf.org webcite
  • [47]Berka RM, Grigoriev IV, Otillar R, Salamov A, Grimwood J, Reid I, Ishmael N, John T, Darmond C, Moisan M, Henrissat B, Coutinho PM, Lombard V, Natvig DO, Lindquist E, Schmutz J, Lucas S, Harris P, Powlowski J, Bellemare A, Taylor D, Butler G, de Vries RP, Allijn IE, van den Brink J, Ushinsky S, Storms R, Powell AJ, Paulsen IT, Elbourne LDH, et al.: Comparative genomic analysis of the thermophilic biomass-degrading fungi Myceliophthora thermophila and Thielavia terrestris. Nat Biotechnol 2011, 29:922-927.
  • [48]Xu Y, Uberbacher EC: Automated gene identification in large-scale genomic sequences. J Comput Biol 1997, 4:325-338.
  • [49]Hinnebusch AG: Molecular mechanism of scanning and start codon selection in eukaryotes. Microbiol Mol Biol Rev 2011, 75:434-467.
  • [50]Petersen TN, Brunak S, von Heijne G, Nielsen H: SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods 2011, 8:785-786.
  • [51]Hood HM, Neafsey DE, Galagan J, Sachs MS: Evolutionary roles of upstream open reading frames in mediating gene regulation in fungi. Annu Rev Microbiol 2009, 63:385-409.
  • [52]Wethmar K, Barbosa-Silva A, Andrade-Navarro MA, Leutz A: uORFdb–a comprehensive literature database on eukaryotic uORF biology. Nucleic Acids Res 2014, 42:D60-D67.
  • [53]Haas BJ, Zeng Q, Pearson MD, Cuomo CA, Wortman JR: Approaches to fungal genome annotation. Mycology 2011, 2:118-141.
  • [54]Levin JZ, Yassour M, Adiconis X, Nusbaum C, Thompson DA, Friedman N, Gnirke A, Regev A: Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nat Methods 2010, 7:709-715.
  • [55]Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, Nusbaum C, Jaffe DB: Characterizing and measuring bias in sequence data. Genome Biol 2013, 14:R51.
  文献评价指标  
  下载次数:101次 浏览次数:24次