期刊论文详细信息
BMC Bioinformatics
IdentiCS – Identification of coding sequence and in silico reconstruction of the metabolic network directly from unannotated low-coverage bacterial genome sequence
An-Ping Zeng1  Jibin Sun1 
[1]Department of Genome Analysis, GBF-German Research Center for Biotechnology, Mascheroder Weg 1, Braunschweig, 38124, Germany
关键词: Klebsiella pneumoniae;    Salmonella typhimurium;    metabolic network;    comparison;    visualization;    in silico reconstruction;    coding sequence;    annotation;    genome sequence;    unfinished;    low-coverage;   
Others  :  1171648
DOI  :  10.1186/1471-2105-5-112
 received in 2004-05-17, accepted in 2004-08-16,  发布年份 2004
PDF
【 摘 要 】

Background

A necessary step for a genome level analysis of the cellular metabolism is the in silico reconstruction of the metabolic network from genome sequences. The available methods are mainly based on the annotation of genome sequences including two successive steps, the prediction of coding sequences (CDS) and their function assignment. The annotation process takes time. The available methods often encounter difficulties when dealing with unfinished error-containing genomic sequence.

Results

In this work a fast method is proposed to use unannotated genome sequence for predicting CDSs and for an in silico reconstruction of metabolic networks. Instead of using predicted genes or CDSs to query public databases, entries from public DNA or protein databases are used as queries to search a local database of the unannotated genome sequence to predict CDSs. Functions are assigned to the predicted CDSs simultaneously. The well-annotated genome of Salmonella typhimurium LT2 is used as an example to demonstrate the applicability of the method. 97.7% of the CDSs in the original annotation are correctly identified. The use of SWISS-PROT-TrEMBL databases resulted in an identification of 98.9% of CDSs that have EC-numbers in the published annotation. Furthermore, two versions of sequences of the bacterium Klebsiella pneumoniae with different genome coverage (3.9 and 7.9 fold, respectively) are examined. The results suggest that a 3.9-fold coverage of the bacterial genome could be sufficiently used for the in silico reconstruction of the metabolic network. Compared to other gene finding methods such as CRITICA our method is more suitable for exploiting sequences of low genome coverage. Based on the new method, a program called IdentiCS (

    Identi
fication of
    C
oding
    S
equences from Unfinished Genome Sequences) is delivered that combines the identification of CDSs with the reconstruction, comparison and visualization of metabolic networks (free to download at http://genome.gbf.de/bioinformatics/index.html webcite).

Conclusions

The reversed querying process and the program IdentiCS allow a fast and adequate prediction protein coding sequences and reconstruction of the potential metabolic network from low coverage genome sequences of bacteria. The new method can accelerate the use of genomic data for studying cellular metabolism.

【 授权许可】

   
2004 Sun and Zeng; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150420013156245.pdf 378KB PDF download
Figure 3. 78KB Image download
Figure 2. 152KB Image download
Figure 1. 38KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

【 参考文献 】
  • [1]Ma HW, Zeng AP: Reconstruction of metabolic networks from genome data and analysis of their global structure for various organisms. Bioinformatics 2003, 19:270-277.
  • [2]Ma HW, Zeng AP: The connectivity structure, giant strong component and centrality of metabolic networks. Bioinformatics 2003, 19:1423-1430.
  • [3]Overbeek R, Larsen N, Pusch GD, D'Souza M, Selkov E Jr, Kyrpides N, Fonstein M, Maltsev N, Selkov E: WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Res 2000, 28:123-125.
  • [4]Ideker T, Thorsson V, Ranish JA, Christmas R, Buhler J, Eng JK, Bumgarner R, Goodlett DR, Aebersold R, Hood L: Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science 2001, 292:929-934.
  • [5]Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000, 28:27-30.
  • [6]Michal G: Biochemical Pathways. 3rd edition. Boehringer Mannheim, Germany; 1992.
  • [7]Michal G: Biochemical Pathways. Heidelberg; Berlin: Spektrum Akademischer Verlag; 1999.
  • [8]Selkov E Jr, Grechkin Y, Mikhailova N, Selkov E: MPW: the Metabolic Pathways Database. Nucleic Acids Res 1998, 26:43-45.
  • [9]Karp PD, Riley M, Saier M, Paulsen IT, Paley SM, Pellegrini-Toole A: The EcoCyc and MetaCyc databases. Nucleic Acids Res 2000, 28:56-59.
  • [10]Karp PD, Riley M, Saier M, Paulsen IT, Collado-Vides J, Paley SM, Pellegrini-Toole A, Bonavides C, Gama-Castro S: The EcoCyc Database. Nucleic Acids Res 2002, 30:56-58.
  • [11]Goesmann A, Haubrock M, Meyer F, Kalinowski J, Giegerich R: PathFinder: reconstruction and dynamic visualization of metabolic pathways. Bioinformatics 2002, 18:124-129.
  • [12]Delcher AL, Harmon D, Kasif S, White O, Salzberg SL: Improved microbial gene identification with GLIMMER. Nucleic Acids Res 1999, 27:4636-4641.
  • [13]Besemer J, Lomsadze A, Borodovsky M: GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res 2001, 29:2607-2618.
  • [14]Guo FB, Ou HY, Zhang CT: ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. Nucleic Acids Res 2003, 31:1780-1789.
  • [15]Badger JH, Olsen GJ: CRITICA: coding region identification tool invoking comparative analysis. Mol Biol Evol 1999, 16:512-524.
  • [16]Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Barrell D, Bateman A, Binns D, Biswas M, Bradley P, Bork P, Bucher P, Copley RR, Courcelle E, Das U, Durbin R, Falquet L, Fleischmann W, Griffiths-Jones S, Haft D, Harte N, Hulo N, Kahn D, Kanapin A, Krestyaninova M, Lopez R, Letunic I, Lonsdale D, Silventoinen V, Orchard SE, Pagni M, Peyruc D, Ponting CP, Selengut JD, Servant F, Sigrist CJ, Vaughan R, Zdobnov EM: The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res 2003, 31:315-318.
  • [17]Falquet L, Pagni M, Bucher P, Hulo N, Sigrist CJ, Hofmann K, Bairoch A: The PROSITE database, its status in 2002. Nucleic Acids Res 2002, 30:235-238.
  • [18]Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer EL: The Pfam protein families database. Nucleic Acids Res 2002, 30:276-280.
  • [19]The Genomic Database at Integrated Genomics, Inc [http://www.integratedgenomics.com/genomic.html] webcite
  • [20]The Academic Site of WIT [http://www-wit.mcs.anl.gov/] webcite
  • [21]Mount DW: Bioinformatics: Sequence and genome analysis. Cold Spring Harbor Laboratory Press; 2001.
  • [22]Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25:3389-3402.
  • [23]McClelland M, Sanderson KE, Spieth J, Clifton SW, Latreille P, Courtney L, Porwollik S, Ali J, Dante M, Du F, Hou S, Layman D, Leonard S, Nguyen C, Scott K, Holmes A, Grewal N, Mulvaney E, Ryan E, Sun H, Florea L, Miller W, Stoneking T, Nhan M, Waterston R, Wilson RK: Complete genome sequence of Salmonella enterica serovar Typhimurium LT2. Nature 2001, 413:852-856.
  • [24]Ma HW, Zeng AP: Phylogenetic comparison of metabolic capacities of organisms at genome level. Mol Phylogenet Evol 2004, 31:204-213.
  • [25]Schomburg I, Chang A, Ebeling C, Gremse M, Heldt C, Huhn G, Schomburg D: BRENDA, the enzyme database: updates and major new developments. Nucleic Acids Res 2004, 32:D431-3.
  • [26]International Union of Biochemistry and Molecular Biology (IUBMB) [http://www.iubmb.unibe.ch] webcite
  • [27]The ExPASy (Expert Protein Analysis System) proteomics server of the Swiss Institute of Bioinformatics (SIB) [http://www.expasy.org] webcite
  • [28]The FTP Site of KEGG Genomes [ftp://ftp.genome.ad.jp/pub/kegg/genomes] webcite
  • [29]The Genome Sequencing Center at Washington University Medical School [http://genome.wustl.edu] webcite
  • [30]The Non-Redundant Protein Sequence Database [ftp://ftp.expasy.org/databases/sp_tr_nrdb] webcite
  • [31]Pearson WR: Flexible similarity searching with the FASTA3 program package. In In Bioinformatics Methods and Protocols. Edited by Misener S, Krawetz SA. Totowa: NJ: Humana Press; 1999:185-219.
  • [32]Burset M, Guigo R: Evaluation of gene structure prediction programs. Genomics 1996, 34:353-367.
  文献评价指标  
  下载次数:56次 浏览次数:33次