期刊论文详细信息
BMC Bioinformatics
Evaluation of BLAST-based edge-weighting metrics used for homology inference with the Markov Clustering algorithm
Theodore R. Gibbons1  Stephen M. Mount2  Endymion D. Cooper1  Charles F. Delwiche3 
[1] Department of Cell Biology and Molecular Genetics, University of Maryland, College Park 20742, Baltimore, Maryland
[2] Center for Bioinformatics and Computational Biology, University of Maryland, College Park 20742, Baltimore, Maryland
[3] Maryland Agricultural Experiment Station, University of Maryland, College Park 20742, Baltimore, Maryland
关键词: High-throughput sequencing;    Short-read sequencing;    Transcriptomics;    Bioinformatics;    Genomics;    Graph;    Homology prediction;    Sequence clustering;    Protein clustering;    MCL;   
Others  :  1230987
DOI  :  10.1186/s12859-015-0625-x
 received in 2014-10-16, accepted in 2015-05-20,  发布年份 2015
【 摘 要 】

Background

Clustering protein sequences according to inferred homology is a fundamental step in the analysis of many large data sets. Since the publication of the Markov Clustering (MCL) algorithm in 2002, it has been the centerpiece of several popular applications. Each of these approaches generates an undirected graph that represents sequences as nodes connected to each other by edges weighted with a BLAST-based metric. MCL is then used to infer clusters of homologous proteins by analyzing these graphs. The various approaches differ only by how they weight the edges, yet there has been very little direct examination of the relative performance of alternative edge-weighting metrics. This study compares the performance of four BLAST-based edge-weighting metrics: the bit score, bit score ratio (BSR), bit score over anchored length (BAL), and negative common log of the expectation value (NLE). Performance is tested using the Extended CEGMA KOGs (ECK) database, which we introduce here.

Results

All metrics performed similarly when analyzing full-length sequences, but dramatic differences emerged as progressively larger fractions of the test sequences were split into fragments. The BSR and BAL successfully rescued subsets of clusters by strengthening certain types of alignments between fragmented sequences, but also shifted the largest correct scores down near the range of scores generated from spurious alignments. This penalty outweighed the benefits in most test cases, and was greatly exacerbated by increasing the MCL inflation parameter, making these metrics less robust than the bit score or the more popular NLE. Notably, the bit score performed as well or better than the other three metrics in all scenarios.

Conclusions

The results provide a strong case for use of the bit score, which appears to offer equivalent or superior performance to the more popular NLE. The insight that MCL-based clustering methods can be improved using a more tractable edge-weighting metric will greatly simplify future implementations. We demonstrate this with our own minimalist Python implementation: Porthos, which uses only standard libraries and can process a graph with 25 m + edges connecting the 60 k + KOG sequences in half a minute using less than half a gigabyte of memory.

【 授权许可】

   
2015 Gibbons et al.

附件列表
Files Size Format View
Fig. 8. 42KB Image download
Fig. 7. 30KB Image download
Fig. 6. 138KB Image download
Fig. 5. 131KB Image download
Fig. 4. 29KB Image download
Fig. 3. 61KB Image download
Fig. 2. 77KB Image download
Fig. 1. 253KB Image download
Fig. 8. 42KB Image download
Fig. 7. 30KB Image download
Fig. 6. 138KB Image download
Fig. 5. 131KB Image download
Fig. 4. 29KB Image download
Fig. 3. 61KB Image download
Fig. 2. 77KB Image download
Fig. 1. 253KB Image download
【 图 表 】

Fig. 1.

Fig. 2.

Fig. 3.

Fig. 4.

Fig. 5.

Fig. 6.

Fig. 7.

Fig. 8.

Fig. 1.

Fig. 2.

Fig. 3.

Fig. 4.

Fig. 5.

Fig. 6.

Fig. 7.

Fig. 8.

【 参考文献 】
  • [1]Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990; 215:403-410.
  • [2]Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K et al.. BLAST+: architecture and applications. BMC Bioinformatics. 2009; 10:421. BioMed Central Full Text
  • [3]Rivera MC, Jain R, Moore JE, Lake JA. Genomic evidence for two functionally distinct gene classes. Proc Natl Acad Sci U S A. 1998; 95:6239-6244.
  • [4]Paccanaro A, Casbon JA, Saqi MAS. Spectral clustering of protein sequences. Nucleic Acids Res. 2006; 34:1571-1580.
  • [5]Remm M, Storm CE, Sonnhammer ELL. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol. 2001;314:1041–52.
  • [6]Sonnhammer ELL, Koonin EV: Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet. 2002;18:619–20.
  • [7]Koonin EV. Orthologs, paralogs, and evolutionary genomics. Annu Rev Genet. 2005; 39:309-338.
  • [8]Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science. 1997; 278:631-637.
  • [9]Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002; 30:1575-1584.
  • [10]Li L, Stoeckert CJ, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003; 13:2178-2189.
  • [11]Ekseth OK, Kuiper M, Mironov V. orthAgogue: an agile tool for the rapid prediction of orthology relations. Bioinformatics. 2014; 30:734-736.
  • [12]Rasko DA, Myers GSA, Ravel J. Visualization of comparative genomic analyses by BLAST score ratio. BMC Bioinformatics. 2005; 6:2. BioMed Central Full Text
  • [13]Sahl JW, Caporaso JG, Rasko DA, Keim P. The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes. PeerJ. 2014; 2: Article ID e332
  • [14]Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV et al.. The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003; 4:41. BioMed Central Full Text
  • [15]Parra G, Bradnam K, Korf I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics. 2007; 23:1061-1067.
  • [16]Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H et al.. Life with 6000 genes. Science. 1996; 274(546):563-567.
  • [17]Genome sequence of the nematode C. elegans: a platform for investigating biology. Science. 1998; 282:2012-2018.
  • [18]TAG Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000;408:796–815.
  • [19]Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG et al.. The genome sequence of Drosophila melanogaster. Science. 2000; 287:2185-2195.
  • [20]Katinka MD, Duprat S, Cornillot E, Méténier G, Thomarat F, Prensier G et al.. Genome sequence and gene compaction of the eukaryote parasite Encephalitozoon cuniculi. Nature. 2001; 414:450-453.
  • [21]Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J et al.. Initial sequencing and analysis of the human genome. Nature. 2001; 409:860-921.
  • [22]Wood V, Gwilliam R, Rajandream MA, Lyne M, Lyne R, Stewart A et al.. The genome sequence of Schizosaccharomyces pombe. Nature. 2002; 415:871-880.
  • [23]Holt RA, Subramanian GM, Halpern A, Sutton GG, Charlab R, Nusskern DR et al.. The genome sequence of the malaria mosquito Anopheles gambiae. Science. 2002; 298:129-149.
  • [24]Dehal P, Satou Y, Campbell RK, Chapman J, Degnan B, De Tomaso A et al.. The draft genome of Ciona intestinalis: insights into chordate and vertebrate origins. Science. 2002; 298:2157-2167.
  • [25]Grossman AR, Harris EE, Hauser C, Lefebvre PA, Martinez D, Rokhsar D et al.. Chlamydomonas reinhardtii at the crossroads of genomics. Eukaryot Cell. 2003; 2:1137-1150.
  • [26]Kissinger JC, et al. ToxoDB: accessing the Toxoplasma gondii genome. Nucleic Acids Res. 2003;31:234–236.
  • [27]Szilágyi SM, Szilágyi L. A fast hierarchical clustering algorithm for large-scale protein sequence data sets. Comput Biol Med. 2014; 48:94-101.
  • [28]Chen F, Mackey AJ, Vermunt JK, Roos DS. Assessing performance of orthology detection strategies applied to eukaryotic genomes. PLoS One. 2007; 2: Article ID e383
  • [29]Apeltsin L, Morris JH, Babbitt PC, Ferrin TE. Improving the quality of protein similarity network clustering algorithms using the network edge weight distribution. Bioinformatics. 2011; 27:326-333.
  • [30]Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D et al.. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003; 13:2498-2504.
  • [31]Smoot ME, Ono K, Ruscheinski J, Wang PL, Ideker T. Cytoscape 28: new features for data integration and network visualization. Bioinformatics. 2011; 27:431-432.
  文献评价指标  
  下载次数:215次 浏览次数:60次