期刊论文详细信息
BMC Research Notes
PCP-ML: Protein characterization package for machine learning
Zheng Wang1  Jesse Eickholt2 
[1] School of Computing, University of Southern Mississippi, Hattiesburg, MS 39406, USA;Department of Computer Science, Central Michigan University, Mount Pleasant, MI 48859, USA
关键词: Machine learning;    Protein software package;    Protein characterization;    Protein structure prediction;   
Others  :  1123247
DOI  :  10.1186/1756-0500-7-810
 received in 2014-01-27, accepted in 2014-10-31,  发布年份 2014
PDF
【 摘 要 】

Background

Machine Learning (ML) has a number of demonstrated applications in protein prediction tasks such as protein structure prediction. To speed further development of machine learning based tools and their release to the community, we have developed a package which characterizes several aspects of a protein commonly used for protein prediction tasks with machine learning.

Findings

A number of software libraries and modules exist for handling protein related data. The package we present in this work, PCP-ML, is unique in its small footprint and emphasis on machine learning. Its primary focus is on characterizing various aspects of a protein through sets of numerical data. The generated data can then be used with machine learning tools and/or techniques. PCP-ML is very flexible in how the generated data is formatted and as a result is compatible with a variety of existing machine learning packages. Given its small size, it can be directly packaged and distributed with community developed tools for protein prediction tasks.

Conclusions

Source code and example programs are available under a BSD license at http://mlid.cps.cmich.edu/eickh1jl/tools/PCPML/ webcite. The package is implemented in C++ and accessible as a Python module.

【 授权许可】

   
2014 Eickholt and Wang; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150216021228421.pdf 414KB PDF download
Figure 2. 58KB Image download
Figure 1. 73KB Image download
【 图 表 】

Figure 1.

Figure 2.

【 参考文献 】
  • [1]Cheng J, Randall AZ, Sweredoski MJ, Baldi P: SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Res 2005, 33:W72-W76.
  • [2]Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 1999, 292:195-202.
  • [3]Di Lena P, Nagata K, Baldi P: Deep architectures for protein contact map prediction. Bioinform Oxf Engl 2012, 28:2449-2457.
  • [4]Eickholt J, Cheng J: Predicting protein residue-residue contacts using deep networks and boosting. Bioinform Oxf Engl 2012, 28:3066-3072.
  • [5]Walsh I, Martin AJM, Di Domenico T, Tosatto SCE: ESpritz: accurate and fast prediction of protein disorder. Bioinform Oxf Engl 2012, 28:503-509.
  • [6]Eickholt J, Cheng J: DNdisorder: predicting protein disorder using boosting and deep networks. BMC Bioinform 2013, 14:88. BioMed Central Full Text
  • [7]Cheng J, Baldi P: A machine learning information retrieval approach to protein fold recognition. Bioinformatics 2006, 22:1456-1463.
  • [8]Wang Z, Eickholt J, Cheng J: APOLLO: a quality assessment service for single and multiple protein models. Bioinformatics 2011, 27:1715-1716.
  • [9]Li J, Deng X, Eickholt J, Cheng J: Designing and benchmarking the MULTICOM protein structure prediction system. BMC Struct Biol 2013, 13:2. BioMed Central Full Text
  • [10]Xu D, Zhang J, Roy A, Zhang Y: Automated protein structure modeling in CASP9 by I-TASSER pipeline combined with QUARK-based ab initio folding and FG-MD-based structure refinement. Proteins 2011, 79(Suppl 10):147-160.
  • [11]Dutheil J, Gaillard S, Bazin E, Glémin S, Ranwez V, Galtier N, Belkhir K: Bio++: a set of C++ libraries for sequence analysis, phylogenetics, molecular evolution and population genetics. BMC Bioinform 2006, 7:188. BioMed Central Full Text
  • [12]Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, Hoon MJL D: Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009, 25:1422-1423.
  • [13]Döring A, Weese D, Rausch T, Reinert K: SeqAn An efficient, generic C++ library for sequence analysis. BMC Bioinform 2008, 9:11. BioMed Central Full Text
  • [14]Joachims T: Advances in Kernel Methods. Edited by Schölkopf B, Burges CJC, Smola AJ. Cambridge, MA, USA: MIT Press; 1999:169-184.
  • [15]Cheng J, Wang Z, Pollastri G: A neural network approach to ordinal regression. IEEE Int. Jt. Conf. Neural Networks 2008 IJCNN 2008 IEEE World Congr. Comput. Intell 2008, 1279-1284.
  • [16]Kabsch W, Sander C: Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983, 22:2577-2637.
  • [17]Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25:3389-3402.
  • [18]Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The protein data bank. Nucleic Acids Res 2000, 28:235-242.
  • [19]Loriot S, Cazals F, Bernauer J: ESBTL: efficient PDB parser and data structure for the structural and geometric analysis of biological macromolecules. Bioinformatics 2010, 26:1127-1128.
  • [20]Baldi P, Brunak S: Bioinformatics: The Machine Language Approach. Cambridge: MIT Press; 2001.
  • [21]Glaser F, Steinberg DM, Vakser IA, Ben-Tal N: Residue frequencies and pairing preferences at protein-protein interfaces. Proteins 2001, 43:89-102.
  • [22]Zhu H, Braun W: Sequence specificity, statistical potentials, and three-dimensional structure prediction with self-correcting distance geometry calculations of beta-sheet formation in proteins. Protein Sci Publ Protein Soc 1999, 8:326-342.
  • [23]Monera OD, Sereda TJ, Zhou NE, Kay CM, Hodges RS: Relationship of sidechain hydrophobicity and α-helical propensity on the stability of the single-stranded amphipathic α-helix. J Pept Sci 1995, 1:319-329.
  • [24]Atchley WR, Zhao J, Fernandes AD, Drüke T: Solving the protein sequence metric problem. Proc Natl Acad Sci U S A 2005, 102:6395-6400.
  文献评价指标  
  下载次数:20次 浏览次数:10次