BMC Bioinformatics | |
On the amyloid datasets used for training PAFIG how (not) to extend the experimental dataset of hexapeptides | |
Malgorzata Kotulska2  Olgierd Unold1  | |
[1] Institute of Computer Engineering, Control and Robotics, Wroclaw University of Technology, 50-370 Wroclaw, Poland | |
[2] Institute of Biomedical Engineering and Instrumentation, Wroclaw University of Technology, 50-370 Wroclaw, Poland | |
关键词: Hot spot; Intramolecular contact sites; Amyloid; Machine learning; | |
Others : 1087687 DOI : 10.1186/1471-2105-14-351 |
|
received in 2013-04-24, accepted in 2013-11-15, 发布年份 2013 | |
【 摘 要 】
Background
Amyloids are proteins capable of forming aberrant intramolecular contact sites, characteristic of beta zipper configuration. Amyloids can underlie serious health conditions, e.g. Alzheimer’s or Parkinson’s diseases. It has been proposed that short segments of amino acids can be responsible for protein amyloidogenicity, but no more than two hundred such hexapeptides have been experimentally found. The authors of the computational tool Pafig published in BMC Bioinformatics a method for extending the amyloid hexapeptide dataset that could be used for training and testing models. They assumed that all hexapeptides belonging to an amyloid protein can be regarded as amylopositive, while those from proteins never reported as amyloid are always amylonegative. Here we show why the above described method of extending datasets is wrong and discuss the reasons why the incorrect data could lead to falsely correct classification.
Results
The amyloid classification of hexapeptides by Pafig was confronted with the classification results from different state of the art computational methods and the outputs of all methods were studied by clustering analysis. The clustering methods show that Pafig is an outlier with regard to other approaches. Our study of the statistical patterns of its training and testing datasets showed a strong bias towards STVIIE hexapeptide in their positive part. Different statistical patterns of seemingly amylo -positive and -negative hexapeptides allow for a repeatable classification, which is not related to amyloid propensity of the hexapetides.
Conclusions
Our study on recognition of amyloid hexapeptides showed that occurrence of incidental patterns in wrongly selected datasets can produce falsely correct results of classification. The assumption that all hexapeptides belonging to amyloid protein can be regarded as amylopositive and those from proteins never reported as amyloid are always amylonegative is not supported by any other computational method. This is in line with experimental observations that amyloid propensity of a full protein can result from only one amyloidogenic fragment in this protein, while the occurrence of amyliodogenic part that is well hidden inside the protein may never lead to fibril formation. This leads to the conclusion that Pafig does not provide correct classification with regard to amyloidogenicity.
【 授权许可】
2013 Kotulska and Unold; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20150117031638158.pdf | 530KB | download | |
Figure 4. | 37KB | Image | download |
Figure 3. | 70KB | Image | download |
Figure 2. | 68KB | Image | download |
Figure 1. | 170KB | Image | download |
【 图 表 】
Figure 1.
Figure 2.
Figure 3.
Figure 4.
【 参考文献 】
- [1]Jaroniec CP, MacPhee CE, Bajaj VS, McMahon MT, Dobson CM, Griffin RG: High-resolution molecular structure of a peptide inan amyloid fibril determined by magic angle spinning NMR spectroscopy. Proc Natl Acad Sci U S A 2004, 101:711-716.
- [2]Makin OS, Atkins E, Sikorski P, Johansson J, Serpell LC: Molecular basis for amyloid fibril formation and stability. Proc Natl Acad Sci U S A 2005, 102:315-320.
- [3]Nelson R, Sawaya MR, Balbirnie M, Madsen AO, Riekel C, Grothe R, Eisenberg D: Structure of the cross- beta spine of amyloid-like fibrils. Nature 2005, 435:773-778.
- [4]Sawaya MR, Sambashivan S, Nelson R, Ivanova MI, Sievers SA, Apostol MI, Thompson MJ, Balbirnie M, Wiltzius JJW, McFarlane HT, Madsen AØ, Riekel C, Eisenberg D: Atomic structures of amyloid cross β-spines reveal varied steric zippers. Nature 2007, 447:453-457.
- [5]Thompson MJ, Balbirnie M, Wiltzius JJW, McFarlane HT, Madsen AØ, Riekel C, Eisenberg D: Atomic structures of amyloid cross β-spines reveal varied steric zippers. Nature 2007, 447:453-457.
- [6]Uversky VN, Fink AL: Conformational constraints for amyloid fibrillation: the importance of being unfolded. Biochim Biophys Acta 2004, 1698:131-153.
- [7]Rousseau F, Schymkowitz J, Serrano L: Protein aggregation and amyloidosis: confusion of the kinds? Curr Opin Struct Biol 2006, 16:118-126.
- [8]Lopez Dela Paz M, Serrano L: Sequence determinants of amyloid fibril formation. Proc Natl Acad Sci U S A 2004, 101:87-92.
- [9]Thompson MJ, Sievers SA, Karanicolas J, Ivanova MI, Baker D, Eisenberg D: The 3D profile method for identifying fibril-forming segments of proteins. Proc Natl Acad Sci U S A 2006, 103:4074-4078.
- [10]Fernandez-Escamilla AM, Rousseau F, Schymkowitz J, Serrano L: Prediction of sequence-dependent and mutational effects on the aggregation of peptides and proteins. Nat Biotechnol 2004, 22:1302-1306.
- [11]Goldschmidt L, Tenga PK, Riek R, Eisenberg D: Identifying the amylome, proteins capable of forming amyloid-like fibrils. Proc Natl Acad Sci U S A 2010, 107:3487-3492.
- [12]Trovato A, Seno F, Tosatto SC: The PASTA server for protein aggregation prediction. Protein Eng Des Sel 2007, 20:521-523.
- [13]Conchillo-Solé O, de Groot NS, Avilés FX, Vendrell J, Daura X, Ventura S: AGGRESCAN: a server for the prediction and evaluation of ”hot spots“ of aggregation in polypeptides. BMC Bioinforma 2007, 8:65. BioMed Central Full Text
- [14]Zhang Z, Chen H, Lai L: Identification of amyloid fibril-forming segments based on structure and residue-based statistical potential. Bioinformatics 2007, 23:2218-2225.
- [15]Tartaglia GG, Vendruscolo M: The Zyggregator method for predicting protein aggregation propensities. Chem Soc Rev 2008, 37:1395-1401.
- [16]Tartaglia GG, Vendruscolo M: Proteome-level interplay between folding and aggregation propensities of proteins. J Mol Biol 2010, 402:919-928.
- [17]Kim C, Choi J, Lee SJ, Welsh WJ, Yoon S: NetCSSP: web application for predicting chameleon sequences and amyloid fibril formation. Nucleic Acids Res 2009, 37:W469-W473.
- [18]Garbuzynskiy SO, Lobanov MY, Galzitskaya OV: FoldAmyloid: a method of prediction of amyloidogenic regions from protein sequence. Bioinformatics 2010, 26:326-332.
- [19]O’Donnell CW, Waldispühl J, Lis M, Halfmann R, Devadas S, Lindquist S, Berger B: A method for probing the mutational landscape of amyloid structure. Bioinformatics 2011, 27:i34-i42.
- [20]Bryan AW Jr, O’Donnell CW, Menke M, Cowen LJ, Lindquist S, Berger B: STITCHER: Dynamic assembly of likely amyloid and prion β-structures from secondary structure predictions. Proteins 2011, 80:410-420.
- [21]Bryan AW Jr, Menke M, Cowen LJ, Lindquist SL, Berger B: BETASCAN: probable beta-amyloids identified by pairwise probabilistic analysis. PLoS Comput Biol 2009, 5:e1000333.
- [22]Frousios KK, Iconomidou VA, Karletidi CM, Hamodrakas SJ: Amyloidogenic determinants are usually not buried. BMC Struct Biol 2009, 9:44. BioMed Central Full Text
- [23]Stanislawski J, Kotulska M, Unold O: Machine learning methods can replace 3D profile method in classification of amyloidogenic hexapeptides. BMC Bioinforma 2013, 14:21. BioMed Central Full Text
- [24]Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 2009, 11(1):10-18.
- [25]Maurer-Stroh S, Debulpaep M, Kuemmerer N, Lopez Dela Paz M, Martins IC, Reumers J, Morris KL, Copland A, Serpell L, Serrano L, Schymkowitz JW, Rousseau F: Exploring the sequence determinants of amyloid structure using position-specific scoring matrices. Nat Methods 2010, 7:237-242.
- [26]David MP, Concepcion GP, Padlan EA: Using simple artificial intelligence methods for predicting amyloidogenesis in antibodies. BMC Bioinforma 2010, 11:79. BioMed Central Full Text
- [27]Tian J, Wu N, Guo J, Fan Y: Prediction of amyloid fibril-forming segments based on a support vector machine. BMC Bioinforma 2009, 10(1):S45. BioMed Central Full Text
- [28]Nair SS, Subba Reddy NV, Hareesha KS: Exploiting heterogeneous features to improve in silico prediction of peptide status - amyloidogenic or non-amyloidogenic. BMC Bioinforma 2011, 12(13):S21.
- [29]Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome Res 2004, 14:1188-1190.
- [30]Hamodrakas SJ, Liappa C, Iconomidou VA: Consensus prediction of amyloidogenic determinants in amyloid fibril-forming proteins. Int J Biol Macromol 2007, 41:295-300.
- [31]The Pafig dataset http://www.mobioinfor.cn/pafig/download/hexpepset.txt webcite
- [32]The web server of FoldAmyloid method http://bioinfo.protres.ru/fold-amyloid/oga.cgi webcite
- [33]The web server of Waltz method http://waltz.switchlab.org/ webcite
- [34]The web server of AmylPred method http://biophysics.biol.uoa.gr/AMYLPRED/ webcite
- [35]Choi SS, Cha SA, Tappert CC: A survey of binary similarity and distance measures. J System Cybernet Informat 2010, 8(1):43-48.
- [36]Sokal RR, Michener C: A statistical method for evaluating systematic relationships. Univ Kansas Sci Bull 1958, 38:1409-1438.
- [37]Rogers JS, Tanimoto TT: A computer program for classing plants. Science 1960, 132:1115-1118.
- [38]Sokal RR, Sneath PH: Principles of Numeric Taxonomy. San Francisco: W.H. Freeman; 1963.
- [39]Baker F, Hubert L: Measuring the power of hierarchical cluster analysis. J Am Stat Assoc 1975, 70:31-38.
- [40]Milligan G, Cooper M: An examination of procedures for determining the number of clusters in a data set. Psychometrika 1985, 50(2):159-179.
- [41]Gurrutxaga I, Muguerza J, Arbelaitz O, Pérez JM, Martín JI: Towards a standard methodology to evaluate internal cluster validity indices. Pattern Recogn Lett 2011, 32(3):505-515.
- [42]Walesiak M, Dudek A: clusterSim: Searching for optimal clustering procedure for a data set. R package version 0:38–2. 2010. http://CRAN.R-project.org/package=clusterSim webcite
- [43]Giancarlo R, Utro F: Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis. Theor Comput Sci 2012, 428:58-79.
- [44]Monti S, Tamayo P, Mesirov J, Golub T: Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 2003, 52(1–2):91-118.
- [45]Simpson TI, Armstrong JD, Jarman AP: Merged consensus clustering to assess and improve class discovery with microarray data. BMC Bioinformatics 2010, 11:590. BioMed Central Full Text