| BioData Mining | |
| Prediction of Drosophila melanogaster gene function using Support Vector Machines | |
| Nicholas Mitsakakis3  Zak Razak2  Michael Escobar4  J Timothy Westwood1  | |
| [1] Department of Cell and Systems Biology, University of Toronto at Mississauga, Mississauga, Canada | |
| [2] Canadian Drosophila Microarray Centre, University of Toronto at Mississauga, Mississauga, Canada | |
| [3] Toronto Health Economics and Technology Assessment (THETA) Collaborative, University of Toronto, Toronto, Canada | |
| [4] Dalla Lana School of Public Health, University of Toronto, Toronto, Canada | |
| 关键词: Gene function prediction; Gene expression data; Drosophila melanogaster; Support Vector Machines; Gene ontology; | |
| Others : 797198 DOI : 10.1186/1756-0381-6-8 |
|
| received in 2012-06-12, accepted in 2013-02-11, 发布年份 2013 | |
PDF
|
|
【 摘 要 】
Background
While the genomes of hundreds of organisms have been sequenced and good approaches exist for finding protein encoding genes, an important remaining challenge is predicting the functions of the large fraction of genes for which there is no annotation. Large gene expression datasets from microarray experiments already exist and many of these can be used to help assign potential functions to these genes. We have applied Support Vector Machines (SVM), a sigmoid fitting function and a stratified cross‐validation approach to analyze a large microarray experiment dataset from Drosophila melanogaster in order to predict possible functions for previously un‐annotated genes. A total of approximately 5043 different genes, or about one‐third of the predicted genes in the D. melanogaster genome, are represented in the dataset and 1854 (or 37%) of these genes are un‐annotated.
Results
39 Gene Ontology Biological Process (GO‐BP) categories were found with precision value equal or larger than 0.75, when recall was fixed at the 0.4 level. For two of those categories, we have provided additional support for assigning given genes to the category by showing that the majority of transcripts for the genes belonging in a given category have a similar localization pattern during embryogenesis. Additionally, by assessing the predictions using a confidence score, we have been able to provide a putative GO‐BP term for 1422 previously un‐annotated genes or about 77% of the un‐annotated genes represented on the microarray and about 19% of all of the un‐annotated genes in the D. melanogaster genome.
Conclusions
Our study successfully employs a number of SVM classifiers, accompanied by detailed calibration and validation techniques, to generate a number of predictions for new annotations for D. melanogaster genes. The applied probabilistic analysis to SVM output improves the interpretability of the prediction results and the objectivity of the validation procedure.
【 授权许可】
2013 Mitsakakis et al.; licensee BioMed Central Ltd.
【 预 览 】
| Files | Size | Format | View |
|---|---|---|---|
| 20140706042645526.pdf | 4313KB | ||
| Figure 5. | 74KB | Image | |
| Figure 4. | 25KB | Image | |
| Figure 3. | 162KB | Image | |
| Figure 2. | 69KB | Image | |
| Figure 1. | 66KB | Image |
【 图 表 】
Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 5.
【 参考文献 】
- [1]Tweedie S, Ashburner M, Falls K, Leyland P, McQuilton P, Marygold S, Millburn G, Osumi‐Sutherland D, Schroeder A, Seal R, Zhang H, the Fly Base Consortium: FlyBase: enhancing Drosophila Gene Ontology annotations. Nucl Acids Res 2009, 37:D555-D559.
- [2]Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome‐wide expression patterns. Proc Natl Acad Sci USA 1998, 95:14863-14868.
- [3]Tibshirani R, Hastie T, Eisen M, Ross D, Botstein D, Brown P: Clustering methods for the analysis of DNA microarray data. Tech. rep., Department of Statistics, Stanford University, Stanford, California; 1999
- [4]Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D: A combined algorithm for genome‐wide prediction of protein function. Nature 1999, 402:83-86.
- [5]Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge‐based approach for interpreting genome‐wide expression profiles. Proc Natl Acad Sci USA 2005, 102:15545-15550.
- [6]Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat Genet 1999, 22:281-285.
- [7]Wu LF, Hughes TR, Davierwala AP, Robinson MD, Stoughton R, Altschuler S: Large‐scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat Genet 2002, 31:255-265.
- [8]Brown MP, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, M Ares J, Haussler D: Knowledge‐based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci USA 2000, 97:262-267.
- [9]Zhang W, Morris QD, Chang R, Shai O, Bakowski MA, Mitsakakis N, Mohammad N, Robinson MD, Zirngibl R, Somogyi E, Laurin N, Eftekharpour E, Sat E, Grigull J, Pan Q, Peng WT, Krogan N, Greenblatt J, Fehlings M, van der Kooy D, Aubin J, Bruneau BG, Rossant J, Blencowe BJ, Frey BJ, Hughes TR: The functional landscape of mouse gene expression. J Biol 2004, 3:21. BioMed Central Full Text
- [10]Yan H, Venkatesan K, Beaver J, Klitgord N, Yildirim M, Hao T, Hill D, Cusick M, Perrimon N, Roth F, Vidal M: A genome‐wide gene function prediction resource for Drosophila melanogaster. PLoS ONE 2010, 5:e12139.
- [11]Mateos A, Dopazo J, Jansen R, Tu Y, Gerstein M, Stolovitzky G: Systematic learning of gene functional classes from DNA array expression data by using multilayer perceptrons. Genome Res 2002, 12:1703-1715.
- [12]Kustra R, Shioda R, Zhu M: A factor analysis model for functional genomics. BMC Bioinformatics 2006, 7:216. BioMed Central Full Text
- [13]Lan H, Carson R, Provart NJ, Bonner AJ: Combining classifiers to predict gene function in Arabidopsis thaliana using large‐scale gene expression measurements. Bioinformatics 2007, 8:358.
- [14]Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD, Kidd MJ, King AM, Meyer MR, Slade D, Lum PY, Stepaniants SB, Shoemaker DD, Gachotte D, Chakraburtty K, Simon J, Bard M, Friend SH: Functional discovery via a compendium of expression profiles. Cell 2000, 102:109-126.
- [15]Zhang W, Zou S, Song J: Term‐tissue specific models for prediction of gene ontology biological processes using transcriptional profiles of aging in Drosophila melanogaster. BMC Bioinformatics 2008, 9:129. BioMed Central Full Text
- [16]Lee I, Li Z, Marcotte EM: An improved, bias‐reduced probabilistic functional gene network of baker’s yeast, Saccharomyces cerevisiae. PLoS ONE 2007, 2:e988.
- [17]Lee I, Lehner B, Crombie C, Wong W, Fraser A, Marcotte E: A single gene network accurately predicts phenotypic effects of gene perturbation in Caenorhabditis elegans. Nat Genet 2008, 40:181-188.
- [18]Costello JC, Dalkilic MM, Beason SM, Gehlhausen JR, Patwardhan R, Middha S, Eads BD, Andrews JR: Gene networks in Drosophila melanogaster: integrating experimental data to predict gene function. Genome Biol 2009, 10:R97. BioMed Central Full Text
- [19]Pena‐Castillo L, Tasan M, Myers CL, Lee H, Joshi T, Zhang C, Guan Y, Leone M, Pagnani A, Kim WK, Krumpelman C, Tian W, Obozinski G, Qi Y, Mostafavi S, Lin GN, Berriz GF, Gibbons FD, Lanckriet G, Qiu J, Grant C, Barutcuoglu Z, Hill DP, Warde‐Farley D, Grouios C, Ray D, Blake JA, Deng M, Jordan MI, Noble WS, Morris Q, Klein‐Seetharaman J, Bar‐Joseph Z, Chen T, Sun F, Troyanskaya OG, Marcotte EM, Xu D, Hughes TR, Roth F: A critical assessment of Mus musculus gene function prediction using integrated genomic evidence. Genome Biol 2008, 9:S2.
- [20]Noble WS: What is a support vector machine? Nat Biotechnol 2006, 24:1565-1567.
- [21]Vapnik V: Statistical Learning Theory. New York: Wiley‐Interscience; 1998.
- [22]Arbeitman MN, Furlong EE, Imam F, Johnson E, Null BH, Baker BS, Krasnow MA, Scott MP, Davis RW, White KP: Gene expression during the life cycle of Drosophila melanogaster. Science 2002, 297:2270-2275.
- [23]The Gene Ontology Consortium: Gene ontology: tool for the unification of biology. Nat Genet 2000, 25:25-29. [http://www.geneontology.org webcite]
- [24]Stanford Microarray Database [http://smd.stanford.edu webcite]
- [25]Gene Expression Omnibus [http://www.ncbi.nlm.nih.gov/geo/ webcite]
- [26]FlyBase: A Database of Drosophila Genes & Genomes [http://flybase.org webcite]
- [27]Burges CJC: A tutorial on support vector machines for pattern recognition. Data Min Knowl Dis 1998, 2:121-167.
- [28]Veropoulos K, Campbell C, Cristianini N: Controlling the sensitivity of support vector machines. In Proceedings of the International Joint Conference on AI. San Francisco: Morgan Kaufmann Publishers; 1999:55-60.
- [29]Gist: Support vector machine and kernel principal components analysis, Version 2.0.9 [http://www.bioinformatics.ubc.ca/gist webcite]
- [30]Platt J: Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Advances in Large Margin Classifiers. Edited by Smola A, Bartlett P, Schölkopf B, Schuurmans D. Cambridge, MA: The MIT Press; 2000.
- [31]Lin HT, Lin CJ, Weng RC: A note on Platt’s probabilistic outputs for support vector machines. Mach Learn 2007, 68:267-276.
- [32]Sing T, Sander O, Beerenwinkel N, Lengauer T: ROCR: visualizing classifier performance in R. Bioinformatics 2005, 21:3940-3941.
- [33]Forman G, Scholz M: Apples‐to‐apples in cross‐validation studies: pitfalls in classifier performance measurement. ACM SIGKDD Explorations Newsl 2010, 12:49-57.
- [34]Fly‐FISH: A database of Drosophila embryo mRNA localization patterns [http://fly‐fish.ccbr.utoronto.ca webcite]
- [35]Lécuyer E, Yoshida H, Parthasarathy N, Alm C, Babak T, Cerovina T, Hughes TR, Tomancak P, Krause HM: Global analysis of mRNA localization reveals a prominent role in organizing cellular architecture and function. Cell 2007, 131:174-187.
PDF