| Algorithms for Molecular Biology | |
| MCOIN: a novel heuristic for determining transcription factor binding site motif width | |
| Alastair M Kilpatrick2  Bruce Ward1  Stuart Aitken2  | |
| [1] School of Biological Sciences, University of Edinburgh, Darwin Building, King’s Buildings, Mayfield Road, EH9 3JR Edinburgh, Scotland | |
| [2] School of Informatics, University of Edinburgh, Informatics Forum, 10 Crichton Street, EH8 9AB Edinburgh, Scotland | |
| 关键词: Motif width; Bacterial motifs; Motif discovery; Transcription factor binding sites; | |
| Others : 793291 DOI : 10.1186/1748-7188-8-16 |
|
| received in 2012-12-18, accepted in 2013-06-24, 发布年份 2013 | |
PDF
|
|
【 摘 要 】
Background
In transcription factor binding site discovery, the true width of the motif to be discovered is generally not known a priori. The ability to compute the most likely width of a motif is therefore a highly desirable property for motif discovery algorithms. However, this is a challenging computational problem as a result of changing model dimensionality at changing motif widths. The complexity of the problem is increased as the discovered model at the true motif width need not be the most statistically significant in a set of candidate motif models. Further, the core motif discovery algorithm used cannot guarantee to return the best possible result at each candidate width.
Results
We present MCOIN, a novel heuristic for automatically determining transcription factor binding site motif width, based on motif containment and information content. Using realistic synthetic data and previously characterised prokaryotic data, we show that MCOIN outperforms the current most popular method (E-value of the resulting multiple alignment) as a predictor of motif width, based on mean absolute error. MCOIN is also shown to choose models which better match known sites at higher levels of motif conservation, based on ROC analysis.
Conclusions
We demonstrate the performance of MCOIN as part of a deterministic motif discovery algorithm and conclude that MCOIN outperforms current methods for determining motif width.
【 授权许可】
2013 Kilpatrick et al.; licensee BioMed Central Ltd.
【 预 览 】
| Files | Size | Format | View |
|---|---|---|---|
| 20140705045936788.pdf | 1123KB | ||
| Figure 4. | 18KB | Image | |
| Figure 3. | 19KB | Image | |
| Figure 2. | 182KB | Image | |
| Figure 1. | 31KB | Image |
【 图 表 】
Figure 1.
Figure 2.
Figure 3.
Figure 4.
【 参考文献 】
- [1]Yip K, Cheng C, Bhardwaj N, Brown J, Leng J, Kundaje A, Rozowsky J, Birney E, Bickel P, Snyder M, Gerstein M: Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biol 2012, 13:R48. BioMed Central Full Text
- [2]Spivakov M, Akhtar J, Kheradpour P, Beal K, Girardot C, Koscielny G, Herrero J, Kellis M, Furlong E, Birney E: Analysis of variation at transcription factor binding sites in Drosophila and humans. Genome Biol 2012, 13:R49. BioMed Central Full Text
- [3]Whitfield T, Wang J, Collins P, Partridge EC, Aldred S, Trinklein N, Myers R, Weng Z: Functional analysis of transcription factor binding sites in human promoters. Genome Biol 2012, 13:R50. BioMed Central Full Text
- [4]Bailey TL, Bodén M, Whitington T, Machanick P: The value of position-specific priors in motif discovery using MEME. BMC Bioinformatics 2010, 11:179. BioMed Central Full Text
- [5]Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 1993, 262:208-14.
- [6]Bailey TL, Elkan C: The value of prior knowledge in discovering motifs with MEME. In Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology: 16-19 July 1995: Cambridge, UK. Edited by Rawlings C, Rawlings C . The AAAI Press; 1995:21-29.
- [7]Keles S, van der Laan MJ, Dudoit S, Xing B, Eisen MB: Supervised detection of regulatory motifs in DNA sequences. Stat Appl Genet Mol Biol 2003 2(1):Article 5.
- [8]Akaike H: A new look at the statistical model identification. IEEE Trans Automatic Control 1974, 19:716-723.
- [9]Schwarz G: Estimating the Dimension of a Model. Ann Stat 1978, 6:461-464.
- [10]Bi C: A Monte Carlo EM algorithm for De Novo Motif discovery in biomolecular sequences. IEEE/ACM Trans Comput Biol Bioinformatics 2009, 6:370-386.
- [11]Bailey TL, Elkan C: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Mach Learn 1995, 21:51-80.
- [12]Bembom O, Keles S, Van Der Laan M: Supervised detection of conserved motifs in DNA sequences with cosmo. Stat Appli Genet Mol Biol 2007, 6(1):Article 8.
- [13]Lin J: Divergence measures based on the Shannon entropy. IEEE Trans Inform Theory 1991, 37:145-151.
- [14]Hertz G, Stormo G: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 1999, 15:563-577.
- [15]Rudd KE: EcoGene: a genome sequence database for Escherichia coli K-12. Nucleic Acids Res 2000, 28:60-64.
- [16]Gama-Castro S, Salgado H, Peralta-Gil M, Santos-Zavaleta A, Muñiz Rascado L, Solano-Lira H, Jimenez-Jacinto V, Weiss V, García-Sotelo JS, López-Fuentes A, Porrón-Sotelo L, Alquicira-Hernández S, Medina-Rivera A, Martínez-Flores I, Alquicira-Hernández K, Martínez-Adame R, Bonavides-Martínez C, Miranda-Ríos J, Huerta AM, Mendoza-Vargas A, Collado-Torres L, Taboada B, Vega-Alvarado L, Olvera M, Olvera L, Grande R, Morett E, Collado-Vides J: RegulonDB version 7.0: transcriptional regulation of Escherichia coli K-12 integrated within genetic sensory response units (Gensor Units). Nucleic Acids Res 2011, 39(Suppl 1):D98—D105.
- [17]Grainger DC, Hurd D, Harrison M, Holdstock J, Busby SJW: Studies of the distribution of Escherichia coli cAMP-receptor protein and RNA polymerase along the E. coli chromosome. Proc Nat Acad Sci USA 2005, 102:17693-17698.
- [18]Wade JT, Reppas NB, Church GM, Struhl K: Genomic analysis of LexA binding reveals the permissive nature of the Escherichia coli genome and identifies unconventional target sites. Genes & Devx 2005, 19:2619-2630.
- [19]Cho BK, Federowicz SA, Embree M, Park YS, Kim D: Palsson BØ: The PurR regulon in Escherichia coli K-12 MG1655. Nucleic Acids Res 2011, 39:6456-6464.
- [20]Shimada T, Ishihama A, Busby SJW, Grainger DC: The Escherichia coli RutR transcription factor binds at targets within genes as well as intergenic regions. Nucleic Acids Res 2008, 36:3950-3955.
- [21]Davies BW, Bogard RW, Mekalanos JJ: Mapping the regulon of Vibrio cholerae ferric uptake regulator expands its known network of gene regulation. Proc Nat Acad Sci 2011, 108(30):12467-72.
- [22]Dong TG, Mekalanos JJ: Characterization of the RpoN regulon reveals differential regulation of T6SS and new flagellar operons in Vibrio cholerae O37 strain V52. Nucleic Acids Res 2012, 40:7766-7775.
- [23]Lun D, Sherrid A, Weiner B, Sherman D, Galagan J: A blind deconvolution approach to high-resolution mapping of transcription factor binding sites from ChIP-seq data. Genome Biol 2009, 10:R142. BioMed Central Full Text
- [24]Smollett KL, Smith KM, Kahramanoglou C, Arnvig KB, Buxton RS, Davis EO: Global analysis of the Regulon of the transcriptional repressor LexA, a key component of SOS response in Mycobacterium tuberculosis. J Bioll Chem 2012, 287:22004-22014.
- [25]Molle V, Fujita M, Jensen ST, Eichenberger P, González-Pastor JE, Liu JS, Losick R: The Spo0A regulon of Bacillus subtilis. Mol Microbiol 2003, 50:1683-1701.
- [26]Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Régnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 2005, 23:137-44.
- [27]Hu J, Li B, Kihara D: Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res 2005, 33:4899-4913.
- [28]Sing T, Sander O, Beerenwinkel N, Lengauer T: ROCR: visualizing classifier performance in R. Bioinformatics 2005, 21:3940-3941.
- [29]Witten IH, Frank E: Data Mining: Practical Machine Learning Tools and Techniques (2nd Ed.). Morgan Kaufmann: San Francisco; 2005.
- [30]Eisen M: All motifs are NOT created equal: structural properties of transcription factor-DNA interactions and the inference of sequence specificity. Genome Biol 2005, 6:P7. BioMed Central Full Text
PDF