Algorithms for Molecular Biology | |
Jaccard index based similarity measure to compare transcription factor binding site models | |
Ilya E Vorontsov2  Ivan V Kulakovskiy1  Vsevolod J Makeev3  | |
[1] Department of Computational Systems Biology, Vavilov Institute of General Genetics, Russian Academy of Sciences, Gubkina str. 3, Moscow 119991, GSP-1, Russia | |
[2] Data Analysis Department, Yandex Data Analysis School, Moscow Institute of Physics and Technology, Leo Tolstoy str. 16, Moscow 119021, Russia | |
[3] Department of Biological and Medical Physics, Moscow Institute of Physics and Technology, Institutskiy per. 9, Dolgoprudny 141700, Moscow Region, Russia | |
关键词: Macroape; PSFM; Position specific frequency matrix; P-value; PWM; Position weight matrix; Jaccard similarity; Binding motif; Transcription factor binding site model; TFBS; Transcription factor binding site; | |
Others : 793169 DOI : 10.1186/1748-7188-8-23 |
|
received in 2012-05-25, accepted in 2013-09-18, 发布年份 2013 | |
【 摘 要 】
Background
Positional weight matrix (PWM) remains the most popular for quantification of transcription factor (TF) binding. PWM supplied with a score threshold defines a set of putative transcription factor binding sites (TFBS), thus providing a TFBS model.
TF binding DNA fragments obtained by different experimental methods usually give similar but not identical PWMs. This is also common for different TFs from the same structural family. Thus it is often necessary to measure the similarity between PWMs. The popular tools compare PWMs directly using matrix elements. Yet, for log-odds PWMs, negative elements do not contribute to the scores of highly scoring TFBS and thus may be different without affecting the sets of the best recognized binding sites. Moreover, the two TFBS sets recognized by a given pair of PWMs can be more or less different depending on the score thresholds.
Results
We propose a practical approach for comparing two TFBS models, each consisting of a PWM and the respective scoring threshold. The proposed measure is a variant of the Jaccard index between two TFBS sets. The measure defines a metric space for TFBS models of all finite lengths. The algorithm can compare TFBS models constructed using substantially different approaches, like PWMs with raw positional counts and log-odds. We present the efficient software implementation: MACRO-APE (MAtrix CompaRisOn by Approximate P-value Estimation).
Conclusions
MACRO-APE can be effectively used to compute the Jaccard index based similarity for two TFBS models. A two-pass scanning algorithm is presented to scan a given collection of PWMs for PWMs similar to a given query.
Availability and implementation
MACRO-APE is implemented in ruby 1.9; software including source code and a manual is freely available at http://autosome.ru/macroape/ webcite and in supplementary materials.
【 授权许可】
2013 Vorontsov et al.; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20140705044356568.pdf | 1049KB | download | |
Figure 4. | 174KB | Image | download |
Figure 3. | 113KB | Image | download |
Figure 2. | 36KB | Image | download |
Figure 1. | 116KB | Image | download |
【 图 表 】
Figure 1.
Figure 2.
Figure 3.
Figure 4.
【 参考文献 】
- [1]Stormo GD: DNA binding sites: representation and discovery. Bioinformatics 2000, 16(1):16-23.
- [2]Pietrokovski S: Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Res 1996, 24(19):3836-3845.
- [3]Hughes JD, Estep PW, Tavazoie S, Church GM: Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol 2000, 296(5):1205-1214.
- [4]Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS: Quantifying similarity between motifs. Genome Biol 2007, 8(2):R24. BioMed Central Full Text
- [5]Roepcke S, Grossmann S, Rahmann S, Vingron M: T-Reg Comparator: an analysis tool for the comparison of position weight matrices. Nucleic Acids Res 2005, 33(Web Server issue):W438-W441.
- [6]Schones DE, Sumazin P, Zhang MQ: Similarity of position frequency matrices for transcription factor binding sites. Bioinformatics 2005, 21(3):307-313.
- [7]Habib N, Kaplan T, Margalit H, Friedman N: A Novel Bayesian DNA Motif Comparison Method for Clustering and Retrieval. PLoS Comput Biol 2008, 4(2):e1000010.
- [8]Jensen ST, Liu JS: Bayesian Clustering of Transcription Factor Binding Motifs. J Am Stat Assoc 2008, 103(481):188-200.
- [9]Kankainen M, Löytynoja A: MATLIGN: a motif clustering, comparison and matching tool. BMC Bioinforma 2007, 8:189. BioMed Central Full Text
- [10]Mahony S, Benos PV: STAMP: a web tool for exploring DNA-binding motif similarities. Nucleic Acids Res 2007, 35(Web Server issue):W253-W258.
- [11]Oh YM, Kim JK, Choi S, Yoo JY: Identification of co-occurring transcription factor binding sites from DNA sequence using clustered position weight matrices. Nucleic Acids Res 2012, 40(5):e38.
- [12]Thomas-Chollier M, Defrance M, Medina-Rivera A, Sand O, Herrmann C, Thieffry D, van Helden J: RSAT 2011: regulatory sequence analysis tools. Nucleic Acids Res 2011, 39(Web Server issue):W86-W91.
- [13]Pape UJ, Rahmann S, Vingron M: Natural similarity measures between position frequency matrices with an application to clustering. Bioinformatics 2008, 24(3):350-357.
- [14]Levitsky VG, Ignatieva EV, Ananko EA, Turnaev II, Merkulova TI, Kolchanov NA, Hodgman TC: Effective transcription factor binding site prediction using a combination of optimization, a genetic algorithm and discriminant analysis to capture distant interactions. BMC Bioinformatics 2007, 8:481. BioMed Central Full Text
- [15]Frishman D, Mironov A, Mewes HW, Gelfand M: Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acids Res 1998, 26(12):2941-2947.
- [16]Lipkus AH: A proof of the triangle inequality for the Tanimoto distance. J Math Chem 1999, 26:263-265.
- [17]Touzet H, Varré JS: Efficient and accurate P-value computation for Position Weight Matrices. Algorithms Mol Biol 2007, 2:15. BioMed Central Full Text
- [18]Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, Voss N, Stegmaier P, Lewicki-Potapov B, Saxel H, Kel AE, Wingender E: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res 2006, 34:D108-D110.
- [19]Portales-Casamar E, Thongjuea S, Kwon AT, Arenillas D, Zhao X, Valen E, Yusuf D, Lenhard B, Wasserman WW, Sandelin A: JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res 2010, 38:D105-D110.
- [20]Kulakovskiy IV, Medvedeva YA, Schaefer U, Kasianov AS, Vorontsov IE, Bajic VB, Makeev VJ: HOCOMOCO: a comprehensive collection of human transcription factor binding sites models. Nucleic Acids Res 2012, 41(Database issue):D195-202.
- [21]Sokal R, Michener C: A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin 1958, 38:1409-1438.
- [22]Smits SA, Ouverney CC: jsPhyloSVG: a javascript library for visualizing interactive and vector-based phylogenetic trees on the web. PLoS One 2010, 5(8):e12267.
- [23]Jolma A, Kivioja T, Toivonen J, Cheng L, Wei G, Enge M, Taipale M, Vaquerizas JM, Yan J, Sillanpää MJ, et al.: Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome Res 2010, 20:861-873.
- [24]Berger MF, Philippakis AA, Qureshi A, He FS, Estep PW 3rd, Bulyk ML: Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat Biotechnol 2006, 24(11):1429-1435.