Algorithms for Molecular Biology | |
Mutual enrichment in ranked lists and the statistical assessment of position weight matrix motifs | |
Limor Leibovich1  Zohar Yakhini2  | |
[1] Department of Computer Science, Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel | |
[2] Agilent Laboratories Israel, 94 Em Hamoshavot Road, 49527 Petach-Tikva, Israel | |
关键词: lncRNA; Tissue specific methylation patterns; High-throughput sequencing data analysis; Position weight matrices; Statistical enrichment; | |
Others : 792921 DOI : 10.1186/1748-7188-9-11 |
|
received in 2013-12-01, accepted in 2014-03-30, 发布年份 2014 | |
【 摘 要 】
Background
Statistics in ranked lists is useful in analysing molecular biology measurement data, such as differential expression, resulting in ranked lists of genes, or ChIP-Seq, which yields ranked lists of genomic sequences. State of the art methods study fixed motifs in ranked lists of sequences. More flexible models such as position weight matrix (PWM) motifs are more challenging in this context, partially because it is not clear how to avoid the use of arbitrary thresholds.
Results
To assess the enrichment of a PWM motif in a ranked list we use a second ranking on the same set of elements induced by the PWM. Possible orders of one ranked list relative to another can be modelled as permutations. Due to sample space complexity, it is difficult to accurately characterize tail distributions in the group of permutations. In this paper we develop tight upper bounds on tail distributions of the size of the intersection of the top parts of two uniformly and independently drawn permutations. We further demonstrate advantages of this approach using our software implementation, mmHG-Finder, which is publicly available, to study PWM motifs in several datasets. In addition to validating known motifs, we found GC-rich strings to be enriched amongst the promoter sequences of long non-coding RNAs that are specifically expressed in thyroid and prostate tissue samples and observed a statistical association with tissue specific CpG hypo-methylation.
Conclusions
We develop tight bounds that can be calculated in polynomial time. We demonstrate utility of mutual enrichment in motif search and assess performance for synthetic and biological datasets. We suggest that thyroid and prostate-specific long non-coding RNAs are regulated by transcription factors that bind GC-rich sequences, such as EGR1, SP1 and E2F3. We further suggest that this regulation is associated with DNA hypo-methylation.
【 授权许可】
2014 Leibovich and Yakhini; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20140705041305290.pdf | 1753KB | download | |
Figure 3. | 180KB | Image | download |
Figure 2. | 136KB | Image | download |
Figure 1. | 139KB | Image | download |
【 图 表 】
Figure 1.
Figure 2.
Figure 3.
【 参考文献 】
- [1]Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 2005, 102:15545-15550.
- [2]Eden E, Navon R, Steinfeld I, Lipson D, Yakhini Z: GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics 2009, 10:48. BioMed Central Full Text
- [3]GOrilla Webserver [http://cbl-gorilla.cs.technion.ac.il/ webcite]
- [4]Ragle-Aure M, Steinfeld I, Baumbusch LO, Liestøl K, Lipson D, Nyberg S, Naume B, Sahlberg KK, Kristensen VN, Børresen-Dale A-L, Lingjærde OC, Yakhini Z: Identifying in-trans process associated genes in breast cancer by integrated analysis of copy number and expression data. PLoS ONE 2013, 8:e53014.
- [5]Akavia UD, Litvin O, Kim J, Sanchez-Garcia F, Kotliar D, Causton HC, Pochanard P, Mozes E, Garraway LA, Pe’er D: An integrated approach to uncover drivers of cancer. Cell 2010, 143:1005-1017.
- [6]Dehan E, Ben-Dor A, Liao W, Lipson D, Frimer H, Rienstein S, Simansky D, Krupsky M, Yaron P, Friedman E, Rechavi G, Perlman M, Aviram-Goldring A, Izraeli S, Bittner M, Yakhini Z, Kaminski N: Chromosomal aberrations and gene expression profiles in non-small cell lung cancer. Lung Cancer 2007, 56:175-184.
- [7]Al-Shahrour F, Díaz-Uriarte R, Dopazo J: FatiGO: a web tool for finding significant associations of gene ontology terms with groups of genes. Bioinformatics 2004, 20:578-580.
- [8]Leibovich L, Yakhini Z: Efficient motif search in ranked lists and applications to variable gap motifs. Nucleic Acids Res 2012, 40:5832-5847.
- [9]Leibovich L, Paz I, Yakhini Z, Mandel-Gutfreund Y: DRIMust: a web server for discovering rank imbalanced motifs using suffix trees. Nucleic Acids Res 2013, 41:W174-W179.
- [10]DRIMust Webserver [http://drimust.technion.ac.il/ webcite]
- [11]Steinfeld I, Navon R, Ach R, Yakhini Z: miRNA target enrichment analysis reveals directly active miRNAs in health and disease. Nucleic Acids Res 2013, 41:e45-e45.
- [12]miTEA Webserver [http://cbl-gorilla.cs.technion.ac.il/miTEA/ webcite]
- [13]Enerly E, Steinfeld I, Kleivi K, Leivonen S-K, Ragle-Aure M, Russnes HG, Rønneberg JA, Johnsen H, Navon R, Rødland E, Mäkelä R, Naume B, Perälä M, Kallioniemi O, Kristensen VN, Yakhini Z, Børresen-Dale A-L: miRNA-mRNA integrated analysis reveals roles for miRNAs in primary breast tumors. PLoS ONE 2011, 6:e16915.
- [14]Plis SM, Weisend MP, Damaraju E, Eichele T, Mayer A, Clark VP, Lane T, Calhoun VD: Effective connectivity analysis of fMRI and MEG data collected under identical paradigms. Comput Biol Med 2011, 41:1156-1165.
- [15]Eden E, Lipson D, Yogev S, Yakhini Z: Discovering motifs in ranked lists of DNA sequences. PLoS Comput Biol 2007, 3:e39.
- [16]Steinfeld I, Navon R, Ardigò D, Zavaroni I, Yakhini Z: Clinically driven semi-supervised class discovery in gene expression data. Bioinformatics 2008, 24:i90-i97.
- [17]Straussman R, Nejman D, Roberts D, Steinfeld I, Blum B, Benvenisty N, Simon I, Yakhini Z, Cedar H: Developmental programming of CpG island methylation profiles in the human genome. Nat Struct Mol Biol 2009, 16:564-571.
- [18]Lee B-K, Bhinge AA, Iyer VR: Wide-ranging functions of E2F4 in transcriptional activation and repression revealed by genome-wide analysis. Nucleic Acids Res 2011, 39:3558-3573.
- [19]Rhee Ho S, Pugh BF: Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution. Cell 2011, 147:1408-1419.
- [20]Lebedeva S, Jens M, Theil K, Schwanhäusser B, Selbach M, Landthaler M, Rajewsky N: Transcriptome-wide analysis of regulatory interactions of the RNA-binding protein HuR. Molecular Cell 2011, 43:340-352.
- [21]Hafner M, Landthaler M, Burger L, Khorshid M, Hausser J, Berninger P, Rothballer A, Ascano M Jr, Jungkamp A-C, Munschauer M, Ulrich A, Wardle GS, Dewell S, Zavolan M, Tuschl T: Transcriptome-wide identification of RNA-binding protein and MicroRNA target sites by PAR-CLIP. Cell 2010, 141:129-141.
- [22]Staden R: Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res 1984, 12:505-519.
- [23]Stormo GD, Schneider TD, Gold L: Quantitative analysis of the relationship between nucleotide sequence and functional activity. Nucleic Acids Res 1986, 14:6661-6679.
- [24]Hertz GZ, Stormo GD: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 1999, 15:563-577.
- [25]Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotech 2005, 23:137-144.
- [26]Sinha S: On counting position weight matrix matches in a sequence, with application to discriminative motif finding. Bioinformatics 2006, 22:e454-e463.
- [27]Abramowitz M, Stegun IA: Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. New York: Dover Publications, Inc.; 1964.
- [28]Bailey TL, Elkan C: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Mach Learn 1995, 21:51-80.
- [29]Bailey TL: DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics 2011, 27:1653-1659.
- [30]Luehr S, Hartmann H, Söding J: The XXmotif web server for eXhaustive, weight matriX-based motif discovery in nucleotide sequences. Nucleic Acids Res 2012, 40:W104-W109.
- [31]Smeenk L, van Heeringen SJ, Koeppel M, van Driel MA, Bartels SJJ, Akkers RC, Denissov S, Stunnenberg HG, Lohrum M: Characterization of genome-wide p53-binding sites upon stress response. Nucleic Acids Res 2008, 36:3639-3654.
- [32]Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne J-B, Reynolds DB, Yoo J, Jennings EG, Zeitlinger J, Pokholok DK, Kellis M, Rolfe PA, Takusagawa KT, Lander ES, Gifford DK, Fraenkel E, Young RA: Transcriptional regulatory code of a eukaryotic genome. Nature 2004, 431:99-104.
- [33]Hogan DJ, Riordan DP, Gerber AP, Herschlag D, Brown PO: Diverse RNA-binding proteins interact with functionally related sets of RNAs. Suggesting an extensive regulatory system. PLoS Biol 2008, 6:e255.
- [34]Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A, Rinn JL: Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev 2011, 25:1915-1927.
- [35]Mathelier A, Zhao X, Zhang AW, Parcy F, Worsley-Hunt R, Arenillas DJ, Buchman S, Chen C-y, Chou A, Ienasescu H, Lim J, Shyr C, Tan G, Zhou M, Lenhard B, Sandelin A, Wasserman WW: JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic Acids Res 2013, 42:D142-D147.
- [36]Yang J-H, Li J-H, Jiang S, Zhou H, Qu L-H: ChIPBase: a database for decoding the transcriptional regulation of long non-coding RNA and microRNA genes from ChIP-Seq data. Nucleic Acids Res 2013, 41:D177-D187.
- [37]Gupta S, Stamatoyannopoulos J, Bailey T, Noble W: Quantifying similarity between motifs. Genome Biol 2007, 8:R24. BioMed Central Full Text
- [38]Brandeis M, Frank D, Keshet I, Siegfried Z, Mendelsohn M, Names A, Temper V, Razin A, Cedar H: Sp1 elements protect a CpG island from de novo methylation. Nature 1994, 371:435-438.
- [39]UCSC Table Browser [http://genome.ucsc.edu/cgi-bin/hgTables?command=start webcite]
- [40]Bert SA, Robinson MD, Strbenac D, Statham AL, Song JZ, Hulf T, Sutherland RL, Coolen MW, Stirzaker C, Clark SJ: Regional activation of the cancer genome by long-range epigenetic remodeling. Cancer Cell 2013, 23:9-22.
- [41]Nejman D, Straussman R, Steinfeld I, Ruvolo M, Roberts D, Yakhini Z, Cedar H: Molecular rules governing de novo methylation in cancer. Cancer Res 2014, 74:1475-1483.
- [42]Kubosaki A, Tomaru Y, Tagami M, Arner E, Miura H, Suzuki T, Suzuki M, Suzuki H, Hayashizaki Y: Genome-wide investigation of in vivo EGR-1 binding sites in monocytic differentiation. Genome Biol 2009, 10:R41. BioMed Central Full Text
- [43]McLeay R, Bailey T: Motif enrichment analysis: a unified framework and an evaluation on ChIP data. BMC Bioinformatics 2010, 11:165. BioMed Central Full Text
- [44]Frank DE, Saecker RM, Bond JP, Capp MW, Tsodikov OV, Melcher SE, Levandoski MM, Record MT: Thermodynamics of the interactions of lac repressor with variants of the symmetric lac operator: effects of converting a consensus site to a non-specific site. J Mol Biol 1997, 267:1186-1206.
- [45]Benos PV, Lapedes AS, Stormo GD: Is there a code for protein-DNA recognition? Probab(ilistical)ly. Bioessays 2002, 24:466-475.