期刊论文详细信息
BMC Bioinformatics
Structure-revealing data fusion
Evrim Acar3  Evangelos E Papalexakis4  Gözde Gürdeniz1  Morten A Rasmussen3  Anders J Lawaetz3  Mathias Nilsson2  Rasmus Bro3 
[1] Department of Nutrition, Exercise and Sports, Faculty of Science, University of Copenhagen, Frederiksberg C, Denmark
[2] School of Chemistry, University of Manchester, Oxford Road, Manchester M13 9PL, UK
[3] Department of Food Science, Faculty of Science, University of Copenhagen, Frederiksberg C, Denmark
[4] School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
关键词: MS;    DOSY;    NMR;    Sparsity;    Optimization;    Coupled matrix and tensor factorizations;    Data fusion;   
Others  :  1087552
DOI  :  10.1186/1471-2105-15-239
 received in 2013-12-31, accepted in 2014-06-26,  发布年份 2014
PDF
【 摘 要 】

Background

Analysis of data from multiple sources has the potential to enhance knowledge discovery by capturing underlying structures, which are, otherwise, difficult to extract. Fusing data from multiple sources has already proved useful in many applications in social network analysis, signal processing and bioinformatics. However, data fusion is challenging since data from multiple sources are often (i) heterogeneous (i.e., in the form of higher-order tensors and matrices), (ii) incomplete, and (iii) have both shared and unshared components. In order to address these challenges, in this paper, we introduce a novel unsupervised data fusion model based on joint factorization of matrices and higher-order tensors.

Results

While the traditional formulation of coupled matrix and tensor factorizations modeling only shared factors fails to capture the underlying structures in the presence of both shared and unshared factors, the proposed data fusion model has the potential to automatically reveal shared and unshared components through modeling constraints. Using numerical experiments, we demonstrate the effectiveness of the proposed approach in terms of identifying shared and unshared components. Furthermore, we measure a set of mixtures with known chemical composition using both LC-MS (Liquid Chromatography - Mass Spectrometry) and NMR (Nuclear Magnetic Resonance) and demonstrate that the structure-revealing data fusion model can (i) successfully capture the chemicals in the mixtures and extract the relative concentrations of the chemicals accurately, (ii) provide promising results in terms of identifying shared and unshared chemicals, and (iii) reveal the relevant patterns in LC-MS by coupling with the diffusion NMR data.

Conclusions

We have proposed a structure-revealing data fusion model that can jointly analyze heterogeneous, incomplete data sets with shared and unshared components and demonstrated its promising performance as well as potential limitations on both simulated and real data.

【 授权许可】

   
2014 Acar et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150117015819701.pdf 3689KB PDF download
Figure 14. 62KB Image download
Figure 13. 107KB Image download
Figure 12. 82KB Image download
Figure 11. 61KB Image download
Figure 10. 49KB Image download
Figure 9. 19KB Image download
Figure 8. 75KB Image download
Figure 7. 39KB Image download
Figure 6. 30KB Image download
Figure 5. 28KB Image download
Figure 4. 28KB Image download
Figure 3. 28KB Image download
Figure 2. 15KB Image download
Figure 1. 19KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Figure 9.

Figure 10.

Figure 11.

Figure 12.

Figure 13.

Figure 14.

【 参考文献 】
  • [1]Alter O, Brown PO, Botstein D: Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms. PNAS 2003, 100:3351-3356.
  • [2]Ponnapalli SP, Saunders MA, Loan CFV, Alter O: A higher-order generalized singular value decomposition for comparison of global mRNA expression from multiple organisms. PLoS One 2011, 6(12):e28072.
  • [3]Acar E, Plopper GE, Yener B: Coupled analysis of in vitro and histology tissue samples to quantify structure-function relationship. PLoS One 2012, 7(3):e32227.
  • [4]Badea L: Extracting gene expression profiles common to colon and pancreatic adenocarcinoma using simultaneous nonnegative matrix factorization. Pacific Symposium on Biocomputing, Volume 2008, 279-290.
  • [5]Acar E, Gurdeniz G, Rasmussen MA, Rago D, Dragsted LO, Bro R: Coupled matrix factorization with sparse factors to identify potential biomarkers in metabolomics. Int J Knowl Discov Bioinformatics 2012, 3(3):22-43.
  • [6]Richards SE, Dumas ME, Fonville JM, Ebbels TM, Holmes E, Nicholson JK: Intra- and inter-omic fusion of metabolic profiling data in a systems biology framework. Chemometrics Int Lab Syst 2010, 104:121-131.
  • [7]Krishnamurthy R, Saleem F, Liu P, Dame ZT, Poelzer J, Huynh J, Yallou FS, Psychogios N, Dong E, Bogumil R, Roehring C, Wishart DS: The human urine metabolome. PLoS One 2013, 8:e73076.
  • [8]Singh AP, Gordon GJ: Relational learning via collective matrix factorization. KDD’08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge Discovery and Data Mining 2008, 650-658.
  • [9]Ma H, Yang H, Lyu MR, King I: SoRec: Social recommendation using probabilistic matrix factorization. CIKM’08: Proceedings of the 17th ACM Conference on Information and Knowledge Management 2008, 931-940.
  • [10]Jiang M, Cui P, Liu R, Yang Q, Wang F, Zhu W, Yang S: Social contextual recommendation. CIKM’12: Proceedings of the 21st ACM Conference on Information and Knowledge Management. 2012, 45-54.
  • [11]Yeredor A: Non-orthogonal joint diagonalization in the least-squares sense with application in blind source separation. IEEE Trans Signal Process 2002, 50:1545-1553.
  • [12]Yoo J, Kim M, Kang K, Choi S: Nonnegative matrix partial co-factorization for drum source separation. ICASSP’10: Proceedings of IEEE International Conference on Acoustics, Speech and Signal. 2010, 1942-1945.
  • [13]Lee CH, Alpert BO, Sankaranarayanan P, Alter O: GSVD Comparison of patient-matched normal and tumor aCGH profiles reveals global copy-number alterations predicting glioblastoma multiforme survival. PLoS One 2012, 7:e30098.
  • [14]Acar E, Kolda TG, Dunlavy DM: All-at-once Optimization For Coupled Matrix and Tensor Factorizations. KDD Workshop on Mining and Learning with Graphs (arXiv:1105.3422). 2011.
  • [15]Banerjee A, Basu S, Merugu S: Multi-way clustering on relation graphs. SDM’07: Proceedings of the 2007 SIAM International Conference on Data Mining. 2007, 145-156.
  • [16]Smilde A, Westerhuis JA, Boque R: Multiway multiblock component and covariates regression models. J Chemometrics 2000, 14:301-331.
  • [17]Yilmaz YK, Cemgil AT, Simsekli U: Generalised coupled tensor factorisation. In Advances in Neural Information Processing Systems 24 Edited by and Pereira, Shawe-taylor J, Zemel RS, Bartlett P, Weinberger KQ. 2011, 2151-2159. [http://books.nips.cc/papers/files/nips24/NIPS2011_1189.pdf webcite]
  • [18]Johnson CS: Diffusion ordered nuclear magnetic resonance spectroscopy: principles and applications. Prog Nucl Magn Reson Spectrosc 1999, 34:203-256.
  • [19]Morris GA: Diffusion-ordered spectroscopy (DOSY). In Encyclopedia of Magnetic Resonance. Edited by Harris RK, Wasylishen RE. Chichester: Wiley; 2009. doi:10.1002/9780470034590.emrstm0119.pub2.
  • [20]Pedersen HT, Dyrby M, Engelsen SB, Bro R: Application of multi-way analysis to 2D NMR data. Ann Rep Nmr Spectrosc 2006, 59:207-233.
  • [21]Nilsson M, Khajeh M, Botana A, Bernstein MA, Morris GA: Diffusion NMR and trilinear analysis in the study of reaction kinetics. Chemical Commun 2009, 1252-1254.
  • [22]Ermis B, Acar E, Cemgil AT: Link prediction in heterogeneous data via generalized coupled tensor factorization. Data Min Knowl Discov 2013. doi:10.1007/s10618-013-0341-y. [http://link.springer.com/article/10.1007%2Fs10618-013-0341-y webcite]
  • [23]Lin YR, Sun J, Castro P, Konuru R, Sundaram H, Kelliher A: MetaFac: community discovery via relational hypergraph factorization. KDD’09: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2009, 527-536.
  • [24]Zheng VW, Cao B, Zheng Y, Xie X, Yang Q: Collaborative filtering meets mobile recommendation: a user-centered approach. AAAI’10: Proceedings of the 24th Conference on Artificial Intelligence. 2010, 236-241.
  • [25]Acar E, Lawaetz AJ, Rasmussen MA, Bro R: Structure-revealing data fusion model with applications in metabolomics. EMBS’13: Proceedings of the 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. 2013, 6023-6026.
  • [26]van Deun K, van Mechelen I, Schouteden M, de Moor B, van der Werf M, de Lathauwer L, Smilde AK, Kiers HAL: DISCO-SCA and adapted GSVD as swinging alternatives to GSVD in finding common and distinctive processes. PLoS One 2012, 7:e37840.
  • [27]Gupta SK, Phung D, Adams B, Tran T, Venkatesh S: Nonnegative shared subspace learning and its application to social media retrieval. KDD’10: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2010, 1169-1178.
  • [28]Lock EF, Hoadley KA, Marron J, Nobel AB: Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann Appl Stat 2013, 7:523-542.
  • [29]Xiao X, M-Moral A, Rotival M, Bottolo L, Petretto E: Multi-tissue analysis of co-expression networks by higher-order generalized singular value decomposition identifies functionally coherent transcriptional modules. PLoS Genetics 2014, 10:e1004006.
  • [30]Liu W, Chan J, Bailey J, Leckie C, Ramamohanarao K: Mining labelled tensors by discovering both their common and discriminative subspaces. SDM’13: Proceedings of the 2013 SIAM International Conference on Data Mining. 2013, 614-622.
  • [31]Tucker LR: An inter-battery method of factor analysis. Psychometrika 1958, 23:111-136.
  • [32]Huopaniemi I, Suvitaival T, Nikkila J, Oresic M, Kaski S: Multivariate multi-way analysis of multi-source data. Bioinformatics 2010, 26:i391-i398.
  • [33]Virtanen S, Klami A, Kaski S: Bayesian CCA via group sparsity. ICML’11: Proceedings of the 28th International Conference on Machine Learning. 2011, 457-464.
  • [34]Klami A, Virtanen S, Kaski S: Bayesian canonical correlation analysis. J Mach Learn Res 2013, 14:965-1003.
  • [35]Hotelling H: Relations between two sets of variates. Biometrika 1936, 28:321-377.
  • [36]Levin J: Simultaneous factor analysis of several Gramian matrices. Psychometrika 1966, 31:413-419.
  • [37]Westerhuis JA, Kourti T, Macgregor JF: Analysis of multiblock and hierarchical PCA and PLS models. J Chemometrics 1998, 12:301-321.
  • [38]Long B, Zhang ZM, Wu X, Yu PS: Spectral clustering for multi-type relational data. ICML’06: Proceedings of the 23rd International Conference on Machine Learning. 2006, 585-592.
  • [39]van Deun K, Wilderjans TF, van den Berg RA, Antoniadis A, van Mechelen I: A flexible framework for sparse simultaneous component based data integration. BMC Bioinformatics 2011, 12:448.
  • [40]Bouchard G, Guo S, Yin D: Convex collective matrix factorization. AISTATS 13: Proceedings of the 16th International Conference on Artificial Intelligence and Statistics. 2013, 144-152.
  • [41]Smilde A, Bro R, Geladi P: Multi-way Analysis: Applications in the Chemical Sciences. West Sussex: Wiley; 2004.
  • [42]Acar E, Yener B: Unsupervised multiway data analysis: a literature survey. IEEE Trans Knowl Data Eng 2009, 21:6-20.
  • [43]Kolda TG, Bader BW: Tensor decompositions and applications. SIAM Rev 2009, 51(3):455-500.
  • [44]Carroll JD, Chang JJ: Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition. Psychometrika 1970, 35:283-319.
  • [45]Harshman RA: Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-modal factor analysis. UCLA Working Papers Phonetics 1970, 16:1-84.
  • [46]Harshman RA, Lundy ME: PARAFAC: parallel factor analysis. Comput Stat Data Anal 1994, 18:39-72.
  • [47]Wilderjans TF, Ceulemans E, Kiers HAL, Meers K: The LMPCA program: A graphical user interface for fitting the Linked-Mode PARAFAC-PCA model to coupled real-valued data. Behav Res Methods 2009, 41:1073-1082.
  • [48]Papalexakis EE, Mitchell TM, Sidiropoulos ND, Faloutsos C, Talukdar PP, Murphy B: Turbo-SMT: accelerating coupled sparse matrix-tensor factorizations by 200x. SDM’14: Proceedings of the 2014 SIAM International Conference on Data Mining. 2014.
  • [49]Beutel A, Kumar A, Papalexakis EE, Talukdar PP, Faloutsos C, Xing EP: FLEXIFACT: scalable flexible factorization of coupled tensors on Hadoop. SDM’14: Proceedings of the 2014 SIAM International Conference on Data Mining. 2014.
  • [50]Sorber L, Barel MV, De Lathauwer L: Structured data fusion. Tech. rep., 13-177, ESAT-STADIUS, KU Leuven 2013. [http://bit.ly/1iKJprY webcite]
  • [51]Narita A, Hayashi K, Tomioka R, Kashima H: Tensor factorization using auxiliary information. 2011.
  • [52]Acar E, Rasmussen MA, Savorani F, Næs T, Bro R: Understanding data fusion within the framework of coupled matrix and tensor factorizations. Chemometrics Intell Lab Syst 2013, 129:53-63.
  • [53]Nocedal J, Wright SJ: Numerical Optimization, second edition. New York: Springer; 2006.
  • [54]Lee S, Lee H, Abbeel P, Ng AY: Efficient L1 regularized logistic regression. AAAI’06: Proceedings of the 20th Conference on Artificial Intelligence. 2006, 401-408.
  • [55]Tomasi G, Bro R: PARAFAC and missing values. Chemometrics Intell Lab Syst 2005, 75:163-180.
  • [56]Acar E, Dunlavy D, Kolda T, Mørup M: Scalable tensor factorizations for incomplete data. Chemometrics Intell Lab Syst 2011, 106:41-56.
  • [57]Dunlavy DM, Kolda TG, Acar E: Poblano v1.0: A Matlab toolbox for gradient-based optimization. Tech. Rep. SAND2010-1422, Sandia National Laboratories, Albuquerque, NM and Livermore, CA 2010. http://www.cs.sandia.gov/~dmdunla/publications/SAND2010-1422.pdf webcite
  • [58]Beckonert O, Keun HC, Ebbels TMD, Bundy J, Holmes E, Lindon JC, Nicholson JK: Metabolic profiling, metabolomic and metabonomic procedures for NMR spectroscopy of urine, plasma, serum and tissue extracts. Nature Protocols 2007, 2:2692-2703.
  • [59]Morris GA, Barjat H, Home TJ: Reference deconvolution methods. Prog Nucl Magn Reson Spectrosc 1997, 31:197-257.
  • [60]Botana A, Aguilar JA, Nilsson M, Morris GA: J-modulation effects in DOSY experiments and their suppression: The Oneshot45 experiment. J Magn Reson 2011, 208:270-278.
  • [61]Nilsson M: The DOSY Toolbox: A new tool for processing PFG NMR diffusion data. J Magn Reson 2009, 200:296-302.
  • [62]Nilsson M, Morris GA: Correction of systematic errors in CORE processing of DOSY data. Magn Reson Chem 2006, 44:655-660.
  • [63]Smith CA, Want EJ, Abagyan R, Siuzdak G, G O: XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Anal Chem 2006, 78:779-787.
  • [64]Kuhl C, Tautenhahn R, Bottcher C, Larson TR, Neumann S: CAMERA: an integrated strategy for compound spectra extraction and annotation of liquid chromatography/mass spectrometry data sets. Anal Chem 2012, 84:283-289.
  • [65]Nilsson M, Botana M, Morris GA: T-1-diffusion-ordered spectroscopy: nuclear magnetic resonance mixture analysis using parallel factor analysis. Anal Chem 2009, 81:8119-8125.
  • [66]Bro R, Viereck N, Toft M, Toft H, Hansen IP, Engelsen SB: Mathematical chromatography solves the cocktail party effect in mixtures using 2D spectra and PARAFAC. Trac-Trends Anal Chem 2010, 29:281-284.
  • [67]Björneras J, Botana A, Morris GA, Nilsson M: Resolving complex mixtures: trilinear diffusion data. J Biomolecular NMR 2014, 58:251-257.
  • [68]Khajeh M, Botana A, Bernstein MA, Nilsson M, Morris GA: Reaction kinetics studied using diffusion-ordered spectroscopy and multiway chemometrics. Anal Chem 2010, 82:2102-2108.
  • [69]Zou H, Hastie T, Tibshirani R: Sparse principal component analysis. J Comput Graph Stat 2006, 15:265-286.
  • [70]Lawaetz AJ, Bro R, Kamstrup-Nielsen M, Christensen IJ, Jorgensen LN, Nielsen HJ: Fluorescence spectroscopy as a potential metabonomic tool for early detection of colorectal cancer. Metabolomics 2012, 8:111-121.
  • [71]Calhoun V, Adali T, Pearlson G, Kiehl K: Neuronal chronometry of target detection: Fusion of hemodynamic and event-related potential data. NeuroImage 2006, 30:544-553.
  • [72]Swinnen W, Hunyadi B, Acar E, Huffel SV, De Vos M: Incorporating higher dimensionality in joint decomposition of EEG and fMRI. Eusipco’14: Proceedings of the 22nd European Signal Processing Conference (To Appear). 2014. ftp://ftp.esat.kuleuven.ac.be/pub/stadius/wswinnen/reports/EUSIPCO-14-49.pdf. webcite
  • [73]Sørensen M, De Lathauwer L: Coupled canonical polyadic decompositions and (coupled) decompositions in multilinear rank- (Lr,n,Lr,n,1) terms—part i: uniqueness. Tech. rep., 13-143, ESAT-STADIUS, KU Leuven 2014. [ftp://ftp.esat.kuleuven.be/pub/SISTA/sistakulak/reports/Coupled_CPD_Uniqueness_plusSM.pdf webcite]
  • [74]Acar E, Nilsson M, Saunders M: A flexible modeling framework for coupled matrix and tensor factorizations. Eusipco’14: Proceedings of the 22nd European Signal Processing Conference 2014. [http://www.models.life.ku.dk/~acare/2014_Eusipco_SNOPT.pdf webcite]
  文献评价指标  
  下载次数:108次 浏览次数:12次