期刊论文详细信息
BioData Mining
Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends
Emad A Mohammed2  Behrouz H Far2  Christopher Naugler1 
[1] Department of Pathology and Laboratory Medicine, University of Calgary and Calgary Laboratory Services, Calgary, AB, Canada
[2] Department of Electrical and Computer Engineering, Schulich School of Engineering, University of Calgary, Calgary, AB, Canada
关键词: Distributed programming;    Bioinformatics;    Clinical data analysis;    Clinical big data analysis;    Big data;    Hadoop;    MapReduce;   
Others  :  1083997
DOI  :  10.1186/1756-0381-7-22
 received in 2014-06-05, accepted in 2014-10-18,  发布年份 2014
PDF
【 摘 要 】

The emergence of massive datasets in a clinical setting presents both challenges and opportunities in data storage and analysis. This so called “big data” challenges traditional analytic tools and will increasingly require novel solutions adapted from other fields. Advances in information and communication technology present the most viable solutions to big data analysis in terms of efficiency and scalability. It is vital those big data solutions are multithreaded and that data access approaches be precisely tailored to large volumes of semi-structured/unstructured data.

The MapReduce programming framework uses two tasks common in functional programming: Map and Reduce. MapReduce is a new parallel processing framework and Hadoop is its open-source implementation on a single computing node or on clusters. Compared with existing parallel processing paradigms (e.g. grid computing and graphical processing unit (GPU)), MapReduce and Hadoop have two advantages: 1) fault-tolerant storage resulting in reliable data processing by replicating the computing tasks, and cloning the data chunks on different computing nodes across the computing cluster; 2) high-throughput data processing via a batch processing framework and the Hadoop distributed file system (HDFS). Data are stored in the HDFS and made available to the slave nodes for computation.

In this paper, we review the existing applications of the MapReduce programming framework and its implementation platform Hadoop in clinical big data and related medical health informatics fields. The usage of MapReduce and Hadoop on a distributed system represents a significant advance in clinical big data processing and utilization, and opens up new opportunities in the emerging era of big data analytics. The objective of this paper is to summarize the state-of-the-art efforts in clinical big data analytics and highlight what might be needed to enhance the outcomes of clinical big data analytics tools. This paper is concluded by summarizing the potential usage of the MapReduce programming framework and Hadoop platform to process huge volumes of clinical data in medical health informatics related fields.

【 授权许可】

   
2014 Mohammed et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150113143419599.pdf 721KB PDF download
Figure 3. 59KB Image download
Figure 2. 58KB Image download
Figure 1. 69KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

【 参考文献 】
  • [1]Shuman S: Structure, mechanism, and evolution of the mRNA capping apparatus. Prog Nucleic Acid Res Mol Biol 2000, 66:1-40.
  • [2]Rajaraman A, Ullman JD: Mining of Massive Datasets. Cambridge – United Kingdom: Cambridge University Press; 2012.
  • [3]Coulouris GF, Dollimore J, Kindberg T: Distributed Systems: Concepts and Design: Pearson Education. 2005.
  • [4]de Oliveira Branco M: Distributed Data Management for Large Scale Applications. Southampton – United Kingdom: University of Southampton; 2009.
  • [5]Raghupathi W, Raghupathi V: Big data analytics in healthcare: promise and potential. Health Inform Sci Syst 2014, 2(1):3. BioMed Central Full Text
  • [6]Bell DE, Raiffa H, Tversky A: Descriptive, normative, and prescriptive interactions in decision making. Decis Mak 1988, 1:9-32.
  • [7]Foster I, Kesselman C: The Grid 2: Blueprint for a new Computing Infrastructure. Houston – USA: Elsevier; 2003.
  • [8]Owens JD, Houston M, Luebke D, Green S, Stone JE, Phillips JC: GPU computing. Proc IEEE 2008, 96(5):879-899.
  • [9]Satish N, Harris M, Garland M: Designing efficient sorting algorithms for manycore GPUs. Parallel & Distributed Processing, 2009 IPDPS 2009 IEEE International Symposium on: 2009 2009, 1-10. [IEEE]
  • [10]He B, Fang W, Luo Q, Govindaraju NK, Wang T: Mars: a MapReduce framework on graphics processors. Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques: 2008 2008, 260-269. [ACM]
  • [11]Dean J, Ghemawat S: MapReduce: simplified data processing on large clusters. Commun ACM 2008, 51(1):107-113.
  • [12]Peyton Jones SL: The Implementation of Functional Programming Languages (Prentice-Hall International Series in Computer Science). New Jersey – USA: Prentice-Hall, Inc; 1987.
  • [13]Bryant RE: Data-intensive supercomputing: The case for DISC. Pittsburgh, PA – USA: School of Computer Science, Carnegie Mellon University; 2007:1-20.
  • [14]White T: Hadoop: The Definitive Guide. Sebastopol – USA: “ O’Reilly Media, Inc.”; 2012.
  • [15]Shvachko K, Kuang H, Radia S, Chansler R: The hadoop distributed file system. Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on: 2010 2010, 1-10. [IEEE]
  • [16]The Apache Software Foundation [http://apache.org/ webcite]
  • [17]Olson M: Hadoop: Scalable, flexible data storage and analysis. IQT Quart 2010, 1(3):14-18.
  • [18]Xiaojing J: Google Cloud Computing Platform Technology Architecture and the Impact of Its Cost. 2010 Second WRI World Congress on Software Engineering: 2010 2010, 17-20.
  • [19]Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R: Hive: a warehousing solution over a map-reduce framework. Proc VLDB Endowment 2009, 2(2):1626-1629.
  • [20]Olston C, Reed B, Srivastava U, Kumar R, Tomkins A: Pig latin: a not-so-foreign language for data processing. Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data: 2008 2008, 1099-1110. [ACM]
  • [21]The Platform for Big Data and the Leading Solution for Apache Hadoop in the Enterprise - Cloudera [http://www.cloudera.com/content/cloudera/en/home.html webcite]
  • [22]DataStax [http://www.datastax.com/ webcite]
  • [23]Hortonworks [http://hortonworks.com/ webcite]
  • [24]MAPR [http://www.mapr.com/products/m3 webcite]
  • [25]Top 14 Hadoop Technology Companies [http://www.technavio.com/blog/top-14-hadoop-technology-companies webcite]
  • [26]Taylor RC: An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics 2010, 11(Suppl 12):S1. BioMed Central Full Text
  • [27]Dai L, Gao X, Guo Y, Xiao J, Zhang Z: Bioinformatics clouds for big data manipulation. Biol Direct 2012, 7(1):43. BioMed Central Full Text
  • [28]Microsoft Excel 2013: Spreadsheet software [http://office.microsoft.com/en-ca/excel/ webcite]
  • [29]Jonas M, Solangasenathirajan S, Hett D: Patient Identification, A Review of the Use of Biometrics in the ICU. In Annual Update in Intensive Care and Emergency Medicine 2014. New York – USA: Springer; 2014:679-688.
  • [30]Wang W, Haerian K, Salmasian H, Harpaz R, Chase H, Friedman C: A drug-adverse event extraction algorithm to support pharmacovigilance knowledge mining from PubMed citations. In AMIA Annual Symposium Proceedings: 2011. Bethesda, Maryland – USA: American Medical Informatics Association; 2011:1464.
  • [31]Aphinyanaphongs Y, Fu LD, Aliferis CF: Identifying unproven cancer treatments on the health web: addressing accuracy, generalizability and scalability. Stud Health Technol Inform 2012, 192:667-671.
  • [32]Yaramakala S, Margaritis D: Speculative Markov blanket discovery for optimal feature selection. Data Mining, Fifth IEEE International Conference on: 2005 2005, 4. [IEEE]
  • [33]Horiguchi H, Yasunaga H, Hashimoto H, Ohe K: A user-friendly tool to transform large scale administrative data into wide table format using a mapreduce program with a pig latin based script. BMC Med Inform Decis Mak 2012, 12(1):151. BioMed Central Full Text
  • [34]Kohlwey E, Sussman A, Trost J, Maurer A: Leveraging the cloud for big data biometrics: Meeting the performance requirements of the next generation biometric systems. Services (SERVICES), 2011 IEEE World Congress on: 2011 2011, 597-601. [IEEE]
  • [35]Raghava N: Iris recognition on hadoop: A biometrics system implementation on cloud computing. Cloud Computing and Intelligence Systems (CCIS), 2011 IEEE International Conference on: 2011 2011, 482-485. [IEEE]
  • [36]Omri F, Hamila R, Foufou S, Jarraya M: Cloud-Ready Biometric System for Mobile Security Access. In Networked Digital Technologies. New York – USA: Springer; 2012:192-200.
  • [37]Chen W-P, Hung C-L, Tsai S-JJ, Lin Y-L: Novel and efficient tag SNPs selection algorithms. Biomed Mater Eng 2014, 24(1):1383-1389.
  • [38]Zhang K, Sun F, Waterman MS, Chen T: Dynamic programming algorithms for haplotype block partitioning: applications to human chromosome 21 haplotype data. Proceedings of the Seventh Annual International Conference on Research in Computational Molecular Biology: 2003 2003, 332-340. [ACM]
  • [39]Nguyen AV, Wynden R, Sun Y: HBase, MapReduce, and Integrated Data Visualization for Processing Clinical Signal Data. AAAI Spring Symposium: Computational Physiology: 2011 2011.
  • [40]Nordberg H, Bhatia K, Wang K, Wang Z: BioPig: a Hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics 2013, 29(23):3014-3019.
  • [41]Cloud Computing at NERSC [http://www.nersc.gov/research-and-development/cloud-computing/ webcite]
  • [42]AWS Amazon Elastic Compute Cloud (EC2) - Scalable Cloud Hosting [http://aws.amazon.com/ec2/ webcite]
  • [43]Chang Y-J, Chen C-C, Ho J-M, Chen C-L: De Novo Assembly of High-Throughput Sequencing Data with Cloud Computing and New Operations on String Graphs. Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on: 2012 2012, 155-161. [IEEE]
  • [44]McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010, 20(9):1297-1303.
  • [45]MacLean B, Eng JK, Beavis RC, McIntosh M: General framework for developing and evaluating database scoring algorithms using the TANDEM search engine. Bioinformatics 2006, 22(22):2830-2832.
  • [46]Lin Y-L: Implementation of a parallel protein structure alignment service on cloud. Int J Genomics 2013, 2013:1-8.
  • [47]Huang H, Tata S, Prill RJ: BlueSNP: R package for highly scalable genome-wide association studies using Hadoop clusters. Bioinformatics 2013, 29(1):135-136.
  • [48]Xu B, Gao J, Li C: An efficient algorithm for DNA fragment assembly in MapReduce. Biochem Biophys Res Commun 2012, 426(3):395-398.
  • [49]Bean DR: Recursive Euler and Hamilton paths. Proc Am Math Soc 1976, 55(2):385-394.
  • [50]Schatz MC: CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 2009, 25(11):1363-1369.
  • [51]Gropp W, Lusk E, Skjellum A: Using MPI: Portable Parallel Programming With the Message-Passing Interface. Volume 1. Cambridge, Massachusetts – USA: MIT press; 1999.
  • [52]Isard M, Budiu M, Yu Y, Birrell A, Fetterly D: Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Oper Syst Rev 2007, 41(3):59-72.
  • [53]Qiu X, Ekanayake J, Beason S, Gunarathne T, Fox G, Barga R, Gannon D: Cloud technologies for bioinformatics applications. Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers: 2009 2009, 6. [ACM]
  • [54]Gaggero M, Leo S, Manca S, Santoni F, Schiaratura O, Zanetti G, CRS E, Ricerche S: Parallelizing bioinformatics applications with MapReduce. Cloud Computing and Its Applications 2008.
  • [55]Matsunaga A, Tsugawa M, Fortes J: Cloudblast: Combining mapreduce and virtualization on distributed resources for bioinformatics applications. eScience, 2008 eScience’08 IEEE Fourth International Conference on: 2008 2008, 222-229. [IEEE]
  • [56]Tatusova TA, Madden TL: BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol Lett 1999, 174(2):247-250.
  • [57]Darling A, Carey L, Feng W-c: The design, implementation, and evaluation of mpiBLAST. Proc Cluster World 2003, 2003:1-14.
  • [58]Sadasivam GS, Baktavatchalam G: A novel approach to multiple sequence alignment using hadoop data grids. Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud: 2010 2010, 2. [ACM]
  • [59]Schönherr S, Forer L, Weißensteiner H, Kronenberg F, Specht G, Kloss-Brandstätter A: Cloudgene: A graphical execution platform for MapReduce programs on private and public clouds. BMC Bioinformatics 2012, 13(1):200. BioMed Central Full Text
  • [60]Lewis S, Csordas A, Killcoyne S, Hermjakob H, Hoopmann MR, Moritz RL, Deutsch EW, Boyle J: Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework. BMC Bioinformatics 2012, 13(1):324. BioMed Central Full Text
  • [61]Díaz-Uriarte R, De Andres SA: Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006, 7(1):3. BioMed Central Full Text
  • [62]Wang Y, Goh W, Wong L, Montana G: Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes. BMC Bioinformatics 2013, 14(16):1-15.
  • [63]Almeida JS, Grüneberg A, Maass W, Vinga S: Fractal MapReduce decomposition of sequence alignment. Algorithms Mol Biol 2012, 7(1):12. BioMed Central Full Text
  • [64]Colosimo ME, Peterson MW, Mardis SA, Hirschman L: Nephele: genotyping via complete composition vectors and MapReduce. Source Code Biol Med 2011, 6:13. BioMed Central Full Text
  • [65]Gao L, Qi J: Whole genome molecular phylogeny of large dsDNA viruses using composition vector method. BMC Evol Biol 2007, 7(1):41. BioMed Central Full Text
  • [66]Lee W-P, Hsiao Y-T, Hwang W-C: Designing a parallel evolutionary algorithm for inferring gene networks on the cloud computing environment. BMC Syst Biol 2014, 8(1):5. BioMed Central Full Text
  • [67]Juang C-F: A hybrid of genetic algorithm and particle swarm optimization for recurrent network design. Syst Man Cybern B Cybern IEEE Trans on 2004, 34(2):997-1006.
  • [68]Zhang B, Yehdego DT, Johnson KL, Leung M-Y, Taufer M: Enhancement of accuracy and efficiency for RNA secondary structure prediction by sequence segmentation and MapReduce. BMC Struct Biol 2013, 13(Suppl 1):S3. BioMed Central Full Text
  • [69]Zhao S, Prenger K, Smith L, Messina T, Fan H, Jaeger E, Stephens S: Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing. BMC Genomics 2013, 14(1):425. BioMed Central Full Text
  • [70]Gurtowski J, Schatz MC, Langmead B: Genotyping in the cloud with crossbow. Curr Protoc Bioinformatics 2012, 15.13:11-15.
  • [71]Reid JG, Carroll A, Veeraraghavan N, Dahdouli M, Sundquist A, English A, Bainbridge M, White S, Salerno W, Buhay C: Launching genomics into the cloud: deployment of Mercury, a next generation sequence analysis pipeline. BMC Bioinformatics 2014, 15(1):30. BioMed Central Full Text
  • [72]Wu Z, Huang NE: Ensemble empirical mode decomposition: a noise-assisted data analysis method. Adv Adapt Data Anal 2009, 1(01):1-41.
  • [73]Wang L, Chen D, Ranjan R, Khan SU, KolOdziej J, Wang J: Parallel Processing of Massive EEG Data with MapReduce. ICPADS: 2012 2012, 164-171.
  • [74]Wang F, Lee R, Liu Q, Aji A, Zhang X, Saltz J: Hadoop-gis: A high performance query system for analytical medical imaging with mapreduce. Atlanta – USA: Technical report, Emory University; 2011:1-13.
  • [75]Markonis D, Schaer R, Eggel I, Müller H, Depeursinge A: Using MapReduce for Large-Scale Medical Image Analysis. HISB: 2012 2012, 1.
  • [76]Meng B, Pratx G, Xing L: Ultrafast and scalable cone-beam CT reconstruction using MapReduce in a cloud computing environment. Med Phys 2011, 38(12):6603-6609.
  • [77]Feldkamp L, Davis L, Kress J: Practical cone-beam algorithm. JOSA A 1984, 1(6):612-619.
  • [78]Kaplan RS, Porter ME: How to solve the cost crisis in health care. Harv Bus Rev 2011, 89(9):46-52.
  • [79]Musen MA, Middleton B, Greenes RA: Clinical decision-support systems. In Biomedical Informatics. New York – USA: Springer; 2014:643-674.
  • [80]Devaraj S, Ow TT, Kohli R: Examining the impact of information technology and patient flow on healthcare performance: A Theory of Swift and Even Flow (TSEF) perspective. J Oper Manage 2013, 31(4):181-192.
  • [81]Friedman AB: Preparing for responsible sharing of clinical trial data. N Engl J Med 2014, 370(5):484-484.
  • [82]Mazurek M: Applying NoSQL Databases for Operationalizing Clinical Data Mining Models. In Beyond Databases, Architectures, and Structures. New York – USA: Springer; 2014:527-536.
  • [83]Chawla NV, Davis DA: Bringing big data to personalized healthcare: A patient-centered framework. J Gen Intern Med 2013, 28(3):660-665.
  • [84]Cusack CM, Hripcsak G, Bloomrosen M, Rosenbloom ST, Weaver CA, Wright A, Vawdrey DK, Walker J, Mamykina L: The future state of clinical data capture and documentation: a report from AMIA’s 2011 Policy Meeting. J Am Med Inform Assoc 2013, 20(1):134-140.
  • [85]Brodie MJ, Schachter SC, Kwan PKL: Fast Facts: Epilepsy. Albuquerque, New Mexico – USA: Health Press; 2012.
  • [86]Fabene PF, Bramanti P, Constantin G: The emerging role for chemokines in epilepsy. J Neuroimmunol 2010, 224(1):22-27.
  • [87]Shepherd GM, Mirsky JS, Healy MD, Singer MS, Skoufos E, Hines MS, Nadkarni PM, Miller PL: The Human Brain Project: neuroinformatics tools for integrating, searching and modeling multidisciplinary neuroscience data. Trends Neurosci 1998, 21(11):460-468.
  • [88]Purves D: Body and Brain: A Trophic Theory of Neural Connections. Cambridge, Massachusetts – USA: Harvard University Press; 1990.
  • [89]Hämäläinen M, Hari R, Ilmoniemi RJ, Knuutila J, Lounasmaa OV: Magnetoencephalography—theory, instrumentation, and applications to noninvasive studies of the working human brain. Rev Mod Phys 1993, 65(2):413.
  • [90]Braak H, Braak E: Neuropathological stageing of Alzheimer-related changes. Acta Neuropathol 1991, 82(4):239-259.
  • [91]Herculano-Houzel S: The human brain in numbers: a linearly scaled-up primate brain. Front Hum Neurosci 2009, 3:1-11.
  • [92]Kumar G, Taneja A, Majumdar T, Jacobs ER, Whittle J, Nanchal R: The association of lacking insurance with outcomes of severe sepsis: retrospective analysis of an administrative database*. Crit Care Med 2014, 42(3):583-591.
  • [93]Youssef AE: A framework for secure healthcare systems based on Big data analytics in mobile cloud computing environments. Int J Ambient Syst Appl 2014, 2(2):1-11.
  文献评价指标  
  下载次数:33次 浏览次数:24次