期刊论文详细信息
BMC Medical Informatics and Decision Making
A user-friendly tool to transform large scale administrative data into wide table format using a mapreduce program with a pig latin based script
Software
Hideki Hashimoto1  Hiromasa Horiguchi2  Hideo Yasunaga2  Kazuhiko Ohe3 
[1] Department of Health Economics and Epidemiology Research, School of Public Health, The University of Tokyo, Tokyo, Japan;Department of Health Management and Policy, Graduate School of Medicine, The University of Tokyo, 7-3-1 Hongo, 1138555, Bunkyo-ku, Tokyo, Japan;Department of Medical Informatics and Economics, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan;
关键词: MapReduce;    Pig Latin;    Large scale administrative data;    User-defined functions;   
DOI  :  10.1186/1472-6947-12-151
 received in 2012-07-24, accepted in 2012-12-13,  发布年份 2012
来源: Springer
PDF
【 摘 要 】

BackgroundSecondary use of large scale administrative data is increasingly popular in health services and clinical research, where a user-friendly tool for data management is in great demand. MapReduce technology such as Hadoop is a promising tool for this purpose, though its use has been limited by the lack of user-friendly functions for transforming large scale data into wide table format, where each subject is represented by one row, for use in health services and clinical research. Since the original specification of Pig provides very few functions for column field management, we have developed a novel system called GroupFilterFormat to handle the definition of field and data content based on a Pig Latin script. We have also developed, as an open-source project, several user-defined functions to transform the table format using GroupFilterFormat and to deal with processing that considers date conditions.ResultsHaving prepared dummy discharge summary data for 2.3 million inpatients and medical activity log data for 950 million events, we used the Elastic Compute Cloud environment provided by Amazon Inc. to execute processing speed and scaling benchmarks. In the speed benchmark test, the response time was significantly reduced and a linear relationship was observed between the quantity of data and processing time in both a small and a very large dataset. The scaling benchmark test showed clear scalability. In our system, doubling the number of nodes resulted in a 47% decrease in processing time.ConclusionsOur newly developed system is widely accessible as an open resource. This system is very simple and easy to use for researchers who are accustomed to using declarative command syntax for commercial statistical software and Structured Query Language. Although our system needs further sophistication to allow more flexibility in scripts and to improve efficiency in data processing, it shows promise in facilitating the application of MapReduce technology to efficient data processing with large scale administrative data in health services and clinical research.

【 授权许可】

CC BY   
© Horiguchi et al.; licensee BioMed Central Ltd. 2012

【 预 览 】
附件列表
Files Size Format View
RO202311095426860ZK.pdf 1239KB PDF download
【 参考文献 】
  • [1]
  • [2]
  • [3]
  • [4]
  • [5]
  • [6]
  • [7]
  • [8]
  • [9]
  • [10]
  • [11]
  • [12]
  • [13]
  • [14]
  • [15]
  • [16]
  • [17]
  • [18]
  • [19]
  • [20]
  • [21]
  文献评价指标  
  下载次数:9次 浏览次数:2次