High Performance Computing Symposium 2013 | |
A case-comparison study of automatic document classification utilizing both serial and parallel approaches | |
计算机科学;物理学 | |
Wilges, B.^1 ; Bastos, R.C.^1 ; Mateus, G.P.^2 ; Dantas, M.A.R.^2 | |
Department of Engineering and Knowledge Management (EGC), Federal University of Santa Catarina (UFSC), Florianópolis | |
SC | |
88040-900, Brazil^1 | |
Department of Informatics and Statistic (INE), Federal University of Santa Catarina (UFSC), Florianópolis, SC | |
88040-900, Brazil^2 | |
关键词: Comparison study; Differential information; Distributed processing; Document Classification; Map-reduce programming; Open source system; Software environments; Unstructured documents; | |
Others : https://iopscience.iop.org/article/10.1088/1742-6596/540/1/012001/pdf DOI : 10.1088/1742-6596/540/1/012001 |
|
学科分类:计算机科学(综合) | |
来源: IOP | |
【 摘 要 】
A well-known problem faced by any organization nowadays is the high volume of data that is available and the required process to transform this volume into differential information. In this study, a case-comparison study of automatic document classification (ADC) approach is presented, utilizing both serial and parallel paradigms. The serial approach was implemented by adopting the RapidMiner software tool, which is recognized as the worldleading open-source system for data mining. On the other hand, considering the MapReduce programming model, the Hadoop software environment has been used. The main goal of this case-comparison study is to exploit differences between these two paradigms, especially when large volumes of data such as Web text documents are utilized to build a category database. In the literature, many studies point out that distributed processing in unstructured documents have been yielding efficient results in utilizing Hadoop. Results from our research indicate a threshold to such efficiency.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
A case-comparison study of automatic document classification utilizing both serial and parallel approaches | 976KB | download |