Frontiers in Applied Mathematics and Statistics | |
Text Classification Using the N-Gram Graph Representation Model Over High Frequency Data Streams | |
Tserpes, Konstantinos1  Violos, John1  Varvarigou, Theodora2  Varlamis, Iraklis2  | |
[1] Department of Electrical and Computer Engineering, National Technical University of Athens, Greece;Department of Informatics and Telematics, Harokopio University of Athens, Greece | |
关键词: text classification; Text streaming; N-gram graph; beam; Cloud computing; | |
DOI : 10.3389/fams.2018.00041 | |
学科分类:数学(综合) | |
来源: Frontiers | |
【 摘 要 】
A prominent challenge in our information age is the classification over high frequency data streams. In this research, we propose an innovative and high-accurate text stream classification model that is designed in an elastic distributed way and is capable to service text load with fluctuated frequency. In this classification model, text is represented as N-Gram Graphs and the classification process takes place using text preprocessing, graph similarity and feature classification techniques following the supervised machine learning approach. The work involves the analysis of many variations of the proposed model and its parameters, such as various representations of text as N-Gram Graphs, graph comparisons metrics and classification methods in order to conclude to the most accurate setup. To deal with the scalability, the availability and the timely response in case of high frequency text we employ the Beam programming model. Using the Beam programming model the classification process occurs as a sequence of distinct tasks and facilitates the distributed implementation of the most computational demanding tasks of the inference stage. The proposed model and the various parameters that constitute it are evaluated experimentally and the high frequency stream emulated using two public datasets (20NewsGroup and Reuters-21578) that are commonly used in the literature for text classification.
【 授权许可】
CC BY
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
RO201904029378704ZK.pdf | 4404KB | download |