期刊论文

【摘要】

This paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.

【授权许可】

CC BY

【预览】

附件列表
Files	Size	Format	View
RO202108119055318ZK.pdf	1606KB	PDF	download

Journal of Big Data
Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

Yahya Albalawi¹ Nikola S. Nikolov² Jim Buckley²
[1] Department of Computer Science and Information Systems, University of Limerick, Limerick, Ireland;Department of Computer and Information Sciences, College of Arts and Science, University of Taibah, Al-Ula, Saudi Arabia;The Irish Software Research Centre, Lero, University of Limerick, Limerick, Ireland;Department of Computer Science and Information Systems, University of Limerick, Limerick, Ireland;The Irish Software Research Centre, Lero, University of Limerick, Limerick, Ireland;
关键词: Deep learning; Health information; Pre-trained word embeddings; Social media; Machine learning; Natural language processing; Twitter;
DOI : 10.1186/s40537-021-00488-w
来源: Springer
PDF


	文献评价指标
	下载次数：25次	浏览次数：18次

【 摘 要 】

【 授权许可】

【 预 览 】

【摘要】

【授权许可】

【预览】