International Conference on Computing and Applied Informatics 2016 | |
Detecting spam comments on Indonesia's Instagram posts | |
物理学;计算机科学 | |
Akbar Septiandri, Ali^1 ; Wibisono, Okiriza^1 | |
School of Informatics, University of Edinburgh, 11 Crichton St, Edinburgh | |
EH8 9LE, United Kingdom^1 | |
关键词: Bag of words; Classification algorithm; Latent Semantic Analysis; McNemar's tests; Products and services; Programming packages; Public figure; Sets of features; | |
Others : https://iopscience.iop.org/article/10.1088/1742-6596/801/1/012069/pdf DOI : 10.1088/1742-6596/801/1/012069 |
|
学科分类:计算机科学(综合) | |
来源: IOP | |
【 摘 要 】
In this paper we experimented with several feature sets for detecting spam comments in social media contents authored by Indonesian public figures. We define spam comments as comments which have promotional purposes (e.g. referring other users to products and services) and thus not related to the content to which the comments are posted. Three sets of features are evaluated for detecting spams: (1) hand-engineered features such as comment length, number of capital letters, and number of emojis, (2) keyword features such as whether the comment contains advertising words or product-related words, and (3) text features, namely, bag-of-words, TF-IDF, and fastText embeddings, each combined with latent semantic analysis. With 24,000 manually-annotated comments scraped from Instagram posts authored by more than 100 Indonesian public figures, we compared the performance of these feature sets and their combinations using 3 popular classification algorithms: Naive Bayes, SVM, and XGBoost. We find that using all three feature sets (with fastText embedding for the text features) gave the best F1-score of 0.9601 on a holdout dataset. More interestingly, fastText embedding combined with hand-engineered features (i.e. without keyword features) yield similar F1-score of 0.9523, and McNemar's test failed to reject the hypothesis that the two results are not significantly different. This result is important as keyword features are largely dependent on the dataset and may not be as generalisable as the other feature sets when applied to new data. For future work, we hope to collect bigger and more diverse dataset of Indonesian spam comments, improve our model's performance and generalisability, and publish a programming package for others to reliably detect spam comments.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
Detecting spam comments on Indonesia's Instagram posts | 744KB | download |