期刊论文详细信息
IEEE Access
An Unsupervised Approach for Content-Based Clustering of Emails Into Spam and Ham Through Multiangular Feature Formulation
Sami Azam1  Asif Karim1  Krishnan Kannoorpatti1  Bharanidharan Shanmugam1 
[1] College of Engineering, IT and Environment, Charles Darwin University, Darwin, NT, Australia;
关键词: Machine learning;    unsupervised learning;    clustering;    spam detection;    spam email;    spam filtering;   
DOI  :  10.1109/ACCESS.2021.3116128
来源: DOAJ
【 摘 要 】

The rapid growth of spam email attacks and the inherent malicious dynamism within those attacks on a range of social, personal and business activities warrants an intelligent and automated anti-spam framework. Attempts like malware propagation, identity theft, sensitive data pilfering, monetary as well as reputational damage are sharply increasing, endangering the privacy of the victim. Current solutions that are rather incomplete when the multidimensional feature range of email, is taken into account. We believe a methodology based on Artificial Intelligence, especially unsupervised machine learning is the way forward. This research attempts to investigating the application of unsupervised learning for the clustering of Spam and Ham emails. The overall goal of the research is to develop an unsupervised framework that solely depends on unsupervised methodologies through a clustering approach that includes multiple algorithms, primarily using the email content (body) and the subject header. The clustering has been done on a novel binary dataset of 22,000 entries of ham and spam emails, composed of ten features (reduced from eleven to ten after the feature reduction). Seven out of these ten features are unique to this study, engineered to represent impactful analytical email characteristics from a multiangular point of view. Out of five different clustering algorithms investigated in this work, OPTICS produced the optimum clustering demonstrating a 0.26% higher average efficacy than its nearest performer DBSCAN. The average balanced accuracy for OPTICS and DBSCAN was found to be ≈75.76%.

【 授权许可】

Unknown   

  文献评价指标  
  下载次数:0次 浏览次数:0次