期刊论文详细信息
The international arab journal of information technology
A Topic-Specific Web Crawler using Deep Convolutional Networks
article
Saed ALqaraleh1  Hatice Meltem Nergız Sırın2 
[1] Department of Computer Engineering, Hasan Kalyoncu University;Software Engineering Department, Hasan Kalyoncu University
关键词: CNN;    natural language processing;    text classification;    topic specific crawler;    focused crawler;    web crawling;   
DOI  :  10.34028/iajit/20/3/3
学科分类:计算机科学(综合)
来源: Zarqa University
PDF
【 摘 要 】

This paper presented a new focused crawler that efficiently supports the Turkish language. The developed architecture was divided into multiple units: a control unit, crawler unit, link extractor unit, link sorter unit, and natural language processing unit. The crawler's units can work in parallel to process the massive amount of published websites. Also, the proposed Convolutional Neural Network (CNN) based natural language processing unit can professionally classifying Turkish text and web pages. Extensive experiments using three datasets have been performed to illustrate the performance of the developed approach. The first dataset contains 50,000 Turkish web pages downloaded by the developed crawler, while the other two are publicly available and consist of “28,567” and “22,431” Turkish web pages, respectively. In addition, the Vector Space Model (VSM) in general and word embedding state-of-the-art techniques, in particular, were investigated to find the most suitable one for the Turkish language. Overall, results indicated that the developed approach had achieved good performance, robustness, and stability when processing the Turkish language. Also, Bidirectional Encoder Representations from Transformer (BERT) was found to be the most appropriate embedding for building an efficient Turkish language classification system. Finally, our experiments showed superior performance of the developed natural language processing unit against seven state-of-the-art CNN classification systems. Where accuracy improvement compared to the second-best is 10% and 47% compared to the lowest performance.

【 授权许可】

Unknown   

【 预 览 】
附件列表
Files Size Format View
RO202307090002594ZK.pdf 962KB PDF download
  文献评价指标  
  下载次数:2次 浏览次数:1次