会议论文详细信息
2019 2nd International Conference on Advanced Materials, Intelligent Manufacturing and Automation
Multilingual Focused Crawler System based on Web Content Extraction and Path Configuration
Wang, Jie^1^2 ; Deng, Sanhong^1^2 ; Wang, Lijuan^3
School of Information Management, Nanjing University, Nanjing
210023, China^1
Jiangsu Key Laboratory of Data Engineering and Knowledge Service (Nanjing University), Nanjing
210023, China^2
Geological Survey of Jiangsu Province, Nanjing
210018, China^3
关键词: Distribution lines;    Foreign language;    Multiple languages;    Network information;    Network resource;    Path configuration;    Path informations;    Web content extractions;   
Others  :  https://iopscience.iop.org/article/10.1088/1757-899X/569/5/052030/pdf
DOI  :  10.1088/1757-899X/569/5/052030
来源: IOP
PDF
【 摘 要 】

The multilingual focused crawler system combines web content extraction with path configuration to make use of their advantages and achieve automatic collection of network information in multiple languages. Firstly, system selects foreign language keywords according to crawling webpage language and Chinese keywords, and uses initial link to obtain webpage information. Then, it uses path configuration information or web content extraction algorithm based on the distribution line block to get webpage content, and adopts rules or configuration information to acquire new links, published time and title. Next, keywords are used to filter irrelevant information. Finally, results are presented as a list. When users use focused crawler system, the webpage path information can be configured or not according to requirements, and the collected network resources can also be searched or filtered.

【 预 览 】
附件列表
Files Size Format View
Multilingual Focused Crawler System based on Web Content Extraction and Path Configuration 458KB PDF download
  文献评价指标  
  下载次数:13次 浏览次数:15次