| 2019 2nd International Conference on Advanced Materials, Intelligent Manufacturing and Automation | |
| Multilingual Focused Crawler System based on Web Content Extraction and Path Configuration | |
| Wang, Jie^1^2 ; Deng, Sanhong^1^2 ; Wang, Lijuan^3 | |
| School of Information Management, Nanjing University, Nanjing | |
| 210023, China^1 | |
| Jiangsu Key Laboratory of Data Engineering and Knowledge Service (Nanjing University), Nanjing | |
| 210023, China^2 | |
| Geological Survey of Jiangsu Province, Nanjing | |
| 210018, China^3 | |
| 关键词: Distribution lines; Foreign language; Multiple languages; Network information; Network resource; Path configuration; Path informations; Web content extractions; | |
| Others : https://iopscience.iop.org/article/10.1088/1757-899X/569/5/052030/pdf DOI : 10.1088/1757-899X/569/5/052030 |
|
| 来源: IOP | |
PDF
|
|
【 摘 要 】
The multilingual focused crawler system combines web content extraction with path configuration to make use of their advantages and achieve automatic collection of network information in multiple languages. Firstly, system selects foreign language keywords according to crawling webpage language and Chinese keywords, and uses initial link to obtain webpage information. Then, it uses path configuration information or web content extraction algorithm based on the distribution line block to get webpage content, and adopts rules or configuration information to acquire new links, published time and title. Next, keywords are used to filter irrelevant information. Finally, results are presented as a list. When users use focused crawler system, the webpage path information can be configured or not according to requirements, and the collected network resources can also be searched or filtered.
【 预 览 】
| Files | Size | Format | View |
|---|---|---|---|
| Multilingual Focused Crawler System based on Web Content Extraction and Path Configuration | 458KB |
PDF