2017 3rd International Conference on Environmental Science and Material Application | |
Design and Implementation of Distributed Crawler System Based on Scrapy | |
生态环境科学;材料科学 | |
Fan, Yuhao^1 | |
Rearch Institute Electronic Science and Technology, University of Electronic Science and Technology of China, ChengDu | |
611731, China^1 | |
关键词: Bloom-filter algorithms; Design and implementations; Distributed crawler; Memory consumption; MongoDB; Search services; Single- machines; Web information; | |
Others : https://iopscience.iop.org/article/10.1088/1755-1315/108/4/042086/pdf DOI : 10.1088/1755-1315/108/4/042086 |
|
来源: IOP | |
【 摘 要 】
At present, some large-scale search engines at home and abroad only provide users with non-custom search services, and a single-machine web crawler cannot sovle the difficult task. In this paper, Through the study and research of the original Scrapy framework, the original Scrapy framework is improved by combining Scrapy and Redis, a distributed crawler system based on Web information Scrapy framework is designed and implemented, and Bloom Filter algorithm is applied to dupefilter modul to reduce memory consumption. The movie information captured from douban is stored in MongoDB, so that the data can be processed and analyzed. The results show that distributed crawler system based on Scrapy framework is more efficient and stable than the single-machine web crawler system.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
Design and Implementation of Distributed Crawler System Based on Scrapy | 200KB | download |