会议论文详细信息
2017 3rd International Conference on Environmental Science and Material Application
Design and Implementation of Distributed Crawler System Based on Scrapy
生态环境科学;材料科学
Fan, Yuhao^1
Rearch Institute Electronic Science and Technology, University of Electronic Science and Technology of China, ChengDu
611731, China^1
关键词: Bloom-filter algorithms;    Design and implementations;    Distributed crawler;    Memory consumption;    MongoDB;    Search services;    Single- machines;    Web information;   
Others  :  https://iopscience.iop.org/article/10.1088/1755-1315/108/4/042086/pdf
DOI  :  10.1088/1755-1315/108/4/042086
来源: IOP
PDF
【 摘 要 】

At present, some large-scale search engines at home and abroad only provide users with non-custom search services, and a single-machine web crawler cannot sovle the difficult task. In this paper, Through the study and research of the original Scrapy framework, the original Scrapy framework is improved by combining Scrapy and Redis, a distributed crawler system based on Web information Scrapy framework is designed and implemented, and Bloom Filter algorithm is applied to dupefilter modul to reduce memory consumption. The movie information captured from douban is stored in MongoDB, so that the data can be processed and analyzed. The results show that distributed crawler system based on Scrapy framework is more efficient and stable than the single-machine web crawler system.

【 预 览 】
附件列表
Files Size Format View
Design and Implementation of Distributed Crawler System Based on Scrapy 200KB PDF download
  文献评价指标  
  下载次数:3次 浏览次数:12次