学位论文

【摘要】

A web crawler is a program that ;;walks;; the Web to gather web resources. In order to scale to the ever-increasing Web, multiple crawling agents may be deployed in a distributed fashion to retrieve web data co-operatively. A common approach is to divide the Web into many partitions with an agent assigned to crawl within each one. If an agent obtains a web resource that is not from its partition, the resource will be transferred to the rightful owner. This thesis proposes a novel approach to distributed web data gathering by partitioning the Web into topics. The proposed approach employs multiple focused crawlers to retrieve pages from various topics. When a crawler retrieves a page of another topic, it transfers the page to the appropriate crawler. This approach is known as topic-oriented collaborative web crawling. An implementation of the system was built and experimentally evaluated. In order to identify the topic of a web page, a topic classifier was incorporated into the crawling system. As the classifier categorizes only English pages, a language identifier was also introduced to distinguish English pages from non-English ones. From the experimental results, we found that redundance retrieval was low and that a resource, retrieved by an agent, is six times more likely to be retained than a system that uses conventional hashing approach. These numbers were viewed as strong indications that topic-oriented collaborative web crawling system is a viable approach to web data gathering.

【预览】

附件列表
Files	Size	Format	View
Topic-Oriented Collaborative Web Crawling	716KB	PDF	download


Topic-Oriented Collaborative Web Crawling
Computer Science;Web Crawling;Distributed System;Text Categorization
Chung, Chiasen
University of Waterloo
关键词: Computer Science; Web Crawling; Distributed System; Text Categorization;
Others : https://uwspace.uwaterloo.ca/bitstream/10012/1040/1/c3chung2001.pdf
瑞士\|英语
来源: UWSPACE Waterloo Institutional Repository
PDF


	文献评价指标
	下载次数：20次	浏览次数：30次

【 摘 要 】

【 预 览 】

【摘要】

【预览】