网络爬虫外文翻译参考文献 下载本文



爬行消耗资源:下载页面的带宽,支持私人数据结构存储的内存,来评价和选择网址的CPU,以及存储文本和链接以及其他持久性数据的磁盘存储。 B.机器人协议

机器人文件给出排除一部分的网站被抓取的指令。类似地,一个简单的文本文件可以提供有关的新鲜和出版对象的流行信息。对信息允许抓取工具优化其收集的数据刷新策略以及更换对象的政策。 C.元搜索引擎







最关键的评价是衡量主题爬行收获的比例,这是在抓取过程中有多少比例相关网页被采用和不相干的网页是有效地过滤掉,这收获率最高,否则主题爬虫会花很多时间在消除不相关的网页,而且使用一个普通的爬虫可能会更好。 B:分布式检索









Discussion on Web Crawlers of Search Engine

Abstract-With the precipitous expansion of the Web,extracting knowledge from the

Web is becoming gradually important and popular.This is due to the Web?s

convenience and richness of information.To find Web pages, one typically uses search engines that are based on the Web crawling framework.This paper describes the basic task performed search engine.Overview of how the Web crawlers are related with search engine.

Keywords Distributed Crawling, Focused Crawling,Web Crawlers


WWW on the Web is a service that resides on computers that are connected to the Internet and allows end users to access data that is stored on the computers using standard interface software. The World Wide Web is the universe of network-accessible information,an embodiment of human knowledge.

Search engine is a computer program that searches for particular keywords and returns a list of documents in which they were found,especially a commercial service that scans documents on the Internet. A search engine finds information for its database by accepting listings sent it by authors who want exposure,or by getting the information from their “Web crawlers,””spiders,” or “robots,”programs that roam the Internet storing links to and information about each page they visit.

Web Crawler is a program, which fetches information from the World Wide Web in an automated manner.Web crawling is an important research issue. Crawlers are software components, which visit portions of Web trees, according to certain strategies,and collect retrieved objects in local repositories.


The rest of the paper is organized as: in Section 2 we explain the background details of Web crawlers.In Section 3 we discuss on types of crawler, in Section 4 we will explain the working of Web crawler. In Section 5 we cover the two advanced techniques of Web crawlers. In the Section 6 we discuss the problem of selecting more interesting pages.


Web crawlers are almost as old as the Web itself.The first crawler,Matthew Gray?s Wanderer, was written in the spring of 1993,roughly coinciding with the first release Mosaic.Several papers about Web crawling were presented at the first two World Wide Web conference.However,at the time, the Web was three to four orders of magnitude smaller than it is today,so those systems did not address the scaling problems inherent in a crawl of today?s Web.

Obviously, all of the popular search engines use crawlers that must scale up to substantial portions of the Web. However, due to the competitive nature of the search engine business, the designs of these crawlers have not been publicly described. There are two notable exceptions:the Goole crawler and the Internet Archive crawler.Unfortunately,the descriptions of these crawlers in the literature are too terse to enable reproducibility.

The original Google crawler (developed at Stanford) consisted of five functional components running in different processes. A URL server process read URLs out of a file and forwarded them to multiple crawler processes.Each crawler process ran on a different machine,was single-threaded,and used asynchronous I/O to fetch data from up to 300 Web servers in parallel. The crawlers transmitted downloaded pages to a single Store Server process, which compressed the pages and stored them to disk.The page were then read back from disk by an indexer process, which extracted links from HTML pages and saved them to a different disk file. A URLs resolver process read the link file, relative the URLs contained there in, and saved the absolute URLs to the disk file that was read by the URL server. Typically,three to four crawler machines were used, so the entire system required between four and eight machines. Research on Web crawling continues at Stanford even after Google has been