网络爬虫外文翻译参考文献
A.资源约束
爬行消耗资源:下载页面的带宽,支持私人数据结构存储的内存,来评价和选择网址的CPU,以及存储文本和链接以及其他持久性数据的磁盘存储。 B.机器人协议
机器人文件给出排除一部分的网站被抓取的指令。类似地,一个简单的文本文件可以提供有关的新鲜和出版对象的流行信息。对信息允许抓取工具优化其收集的数据刷新策略以及更换对象的政策。 C.元搜索引擎
一个元搜索引擎是一种没有它自己的网页数据库的搜索引擎。它发出的搜索支持其他搜索引擎所有的数据库,从所有的搜索引擎查询并为用户提供的结果。较少的元搜索可以让您深入到最大,最有用的搜索引擎数据库。他们往往返回最小或免费的搜索引擎和其他免费目录并且通常是小和高度商业化的结果。
5.爬行技术
A:主题爬行
一个通用的网络爬虫根据一个URL的特点设置来收集网页。凡为主题爬虫的设计有一个特定的主题的文件,从而减少了网络流量和下载量。主题爬虫的目标是有选择地寻找相关的网页的主题进行预先定义的设置。指定的主题不使用关键字,但使用示范文件。
不是所有的收集和索引访问的Web文件能够回答所有可能的特殊查询,有一个主题爬虫爬行分析其抓起边界,找到链接,很可能是最适合抓取相关,并避免不相关的区域的Web。
这导致在硬件和网络资源极大地节省,并有助于于保持在最新状态的数据。主题爬虫有三个主要组成部分一个分类器,这能够判断相关网页,决定抓取链接的拓展,过滤器决定过滤器抓取的网页,以确定优先访问中心次序的措施,以及均受量词和过滤器动态重新配置的优先的控制的爬虫。
最关键的评价是衡量主题爬行收获的比例,这是在抓取过程中有多少比例相关网页被采用和不相干的网页是有效地过滤掉,这收获率最高,否则主题爬虫会花很多时间在消除不相关的网页,而且使用一个普通的爬虫可能会更好。 B:分布式检索
网络爬虫外文翻译参考文献
检索网络是一个挑战,因为它的成长性和动态性。随着网络规模越来越大,已经称为必须并行处理检索程序,以完成在合理的时间内下载网页。一个单一的检索程序,即使在是用多线程在大型引擎需要获取大量数据的快速上也存在不足。当一个爬虫通过一个单一的物理链接被所有被提取的数据所使用,通过分配多种抓取活动的进程可以帮助建立一个可扩展的易于配置的系统,它具有容错性的系统。拆分负载降低硬件要求,并在同一时间增加整体下载速度和可靠性。每个任务都是在一个完全分布式的方式,也就是说,没有中央协调器的存在。
6、挑战更多“有趣”对象的问题
搜索引擎被认为是一个热门话题,因为它收集用户查询记录。检索程序优先抓取网站根据一些重要的度量,例如相似性(对有引导的查询),返回链接数网页排名或者其他组合/变化最精Najork等。表明,首先考虑广泛优先搜索收集高品质页面,并提出一种网页排名。然而,目前,搜索策略是无法准确选择“最佳”路径,因为他们的认识仅仅是局部的。由于在互联网上可得到的信息数量非常庞大目前不可能实现全面的索引。因此,必须采用剪裁策略。主题爬行和智能检索,是发现相关的特定主题或主题集网页技术。
结论
在本文中,我们得出这样的结论实现完整的网络爬行覆盖是不可能实现,因为受限于整个万维网的巨大规模和资源的可用性。通常是通过一种阈值的设置(网站访问人数,网站上树的水平,与主题等规定),以限制对选定的网站上进行抓取的过程。此信息是在搜索引擎可用于存储/刷新最相关和最新更新的网页,从而提高检索的内容质量,同时减少陈旧的内容和缺页。
网络爬虫外文翻译参考文献
原文:
Discussion on Web Crawlers of Search Engine
Abstract-With the precipitous expansion of the Web,extracting knowledge from the
Web is becoming gradually important and popular.This is due to the Web?s
convenience and richness of information.To find Web pages, one typically uses search engines that are based on the Web crawling framework.This paper describes the basic task performed search engine.Overview of how the Web crawlers are related with search engine.
Keywords Distributed Crawling, Focused Crawling,Web Crawlers
Ⅰ.INTRODUCTION
WWW on the Web is a service that resides on computers that are connected to the Internet and allows end users to access data that is stored on the computers using standard interface software. The World Wide Web is the universe of network-accessible information,an embodiment of human knowledge.
Search engine is a computer program that searches for particular keywords and returns a list of documents in which they were found,especially a commercial service that scans documents on the Internet. A search engine finds information for its database by accepting listings sent it by authors who want exposure,or by getting the information from their “Web crawlers,””spiders,” or “robots,”programs that roam the Internet storing links to and information about each page they visit.
Web Crawler is a program, which fetches information from the World Wide Web in an automated manner.Web crawling is an important research issue. Crawlers are software components, which visit portions of Web trees, according to certain strategies,and collect retrieved objects in local repositories.
网络爬虫外文翻译参考文献
The rest of the paper is organized as: in Section 2 we explain the background details of Web crawlers.In Section 3 we discuss on types of crawler, in Section 4 we will explain the working of Web crawler. In Section 5 we cover the two advanced techniques of Web crawlers. In the Section 6 we discuss the problem of selecting more interesting pages.
Ⅱ.SURVEY OF WEB CRAWLERS
Web crawlers are almost as old as the Web itself.The first crawler,Matthew Gray?s Wanderer, was written in the spring of 1993,roughly coinciding with the first release Mosaic.Several papers about Web crawling were presented at the first two World Wide Web conference.However,at the time, the Web was three to four orders of magnitude smaller than it is today,so those systems did not address the scaling problems inherent in a crawl of today?s Web.
Obviously, all of the popular search engines use crawlers that must scale up to substantial portions of the Web. However, due to the competitive nature of the search engine business, the designs of these crawlers have not been publicly described. There are two notable exceptions:the Goole crawler and the Internet Archive crawler.Unfortunately,the descriptions of these crawlers in the literature are too terse to enable reproducibility.
The original Google crawler (developed at Stanford) consisted of five functional components running in different processes. A URL server process read URLs out of a file and forwarded them to multiple crawler processes.Each crawler process ran on a different machine,was single-threaded,and used asynchronous I/O to fetch data from up to 300 Web servers in parallel. The crawlers transmitted downloaded pages to a single Store Server process, which compressed the pages and stored them to disk.The page were then read back from disk by an indexer process, which extracted links from HTML pages and saved them to a different disk file. A URLs resolver process read the link file, relative the URLs contained there in, and saved the absolute URLs to the disk file that was read by the URL server. Typically,three to four crawler machines were used, so the entire system required between four and eight machines. Research on Web crawling continues at Stanford even after Google has been