Web Crawling

8 Web Crawling Web crawlers, also known as spiders or robots, are programs that auto- matically download Web pages. Since information on the Web is scattered among billions of pages served by millions of servers around the globe, users who browse the Web can follow hyperlinks to access information, virtually moving from one page to the next. A crawler can visit many sites to collect information that can be analyzed and mined in a central location, either online (as it is downloaded) or off-line (after it is stored). Were the Web a static collection of pages, we would have little long term use for crawling. Once all the pages are fetched and saved in a reposi- tory, we are done. However, the Web is a dynamic entity evolving at rapid rates. Hence there is a continuous need for crawlers to help applications stay current as pages and links are added, deleted, moved or modified. There are many applications for Web crawlers. One is business intelli- gence, whereby organizations collect information about their competitors and potential collaborators. Another use is to monitor Web sites and pages of interest, so that a user or community can be notified when new informa- tion appears in certain places. There are also malicious applications of crawlers, for example, that harvest email addresses to be used by spam- mers or collect personal information to be used in phishing and other iden- tity theft attacks. The most widespread use of crawlers is, however, in sup- port of search engines. In fact, crawlers are the main consumers of Internet bandwidth. They collect pages for search engines to build their indexes. Well known search engines such as Google, Yahoo! and MSN run very ef- ficient universal crawlers designed to gather all pages irrespective of their content. Other crawlers, sometimes called preferential crawlers, are more targeted. They attempt to download only pages of certain types or topics. This chapter introduces the main concepts, algorithms and data struc- tures behind Web crawlers. After discussing the implementation issues that all crawlers have to address, we describe different types of crawlers: uni- versal, focused, and topical. We also discuss some of the ethical issues around crawlers. Finally, we peek at possible future uses of crawlers in support of alternative models where crawling and searching activities are distributed among a large community of users connected by a dynamic and adaptive peer network. By Filippo Menczer 274 8 Web Crawling 8.1 A Basic Crawler Algorithm In its simplest form, a crawler starts from a set of seed pages (URLs) and then uses the links within them to fetch other pages. The links in these pages are, in turn, extracted and the corresponding pages are visited. The process repeats until a sufficient number of pages are visited or some other objective is achieved. This simple description hides many delicate issues related to network connections, spider traps, URL canonicalization, page parsing, and crawling ethics. In fact, Google founders Sergey Brin and Lawrence Page, in their seminal paper, identified the Web crawler as the most sophisticated yet fragile component of a search engine [68]. Figure 8.1 shows the flow of a basic sequential crawler. Such a crawler fetches one page at a time, making inefficient use of its resources. Later in the chapter we discuss how efficiency can be improved by the use of mul- tiple processes, threads, and asynchronous access to resources. The crawler maintains a list of unvisited URLs called the frontier. The list is initialized with seed URLs which may be provided by the user or another program. In each iteration of its main loop, the crawler picks the next URL from the frontier, fetches the page corresponding to the URL through HTTP, parses the retrieved page to extract its URLs, adds newly discovered URLs to the frontier, and stores the page (or other extracted information, possibly index terms) in a local disk repository. The crawling process may be terminated when a certain number of pages have been crawled. The crawler may also be forced to stop if the frontier becomes empty, although this rarely hap- pens in practice due to the high average number of links (on the order of ten out-links per page across the Web). A crawler is, in essence, a graph search algorithm. The Web can be seen as a large graph with pages as its nodes and hyperlinks as its edges. A crawler starts from a few of the nodes (seeds) and then follows the edges to reach other nodes. The process of fetching a page and extracting the links within it is analogous to expanding a node in graph search. The frontier is the main data structure, which contains the URLs of un- visited pages. Typical crawlers attempt to store the frontier in the main memory for efficiency. Based on the declining price of memory and the spread of 64-bit processors, quite a large frontier size is feasible. Yet the crawler designer must decide which URLs have low priority and thus get discarded when the frontier is filled up. Note that given some maximum size, the frontier will fill up quickly due to the high fan-out of pages. Even more importantly, the crawler algorithm must specify the order in which new URLs are extracted from the frontier to be visited. These mechanisms determine the graph search algorithm implemented by the crawler. 8.1 A Basic Crawler Algorithm 275 Fig. 8.1. Flow chart of a basic sequential crawler. The main data operations are shown on the left, with dashed arrows. 8.1.1 Breadth-First Crawlers The frontier may be implemented as a first-in-first-out (FIFO) queue, cor- responding to a breadth-first crawler. The URL to crawl next comes from the head of the queue and new URLs are added to the tail of the queue. Once the frontier reaches its maximum size, the breadth-first crawler can add to the queue only one unvisited URL from each new page crawled. 276 8 Web Crawling The breadth-first strategy does not imply that pages are visited in “ran- dom” order. To understand why, we have to consider the highly skewed, long-tailed distribution of indegree in the Web graph. Some pages have a number of links pointing to them that are orders of magnitude larger than the mean. Indeed, the mean indegree is not statistically significant when the indegree k is distributed according to a power law Pr(k) ∼ k−γ with ex- ponent γ < 3 [437]. For the Web graph, this is the case, with γ ≈ 2.1 [69]. This means that the fluctuations of indegree are unbounded, i.e., the stan- dard deviation is bounded only by the finite size of the graph. Intuitively, popular pages have so many incoming links that they act like attractors for breadth-first crawlers. It is therefore not surprising that the order in which pages are visited by a breadth-first crawler is highly correlated with their PageRank or indegree values. An important implication of this phenome- non is an intrinsic bias of search engines to index well connected pages. Another reason that breadth-first crawlers are not “random” is that they are greatly affected by the choice of seed pages. Topical locality measures indicate that pages in the link neighborhood of a seed page are much more likely to be related to the seed pages than randomly selected pages. These and other types of bias are important to universal crawlers (Sect. 8.3). As mentioned earlier, only unvisited URLs are to be added to the fron- tier. This requires some data structure to be maintained with visited URLs. The crawl history is a time-stamped list of URLs fetched by the crawler tracking its path through the Web. A URL is entered into the history only after the corresponding page is fetched. This history may be used for post- crawl analysis and evaluation. For example, we want to see if the most relevant or important resources are found early in the crawl process. While history may be stored on disk, it is also maintained as an in-memory data structure for fast look-up, to check whether a page has been crawled or not. This check is required to avoid revisiting pages or wasting space in the limited-size frontier. Typically a hash table is appropriate to obtain quick URL insertion and look-up times (O(1)). The look-up process assumes that one can identify two URLs effectively pointing to the same page. This in- troduces the need for canonical URLs (see Sect. 8.2). Another important detail is the need to prevent duplicate URLs from be- ing added to the frontier. A separate hash table can be maintained to store the frontier URLs for fast look-up to check whether a URL is already in it. 8.1.2 Preferential Crawlers A different crawling strategy is obtained if the frontier is implemented as a priority queue rather than a FIFO queue. Typically, preferential crawl- 8.2 Implementation Issues 277 ers assign each unvisited link a priority based on an estimate of the value of the linked page. The estimate can be based on topological properties (e.g., the indegree of the target page), content properties (e.g., the similar- ity between a user query and the source page), or any other combination of measurable features. For example, the goal of a topical crawler is to follow edges that are expected to lead to portions of the Web graph that are rele- vant to a user-selected topic. The choice of seeds is even more important in this case than for breadth-first crawlers. We will discuss various preferen- tial crawling algorithms in Sects. 8.4 and 8.5. For now let us simply as- sume that some function exists to assign a priority value or score to each unvisited URL. If pages are visited in the order specified by the priority values in the frontier, then we have a best-first crawler. The priority queue may be a dynamic array that is always kept sorted by URL scores. At each step, the best URL is picked from the head of the queue. Once the corresponding page is fetched, the URLs extracted from it must, in turn, be scored. They are then added to the frontier in such a manner that the sorting order of the priority queue is maintained. As for breadth-first, best-first crawlers also need to avoid duplicate URLs in the frontier. Keeping a separate hash table for look-up is an efficient way to achieve this. The time complexity of inserting a URL into the priority queue is O(logF), where F is the frontier size (looking up the hash requires constant time). To dequeue a URL, it must first be removed from the prior- ity queue (O(logF)) and then from the hash table (again O(1)). Thus the parallel use of the two data structures yields a logarithmic total cost per URL. Once the frontier’s maximum size is reached, only the best URLs are kept; the frontier must be pruned after each new set of links is added. 8.2 Implementation Issues 8.2.1 Fetching To fetch pages, a crawler acts as a Web client; it sends an HTTP request to the server hosting the page and reads the response. The client needs to timeout connections to prevent spending unnecessary time waiting for re- sponses from slow servers or reading huge pages. In fact, it is typical to re- strict downloads to only the first 10-100 KB of data for each page. The cli- ent parses the response headers for status codes and redirections. Redirect loops are to be detected and broken by storing URLs from a redirection chain in a hash table and halting if the same URL is encountered twice. One may also parse and store the last-modified header to determine the age of the document, although this information is known to be unreliable. Er- 278 8 Web Crawling ror-checking and exception handling is important during the page fetching process since the same code must deal with potentially millions of remote servers. In addition, it may be beneficial to collect statistics on timeouts and status codes to identify problems or automatically adjust timeout val- ues. Programming languages such as Java, Python and Perl provide simple programmatic interfaces for fetching pages from the Web. However, one must be careful in using high-level interfaces where it may be harder to de- tect lower-level problems. For example, a robust crawler in Perl should use the Socket module to send HTTP requests rather than the higher-level LWP library (the World-Wide Web library for Perl). The latter does not al- low fine control of connection timeouts. 8.2.2 Parsing Once (or while) a page is downloaded, the crawler parses its content, i.e., the HTTP payload, and extracts information both to support the crawler’s master application (e.g., indexing the page if the crawler supports a search engine) and to allow the crawler to keep running (extracting links to be added to the frontier). Parsing may imply simple URL extraction from hy- perlinks, or more involved analysis of the HTML code. The Document Ob- ject Model (DOM) establishes the structure of an HTML page as a tag tree, as illustrated in Fig. 8.2. HTML parsers build the tree in a depth-first man- ner, as the HTML source code of a page is scanned linearly. Unlike program code, which must compile correctly or else will fail with a syntax error, correctness of HTML code tends to be laxly enforced by browsers. Even when HTML standards call for strict interpretation, de facto standards imposed by browser implementations are very forgiving. This, together with the huge population of non-expert authors generating Web pages, imposes significant complexity on a crawler's HTML parser. Many pages are published with missing required tags, tags improperly nested, missing close tags, misspelled or missing attribute names and val- ues, missing quotes around attribute values, unescaped special characters, and so on. As an example, the double quotes character in HTML is re- served for tag syntax and thus is forbidden in text. The special HTML en- tity " is to be used in its place. However, only a small number of au- thors are aware of this, and a large fraction of Web pages contains this illegal character. Just like browsers, crawlers must be forgiving in these cases; they cannot afford to discard many important pages as a strict parser would do. A wise preprocessing step taken by robust crawlers is to apply a tool such as tidy (www.w3.org/People/Raggett/tidy) to clean up the HTML content prior to parsing. To add to the complexity, there are many coexist- 8.2 Implementation Issues 279 ing HTML and XHTML reference versions. However, if the crawler only needs to extract links within a page and/or the text in the page, simpler parsers may suffice. The HTML parsers available in high-level languages such as Java and Perl are becoming increasingly sophisticated and robust. Fig. 8.2. Illustration of the DOM (or tag) tree built from a simple HTML page. Internal nodes (shown as ovals) represent HTML tags, with the tag as the root. Leaf nodes (shown as rectangles) correspond to text chunks. A growing portion of Web pages are written in formats other than HTML. Crawlers supporting large-scale search engines routinely parse and index documents in many open and proprietary formats such as plain text, PDF, Microsoft Word and Microsoft PowerPoint. Depending on the appli- 280 8 Web Crawling cation of the crawler, this may or may not be required. Some formats pre- sent particular difficulties as they are written exclusively for human inter- action and thus are especially unfriendly to crawlers. For instance, some commercial sites use graphic animations in Flash; these are difficult for a crawler to parse in order to extract links and their textual content. Other examples include image maps and pages making heavy use of Javascript for interaction. New challenges are going to come as new standards such as Scalable Vector Graphics (SVG), Asynchronous Javascript and XML (AJAX), and other XML-based languages gain popularity. 8.2.3 Stopword Removal and Stemming When parsing a Web page to extract the content or to score new URLs suggested by the page, it is often helpful to remove so-called stopwords, i.e., terms such as articles and conjunctions, which are so common that they hinder the discrimination of pages on the basis of content. Another useful technique is stemming, by which morphological variants of terms are conflated into common roots (stems). In a topical crawler where a link is scored based on the similarity between its source page and the query, stemming both the page and the query helps improve the matches between the two sets and the accuracy of the scoring function. Both stop-word removal and stemming are standard techniques in in- formation retrieval, and are discussed in greater detail in Chap. 6. 8.2.4 Link Extraction and Canonicalization HTML parsers provide the functionality to identify tags and associated at- tribute-value pairs in a given Web page. In order to extract hyperlink URLs from a page, we can use a parser to find anchor () tags and grab the values of the associated href attributes. However, the URLs thus ob- tained need to be further processed. First, filtering may be necessary to ex- clude certain file types that are not to be crawled. This can be achieved with white lists (e.g., only follow links to text/html content pages) or black lists (e.g., discard links to PDF files). The identification of a file type may rely on file extensions. However, they are often unreliable and sometimes missing altogether. We cannot afford to download a document and then decide whether we want it or not. A compromise is to send an HTTP HEAD request and inspect the content-type response header, which is usu- ally a more reliable label. Another type of filtering has to do with the static or dynamic nature of pages. A dynamic page (e.g., generated by a CGI script) may indicate a 8.2 Implementation Issues 281 query interface for a database or some other application in which a crawler may not be interested. In the early days of the Web, such pages were few and easily recognizable, e.g., by matching URLs against the /cgi-bin/ direc- tory name for CGI scripts, or against the special characters [?=&] used in CGI query strings. However, the use of dynamic content has become much more common; it is used in a variety of sites for content that is perfectly indexable. Most importantly, its dynamic nature is very difficult to recog- nize via URL inspection. For these reasons, most crawlers no longer make such distinction between static and dynamic content. While a crawler nor- mally would not create query URLs autonomously (unless it is designed to probe the so-called deep or hidden Web, which contain databases with query interfaces), it will happily crawl URLs hard-coded in HTML source of parsed pages. In other words, if a URL is found in a Web page, it is fair game. There is one important exception to this strategy, the spider trap, which is discussed below. Before links can be added to the frontier, relative URLs must be con- verted to absolute URLs. For example, the relative URL news/today.html in the page http://www.somehost.com/index.html is to be transformed into the absolute form http://www.somehost.com/news/today.html. There are various rules to convert relative URLs into absolute ones. A relative URL can be expressed as a relative or absolute path relative to the Web server’s document root directory. The base URL may be specified by an HTTP header or a meta-tag within an HTML page, or not at all–in the latter case the directory of the hyperlink’s source page is used as a base URL. Converting relative URLs is just one of many steps that make up the canonicalization process, i.e., the conversion of a URL into a canonical form. The definition of canonical form is somewhat arbitrary, so that dif- ferent crawlers may specify different rules. For example, one crawler may always specify the port number within the URL (e.g., http://www.somehost.com:80/), while another may specify the port number only when it is not 80 (the default HTTP port). As long as the canonical form is applied consistently by a crawler, such distinctions are inconse- quential. Some programming languages such as Perl provide modules to manage URLs, including methods for absolute/relative conversion and canonicalization. However, several canonicalization steps require the ap- plication of heuristic rules, and off-the-shelf tools typically do not provide such functionalities. A crawler may also need to use heuristics to detect when two URLs point to the same page in order to minimize the likelihood that the same page is fetched multiple times. Table 8.1 lists the steps typi- cally employed to canonicalize a URL. 282 8 Web Crawling Table 8.1. Some transformations to convert URLs to canonical forms. Stars indi- cate heuristic rules, where there is a tradeoff between the risk of altering the se- mantics of the URL (if a wrong guess is made) and the risk of missing duplicate URLs (if no transformation is applied) for the same target Description and transformation Example and canonical form Default port number http://cs.indiana.edu:80/ Remove http://cs.indiana.edu/ Root directory http://cs.indiana.edu Add trailing slash http://cs.indiana.edu/ Guessed directory* http://cs.indiana.edu/People Add trailing slash http://cs.indiana.edu/People/ Fragment http://cs.indiana.edu/faq.html#3 Remove http://cs.indiana.edu/faq.html Current or parent directory http://cs.indiana.edu/a/./../b/ Resolve path http://cs.indiana.edu/b/ Default filename* http://cs.indiana.edu/index.html Remove http://cs.indiana.edu/ Needlessly encoded characters http://cs.indiana.edu/%7Efil/ Decode http://cs.indiana.edu/~fil/ Disallowed characters http://cs.indiana.edu/My File.htm Encode http://cs.indiana.edu/My%20File.htm Mixed/upper-case host names http://CS.INDIANA.EDU/People/ Lower-case http://cs.indiana.edu/People/ 8.2.5 Spider Traps A crawler must be aware of spider traps. These are Web sites where the URLs of dynamically created links are modified based on the sequence of actions taken by the browsing user (or crawler). Some e-commerce sites such as Amazon.com may use URLs to encode which sequence of products each user views. This way, each time a user clicks a link, the server can log detailed information on the user's shopping behavior for later analysis. As an illustration, consider a dynamic page for product x, whose URL path is /x and that contains a link to product y. The URL path for this link would be /x/y to indicate that the user is going from page x to page y. Now sup- pose the page for y has a link back to product x. The dynamically created URL path for this link would be /x/y/x, so that the crawler would think this is a new page when in fact it is an already visited page with a new URL. As a side effect of a spider trap, the server may create an entry in a data- base every time the user (or crawler) clicks on certain dynamic links. An example might be a blog or message board where users can post com- ments. These situations create sites that appear infinite to a crawler, be- cause the more links are followed, the more new URLs are created. How- 8.2 Implementation Issues 283 ever these new “dummy” links do not lead to existing or new content, but simply to dynamically created form pages, or to pages that have already been visited. Thus a crawler could go on crawling inside the spider trap forever without actually fetching any new content. In practice spider traps are not only harmful to the crawler, which wastes bandwidth and disk space to download and store duplicate or use- less data. They may be equally harmful to the server sites. Not only does the server waste its bandwidth, the side effect of a crawler caught in a spi- der trap may also be filling a server-side database with bogus entries. The database may eventually become filled to capacity, and the site may be disabled as a result. This is a type of denial of service attack carried out unwittingly by the crawler. In some cases a spider trap needs the client to send a cookie set by the server for the dynamic URLs to be generated. So the problem is prevented if the crawler avoids accepting or sending any cookies. However, in most cases a more proactive approach is necessary to defend a crawler against spider traps. Since the dummy URLs often become larger and larger in size as the crawler becomes entangled in a spider trap, one common heuristic approach to tackle such traps is by limiting the URL sizes to some maxi- mum number of characters, say 256. If a longer URL is encountered, the crawler should simply ignore it. Another way is by limiting the number of pages that the crawler requests from a given domain. The code associated with the frontier can make sure that every consecutive sequence of, say, 100 URLs fetched by the crawler contains at most one URL from each fully qualified host name. This approach is also germane to the issue of crawler etiquette, discussed later. 8.2.6 Page Repository Once a page is fetched, it may be stored/indexed for the master application (e.g., a search engine). In its simplest form a page repository may store the crawled pages as separate files. In this case each page must map to a unique file name. One way to do this is to map each page's URL to a com- pact string using some hashing function with low probability of collisions, e.g., MD5. The resulting hash value is used as a (hopefully) unique file name. The shortcoming of this approach is that a large scale crawler would incur significant time and disk space overhead from the operating system to manage a very large number of small individual files. A more efficient solution is to combine many pages into one file. A naïve approach is to simply concatenate some number of pages (say 1,000) into each file, with some special markup to separate and identify the pages 284 8 Web Crawling within the file. This requires a separate look-up table to map URLs to file names and IDs within each file. A better method is to use a database to store the pages, indexed by (canonical) URLs. Since traditional RDBMSs impose high overhead, embedded databases such as the open-source Berkeley DB are typically preferred for fast access. Many high-level languages such as Java and Perl provide simple APIs to manage Berkeley DB files, for example as tied associative arrays. This way the storage management operations become nearly transparent to the crawler code, which can treat the page repository as an in-memory data structure. 8.2.7 Concurrency A crawler consumes three main resources: network, CPU, and disk. Each is a bottleneck with limits imposed by bandwidth, CPU speed, and disk seek/transfer times. The simple sequential crawler described in Sect. 8.1 makes a very inefficient use of these resources because at any given time two of them are idle while the crawler attends to the third. The most straightforward way to speed-up a crawler is through concur- rent processes or threads. Multiprocessing may be somewhat easier to im- plement than multithreading depending on the programming language and platform, but it may also incur a higher overhead due to the involvement of the operating system in the management (creation and destruction) of child processes. Whether threads or processes are used, a concurrent crawler may follow a standard parallel computing model [292] as illustrated in Fig. 8.3. Basically each thread or process works as an independent crawler, ex- cept for the fact that access to the shared data structures (mainly the fron- tier, and possibly the page repository) must be synchronized. In particular a frontier manager is responsible for locking and unlocking the frontier data structures so that only one process or thread can write to them at one time. Note that both enqueueing and dequeuing are write operations. Addi- tionally, the frontier manager would maintain and synchronize access to other shared data structures such as the crawl history for fast look-up of visited URLs. It is a bit more complicated for a concurrent crawler to deal with an empty frontier than for a sequential crawler. An empty frontier no longer implies that the crawler has reached a dead-end, because other processes may be fetching pages and adding new URLs in the near future. The proc- ess or thread manager may deal with such a situation by sending a tempo- rary sleep signal to processes that report an empty frontier. The process manager needs to keep track of the number of sleeping processes; when all the processes are asleep, the crawler must halt. 8.3 Universal Crawlers 285 Fig. 8.3. Architecture of a concurrent crawler The concurrent design can easily speed-up a crawler by a factor of 5 or 10. The concurrent architecture however does not scale up to the perform- ance needs of a commercial search engine. We discuss in Sect. 8.3 further steps that can be taken to achieve more scalable crawlers. 8.3 Universal Crawlers General purpose search engines use Web crawlers to maintain their indices [25], amortizing the cost of crawling and indexing over the millions of queries received between successive index updates (though indexers are 286 8 Web Crawling designed for incremental updates [101]. These large-scale universal crawl- ers differ from the concurrent breadth-first crawlers described above along two major dimensions: 1. Performance: They need to scale up to fetching and processing hun- dreds of thousands of pages per second. This calls for several architec- tural improvements. 2. Policy: They strive to cover as much as possible of the most important pages on the Web, while maintaining their index as fresh as possible. These goals are, of course, conflicting so that the crawlers must be de- signed to achieve good tradeoffs between their objectives. Next we discuss the main issues in meeting these requirements. 8.3.1 Scalability Figure 8.4 illustrates the architecture of a large-scale crawler, based on the accounts in the literature [68, 85, 238]. The most important change from the concurrent model discussed earlier is the use of asynchronous sockets in place of threads or processes with synchronous sockets. Asynchronous sockets are non-blocking, so that a single process or thread can keep hun- dreds of network connections open simultaneously and make efficient use of network bandwidth. Not only does this eliminate the overhead due to managing threads or processes, it also makes locking access to shared data structures unnecessary. Instead, the sockets are polled to monitor their states. When an entire page has been fetched into memory, it is processed for link extraction and indexing. This “pull” model eliminates contention for resources and the need for locks. The frontier manager can improve the efficiency of the crawler by main- taining several parallel queues, where the URLs in each queue refer to a single server. In addition to spreading the load across many servers within any short time interval, this approach allows to keep connections with servers alive over many page requests, thus minimizing the overhead of TCP opening and closing handshakes. The crawler needs to resolve host names in URLs to IP addresses. The connections to the Domain Name System (DNS) servers for this purpose are one of the major bottlenecks of a naïve crawler, which opens a new TCP connection to the DNS server for each URL. To address this bottle- neck, the crawler can take several steps. First, it can use UDP instead of TCP as the transport protocol for DNS requests. While UDP does not guarantee delivery of packets and a request can occasionally be dropped, this is rare. On the other hand, UDP incurs no connection overhead with a 8.3 Universal Crawlers 287 significant speed-up over TCP. Second, the DNS server should employ a large, persistent, and fast (in-memory) cache. Finally, the pre-fetching of DNS requests can be carried out when links are extracted from a page. In addition to being added to the frontier, the URLs can be scanned for host names to be sent to the DNS server. This way, when a URL is later ready to be fetched, the host IP address is likely to be found in the DNS cache, obviating the need to propagate the request through the DNS tree. Fig. 8.4. High-level architecture of a scalable universal crawler 288 8 Web Crawling In addition to making more efficient use of network bandwidth through asynchronous sockets, large-scale crawlers can increase network band- width by using multiple network connections switched to multiple routers, thus utilizing the networks of multiple Internet service providers. Simi- larly, disk I/O throughput can be boosted via a storage area network con- nected to a storage pool through a fibre channel switch. 8.3.2 Coverage vs. Freshness vs. Importance Given the size of the Web, it is not feasible even for the largest-scale crawlers employed by commercial search engines to index all of the con- tent that could be accessed. Instead, search engines aim to focus on the most “important” pages, where importance is assessed based on various factors such as link popularity measures (indegree or PageRank) [102, 234]. At the time of this writing the three major commercial search engines report index sizes in the order of 1010 pages, while the indexable Web may be at least an order of magnitude larger. The simplest strategy to bias the crawler in favor of popular pages is to do nothing – given the long-tailed distribution of indegree discussed in Sect. 8.1, a simple breadth-first crawling algorithm will tend to fetch the pages with the highest PageRank by definition, as confirmed empirically [401]. In fact, one would have to apply a reverse bias to obtain a fair sam- ple of the Web. Suppose that starting with a random Web walk, we wanted a random sample of pages drawn with uniform probability distribution across all pages. We can write the posterior probability of adding a page p to the sample as Pr(accept(p)|crawl(p))⋅Pr(crawl(p)) where the first factor is the conditional probability of accepting the page into the sample given that it was crawled, and the second factor is the prior probability of crawl- ing the page in the random walk. We can find the acceptance strategy to obtain a uniform sample by setting the product to a constant, yielding Pr(accept(p)|crawl(p)) ∼ 1/Pr(crawl(p)). The prior Pr(crawl(p)) is given by the PageRank of p, and can be approximated during the random walk by the frequency f(p) that the crawler has encountered a link to p. So there- fore, each visited page p should be accepted with probability proportional to 1/f(p). Empirical tests on a simulated Web graph validate that this strat- egy yields a sample of the graph that is statistically representative of the original [235]. The goal to cover as many pages as possible (among the most important ones) is in conflict with the need to maintain a fresh index. Because of the highly dynamic nature of the Web, with pages being added, deleted, and modified all the time, it is necessary for a crawler to revisit pages already 8.4 Focused Crawlers 289 in the index in order to keep the index up-to-date. Many studies have been conducted to analyze the dynamics of the Web, i.e., the statistical proper- ties of the processes leading to change in Web structure and content 66, 101, 152, 177, 416]. They all indicate that the Web changes at very rapid rates. While early studies relied on the values reported by Web servers in the last-modified HTTP header, recently there is consensus that this infor- mation has little reliability. The most recent and exhaustive study at the time of this writing [416] reports that while new pages are created at a rate of about 8% per week, only about 62% of the content of these pages is really new because pages are often copied from existing ones. The link structure of the Web is more dynamic, with about 25% new links created per week. Once created, pages tend to change little so that most of the changes observed in the Web are due to additions and deletions rather than modifications. Finally, there is an agreement on the observation that the degree of change of a page is a better predictor of future change than the frequency of change [177, 416]. This suggests that crawler revisit strate- gies based on frequency of change [25, 101] may not be the most appropri- ate for achieving a good tradeoff between coverage and freshness. 8.4 Focused Crawlers Rather than crawling pages from the entire Web, we may want to crawl only pages in certain categories. One applications of such a preferential crawler would be to maintain a Web taxonomy such as the Yahoo! Direc- tory (dir.yahoo.com) or the volunteer-based Open Directory Project (ODP, dmoz.org). Suppose you are the ODP editor for a certain category; you may wish to launch such a crawler from an initial seed set of pages relevant to that category, and see if any new pages discovered should be added to the directory, either directly under the category in question or one of its subcategories. A focused crawler attempts to bias the crawler to- wards pages in certain categories in which the user is interested. Chakrabarti et al. [87] proposed a focused crawler based on a classifier. The idea is to first build a text classifier using labeled example pages from, say, the ODP. Then the classifier would guide the crawler by preferentially selecting from the frontier those pages that appear most likely to belong to the categories of interest, according to the classifier's prediction. To train the classifier, example pages are drawn from various categories in the tax- onomy as shown in Fig. 8.5. The classification algorithm used was the na- ïve Bayesian method (see Chap. 3). For each category c in the taxonomy we can build a Bayesian classifier to compute the probability Pr(c|p) that a crawled page p belongs to c (by definition, Pr(top|p) = 1 for the top or root 290 8 Web Crawling category). The user can select a set c* of categories of interest. Each crawled page is assigned a relevance score. ∑ ∈= * ).|Pr()( cc pcpR (1) 3 Two strategies were explored. In the “soft” focused strategy, the crawler uses the score R(p) of each crawled page p as a priority value for all unvis- ited URLs extracted from p. The URLs are then added to the frontier, which is treated as a priority queue (see Sect. 8.1.2). In the “hard” focused strategy, for a crawled page p, the classifier first finds the leaf category )(ˆ pc in the taxonomy most likely to include p: ˆ c (p) = arg max c: / ∃ c'⊂c Pr(c | p). (2) If an ancestor of )(ˆ pc is a focus category, i.e., ∃c’: )(ˆ pc ⊂ c’∧ c’∈ c*, then the URLs from the crawled page p are added to the frontier. Other- wise they are discarded. The idea is illustrated in Fig. 8.5 (left). For exam- ple, imagine a crawler focused on soccer (c' = soccer ∈ c*) visits a page p in the FIFA World Cup Germany 2006 site. If the classifier correctly as- signs p to the leaf category cˆ =Sports/Soccer/Competitions/World_Cup/2006, Fig. 8.5. Left: A taxonomy supporting a focused crawler. The areas in gray repre- sent the categories of interest c*. A crawler with hard focus would add to the fron- tier the links extracted from a page classified in the leaf category 1ˆc because its ancestor category c' is of interest to the user, while the links from a page classi- fied in 2ˆc would be discarded. Right: A context graph with L = 3 layers con- structed to train a context focused crawler from the target set in layer l = 0. 8.4 Focused Crawlers 291 the links extracted from p are added to the frontier because 2006 is a sub- category of Sports/Soccer (cˆ ⊂ soccer). The soft and hard focus strategies worked equally well in experiments. Another element of the focused crawler is the use of a distiller. The dis- tiller applies a modified version of the HITS algorithm [282] to find topical hubs. These hubs provide links to authoritative sources on a focus cate- gory. The distiller is activated at various times during the crawl and some of the top hubs are added to the frontier. Context-Focused Crawlers are another type of focused crawlers. They also use naïve Bayesian classifiers as a guide, but in this case the classifi- ers are trained to estimate the link distance between a crawled page and a set of relevant target pages [139]. To see why this might work, imagine looking for information on “machine learning.” One might go to the home pages of computer science departments and from there to faculty pages, which may then lead to relevant pages and papers. A department home page, however, may not contain the keywords “machine learning.” A typi- cal focused or best-first crawler would give such a page a low priority and possibly never follow its links. However, if the crawler could estimate that pages about machine learning are only two links away from a page con- taining the keywords “computer science department,” then it would give the department home page a higher priority. The context-focused crawler is trained using a context graph with L layers (Fig. 8.5 right). The seed (target) pages form the layer 0 of the graph. The pages corresponding to the in-links to the seed pages are in layer 1. The in-links to the layer 1 pages make up the layer 2, and so on. The in-links to any page can be obtained by submitting a link: query to a search engine. The seed pages in layer 0 (and possibly those in layer 1) are then concatenated into a single large document, and the top few terms ac- cording to the TF-IDF weighting scheme (see Chap. 6) are selected as the vocabulary (feature space) to be used for classification. A naïve Bayesian classifier is built for each layer in the context graph. A prior probability Pr(l ) = 1/L is assigned to each layer. All the pages in a layer are used to compute Pr(t|l ), the probability of occurrence of a term t given the layer (class) l . At the crawling time, these are used to compute Pr(p|l ) for each crawled page p. The posterior probability Pr(l |p) of p belonging to layer l can then be computed for each layer from Bayes’ rule. The layer l * with highest posterior probability wins: ).|Pr(maxarg)(* pp ll l = (3) 292 8 Web Crawling If Pr(l *|p) is less than a threshold, p is classified into the “other” class, which represents pages that do not have a good fit with any of the layers in the context graph. If Pr(l *|p) exceeds the threshold, p is classified into l *. The set of classifiers corresponding to the context graph provides a mechanism to estimate the link distance of a crawled page from a relevant page. If the mechanism works, the computer science department page in our example will get classified into layer 2. The crawler maintains a sepa- rate priority queue for each layer, containing the links extracted from vis- ited pages classified in that layer. Each queue is sorted by the scores Pr(l |p). The next URL to crawl is taken from the non-empty queue with the smallest l . So the crawler gives precedence to links that appear to be closest to relevant targets. It is shown in [139] that the context-focused crawler outperforms the standard focused crawler in experiments. While the majority of focused crawlers in the literature have employed the naïve Bayesian method as the classification algorithm to score unvis- ited URLs, an extensive study with hundreds of topics has provided strong evidence that classifiers based on SVM or neural networks can yield sig- nificant improvements in the quality of the crawled pages [433]. 8.5 Topical Crawlers For many preferential crawling tasks, labeled (positive and negative) ex- amples of pages are not available in sufficient numbers to train a focused crawler before the crawl starts. Instead, we typically have a small set of seed pages and a description of a topic of interest to a user or user commu- nity. The topic can consist of one or more example pages (possibly the seeds) or even a short query. Preferential crawlers that start with only such information are often called topical crawlers [85, 102, 377]. They do not have text classifiers to guide crawling. Even without the luxury of a text classifier, a topical crawler can be smart about preferentially exploring regions of the Web that appear rele- vant to the target topic by comparing features collected from visited pages with cues in the topic description. To illustrate a topical crawler with its advantages and limitations, let us consider the MySpiders applet (myspiders.informatics.indiana.edu). Figure 8.6 shows a screenshot of this application. The applet is designed to dem- onstrate two topical crawling algorithms, best-N-first and InfoSpiders, both discussed below [431]. MySpiders is interactive in that a user submits a query just like one would do with a search engine, and the results are then shown in a win- 8.5 Topical Crawlers 293 dow. However, unlike a search engine, this application has no index to search for results. Instead the Web is crawled in real time. As pages deemed relevant are crawled, they are displayed in a list that is kept sorted by a user-selected criterion: score or recency. The score is simply the con- tent (cosine) similarity between a page and the query (see Chap. 6); the re- cency of a page is estimated by the last-modified header, if returned by the server (as noted earlier this is not a very reliable estimate). Fig. 8.6. Screenshot of the MySpiders applet in action. In this example the user has launched a population of crawlers with the query “search censorship in france” using the InfoSpiders algorithm. The crawler reports some seed pages obtained from a search engine, but also a relevant blog page (bottom left) that was not re- turned by the search engine. This page was found by one of the agents, called Spi- der2, crawling autonomously from one of the seeds. We can see that Spider2 spawned a new agent, Spider13, who started crawling for pages also containing the term “italy.” Another agent, Spider5, spawned two agents one of which, Spi- der11, identified and internalized the relevant term “engine.” One of the advantages of topic crawling is that all hits are fresh by defi- nition. No stale results are returned by the crawler because the pages are visited at query time. This makes this type of crawlers suitable for applica- 294 8 Web Crawling tions that look for very recently posted documents, which a search engine may not have indexed yet. On the down side, the search is slow compared to a traditional search engine because the user has to wait while the crawler fetches and analyzes pages. If the user's client machine (where the applet runs) has limited bandwidth, e.g., a dial-up Internet connection, the wait is likely infeasible. Another disadvantage is that the ranking algorithms can- not take advantage of global prestige measures, such as PageRank, avail- able to a traditional search engine. Several research issues around topical crawlers have received attention. One key question is how to identify the environmental signals to which crawlers should attend in order to determine the best links to follow. Rich cues such as the markup and lexical (text) signals within Web pages, as well as features of the link graph built from pages already seen, are all rea- sonable sources of evidence to exploit. Crawlers can use the evidence available to them in different ways, for example more or less greedily. The goals of the application also provide crucial context. For example the desired properties of the pages to be fetched (similar pages, popular pages, authoritative pages, recent pages, and so on) can lead to significant differences in crawler design and imple- mentation. The task could be constrained by parameters like the maximum number of pages to be fetched (long crawls vs. short crawls) or the mem- ory available. A crawling task can thus be viewed as a constrained multi- objective search problem. The wide variety of objective functions, coupled with the lack of appropriate knowledge about the search space, make such a problem challenging. In the remainder of this section we briefly discuss the theoretical condi- tions necessary for topical crawlers to function, and the empirical evidence supporting the existence of such conditions. Then we review some of the machine learning techniques that have been successfully applied to iden- tify and exploit useful cues for topical crawlers. 8.5.1 Topical Locality and Cues The central assumption behind topical crawlers is that Web pages contain reliable cues about each other’s content. This is a necessary condition for designing a crawler that has a better-than-random chance to preferentially visit pages relevant with respect to a given topic. Indeed, if no estimates could be made about unvisited pages, then all we could do is a random walk through the Web graph, or an exhaustive search (using breadth-first or depth-first search algorithms). Fortunately, crawling algorithms can use cues from words and hyperlinks, associated respectively with a lexical and 8.5 Topical Crawlers 295 a link topology. In the former, two pages are close to each other if they have similar textual content; in the latter, if there is a short path between them (we will see what “short” means). Lexical metrics are text similarity measures derived from the vector space model (see Chap. 6). The cluster hypothesis behind this model is that a document lexically close to a relevant document (with respect to the given query) is also relevant with high probability [461]. Link metrics typically look at hyperlinks as directed edges in a graph, but a path can also be defined in an undirected sense, in which case two pages have a short link distance between them if they are co-cited or co- referenced, even if there is no directed path between them. Links are a very rich source of topical information about Web pages. From a crawler's perspective, there are two central questions: 1. link-content conjecture: whether two pages that link to each other are more likely to be lexically similar to each other, compared to two ran- domly selected pages; 2. link-cluster conjecture: whether two pages that link to each other are more likely to be semantically related to each other, compared to two randomly selected pages. A first answer to the link-content conjecture was obtained by computing the cosine similarity between linked and random pairs of pages, showing that the similarity is an order of magnitude higher in the former case [123]. The same study also showed that the anchor text tends to be a good (simi- lar) description of the target page. The link-content conjecture can be generalized by looking at the decay in content similarity as a function of link distance from a source page. This decay was measured by launching an exhaustive breadth-first crawl from seed sets of 100 topics in the Yahoo! directory [372]. Let us use the cosine similarity measure σ(p1, p2) between pages p1 and p2 (see Chap. 6). We can measure the link distance δ1(p1, p2) along the shortest directed path from p1 and p2, revealed by the breadth-first crawl. Both distances δ1(q, p) and similarities σ(q, p) were averaged for each topic q over all pages p in the crawl set q dP for each depth d: )(1),(),( 1 1 1 q i d i q iq d P NNiNpqdq p d − = −=≡ ∑δδ (4) ∑ ∈ =≡ p d p d Pp q d P pqNpqdq ),(1),(),( σσσ (5) 296 8 Web Crawling where Nd q is the size of the cumulative page set Pd q = {p | δ1(q, p) ≤ d}. The crawlers were stopped at depth d = 3, yielding 3000 data points {(p, d): q ∈{1, …, 100}, d ∈{1, 2, 3}}. These points were then used for fitting an exponential decay model: 21)1()( αδασσδσ − ∞∞ −+≈ e (6) where σ∞ is the noise level in similarity, measured empirically by averag- ing across random pairs of pages. The parameters α1 and α2 are set by fit- ting the data. This was done for pages in various top-level domains, and the resulting similarity decay curves are plotted in Fig. 8.7. Fig. 8.7. Illustration of the link-content conjecture. The curves plot, for each top- level domain, the decay in mean cosine similarity between pages as a function of their mean directed link distance, obtained by fitting data from 100 exhaustive breadth-first crawls starting from the 100 Yahoo! directory topics [372]. The curves provide us with a rough estimate of how far in link space one can make inferences about lexical content. We see that a weak signal is still present three links away from the starting pages for all but the .com domain, and even further for the .edu domain. Such heterogeneity is not surprising – academic pages are written carefully to convey information and proper pointers, while business sites often do not link to related sites because of competition. Therefore a topical crawler in the commercial do- main would have a harder task, other things being equal. A solution may 8.5 Topical Crawlers 297 be to use undirected links. More specifically, if a crawler can obtain in- links to good pages (by querying a search engine), it can use co-citation to detect hubs. If a page links to several good pages, it is probably a good hub and all its out-links should be given high priority. This strategy, related to the so-called sibling locality [3], has been used in focused crawlers [87] and in topical crawlers for business intelligence [432]. In addition to co- citation, one could look at bibliographic coupling: if several good pages link to a certain page, that target is likely to be a good authority so it and its in-links should be given high priority. Fig. 8.8 illustrates various ways in which crawlers can exploit co-citation and bibliographic coupling. Fig. 8.8. Crawling techniques exploiting co-citation (top) and bibliographic cou- pling (bottom). Dashed edges represent in-links, which require access to a search engine or connectivity server. Page A is a good hub, so it should be given high priority; once fetched, page B linked by it can be discovered and placed in the frontier with high priority since it is likely to be a good authority. Page C is also a good hub, so D should be given high priority. Page E is a good authority, so it should be given high priority. Its URL can also be used to discover F, which may be a good hub and should be placed in the frontier. G is also a good authority, so H should be given high priority and I should be placed in the frontier. 298 8 Web Crawling The link-cluster conjecture, also known as linkage locality [87], states that one can infer the meaning of a page by looking at its neighbors. This is actually more important than inferring lexical content, since the latter is only relevant insofar as it is correlated with the semantic content of pages. The same exhaustive crawl data used to validate the link-content conjec- ture can also be used to explore the link-cluster conjecture, namely the ex- tent to which relevance is preserved within link space neighborhoods and the decay in expected relevance as one browses away from a relevant page [372]. The link-cluster conjecture can be simply formulated in terms of the conditional probability that a page p is relevant with respect to some query q, given that page r is relevant and that p is within d links from r: Rq (d) ≡ Pr(relq (p)|relq (r)∧δ1(r, p) ≤ d] (7) where relq() is a binary relevance assessment with respect to q. In other words a page has a higher than random probability of being about a certain topic if it is in the neighborhood of other pages about that topic. Rq(d) is the posterior relevance probability given the evidence of a relevant page nearby. The conjecture is then represented by the likelihood ratio λ(q, d) between Rq(d) and the prior relevance probability Gq ≡ Pr(relq(p)), also known as the generality of the query. If semantic inferences are possible within a link radius d, then the following condition must hold: .1)(),( >≡ q q G dRdqλ (8) To illustrate the meaning of the link-cluster conjecture, consider a random crawler searching for pages about a topic q. Call ηq(t) the probability that the crawler hits a relevant page at time t. Solving the recursion ηq (t +1) = ηq (t)Rq (1) + (1 −ηq (t))Gq (9) for ηq(t+1) = ηq(t) yields the stationary hit rate .)1(1 * qq q q RG G −+=η (10) The link-cluster conjecture is a necessary and sufficient condition for such a crawler to have a better than chance hit rate: .1)1,(* >⇔> qGqq λη (11) Figure 8.9 plots the mean likelihood ratio λ(q, d) versus the mean link dis- tance δ(q, d) obtained by fitting an exponential decay function 8.5 Topical Crawlers 299 λ(δ) ≈1+ α3e−α4δ α5 (12) to the same 300 data points {(q, d)}. Note that this three-parameter model is more complex than the one used to validate the link-content conjecture, because λ(δ = 0) must also be estimated from the data (λ(q, 0) = 1/Gq). The fitted curve reveals that being within a radius of three links from a relevant page increases the relevance probability by a factor λ(q, d) >>1. This is very reassuring for the design of topical crawlers. It also suggests that crawlers should attempt to remain within a few links from some rele- vant source. In this range hyperlinks create detectable signals about lexical and semantic content, despite the Web's apparent lack of structure. Fig. 8.9. Illustration of the link-cluster conjecture. The curve plots the decay in mean likelihood ratio as a function of mean directed link distance from a relevant page, obtained by fitting data from 100 exhaustive breadth-first crawls starting from as many Yahoo! directory topics [372]. The link-content and link-cluster conjectures can be further developed by looking at the correlation between content-based, link-based, and se- mantic-based similarity measures. Using the ODP as a ground truth, we can express the semantic similarity between any two pages in the taxon- omy [373, 359] and see how it can be approximated by content and link similarity measures. For content one can consider for example cosine simi- larity based on TF or TF-IDF term weights. For link similarity one can similarly represent a page as a bag of links (in-links, out-links, or 300 8 Web Crawling both/undirected) and then apply a Jaccard coefficient or a cosine similar- ity. Figure 8.10 shows, for various topical domains from the ODP, the cor- relation between semantic similarity and two representative content and link similarity measures. We observe significant heterogeneity in the corre- lations, suggesting that topical crawlers have an easier job in some topics (e.g., “news”) than others (e.g., “games”). Another observation is that in some topical domains (e.g., “home”) textual content is a more reliable sig- nal, while in others (e.g., “computers”) links are more helpful. Fig. 8.10. Pearson correlation coefficients between the semantic similarity ex- tracted from ODP [359] and two representative content and link similarity meas- ures. The correlations are measured using a stratified sample of 150,000 URLs from the ODP, for a total of 4 billion pairs [373]. Content similarity is cosine with TF weights, and link similarity is the Jaccard coefficient with undirected links. 8.5.2 Best-First Variations The majority of crawling algorithms in the literature are variations of the best-first scheme described in Sect. 8.1.2. The difference is in the heuris- tics that they use to score unvisited URLs. A very simple instance is the case where each URL is queued into the frontier with priority given by the content similarity between the topic description and the page from which the URL was extracted. Content similarity can be measured with the stan- 8.5 Topical Crawlers 301 dard cosine similarity, using TF or TFIDF term weights (in the latter case the crawler must have global or topic-contextual term frequency informa- tion available). This simple crawler is also known as naïve best-first. Many variations of the naïve best-first crawlers are possible. Some give more importance to certain HTML markups, such as the title, or to text segments marked by special tags, such as headers. Other techniques focus on determining the most appropriate textual context to score a link. One al- ternative to using the entire page or just the anchor text as context, used by Fig. 8.11. Link context from distance-weighted window (top) and from the DOM tree (bottom). 302 8 Web Crawling InfoSpiders [375] and Clever [86], is a weighted window where topic key- words occurrences near the anchor count more toward the link score than those farther away, as shown in Fig. 8.11. Another approach is to consider the tag (DOM) tree of the HTML page [85]. The idea is to walk up the tree from the link anchor toward the root, stopping at an appropriate aggrega- tion node. The link context is then obtained by the text in the tag subtree rooted at the aggregation node (Fig. 8.11). SharkSearch [237] is an improved version of the earlier FishSearch crawler [60]. It uses a similarity measure like the one used in the naïve best-first crawler as a first step for scoring unvisited URLs. The similarity is computed for anchor text, a fixed-width link context, the entire source page, and ancestor pages. The ancestors of a URL are the pages that appear on the crawl path to the URL. SharkSearch, like its predecessor Fish- Search, maintains a depth bound. That is, if the crawler finds unimportant pages on a crawl path it stops crawling further along that path. To this end, each URL in the frontier is associated with a depth and a potential score. The score of an unvisited URL is obtained from a linear combina- tion of anchor text similarity, window context similarity, and an inherited score. The inherited score is the similarity of the source page to the topic, unless it is zero, in which case it is inherited from the source's parent (and recursively from its ancestors). The implementation of SharkSearch re- quires to preset three similarity coefficients in addition to the depth bound. This crawler does not perform as well as others described below. Rather than (or in addition to) improving the way we assign priority scores to unvisited URLs, we can also improve on a naïve best-first crawler by altering the priority scheme. A classic trade-off in machine learning is that between exploration and exploitation of information. A crawler is no different: it can greedily pursue the best-looking leads based on noisy qual- ity estimates, or be more explorative and visit some pages that seem less promising, but might lead to better pages. The latter approach is taken in many optimization algorithms in order to escape local optima and reach a global optimum with some probability. As it turns out, the same strategy is also advantageous for topical crawlers. Visiting some URLs with lower priority leads to a better overall quality of the crawler pages than strictly following the best-first order. This is demonstrated by best-N-first, a crawling algorithm that picks N URLs at a time from the frontier (the top N by priority score) and fetches them all. Once all N pages are visited, the newly extracted URLs are merge-sorted into the priority queue, and the cycle is repeated. The best-N-first crawler with N = 256 is a very strong competitor, outperforming most of the other topical crawlers in the litera- ture [434, 377]. Figure 8.12 shows a comparison with two crawlers dis- 8.5 Topical Crawlers 303 cussed thus far. Note that a concurrent implementation of a best-first crawler with N threads or processes is equivalent to a best-N-first crawler. 8.5.3 Adaptation All the crawlers discussed thus far use a static strategy both to evaluate unvisited URLs and to manage the frontier. Thus they do not learn from experience or adapt to the context of a particular topic in the course of the crawl. In this section we describe a number of machine learning techniques that have been incorporated into adaptive topical crawlers. The intelligent crawler uses a statistical model for learning to assign priorities to the URLs in the frontier, considering Bayesian interest factors derived from different features [3]. For example, imagine that the crawler is supposed to find pages about soccer and that 40% of links with the keyword football in the anchor text lead to relevant pages, versus a back- ground or prior frequency of only 2% of crawled pages being relevant. Then the crawler assigns an interest factor Fig. 8.12. Performance of best-N-first crawler with N = 256 (BFS256) compared with a naïve best-first crawler (BFS1) and a breadth-first crawler. Recall refers to sets of relevant pages that the crawlers are supposed to discover; averages and er- ror bars are computed across 100 crawls from as many ODP topics. 304 8 Web Crawling 20)](Pr[ )](|)(Pr[ ),( =∈= ∈ prel panchorfootballprel anchorfootballsoccer soccer soccer λ (13) to the feature “keyword football in anchor.” Recall relsoccer(p) is the binary relevance score (0 or 1) of page p to soccer. The interest factors are treated as independent sources of evidence, or likelihoods. They are combined by a linear combination of log-likelihoods, with user-defined weight parame- ters. The features employed by the intelligent crawler may be diverse, de- pending on the particular crawling task. They may include tokens extracted from candidate URLs, source page content and links, co-citation (sibling) relationships, and/or other characteristics of the visited and unvisited URLs. As more evidence is accumulated and stored throughout the crawl, the interest factors are recalculated and the priorities updated, so that the frontier is always sorted according to the most recent estimates. Thus intel- ligent crawlers adapt to the content and link structure of the Web neighborhoods being explored. The original focused crawlers described earlier also use machine learn- ing, in particular a classifier that guides the crawler. However the classifier is trained before the crawl is launched, and no learning occurs during the crawl. Therefore we do not consider it an adaptive crawler. However, in a later “accelerated” version of the focused crawler [85], an online learning apprentice was added to the system; the original (baseline) classifier then acts as a critic, providing the apprentice with training examples for learn- ing to classify outgoing links from the features of the pages from which they are extracted. Suppose page p1 is fetched and contains a link to page p2. Later, p2 is fetched and the baseline classifier assigns it to a relevant class. This information is passed to the apprentice, which uses the labeled example (“the link from p1 to p2 is good”) to learn to classify the link to p2 as good based on the textual features in the context of the anchor within p1. Future links with a similar context should be given high priority. Con- versely, if p2 is deemed irrelevant by the baseline classifier, the apprentice learns to predict (“bad link”) when it encounters a link with a similar con- text in the future. The features used to train the apprentice were textual to- kens associated with a link context based on the DOM tree, and the learn- ing algorithm used by the apprentice was a naïve Bayesian classifier. This approach led to a significant reduction in the number of irrelevant pages fetched by the focused crawler. While the accelerated focused crawler is not a topical crawler because it still needs labeled examples to train the baseline classifier prior to the crawl, the idea of training an apprentice online during the crawl can be ap- 8.5 Topical Crawlers 305 plied in topical crawlers as well. Indeed this is a type of reinforcement learning technique employed in several crawlers, using different features and/or different learning algorithms for the apprentice. In reinforcement learning [263] we have a network where nodes are states and directed links are actions. An action a ∈ A (think “anchor”) moves an agent from a state p ∈ P (think “page”) to another state according to a transition function L: P × A → P. Thus an adaptive crawler is seen as an agent moving from page to page. Actions are rewarded according to a function r: P × A → ℜ. We want to learn a policy mapping states to actions, π: P → A, that maximizes future reward discounted over time: ∑ ∞ = = 0 0 ),()( t tt t aprpV γπ (14) where we follow action (link) at=π(pt) from state (page) pt at each time step t. The parameter γ determines how future rewards are discounted (0 ≤ γ < 1). If γ = 0, the reinforcement learning policy is the greedy one em- ployed by the naïve best-first crawler. To learn an optimal policy, we de- fine the value of selecting action a from state p, and following the optimal policy thereafter: )],([),(),( * apLVaprapQ γ+= (15) where V* is the value function of the optimal policy π*(p) = argmaxaQ(p, a). The question then becomes how to estimate the function Q, i.e., to as- sign a value to a link a based on the context information in page p from which the link is extracted. However, the actions available to the crawler are not limited to the links from the last page visited; any of the actions corresponding to the URLs in the frontier are available. Furthermore, there is no reason why the Q value of a link should be a function of a particular source page; if links to the same target page are extracted from multiple source pages, the estimated values of the anchors can be combined, for ex- ample Q(u) = max{(p, a): L(p, a) = u}Q(p, a). This way Q values can be com- puted not for links (anchors), but for target pages (URLs); the state and ac- tion spaces are thus greatly reduced, basically collapsing all visited pages into a single degenerate state and all links to their target URLs. The policy π reduces to the simple selection of the URL in the frontier with the maxi- mum Q value. One way to calculate Q values is via a naïve Bayesian classifier. This method was found to work well compared to a breadth-first crawler for the tasks of crawling computer science research papers and company directory information [366, 459]. In this case, the classifier was trained off-line 306 8 Web Crawling rather than online while crawling, using labeled examples as in the focused crawler. Training the classifier to predict future reward (γ > 0) was better than only using immediate reward (γ = 0). For future reward the authors use a heavy discount γ = 0.5, arguing that it is optimal to be greedy in se- lecting URLs from the frontier, so that one can crawl toward the nearest relevant page. This assumes that all relevant targets are within reach. So there is no reason to delay reward. However, as discussed earlier, a crawler typically deals with noisy data, so the classifier’s Q estimates are not en- tirely reliable; more importantly, a typical crawler cannot possibly cover the entire search space. These factors suggest that it may be advantageous to occasionally give up some immediate reward in order to explore other directions, potentially leading to pockets or relevant pages unreachable by a greedy crawler (see Fig. 8.12). Using a previously trained classifier to compute Q values for URLs in the frontier means that supervised learning is combined with reinforcement learning. As for focused crawlers, labeled examples must be available prior to the start of the crawl. This may be possible in tasks such as the collec- tion of research articles, but is not a realistic assumption for typical topical crawlers. An adaptive crawling algorithm that actually uses reinforcement learning while crawling online, without any supervised learning, is InfoS- piders. This crawler employs various machine learning techniques to ex- ploit various types of Web regularities. InfoSpiders are inspired by artifi- cial life models in which a population of agents lives, learn, evolve, and die in an environment. Individual agents learn during their lifetimes based on their experiences, with the environment playing the role of a critic, pro- viding rewards and penalties for actions. Agents may also reproduce, giv- ing rise to new agents similar to them, and die. The environment is the Web, the actions consist of following links and visiting pages and the text and link features of pages are the signals that agents can internalize into their learned and evolved behaviors. Feedback from the environment con- sists of a finite energy resource, necessary for survival. Each action has an energy cost, which may be fixed or proportional to, say, the size of a fetched page or the latency of a page download [126]. Energy is gained from visiting new pages relevant to the query topic. A cache prevents an agent from accumulating energy by visiting the same page multiple times. In the recent version of InfoSpiders, each agent maintains its own frontier of unvisited URLs [377]. The agents can be implemented as concurrent processes/threads, with non-contentious access to their local frontiers. Fig. 8.13 illustrates the representation and flow of an individual agent. The adaptive representation of each InfoSpiders agent consists of a list of keywords (initialized with the topic description) and a neural net used to evaluate new links. Each input unit of the neural net receives a count of the 8.5 Topical Crawlers 307 frequency with which the keyword occurs in the vicinity of each link, weighted to give more importance to keywords occurring near the anchor and maximum for the anchor text (Fig. 8.11). The neural net has a single output unit whose activation is used as a Q value (score) for each link u in input. The agent’s neural net learns to predict the Q value of the link’s tar- get URL u given the inputs from the link's source page p. The reward func- tion r(u) is the cosine similarity between the agent’s terms and the target page u. The future discounted optimal value γV*(u) is approximated using the highest neural net prediction among the links subsequently extracted from u. This procedure is similar to the reinforcement learning algorithm described above, except that the neural net replaces the naïve Bayesian classifier. The neural net is trained by the back-propagation algorithm [469]. This mechanism is called connectionist reinforcement learning [335]. While the neural net can in principle model nonlinear relationships between term frequencies and pages, in practice we have used a simple perceptron whose prediction is a linear combination of the keyword weights. Such a learning technique provides each InfoSpiders agent with the capability to adapt its own link-following behavior in the course of a crawl by associating relevance estimates with particular patterns of key- word frequencies around links. Fig. 8.13. A single InfoSpiders agent. The link context is the weighted window as shown in Fig. 8.11: for each newly extracted URL and for each term in the agent's term list, this produces a weight that is fed into the neural network, whose output is stored as the link's priority score in the frontier. 308 8 Web Crawling The neural net's link scores are combined with estimates based on the cosine similarity between the agent's keyword list and the entire source page. A parameter α (0 ≤ α ≤ 1) regulates the relative importance given to the estimates based on the neural net versus the source page. Based on the combined score σ the agent uses a stochastic selector to pick one of the links in the frontier with probability ∑ ∈ = φ βσ βσ ' )'( )( )Pr( u u u e eu (16) where u is a URL in the local frontier φ. Parameter β regulates the greedi- ness of the link selector. Its value can be fixed or evolved with the agent. After a new page u has been fetched, the agent receives an energy pay- off proportional to the difference between the reward r(u) and the cost charged for the download. An agent dies when it runs out of energy. The energy level is also used to determine whether or not the agent should re- produce after visiting a page. An agent reproduces when the energy level passes a fixed threshold. The reproduction is meant to bias the search to- ward areas with pages relevant to the topic. Topical locality suggests that if an agent visits a few relevant pages in rapid sequence, more relevant pages are likely to be nearby (in the frontier). To exploit this, the accumulated energy results in a short-term doubling of the frequency with which the crawler explores this agent’s frontier. At reproduction, the agent’s energy and frontier are split in half with the offspring (new agent or thread). Ac- cording to ecological theory, this way the agent population is supposed to move toward an optimal cover of the Web graph in proportion to the local density of resources, or relevant pages. In addition to the individual's reinforcement learning and the popula- tion’s evolutionary bias, InfoSpiders employ a third adaptive mechanism. At reproduction, the offspring’s keyword vector is mutated (expanded) by adding a new term. The chosen term/keyword is the one that is most fre- quent in the parent’s last visited page, i.e., the page that triggered the re- production. This selective query expansion strategy, illustrated in Fig. 8.6, is designed to allow the population to diversify and expand its focus according to each agent’s local context. An InfoSpiders crawler incorpo- rating all of these adaptive techniques has been shown to outperform vari- ous versions of naïve best-first crawlers (Fig. 8.14) when visiting a suffi- ciently large number of pages (more than 10,000) so that the agents have time to adapt [377, 504]. 8.5 Topical Crawlers 309 Fig. 8.14. Performance plots [377]: average target recall 〈RT(t)〉 (top) and average precision 〈PD(t)〉 (similarity to topic description, bottom). The averages are calcu- lated over 10 ODP topics. After 50,000 pages crawled, one tailed t-tests reveal that both BFS256 and InfoSpiders outperform the breadth-first crawler on both performance metrics. InfoSpiders outperform BFS256 on recall, while the differ- ence in precision is not statistically significant. 310 8 Web Crawling 8.6 Evaluation Given the goal of building a “good” crawler, a critical question is how to evaluate crawlers so that one can reliably compare two crawling algo- rithms and conclude that one is “better” than the other. Since a crawler is usually designed to support some application, e.g., a search engine, it can be indirectly evaluated through the application it supports. However, attri- bution is problematic; if a search engine works better than another (assum- ing that were easy to determine!), how can we attribute this difference in performance to the underlying crawling algorithms as opposed to the rank- ing or indexing schemes? Thus it is desirable to evaluate crawlers directly. Often crawler evaluation has been carried out by comparing a few crawling algorithms on a limited number of queries/tasks without consider- ing the statistical significance. Such anecdotal results, while important, do not suffice for thorough performance comparisons. As the Web crawling field has matured, a need has emerged for evaluating and comparing dispa- rate crawling strategies on common tasks through well-defined perform- ance measures. Let us review the elements of such an evaluation frame- work, which can be applied to topical as well as focused crawlers. A comparison between crawlers must be unbiased and must allow one to measure statistically significant differences. This requires a sufficient number of crawl runs over different topics, as well as sound methodologies that consider the temporal nature of crawler outputs. Significant challenges in evaluation include the general unavailability of relevant sets for particu- lar topics or queries. Unfortunately, meaningful experiments involving real users for assessing the relevance of pages as they are crawled are ex- tremely problematic. In order to obtain a reasonable notion of crawl effec- tiveness one would have to recruit a very large number of subjects, each of whom would have to judge a very large number of pages. Furthermore, crawls against the live Web pose serious time constraints and would be overly burdensome to the subjects. To circumvent these problems, crawler evaluation typically relies on de- fining measures for automatically estimating page relevance and quality. The crawler literature reveals several performance measures used for these purposes. A page may be considered relevant if it contains some or all of the keywords in the topic/query. The frequency with which the keywords appear on the page may also be considered [102]. While the topic of inter- est to the user is often expressed as a short query, a longer description may be available in some cases. Similarity between the short or long description and each crawled page may be used to judge the page's relevance [237, 376, 504]. The pages used as the crawl's seed URLs may be combined to- gether into a single document, and the cosine similarity between this 8.6 Evaluation 311 document and a crawled page may serve as the page’s relevance score [18]. A classifier may be trained to identify relevant pages. The training may be done using seed pages or other pre-specified relevant pages as positive examples. The trained classifier then provides boolean or continu- ous relevance scores for each of the crawled pages [87, 139]. Note that if the same classifier, or a classifier trained on the same labeled examples, is used both to guide a (focused) crawler and to evaluate it, the evaluation is not unbiased. Clearly the evaluating classifier would be biased in favor of crawled pages. To partially address this issue, an evaluation classifier may be trained on a different set than the crawling classifier. Ideally the training sets should be disjoint. At a minimum the training set used for evaluation must be extended with examples not available to the crawler [433]. An- other approach is to start N different crawlers from the same seeds and let them run until each crawler gathers P pages. All of the N×P pages col- lected from the crawlers are ranked against the topic query/description us- ing a retrieval algorithm such as cosine. The rank provided by the retrieval system for each page is then used as a relevance score. Finally, one may use algorithms, such as PageRank or HITS, that provide authority or popu- larity estimates for each crawled page. A simpler method would be to use just the number of in-links to the crawled page to derive similar informa- tion [18, 102]. Many variations of link-based methods using topical weights may be applied to measure the topical quality of pages [52, 88]. Once each page is assessed, a method is needed to summarize the per- formance of a crawler across a set of crawled pages. Given a particular measure of page relevance and/or importance we can summarize the per- formance of the crawler with metrics that are analogous to the information retrieval notions of precision and recall (see Chap. 6). Lacking well- defined relevant sets, the classic boolean relevance is replaced by one of the scores outlined above. A few precision-like measures are found in the literature. In case we have boolean relevance scores, we could measure the rate at which “good” pages are found; if 100 relevant pages are found in the first 500 pages crawled, we have an acquisition rate or harvest rate of 20% at 500 pages [3]. If the relevance scores are continuous (e.g., from co- sine similarity or a trained classifier) they can be averaged over the crawled pages. The average relevance, as shown in Fig. 8.14, may be com- puted over the progress of the crawl [376]. Sometimes running averages are calculated over a window of a number of pages, e.g., the last 50 pages from a current crawl point [87]. Another measure from information re- trieval that has been applied to crawler evaluation is search length [375], defined as the number of pages (or the number of irrelevant pages) crawled before a certain percentage of the relevant pages are found. Search length is akin to the reciprocal of precision for a preset level of recall. 312 8 Web Crawling Recall-like measures would require normalization by the number of relevant pages. Since this number is unknown for Web crawling tasks, it might appear that recall cannot be applied to crawlers. However, even if unknown, the size of the relevant set is a constant. Therefore, it can be dis- regarded as a scaling factor when comparing two crawling algorithms on the same topical query. One can simply sum the quality or relevance esti- mates (obtained by one of the methods outlined above) over the course of a crawl, and obtain a total relevance as shown in Fig. 8.14. It is possible to design crawling experiments so that a set of relevant tar- get pages is known by the experimenter. Then precision and recall can be calculated from the fraction of these relevant targets that are discovered by the crawler, rather than based on relevance estimates. One way to obtain a set of relevant pages is from a public directory such as the ODP. This way one can leverage the classification already carried out by the volunteer edi- tors of the directory. The experimenter can select as topics a set of catego- ries from the ODP, whose distance from the root of the ODP taxonomy can be determined so as to obtain topics with generality/specificity appropriate for the crawling task [377, 504]. Figure 8.5 (left) illustrates how subtrees rooted at a chosen category can be used to harvest a set of relevant target pages. If a page is classified in a subtopic of a target topic, it can be con- sidered relevant with respect to the target topic. If a set of known relevant target pages is used to measure the perform- ance of a topical crawler, these same pages cannot be used as seeds for the crawl. Two approaches have been proposed to obtain suitable seed pages. One is to perform a back-crawl from the target pages [504]. By submitting link: queries to a search engine API, one can obtain a list of pages linking to each given target; the process can be repeated from these parent pages to find “grandparent” pages, and so on until a desired link distance is reached. The greater the link distance, the harder the task is for the crawler to locate the relevant targets from these ancestor seed pages. The procedure has the desired property that directed paths are guaranteed to exist from any seed page to some relevant targets. Given the potentially large fan-in of pages, sampling is likely required at each stage of the back-crawl to obtain a suit- able number of seeds. The process is similar to the construction of a con- text graph, as shown in Fig. 8.5 (right). A second approach is to split the set of known relevant pages into two sets; one set can be used as seeds, the other as targets. While there is no guarantee that the targets are reachable from the seeds, this approach is significantly simpler because no back- crawl is necessary. Another advantage is that each of the two relevant sub- sets can be used in turn as seeds and targets. In this way, one can measure the overlap between the pages crawled starting from the two disjoint sets. 8.6 Evaluation 313 A large overlap is interpreted as robustness of the crawler in covering relevant portions of the Web [87, 85]. The use of known relevant pages as proxies for unknown relevant sets implies an important assumption, which we can illustrate by the Venn dia- gram in Fig. 8.15. Here S is a set of crawled pages and T is the set of known relevant target pages, a subset of the relevant set R. Let us consider the measure of recall. Using T as if it were the relevant set means that we are estimating the recall |R ∩ S| / |R| by |T ∩ S| / |T|. This approximation only holds if T is a representative, unbiased sample of R independent of the crawl process. While the crawler attempts to cover as much as possible of R, it should not have any information about how pages in T are sampled from R. If T and S are not independent, the measure is biased and unreli- able. For example if a page had a higher chance of being selected in T be- cause it was in S, or vice versa, then the recall would be overestimated. The same independence assumption holds for precision-like measures, where we estimate |R ∩ S| / |S| by |T ∩ S| / |S|. A consequence of the inde- pendence requirement is that if the ODP is used to obtain T, the experi- menter must prevent the crawler from accessing the ODP. This would bias the results because, once a relevant ODP category page is found, all of the relevant target pages can be reached by the crawler in a short breadth-first sweep. Preventing access to the ODP may pose a challenge because so many ODP mirrors exist on the Web. They may not be known by the ex- perimenter, and not trivial to detect. To summarize, crawler performance measures [504] can be character- ized along two dimensions: the source of relevance assessments (target pages vs. similarity to their descriptions) and the normalization factor (av- Fig. 8.15. Illustration of precision and recall measures based on known relevant target pages and underlying independence assumption/requirement. 314 8 Web Crawling erage relevance, or precision, vs. total relevance, or recall). Using target pages as the relevant sets we can define crawler precision and recall as fol- lows: || ||),( θ θθ T TStR t T ∩= (17) || ||),( t t T S TStP θθ ∩= (18) where St is the set of pages crawled at time t (t can be wall clock time, network latency, number of pages visited, number of bytes downloaded, and so on). Tθ is the relevant target set, where θ represents the parameters used to select the relevant target pages. This could include for example the depth of ODP category subtrees used to extract topic-relevant pages. Analogously we can define crawler precision and recall based on similarity to target descriptions: || ),( ),( θ θσ θ T Dp tR tSp D ∑ ∈= (19) || ),( ),( t Sp D S Dp tP t∑ ∈= θσ θ (20) where Dθ is the textual description of the target pages, selected with pa- rameters θ, and σ is a text-based similarity function, e.g., cosine similarity (see Chap. 6). Figure 8.14 shows two examples of performance plots for three different crawlers discussed earlier in this chapter. The two plots de- pict RT and PD as a function of pages crawled. InfoSpiders and the BFS256 crawler are found to outperform the breadth-first crawler. InfoSpiders gain a slight edge in recall once the agents have had an opportunity to adapt. This evaluation involves each of the three crawlers visiting 50,000 pages for each of 10 topics, for a total of 1.5 million pages. Another set of evaluation criteria can be obtained by scaling or normal- izing any of the above performance measures by the critical resources used by a crawler. This way, one can compare crawling algorithms by way of performance/cost analysis. For example, with limited network bandwidth one may see latency as a major bottleneck for a crawling task. The time spent by a crawler on network I/O can be monitored and applied as a scal- ing factor to normalize precision or recall. Using such a measure, a crawler 8.7 Crawler Ethics and Conflicts 315 designed to preferentially visit short pages, or pages from fast servers [126], would outperform one that can locate pages of equal or even better quality but less efficiently. 8.7 Crawler Ethics and Conflicts Crawlers, especially when efficient, can put a significant strain on the re- sources of Web servers, mainly on their network bandwidth. A crawler that sends many page requests to a server in rapid succession, say ten or more per second, is considered impolite. The reason is that the server would be so busy responding to the crawler that its service to other requests, includ- ing those from human browsing interactively, would deteriorate. In the ex- treme case a server inundated with requests from an aggressive crawler would become unable to respond to other requests, resulting in an effective denial of service attack by the crawler. To prevent such incidents, it is essential for a crawler to put in place measures to distribute its requests across many servers, and to prevent any one server (fully qualified host name) from receiving requests at more than some reasonably set maximum rate (say, one request every few seconds). In a concurrent crawler, this task can be carried out by the frontier man- ager, when URLs are dequeued and passed to individual threads or proc- esses. This practice not only is required by politeness toward servers, but also has the additional benefits of limiting the impact of spider traps and not overloading the server, which will respond slowly. Preventing server overload is just one of a number of policies required of ethical Web agents [160]. Such policies are often collectively referred to as crawler etiquette. Another requirement is to disclose the nature of the crawler using the User-Agent HTTP header. The value of this header should include not only a name and version number of the crawler, but also a pointer to where Web administrators may find information about the crawler. Often a Web site is created for this purpose and its URL is in- cluded in the User-Agent field. Another piece of useful information is the email contact to be specified in the From header. Finally, crawler etiquette requires compliance with the Robot Exclusion Protocol. This is a de facto standard providing a way for Web server ad- ministrators to communicate which files may not be accessed by a crawler. This is accomplished via an optional file named robots.txt in the root direc- tory of the Web server (e.g., http://www.somehost.com/robots.txt). The file provides access policies for different crawlers, identified by the User-agent field. For any user-agent value (or the default “*”) a number of Disallow entries identify directory subtrees to be avoided. Compliant crawlers must 316 8 Web Crawling fetch and parse a server's robots.txt file before sending requests to that server. For example, the following policy in robots.txt: User-agent: * Disallow: / directs any crawler to stay away from the entire server. Some high-level languages such as Perl provide modules to parse robots.txt files. It is wise for a crawler to cache the access policies of recently visited servers, so that the robots.txt file need not be fetched and parsed every time a request is sent to the same server. Additionally, Web authors can indicate if a page may or may not be indexed, cached, or mined by a crawler using a special HTML meta-tag. Crawlers need to fetch a page in order to parse this tag, therefore this approach is not widely used. More details on the robot exclu- sion protocols can be found at http://www.robotstxt.org/wc/robots.html. When discussing the interactions between information providers and search engines or other applications that rely on Web crawlers, confusion sometime arises between the ethical, technical, and legal ramifications of the Robot Exclusion Protocol. Compliance with the protocol is an ethical issue, and non-compliant crawlers can justifiably be shunned by the Web community. However, compliance is voluntary, and a robots.txt file cannot enforce it. Servers can, however, block access to a client based on its IP address. Thus it is likely that a crawler which does not comply with the Exclusion Protocol and does not follow proper etiquette will be quickly blocked by many servers. Crawlers may disguise themselves as browsers by sending a browser's identifying string in the User-Agent header. This way a server administrator may not immediately detect lack of compliance with the Exclusion Protocol, but an aggressive request profile is likely to reveal the true nature of the crawler. To avoid detection, some mischievous crawlers send requests at low and randomized rates. While such behaviors may be reprehensible, they are not illegal – at least not at the time of this writing. Nonetheless, there have been cases of businesses bringing lawsuits against search organizations for not complying with the Robot Exclusion Protocol. In a recent lawsuit involving the Internet Archive's WayBack Machine (www.archive.org), a plaintiff not only attributed legal weight to the Exclusion Protocol, but also expected that a newly added robots.txt policy should have retroactive value! Deception does not occur only by crawlers against servers. Some servers also attempt to deceive crawlers. For example, Web administrators may at- tempt to improve the ranking of their pages in a search engine by provid- ing different content depending on whether a request originates from a browser or a search engine crawler, as determined by inspecting the re- quest's User-Agent header. This technique, called cloaking, is frowned 8.7 Crawler Ethics and Conflicts 317 upon by search engines, which remove sites from their indices when such abuses are detected. For more information about Web spam, see Chap. 6. One of the most serious challenges for crawlers originates from the ris- ing popularity of pay-per-click advertising. If a crawler is not to follow ad- vertising links, it needs to have a robust detection algorithm to discriminate ads from other links. A bad crawler may also pretend to be a genuine user who clicks on the advertising links in order to collect more money from merchants for the hosts of advertising links. The above examples suggest a view of the Web as a new playground for artificial intelligence (AI). Crawlers need to become increasingly sophisti- cated to prevent insidious forms of spam from polluting and exploiting the Web environment. Malicious crawlers are also becoming smarter in their efforts, not only to spam but also to steal personal information and in gen- eral to deceive people and crawlers for illicit gains. One chapter of this arms race has been the development of CAPTCHAs [14], graphics-based inverse Turing tests automatically generated by server sites to keep out malicious crawlers. Maybe a stronger AI will be a positive outcome of crawler evolution; maybe a less usable Web will be a hefty price to pay. Interestingly, the gap between humans and crawlers may be narrowing from both sides. While crawlers become smarter, some humans are dumbing down their content to make it more accessible to crawlers. For example some online news providers use simpler titles than can be easily classified and interpreted by a crawler as opposed or in addition to witty ti- tles that can only be understood by humans. Another gap that is getting narrower is the distinction between browsers and crawlers, with a growing gray area between the two. A business may wish to disallow crawlers from its site if it provides a service by which it wants to entice human users to visit the site, say to make a profit via ads on the site. A competitor crawling the information and mirroring it on its own site, with different ads, is a clear violator not only of the Robot Exclusion Protocol but also possibly of copyright law. What about an individual user who wants to access the information but automatically hide the ads? There are many browser extensions that allow users to perform all kinds of tasks that deviate from the classic browsing activity, including hiding ads, alter- ing the appearance and content of pages, adding and deleting links, adding functionality to pages, pre-fetching pages, and so on. Such extensions have some of the functionalities of crawlers. Should they identify themselves through the User-Agent header as distinct from the browser with which they are integrated? Should a server be allowed to exclude them? And should they comply with such exclusion policies? These too are questions about ethical crawler behaviors that remain open for the moment. 318 8 Web Crawling 8.8 Some New Developments The typical use of (universal) crawlers thus far has been for creating and maintaining indexes for general purpose search engines. However a more diverse use of (topical) crawlers is emerging both for client and server based applications. Topical crawlers are becoming important tools to sup- port applications such as specialized Web portals (a.k.a. “vertical” search engines), live crawling, and competitive intelligence. Another characteristic of the way in which crawlers have been used by search engines up to now is the one-directional relationship between users, search engines, and crawlers. Users are consumers of information provided by search engines, search engines are consumers of information provided by crawlers, and crawlers are consumers of information provided by users (authors). This one-directional loop does not allow, for example, informa- tion to flow from a search engine (say, the queries submitted by users) to a crawler. It is likely that commercial search engines will soon leverage the huge amounts of data collected from their users to focus their crawlers on the topics most important to the searching public. To investigate this idea in the context of a vertical search engine, a system was built in which the crawler and the search engine engage in a symbiotic relationship [430]. The crawler feeds the search engine which in turn helps the crawler. It was found that such a symbiosis can help the system learn about a community's interests and serve such a community with better focus. As discussed in Sect. 8.3, universal crawlers have to somehow focus on the most “important” pages given the impossibility to cover the entire Web and keep a fresh index of it. This has led to the use of global prestige measures such as PageRank to bias universal crawlers, either explicitly [102, 234] or implicitly through the long-tailed structure of the Web graph [401]. An important problem with these approaches is that the focus is dic- tated by popularity among “average” users and disregards the heterogene- ity of user interests. A page about a mathematical theorem may appear quite uninteresting to the average user, if one compares it to a page about a pop star using indegree or PageRank as a popularity measure. Yet the math page may be highly relevant and important to a small community of users (mathematicians). Future crawlers will have to learn to discriminate be- tween low-quality pages and high-quality pages that are relevant to very small communities. Social networks have recently received much attention among Web us- ers as vehicles to capture commonalities of interests and to share relevant information. We are witnessing an explosion of social and collaborative engines in which user recommendations, opinions, and annotations are ag- gregated and shared. Mechanisms include tagging (e.g., del.icio.us and 8.8 Some New Developments 319 flickr.com), ratings (e.g., stumbleupon.com), voting (e.g., digg.com), and hierarchical similarity (GiveALink.org) [363]. One key advantage of social systems is that they empower humans rather than depending on crawlers to discover relevant resources. Further, the aggregation of user recommenda- tions gives rise to a natural notion of trust. Crawlers could be designed to expand the utility of information collected through social systems. For ex- ample it would be straightforward to obtain seed URLs relevant to specific communities of all sizes. Crawlers would then explore the Web for other resources in the neighborhood of these seed pages, exploiting topical local- ity to locate and index other pages relevant to those communities. Social networks can emerge not only by mining a central repository of user-provided resources, but also by connecting hosts associated with indi- vidual users or communities scattered across the Internet. Imagine a user creating its own micro-search engine by employing a personalized topical crawler, seeded for example with a set of bookmarked pages. Desktop search applications make it easy to also share certain local files, if so de- sired. Can federations of such micro-engine agents emerge on the basis of mutual interests? Peer-to-peer (P2P) networks are beginning to be seen as robust architectures ideal for brokering among individual needs and cater- ing to communities [354]. Adaptive peer-based search systems driven by simple distributed adap- tive query routing algorithms can spontaneously organize into networks with efficient communication and with emerging clusters capturing seman- tic locality. Specifically, in a P2P search application called 6Search (6S), each peer crawls the Web in a focused way, guided by its user’s informa- tion context. Each peer submits and responds to queries to/from its neighbors. This search process has no centralized control. Peers depend on local adaptive routing algorithms to dynamically change the topology of the peer network and search for the best neighbors to answer their queries. Machine learning techniques are being explored to improve local adaptive routing. Validation of the 6S framework and network via simulations with 70−500 model users based on actual Web crawls has yielded encouraging preliminary results. The network topology rapidly converges from a ran- dom network to a small-world network, with clusters emerging to match user communities with shared interests [15]. Additionally the quality of the results is significantly better than obtained by centralized search engines built with equivalent resources, and comparable with the results from much larger search engines such as Google [553, 554]. The integration of effective personalized/topical crawlers with adaptive query routing algorithms is the key to the success of peer-based social search systems. Many synergies may be exploited in this integration by leveraging contextual information about the local peer that is readily avail- 320 8 Web Crawling able to the crawler, as well as information about the peer's neighbors that can be mined through the stream of queries and results routed through the local peer. An open-source prototype of 6S enabling sharing of bookmarks, one-click crawling, and distributed collaborative search is available (http://homer.informatics.indiana.edu/~nan/6S/). If successful, this kind of application could create a new paradigm for crawling and searching where universal crawlers and search engines are complemented with swarms of personal crawlers and micro-engines tuned to the specialized information needs of individual users and dynamic self-organized social networks. Bibliographic Notes General ideas and techniques about crawling can be found in [68, 85, 263], little is known about implementation details of commercial crawlers. Fo- cused crawling discussed in this chapter is based on [87, 85, 139]. Litera- ture on topical crawling algorithms is extensive [e.g., 3, 60, 102, 126, 237, 366, 369, 375, 377, 432, 434, 459]. Topical crawlers have been used for building focused repositories, automating resource discovery, and support- ing software agents. For example, topical crawlers are used to collect pa- pers for building scientific literature digital libraries such as CiteSeer and Google Scholar [308, 366, 550]. Applications of topical crawlers to busi- ness and competitive intelligence are discussed in [432], and biomedical applications in [503]. Controversial applications to harvest personal infor- mation for spam and phishing purposes are illustrated in [251]. On best-first crawlers, various methods have been used to determine an appropriate textual context in which to evaluate and score unvisited links. Using the anchor text is one strategy [123]. Another strategy is to use win- dows of a fixed size, e.g., 50 words around the anchor, in place of/in addi- tion to the anchor text [237]. The weighted window used by InfoSpiders [375] yields a weight for each link, which is then fed to a neural network to score each link. In the tag (DOM) tree approach [85], using the parent node of the anchor as aggregation node worked well in a business intelli- gence crawling task [432]. There is a tradeoff analogous to that between precision and recall when we consider the optimal size of a link context: small contexts (e.g., anchor text) have the highest average similarities to the target page, but also highest chance to miss important cues about the target. Larger contexts (e.g., parent or grand-parent aggregator node) have lower average similarities to the target, but lower chance to miss all the keywords in the target. This suggests a greedy optimization scheme: climb the DOM tree from the anchor until sufficient terms are present in the link context [429]. This approach outperformed both the fixed-window method Bibliographic Notes 321 (with optimal window size) and the DOM tree method with a fixed aggre- gator depth (anchor, parent, or grandparent). Early versions of InfoSpiders were described in [369, 374, 375, 376]. Certain aspects of evolutionary computation have also been used in other topical crawlers such as the itsy bitsy spider [96]. Another adaptive mecha- nism for topical crawlers inspired by natural processes is ant colony opti- mization [194]. The idea is that a population of agents leaves a trail of pheromone along the paths that lead to relevant pages, gradually biasing the crawl toward promising portions of the Web graph. A more extensive review of adaptive topical crawling algorithms can be found in [380].





下载需要 6 金币 [金币充值 ]
亲,您也可以通过 分享原创pdf 来获得金币奖励!