To offer the best possible results, search engines must attempt to discover all the public pages on the World Wide Web and then present the ones that best match up with the user’s search query. The first step in this process is crawling the Web. The search engines start with a seed set of sites that are known to be very high quality sites, and then visit the links on each page of those sites to discover other web pages.
The link structure of the Web serves to bind together all of the pages that have been made public as a result of someone linking to them. Through links, search engines’ automated robots, called crawlers or spiders, can reach the many billions of interconnected documents.
You can see the home page of http://www.usa.gov, the official US government website. The links on the page are outlined. Crawling this page would start with loading the page, analyzing its content, and then seeing what other pages it links to.
Figure 2-10. Crawling the US government website
The search engine will then load those other pages and analyze that content as well. This process repeats over and over again until the crawling process is complete. This process is an enormously complex one as the Web is a large and complex place.
Search engines do not attempt to crawl the entire Web every day. In fact, they may become aware of pages that they choose not to crawl because they are not likely to be important enough to return in a search result. We will discuss the role of importance in the next section, Retrieval and Rankings.
Once the engines have retrieved a page during a crawl, their next job is to parse the code from them and store selected pieces of the pages in massive arrays of hard drives, to be recalled when needed in a query. The first step in this process is to build a dictionary of terms. This is a massive database that catalogs all the significant terms on each page crawled by a search engine. A lot of other data is also recorded, such as a map of all the pages that each page links to, the anchor text of those links, whether or not those links are considered ads, and more. To accomplish the monumental task of holding data on hundreds of billions (or trillions) of pages that can be accessed in a fraction of a second, the search engines have constructed massive data centers.
One key concept in building a search engine is deciding where to begin a crawl of the Web. Although you could theoretically start from many different places on the Web, you would ideally begin your crawl with a trusted seed set of websites.
Starting with a known trusted set of websites enables search engines to measure how much they trust the other websites that they find through the crawling process. We will discuss the role of trust in search algorithms in more detail in “How Links Influence Search Engine Rankings”.