next up previous
Next: ... and supercomputers and Up: The WEB archives: A Previous: Abstract

The Internet booms...

``We had the sky up there, all speckled with stars, and we used to lay on our backs and look up at them, and discuss about whether they was made, or only just happened.'' - Huckleberry Finn

The Internet is booming both in terms of number of hosts and in terms of the number of users. In particular, the World Wide Web (WWW) is still growing exponentially [1,2]. However, this growth is difficult to measure, in particular, if a statement about the quality of the available information, rather than the mere quantity, is to be made. A hostcounts in July 1998 estimated some 36,739,000 hosts worldwide, of which 6,529,000 replied to a ping [3]. Also, the estimate of the total number of 800 million Webpages is well-known. However, due to dynamic Webpages and mirror Websites this kind of information is not as relevant as is the quality and actuality of the number of pages that are there to crawl.

The challenge to create a large-scale archive of the WWW to conserve the history of its change is similar in magnitude to that of the top-secret NSA project Echelon [4], or Search engines like Inktomi [5] and Google [6,2], with the additional complication of a time dimension. Technically, all the above mentioned examples are run on a network of workstations [7,8]. The high-end Internet search-engines of today include databases of about 150 million indexed Webpages, and they crawl more than 10 million webpages per day [9], which are stored in the database. This crawling speed will most likely have to be increased in future, as the average life time of a Webpage is only 44 days [10] [*], the exponential growth of the total storage size of all Webpages will be sustained for quite some time. Most search engines store the page and perform some simple page-relevance ranking. However, further postprocessing methods of the search information differ a lot. According to the authors' experience with scanning Webpages with robots [11,1], the average Webpage is about 300 words long. So, a typical crawling system is left with the processing of $3\cdot 10^9$ words, and will also serve several hundred thousands of searches a day, which requires large-scale computing power and fast access to massive storage systems.


next up previous
Next: ... and supercomputers and Up: The WEB archives: A Previous: Abstract
A. S. A. Roehrl
2/14/2000