``The moment of enlightenment is when a person's dreams of possibilities become images of probabilities.'' - Vic Braden
Some recent Web robots, such as the Web robot ``Ellen", that is described in the previous section, use modern statistical methods to avoid the usual brute-force approach. This section illustrates some of the statistical techniques used by the authors' robot, using a simple example. For more details, please see [31,32,33,34].
Again, the starting point is to archive data by themes of interest through all the periods of time, languages and so forth. Consider the problem of a student in Cyberworld who is interested in in French literature, in particular in the great French ``Poètes Maudits" (the Damned Poets) of the 19th century.
Firstly, to cope with the size problem, we use a general statistical method based on capture-recapture techniques. These methods are currently used to get information about a population that is difficult to control. With constant improvements in technology, capture-recapture methods have become efficient tools to reduce costs and to save time. Since the 1940s the capture-recapture methodology has been widely developed to estimate the size of wildlife population and related parameters such as survival and immigration rates [35]. Recent applications are for example in software correctness, where capture-recapture samples represent errors detected by different viewers in parallel inspection (see for example [36] where an estimation of the number of undetected errors for a particular feature of the AT&T telephone switch) is given.
In the context of an Internet archive, the size estimation of a dataset on a
specific
theme is of interest and the use of capture-recapture techniques is
straightforward.
The simplest capture-recapture experiment to estimate the unknown size N
of that Cyberpopulation, consists of identifying a sample (sample 1) of n1
pages in Internet, and then to extract a second sample of n2 pages
(sample 2). Let m2 be
the number of pages already identified in sample 1. Assuming that
we get the estimate
![]()
![]()
The second problem on finding all the distinct famous French
poets of the 19th century can be reduced to a parallel version of the
coupon
collector problem. Suppose there are n different coupons. At each trial,
one is chosen
at random. These random choices are mutually independent. Let m be the
number of trials. The goal is to study the relationship between m and ``having
collected at least one of each type of coupon" [37]. This simple
setting can be mapped to our problem, by imaging every coupon as one poet,
and each trial checks whether
the current link contains a new poet not yet in the list. A way to improve the
rapidity of the procedure is to select from one page the next link directly from
that page. For example, imagine a robot identifying
the poet Arthur Rimbaud. The robot records that new poet and is then likely to visit a page related to Paul
Verlaine
.
This approach is based on the random surfer model, as used in Google's
pageranking [15]. Unless you already have an ``archive'' to query,
the above analysis becomes a classifier,
targeted random walk on the Web. Each page is a node and every link a directed
node. This can be treated as a random walk on an expander, and has very
rapidly-converging and easy to update properties, see
[34] and [31].
Random walks on the web [34] require convergence
of a Markov chain, which might not have yet reached stationarity and might be
biased.
The fundamental limitation involved in all these approaches is that,
it is difficult to know whether you have already sampled thoroughly enough for
your specific purpose. Since you do not know what might be of interest in 50 or
100
years, you might end up sampling too little, as that one student Webpage you did
not
scan regularly, might be the Webpage of the future president.