next up previous
Next: ... Web Archives and Up: The WEB archives: A Previous: ... and supercomputers and

... and statistical methods help chasing Webpages

 
``The moment of enlightenment is when a person's dreams of possibilities become images of probabilities.'' - Vic Braden

Some recent Web robots, such as the Web robot ``Ellen", that is described in the previous section, use modern statistical methods to avoid the usual brute-force approach. This section illustrates some of the statistical techniques used by the authors' robot, using a simple example. For more details, please see [31,32,33,34].

Again, the starting point is to archive data by themes of interest through all the periods of time, languages and so forth. Consider the problem of a student in Cyberworld who is interested in in French literature, in particular in the great French ``Poètes Maudits" (the Damned Poets) of the 19th century.

Two well-known statistical techniques are used to deal with these problems. The main motivation is to offer solutions as simple as possible, so that complexity can be kept low, and a balance is found in the trade-off between cost (computational, timing) and variance (measure of variability).

Firstly, to cope with the size problem, we use a general statistical method based on capture-recapture techniques. These methods are currently used to get information about a population that is difficult to control. With constant improvements in technology, capture-recapture methods have become efficient tools to reduce costs and to save time. Since the 1940s the capture-recapture methodology has been widely developed to estimate the size of wildlife population and related parameters such as survival and immigration rates [35]. Recent applications are for example in software correctness, where capture-recapture samples represent errors detected by different viewers in parallel inspection (see for example [36] where an estimation of the number of undetected errors for a particular feature of the AT&T telephone switch) is given.

In the context of an Internet archive, the size estimation of a dataset on a specific theme is of interest and the use of capture-recapture techniques is straightforward. The simplest capture-recapture experiment to estimate the unknown size N of that Cyberpopulation, consists of identifying a sample (sample 1) of n1 pages in Internet, and then to extract a second sample of n2 pages (sample 2). Let m2 be the number of pages already identified in sample 1. Assuming that \( 
\frac{m_{2}}{n_{2}} \approx \frac{n_{1}}{N} \) we get the estimate

\begin{displaymath}
\hat{N}=\frac{n_{1}n_{2}}{m_{2}}.\end{displaymath}

When \( (n_{1}+n_{2}\geq N) \), the modified estimate

\begin{displaymath}
\hat{N}=\frac{(n_{1}+1)(n_{2}+1)}{(m_{2}+1)}-1\end{displaymath}

gives a more accurate estimate. This simple experiment illustrates the well-known Peterson method. What is really done by the robot is more complicated, and relies also on Bayesian statistics, see [32,33]. The methods can be applied directly to our problem, and to cope with other relevant questions, such as ``What is the probability that a certain group of pages are modified during a specified period of time?" or ``What is the probability that a Web page survives during a specified period of time?". Compared to traditional statistical methods, the discussed methods here are very computing-intensive, because high-dimensional integration problems have to be solved. However, the experiment shows that the gain on not having to download and analysis all entries is much larger.

The second problem on finding all the distinct famous French poets of the 19th century can be reduced to a parallel version of the coupon collector problem. Suppose there are n different coupons. At each trial, one is chosen at random. These random choices are mutually independent. Let m be the number of trials. The goal is to study the relationship between m and ``having collected at least one of each type of coupon" [37]. This simple setting can be mapped to our problem, by imaging every coupon as one poet, and each trial checks whether the current link contains a new poet not yet in the list. A way to improve the rapidity of the procedure is to select from one page the next link directly from that page. For example, imagine a robot identifying the poet Arthur Rimbaud. The robot records that new poet and is then likely to visit a page related to Paul Verlaine[*]. This approach is based on the random surfer model, as used in Google's pageranking [15]. Unless you already have an ``archive'' to query, the above analysis becomes a classifier, targeted random walk on the Web. Each page is a node and every link a directed node. This can be treated as a random walk on an expander, and has very rapidly-converging and easy to update properties, see [34] and [31]. Random walks on the web [34] require convergence of a Markov chain, which might not have yet reached stationarity and might be biased. The fundamental limitation involved in all these approaches is that, it is difficult to know whether you have already sampled thoroughly enough for your specific purpose. Since you do not know what might be of interest in 50 or 100 years, you might end up sampling too little, as that one student Webpage you did not scan regularly, might be the Webpage of the future president.


next up previous
Next: ... Web Archives and Up: The WEB archives: A Previous: ... and supercomputers and
A. S. A. Roehrl
2/14/2000