Chris Ridings on PageRank

In 1998, search engines added link analysis to their bag of tricks. As a result, content spam and cloaking alone could no longer fool the link analysis engines and garner spammers unjustifiably high rankings. Spammers and SEOs adapted by learning how link analysis works. The SEO community has always been active – its members, then and now, hold conferences, write papers and books, host weblogs, and sell their secrets. The most famous and informative SEO papers were written by Chris Ridings, PageRank Explained: Everything you’ve always wanted to know about PageRank and PageRank uncovered. These papers offer practical strategies for hoarding PageRank and avoiding such undesirable things as PageRank leak. Search engines consider unethical SEOs to be adversaries, some web analysts call them an essential part of the web food chain, because they drive innovation and research and development.

Langville & Meyer, Google’s PageRank and Beyond: The Science of Search Engine Rankings, p. 44.

The Use of the Linear Algebra by Web Search Engines

Amy N. Langville and Carl D. Meyer Department of Mathematics, North Carolina State University Raleigh, NC 27695-8205
October 19, 2004

1 Introduction

Nearly all the major Web search engines today use link analysis to improve their search results. That’s exciting news for linear algebraists because link analysis, the use of the Web’s hyperlink structure, is built from fundamentals of matrix theory. Link analysis and its underlying linear algebra have helped revolutionize Web search, so much so that the pre-link analysis search (before 1998) pales in comparison to today’s remarkably accurate search.

HITS [13] and PageRank [2, 3] are two of the most popular link analysis algorithms. Both were developed around 1998 and both have dramatically improved the search business. In order to appreciate the impact of link analysis, recall for a minute the state of search prior to 1998. Because of the immense number of pages on the Web, a query to an engine often produced a very long list of relevant pages, sometimes thousands of pages long. A user had to sort carefully through the list to find the most relevant pages. The order of presentation of the pages was little help because spamming was so easy then. In order to trick a search engine into producing rankings higher than normal, spammers used meta-tags liberally, claiming their page used popular search terms that never appeared in the page. Meta-tags became useless for search engines. Spammers also repeated popular search terms in invisible text (white text on a white background) to fool engines.

2 The HITS Algorithm

HITS [13], a link analysis algorithm developed by Jon Kleinberg from Cornell University during his postdoctoral studies at IBM Almaden, aimed to focus this long, unruly query list. The HITS algorithm is based on a pattern Kleinberg noticed among Web pages. Some pages serve as hubs or portal pages, i.e., pages with many outlinks. Other pages are authorities on topics because they have many inlinks. Kleinberg noticed that good hubs seemed to point to good authorities and good authorities were pointed to by good hubs. So he decided to give each page i both an hub score hi and an authority score a^ In fact, for every page i he defined the hub score at iteration k, h(k, and the authority score, a(k), as

Meyer 01-01

where eij represents a hyperlink from page i to page j and E is the set of hyperlinks. To compute the scores for a page, he started with uniform scores for all pages, i.e., h(1) = 1/n and a(1) = 1/n where n is the number of pages in a so-called neighborhood set for the query list. The neighborhood set consists of

all pages in the query list plus all pages pointing to or from the query pages. Depending on the query, the neighborhood set could contain just a hundred pages or a hundred thousand pages. (The neighborhood set allows latent semantic associations to be made.) The hub and authority scores are iteratively refined until convergence to stationary values.

Using linear algebra we can replace the summation equations with matrix equations. Let h and a be column vectors holding the hub and authority scores. Let L be the adjacency matrix for the neighborhood set. That is, Lj = 1 if page i links to page j, and 0, otherwise. These definitions show that

a(k) = LT h(fc-1) and h(k) = La(k).

Using some algebra, we have

a(k) = LTLa(fc-1) h(k) = LLTh(fc-1).

These equations make it clear that Kleinberg’s algorithm is really the power method applied to the positive semi-definite matrices LTL and LLT. LTL is called the hub matrix and LLT is the authority matrix. Thus, HITS amounts to solving the eigenvector problems LTLa = A1a and LLTh = A1 h, where A1 is the largest eigenvalue of LTL (and LLT), and a and h are corresponding eigenvectors.

While this is the basic linear algebra required by the HITS method, there are many more issues to be considered. For example, important issues include convergence, existence, uniqueness, and numerical computation of these scores [5, 7, 14]. Several modifications to HITS have been suggested, each bringing various advantages and disadvantages [4, 6, 8]. A variation of Kleinberg’s HITS concept is at the base of the search engine TEOMA (http://www.teoma.com), which is owned by Ask Jeeves, Inc.

3 The PageRank Algorithm

PageRank, the second link analysis algorithm from 1998, is the heart of Google. Both PageRank and Google were conceived by Sergey Brin and Larry Page while they were computer science graduate students at Stanford University. Brin and Page use a recursive scheme similar to Kleinberg’s. Their original idea was that a page is important if it is pointed to by other important pages. That is, they decided that the importance of your page (its PageRank score) is determined by summing the PageRanks of all pages that point to yours. In building a mathematical definition of PageRank, Brin and Page also reasoned that when an important page points to several places, its weight (PageRank) should be distributed proportionately. In other words, if YAHOO! points to your Web page, that’s good, but you shouldn’t receive the full weight of YAHOO! because they point to many other places. If YAHOO! points to 999 pages in addition to yours, then you should only get credit for 1/1000 of YAHOO!’s PageRank.

This reasoning led Brin and Page to formulate a recursive definition PageRank. They defined

r(k)

r(k+1) = y^ j jeii 1 Jl

where r(k) is the PageRank of page i at iteration k, is the set of pages pointing into page i and \Oj \ is the number of outlinks from page j. Like HITS, PageRank starts with a uniform rank for all pages, i.e., r(0) = 1/n and successively refines these scores, where n is the total number of Web pages.

Like HITS, we can write this process using matrix notation. Let the row vector n(k)T be the PageRank vector at the kth iteration. As a result, the summation equation for PageRank can be written compactly

as

n(k+1)T = n(k)T H,

where H is a row normalized hyperlink matrix, i.e., hij = 1/0^, if there is a link from page i to page j, and 0, otherwise. Unfortunately, this iterative procedure has convergence problems—it can cycle or the limit may be dependent on the starting vector.

To fix these problems, Brin and Page revised their basic PageRank concept. Still using the hyperlink structure of the Web, they build an irreducible aperiodic Markov chain characterized by a primitive (irreducible with only one eigenvalue on the spectral circle) transition probability matrix. The irreducibility guarantees the existence of a unique stationary distribution vector nT, which becomes the PageRank vector. The power method with a primitive stochastic iteration matrix will always converge to nT independent of the starting vector, and the asymptotic rate of convergence is governed by the magnitude of the subdominant eigenvalue X2 of the transition matrix [19].

Here’s how Google turns the hyperlink structure of the Web into a primitive stochastic matrix. If there are n pages in the Web, let H be the n x n matrix whose element hij is the probability of moving from page i to page j in one click of the mouse. The simplest model is to take hij = 1/|Oi|, which means that starting from any Web page we assume that it is equally likely to follow any of the outgoing links to arrive at another page.

However, some rows of H may contain all zeros, so H is not necessarily stochastic. This occurs whenever a page contains no outlinks; many such pages exist on the Web and are called dangling nodes. An easy fix is to replace all zero rows with eT/n, where eT is the row vector of all ones. The revised (now stochastic) matrix S can be written as a rank-one update to the sparse H. Let a be the dangling node vector in which

1 if page i is a dangling node, 0 otherwise. Then,

S = H + aeT/n.

Actually, any probability vector pT > 0 with pTe = 1 can be used in place of the uniform vector eT/n.

We’re not home yet because the adjustment that produces the stochastic matrix S isn’t enough to insure the existence of a unique stationary distribution vector (needed to make PageRank well defined). Irreducibility on top of stochasticity is required. But the link structure of the Web is reducible—the Web graph is not strongly connected. Consequently, an adjustment to make S irreducible is needed. This last adjustment brings us to the Google matrix, which is defined to be

G = aS + (1 – a)E,

where 0 < a < 1 and E = eeT/n. Google eventually replaced the uniform vector eT/n with a more general probability vector vT (so that E = evT) to allow them the flexibility to make adjustments to PageRanks as well as to personalize them. See [10, 15] for more about the personalization vector vT.

Because G is a convex combination of the two stochastic matrices S and E, it follows that G is both stochastic and irreducible. Furthermore, every node is now directly connected to every other node (although the probability of transition may be very small in some cases), so G > 0. Consequently, G is a primitive matrix, and this insures that the power method n(k+1)T = n(k)T G will converge, independent of the starting vector, to a unique stationary distribution nT [19]. This is the mathematical part of Google’s PageRank vector.

3.1 The Power Method

While it doesn’t always excite numerical analysts, the power method has been Google’s computational method of choice, and there are some good reasons for this. First, consider iterates of the power method applied to G (a completely dense matrix, were it to be formed explicitly). If we take E = evT, then

n(k)T = n(k-1)TG = an(k-1)TS + (1 – a)vT = an(k-1)TH + (an(k-1)Ta +(1 – a))vT,

Written in this way, it becomes clear that the power method applied to G can be implemented with vector-matrix multiplications on the extremely sparse H, and G and S are never formed or stored. A matrix-free method such as the power method is required due to the size of the matrices and vectors involved (Google’s index is currently 4.3 billion pages). Fortunately, since H is sparse, each vector-matrix multiplication required by the power method can be computed in nnz(H) flops, where nnz(H) is the number of nonzeros in H. And since the average number of nonzeros per row in H is significantly less than 10, we have O(nnz(H)) « O(n). Furthermore, at each iteration, the power method only requires the storage of one vector, the current iterate, whereas other accelerated matrix-free methods, such as restarted GMRES or BiCGStab, require storage of at least several vectors, depending on the size of the subspace chosen. Finally, the power method applied in this way converges quickly. Brin and Page report success using only 50 to 100 power iterations [3]. This is due in large part to the fact that it can be proven [15] that the subdominant eigenvalue of G satisfies |A2| < a, and Google originally set a = .85.

Like HITS, the basic concepts of PageRank are simple, but there are many subtle issues that lurk just below the surface. For example, there are complicated and unresolved issues concerning personalization, computation, accelerated computation, sensitivity, and updating—more information is available in [7, 11, 12, 21, 15, 17, 18, 20].

This brief introduction describes only the mathematical component of Google’s ranking system. However, it’s known that there are non-mathematical “metrics” that are also considered when Google responds to a query, so the results seen by a user are in fact PageRank tempered by other metrics. While Google is secretive about these other metrics, they state on their Web site (http://www.google.com/technology) that “The heart of our software is PageRank… .”

4 Books

SIAM is publishing a second edition of the popular Understanding Search Engines: Mathematical Modeling and Text Retrieval [1] by Michael W. Berry and Murray Browne in 2005. The new edition contains a chapter devoted to link analysis. As a result, readers can see how link analysis and ranking algorithms fit into the overall search process.

Also due out in 2005 is our book, Understanding Web Search Engine Rankings: Google’s PageRank, Teoma’s HITS, and other ranking algorithms [16]. This book from Princeton University Press will contain over 250 pages devoted to link analysis algorithms with several introductory chapters, examples, and code, as well as chapters dealing with more advanced issues in Web search ranking.

References

[acp add author="Michael W. Berry and Murray Browne" title="Understanding Search Engines: Mathematical Modeling and Text Retrieval" id="1" ... /]
[1] Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM, Philadelphia, 1999.

[2] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 33:107-117, 1998.

[3] Sergey Brin, Lawrence Page, R. Motwami, and Terry Winograd. The PageRank citation ranking: bringing order to the web. Technical report, Computer Science Department, Stanford University, 1998.

[4] Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst Simon. PageRank, HITS and a unified framework for link analysis.

In Proceedings of the 25th ACM SIGIR Conference, pages 353-354, Tampere, Finland, August 2002.

[5] Chris H. Q. Ding, Hongyuan Zha, Xiaofeng He, Parry Husbands, and Horst D. Simon. Link analysis: hubs and authorities on the World Wide Web. SIAM Review, 46(2):256-268, 2004.

[6] Ayman Farahat, Thomas Lofaro, Joel C. Miller, Gregory Rae, F. Schaefer, and Lesley A. Ward. Modifications of Kleinberg’s HITS algorithm using matrix exponentiation and web log records. In ACM SIGIR Conference, pages 444-445, September 2001.

[7] Ayman Farahat, Thomas Lofaro, Joel C. Miller, Gregory Rae, and Lesley A. Ward. Existence and uniqueness of ranking vectors for linear link analysis. In ACM SIGIR Conference, September 2001.

[8] Francois Fouss, Jean-Michel Renders, and Marco Saerens. Some relationships between Weinberg’s hubs and authorities, correspondence analysis, and the Salsa algorithm. In Proceedings of the 7th International Conference on the Statistical Analysis of Textual Data (JADT 2004), pages 445-455, 2004.

[9] Taher H. Haveliwala and Sepandar D. Kamvar. The second eigenvalue of the Google matrix. Technical report, Stanford University, 2003.

[10] Taher H. Haveliwala, Sepandar D. Kamvar, and Glen Jeh. An analytical comparison of approaches to personalizing PageRank. Technical report, Stanford University, 2003.

[11] Sepandar D. Kamvar and Taher H. Haveliwala. The condition number of the PageRank problem. Technical report, Stanford University, 2003.

[12] Sepandar D. Kamvar, Taher H. Haveliwala, and Gene H. Golub. Adaptive methods for the computation of PageRank. Technical report, Stanford University, 2003.

[13] Jon Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46, 1999.

[14] Amy N. Langville and Carl D. Meyer. A survey of eigenvector methods of web information retrieval. The SIAM Review, 2003. Accepted in December 2003.

[15] Amy N. Langville and Carl D. Meyer. Deeper inside PageRank. Internet Mathematics Journal, 2004. Accepted in February 2004.

[16] Amy N. Langville and Carl D. Meyer. Understanding Web Search Engine Rankings: Google’s PageRank, Teoma’s HITS, and other ranking algorithms. Princeton University Press, Princeton, 2005.

[17] Chris Pan-Chi Lee, Gene H. Golub, and Stefanos A. Zenios. Partial state space aggregation based on lumpability and its application to PageRank. Technical report, Stanford University, 2003.

[18] Ronny Lempel and Shlomo Moran. Rank-stability and rank-similarity of link-based web ranking algorithms in authority-connected graphs. In Second Workshop on Algorithms and Models for the Web-Graph (WAW 2003), Budapest, Hungary, May 2003.

[19] Carl D. Meyer. Matrix Analysis and Applied Linear Algebra. SIAM, Philadelphia, 2000.

[20] Andrew Y. Ng, Alice X. Zheng, and Michael I. Jordan. Link analysis, eigenvectors and stability. In Seventh International Joint Conference on Artificial Intelligence, 2001.

[21] Twelfth International World Wide Web Conference. Extrapolation Methods for Accelerating PageRank Computations, 2003.

5

PageRank Computation

next up previous contents index
Next: Topic-specific PageRank Up: PageRank Previous: Definition:   Contents   Index


The PageRank computation

How do we compute PageRank values? Recall the definition of a left eigenvector from Equation 214; the left eigenvectors of the transition probability matrix $P$ are $N$-vectors $\vec{\pi}$ such that

\begin{displaymath}
\vec{\pi}\matrix{P} = \lambda \vec{\pi}.
\end{displaymath} (255)

 

The $N$ entries in the principal eigenvector $\vec{\pi}$ are the steady-state probabilities of the random walk with teleporting, and thus the PageRank values for the corresponding web pages. We may interpret Equation 255 as follows: if $\vec{\pi}$ is the probability distribution of the surfer across the web pages, he remains in the steady-state distribution $\vec{\pi}$. Given that $\vec{\pi}$ is the steady-state distribution, we have that $\pi P = 1\pi$, so 1 is an eigenvalue of P. Thus if we were to compute the principal left eigenvector of the matrix $P$ – the one with eigenvalue 1 — we would have computed the PageRank values.

There are many algorithms available for computing left eigenvectors; the references at the end of Chapter 18 and the present chapter are a guide to these. We give here a rather elementary method, sometimes known aspower iteration. If $\vec{x}$ is the initial distribution over the states, then the distribution at time $t$ is $\vec{x}P^t$. As $t$ grows large, we would expect that the distribution $\vec{x}P^t$[*] is very similar to the distribution $\vec{x}P^{t+1}$, since for large $t$we would expect the Markov chain to attain its steady state. By Theorem 21.2.1 this is independent of the initial distribution $\vec{x}$. The power iteration method simulates the surfer’s walk: begin at a state and run the walk for a large number of steps $t$, keeping track of the visit frequencies for each of the states. After a large number of steps $t$, these frequencies “settle down” so that the variation in the computed frequencies is below some predetermined threshold. We declare these tabulated frequencies to be the PageRank values.

We consider the web graph in Exercise 21.2.3 with $\alpha=0.5$. The transition probability matrix of the surfer’s walk with teleportation is then

 

\begin{displaymath}
P=\left(
\begin{array}{ccc}
1/6 & 2/3 & 1/6 \\
5/12 & 1/6 & 5/12 \\
1/6 & 2/3 & 1/6 \\
\end{array} \right).
\end{displaymath} (256)

 

Imagine that the surfer starts in state 1, corresponding to the initial probability distribution vector $\vec{x_0}=(1\ 0\ 0)$. Then, after one step the distribution is

\begin{displaymath}
\vec{x_0}P=\left(
\begin{array}{ccc}
1/6 & 2/3 & 1/6 \\
\end{array} \right)=\vec{x_1}.
\end{displaymath} (257)

 

After two steps it is

\begin{displaymath}
\vec{x_1}P=\left(\begin{array}{ccc}
1/6 & 2/3 & 1/6 \\
\...
...ay}{ccc}
1/3 & 1/3 & 1/3 \\
\end{array}\right) = \vec{x_2}.
\end{displaymath} (258)

 

Continuing in this fashion gives a sequence of probability vectors as shown in Figure 21.3 .

 

Figure 21.3: The sequence of probability vectors.
\begin{figure}\begin{tabular}{\vert c\vert\vert c\vert c\vert c\vert}
\hline
$...
...\cdots$\\
$\vec{x}$\ & 5/18 & 4/9 & 5/18\\
\hline
\end{tabular}
\end{figure}

Continuing for several steps, we see that the distribution converges to the steady state of $\vec{x}=(5/18\quad 4/9\quad 5/18)$. In this simple example, we may directly calculate this steady-state probability distribution by observing the symmetry of the Markov chain: states 1 and 3 are symmetric, as evident from the fact that the first and third rows of the transition probability matrix in Equation 256 are identical. Postulating, then, that they both have the same steady-state probability and denoting this probability by $p$, we know that the steady-state distribution is of the form $\vec{\pi}=(p\ \ 1-2p\ \ p)$. Now, using the identity $\vec{\pi}=\vec{\pi}P$, we solve a simple linear equation to obtain$p=5/18$ and consequently, $\vec{\pi}=(5/18\ 4/9\ 5/18)$.

The PageRank values of pages (and the implicit ordering amongst them) are independent of any query a user might pose; PageRank is thus a query-independent measure of the static quality of each web page (recall such static quality measures from Section 7.1.4 ). On the other hand, the relative ordering of pages should, intuitively, depend on the query being served. For this reason, search engines use static quality measures such as PageRank as just one of many factors in scoring a web page on a query. Indeed, the relative contribution of PageRank to the overall score may again be determined by machine-learned scoring as in Section 15.4.1 .

 

\begin{figure}
% latex2html id marker 32410
\vspace{5mm}
\begin{center}
\begi...
...he
word that occurs in the anchor text of the corresponding link.}
\end{figure}

Worked example. Consider the graph in Figure 21.4 . For a teleportation rate of 0.14 its (stochastic) transition probability matrix is:

\begin{displaymath}
\begin{tabular}{rrrrrrr}
0.02&0.02&0.88&0.02&0.02&0.02&0.02\...
...2&0.45&0.45\\
0.02&0.02&0.02&0.31&0.31&0.02&0.31
\end{tabular}\end{displaymath} (259)

 

The PageRank vector of this matrix is:

\begin{displaymath}
\vec{x} = (
0.05\quad
0.04\quad
0.11\quad
0.25\quad
0.21\quad
0.04\quad
0.31
)\end{displaymath} (260)

 

Observe that in Figure 21.4 , $q_2$$q_3$$q_4$ and $q_6$ are the nodes with at least two in-links. Of these, $q_2$ has the lowest PageRank since the random walk tends to drift out of the top part of the graph – the walker can only return there through teleportation. End worked example.

 


next up previous contents index
Next: Topic-specific PageRank Up: PageRank Previous: Definition:   Contents   Index

© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07

Playing with Google’s PageRank

The excellent book Google’s PageRank and Beyond of Amy N. Langville and Carl D. Meyer elegantly describes the science of search engine rankings in a rigorous yet playful style. In the following, we briefly recall how the PageRank method works.

Before giving the mathematical formulation of the problem solved by the PageRank method, we provide an intuitive interpretation of PageRank in terms of random walks on graphs. The Web is viewed as a directed graph of pages connected by hyperlinks. A random surfer starts from an arbitrary page and simply keeps clicking on successive links at random, bouncing from page to page. The PageRank value of a page corresponds to the relative frequency the random surfer visits that page, assuming that the surfer goes on infinitely. The higher is the number of times the random surfer visits a page, the higher is the status, measured in terms of PageRank, of the page.

In mathematical terms, the method can be described as follows. Let us denote by

qi

the number of forward (outgoing) links of page

i

. Let

H=(hi,j)

be a square transition matrix such that

hi,j=1/qi

if there exists a link from page

i

to page

j

and

hi,j=0

otherwise. Page self-links are ignored, hence

hi,i=0

for all

i

. The value

hi,j

can be interpreted as the probability that the random surfer moves from page

i

to page

j

by clicking on one of the hyperlinks of page

i

. The PageRank

πj

of page

j

is recursively defined as:

πj=∑iπihi,j

or, in matrix notation:

π=πH

Hence, the PageRank of page

j

is the sum of the PageRank scores of pages

i

linking to

j

, weighted by the probability of going from

i

to

j

.

Unfortunately, the described ideal model has two problems that prevent the solution of the system. The first one is due to the presence of dangling nodes, that are pages with no forward links. These pages capture the random surfer indefinitely. Notice that a dangling node corresponds to a row in

H

with all entries equal to 0. To tackle the problem, such rows are replaced by a bookmark vector

b

such that

bi>0

for all

i

and

∑ibi=1

. This means that the random surfer escapes from the dangling page by jumping to a completely unrelated page according to the bookmark probability distribution. Let

S

be the resulting matrix.

The second problem with the ideal model is that the surfer can get trapped by a closed cycle in the Web graph, for instance, two pages A and B such that A links only to B and B links only to A. The solution proposed by Brin and Page is to replace matrix

S

by Google matrix:

G=αS+(1-α)T

where

T

is the teleportation matrix with identical rows each equal to the bookmark vector

b

, and

α

is a free balancing parameter of the algorithm. The interpretation is that, with probability

α

the random surfer moves forward by following links, and, with the complementary probability

1-α

the surfer gets bored of following links and it is teleported to a random page chosen in accordance with the bookmark distribution. The inventors of PageRank propose to set

α=0.85

, meaning that after about five link clicks the random surfer chooses from bookmarks. The PageRank vector is then defined as the solution of equation:

π=πG

Figure above shows a PageRank instance with solution. Each node is labelled with its PageRank score; scores have been normalized to sum to 100 and are approximated to the first decimal point. We assumed

α=0.85

and each bookmark probability equal to

1/n

, where

n=11

is the number of nodes. Node A is a dangling node, and nodes B and C form a dangling cycle. Notice the dynamics of the method: node C receives just one link but from the most important page B, hence its importance is higher than that of E, which receives much more links (5) but from anonymous pages. Pages G, H, I, L, and M do not receive endorsements; their scores correspond to the minimum amount of status assigned to each page.

We can use the iterative Power Method on

G

to compute the PageRank vector as follows:

π0=e/nπk+1=πkG

where

e

is a vector of 1′s. Furthermore, notice that the adjacency matrix

H

is typically a sparse matrix, but

G

is a dense matrix. We can apply the Power Method directly to

H

as follows. Let

a

be a dangling vector such as

ai=1

if

i

is a dangling node and

ai=0

otherwise. With a little algebra we have that:

 

G=αS+(1-α)T=α(H+abT)+(1-α)ebT=αH+(αa+(1-α)e)bT

Hence, the Power Method becomes:

π0=e/nπk+1=απkH+(απka+(1-α)πke)bT

Let’s do all this in R:

# EXACT SMALL-SCALE COMPUTATION OF PAGERANK
# H = adjacency matrix
# b = bookmark vector
# alpha = balancing parameter

exact.pr = function (H, b, alpha = 0.85) {

  n = dim(H)[1]
  # normalize adjacency matrix by row sum and 
  # replace dangling rows with bookmark vector (S matrix)
  S = H
  # compute row sums
  rs = H %*% rep(1,n)
  for (i in 1:n) {
    if (rs[i] == 0) {
      S[i,] = b
    } else {
      S[i,] = S[i,] / rs[i]   
    }  
  }

  # build teleportation matrix 
  T = rep(1, n) %*% t(b)

  # build Google matrix 
  G = alpha * S + (1-alpha) * T

  # compute eigenpairs and retrieve the leading eigenvector
  eig = eigen(t(G))
  pi = as.real(eig$vectors[,1])
  pi = pi / sum(pi)
  return(pi)
}

# APPROXIMATED SMALL-SCALE COMPUTATION OF PAGERANK (uses dense matrix G)
# H = adjacency matrix
# b = bookmark vector
# alpha = balancing parameter
# t = number of digits of precision

approx.pr = function (H, b, alpha = 0.85, t = 3) {

  n = dim(H)[1]
  # normalize adjacency matrix by row sum and 
  # replace dangling rows with bookmark vector (S matrix)
  S = H
  # compute row sums
  rs = H %*% rep(1,n)
  for (i in 1:n) {
    if (rs[i] == 0) {
      S[i,] = b
    } else {
      S[i,] = S[i,] / rs[i]   
    }  
  }

  # build teleportation matrix 
  T = rep(1, n) %*% t(b)

  # build Google matrix 
  G = alpha * S + (1-alpha) * T

  pi0 = rep(0, n)
  pi1 = rep(1/n, n)
  eps = 1/10^t
  iter = 0
  while (sum(abs(pi0 - pi1)) > eps) {
    pi0 = pi1
    pi1 = pi1 %*% G
    iter = iter + 1
  } 
  pi1 = pi1 / sum(pi1)
  return(list(pi = pi1, iter = iter))
}

# APPROXIMATED LARGE-SCALE COMPUTATION OF PAGERANK (uses sparse matrix H)
# H = adjacency matrix
# b = bookmark vector
# alpha = balancing parameter
# t = number of digits of precision

approx.fast.pr = function (H, b, alpha = 0.85, t = 3) {

  n = dim(H)[1]
  # normalize adjacency matrix by row sum and compute dangling node vector
  a = rep(0, n)
  # compute row sums
  rs = H %*% rep(1,n)
  for (i in 1:n) {
    if (rs[i] == 0) {
      a[i] = 1
    } else {
      H[i,] = H[i,] / rs[i]   
    }  
  }

  e = rep(1, n)
  pi0 = rep(0, n)
  pi1 = rep(1/n, n)
  eps = 1/10^t
  iter = 0
  while (sum(abs(pi0 - pi1)) > eps) {
    pi0 = pi1
    pi1 = alpha * pi1 %*% H + (alpha * pi1 %*% a + (1 - alpha) * pi1 %*% e) * b
    iter = iter + 1
  } 
  pi1 = pi1 / sum(pi1)
  return(list(pi = pi1, iter = iter))
}

We first try with the toy example depicted in the above graph:

n = 11
H = matrix(scan("pagerank.txt",0), nrow=n, byrow=TRUE)
b = rep(1/n, n)
alpha = 0.85
t = 10

> exact.pr(H, b, alpha)
 [1] 0.03278149 0.38440095 0.34291029 0.03908709 0.08088569 0.03908709
 [7] 0.01616948 0.01616948 0.01616948 0.01616948 0.01616948

> approx.pr(H, b, alpha, t)
$pi
           [,1]      [,2]      [,3]       [,4]      [,5]       [,6]       [,7]
[1,] 0.03278149 0.3844009 0.3429103 0.03908709 0.0808857 0.03908709 0.01616948
           [,8]       [,9]      [,10]      [,11]
[1,] 0.01616948 0.01616948 0.01616948 0.01616948

$iter
[1] 137

> approx.fast.pr(H, b, alpha, t)
$pi
           [,1]      [,2]      [,3]       [,4]      [,5]       [,6]       [,7]
[1,] 0.03278149 0.3844009 0.3429103 0.03908709 0.0808857 0.03908709 0.01616948
           [,8]       [,9]      [,10]      [,11]
[1,] 0.01616948 0.01616948 0.01616948 0.01616948

$iter
[1] 137

In the following we generate random graphs and use them to benchmark the different PageRank computation methods:

# COMPUTE A RANDOM DIRECTED GRAPH
# n = number of nodes
# p = outlink probability (each node will have p * n successors on average) 

randG = function(n, p) {
  adj =  matrix(nrow=n, ncol=n)
  for (i in 1:n) {
    adj[i,] = sample(c(0,1), n, replace=TRUE, prob=c(1-p, p))
  }
  return(adj)
}

n = 1000;
p = 0.1;
H = randG(n, p)
b = rep(1/n, n)
alpha = 0.85
t = 10

> system.time(exact.pr(H, b, alpha))
   user  system elapsed 
 62.424   0.932 101.097 

> system.time(approx.pr(H, b, alpha, t))
   user  system elapsed 
  1.372   0.036   1.408 

> system.time(approx.fast.pr(H, b, alpha, t))
   user  system elapsed 
  0.904   0.000   0.904

The performance of approx.fast.pr can be significantly enhanced using a sparse representation for matrix

H

.

Let’s test how the balancing parameter

α

influences the computational speed (number of iterations) of the approximated methods:

n = 100;
p = 0.1;
H = randG(n, p)
b = rep(1/n, n)
aseq = seq(0, 1, 0.05)
t = 10
i = 1
iter = NULL
for (alpha in aseq) {
  iter[i] = approx.fast.pr(H, b, alpha, t)$iter
  i = i + 1
}

names(iter) = aseq

> iter
   0 0.05  0.1 0.15  0.2 0.25  0.3 0.35  0.4 0.45  0.5 0.55  0.6 0.65  0.7 0.75 
   1    6    7    8    9   10   10   11   12   12   13   14   14   15   16   16 
 0.8 0.85  0.9 0.95    1 
  17   18   19   20   20 

plot(aseq, iter, xlab="alpha", ylab="iterations")

The asymptotic rate of convergence of the power method depends on the ratio of the two eigenvalues of

G

that are largest in magnitude, denoted

λ1

and

λ2

. Specifically, it is the rate at which

∣λ2/λ1∣k

goes to 0. Since

G

is stochastic,

λ1=1

, so

∣λ2∣

governs the convergence. It turns out that

λ2=αμ2

, where

μ2

is the subdominant eigenvalue of

S

. It is expected that, for matrices representing the link structure of the Web,

μ2

is close to 1, so the convergence rate is

αk

. Hence, to produce scores with

τ

digits of precision we must have

10-τ=αk

and thus about

k=-τ/log10α

iterations must be completed. Is the observed computation faster than expected?

exp.iter = ceiling(-t / log10(aseq))
names(exp.iter) = aseq

> exp.iter
   0 0.05  0.1 0.15  0.2 0.25  0.3 0.35  0.4 0.45  0.5 0.55  0.6 0.65  0.7 0.75 
   0    8   10   13   15   17   20   22   26   29   34   39   46   54   65   81 
 0.8 0.85  0.9 0.95    1 
 104  142  219  449 -Inf 

> iter
   0 0.05  0.1 0.15  0.2 0.25  0.3 0.35  0.4 0.45  0.5 0.55  0.6 0.65  0.7 0.75 
   1    6    7    8    9   10   10   11   12   12   13   14   14   15   16   16 
 0.8 0.85  0.9 0.95    1 
  17   18   19   20   20 

plot(aseq, exp.iter, xlab="alpha", ylab="iterations")
points(aseq, iter, pch=2)
legend(x = "topleft", legend = c("expected", "observed"), pch = c(1,2))

Why so fast? Let’s check the dominant and subdominant eigenvalues of matrices

S

and

G

:

alpha = 0.85
S = H
rs = H %*% rep(1,n)
for (i in 1:n) {
  if (rs[i] == 0) {
    S[i,] = b
  } else {
    S[i,] = S[i,] / rs[i]   
  }  
}

T = rep(1, n) %*% t(b)
G = alpha * S + (1-alpha) * T

val.S = eigen(S)$val
val.G = eigen(G)$val

> abs(val.S[1:2])
[1] 1.0000000 0.3322384

> abs(val.G[1:2])
[1] 1.0000000 0.2824027

> abs(val.S[2]) * alpha
[1] 0.2824027

Notice that the subdominant eigenvalue of

S

is far from 1, and this speeds up the computation. In fact, the generated matrix

H

is a random graph, and not a scale-free network as the link structure of the Web. Let us simulate the Web graph with scale-free networks:

# Pareto distribution
dpareto=function(x, shape=1, location=1) shape*location^shape/x^(shape+1);
ppareto=function(x, shape=1, location=1) (x >= location)*(1-(location/x)^shape);
qpareto=function(u, shape=1, location=1) location/(1-u)^(1/shape);
rpareto=function(n, shape=1, location=1) qpareto(runif(n),shape,location);

# GENERATE A SCALE-FREE DIRECTED GRAPH
# n = number of nodes
# shape = the exponent parameter of power law 
# location = the min value parameter of power law

paretoG = function(n, shape=1, location=1) {
  degree = round(rpareto(n, shape, location))
  degree[degree > n] = n
  adj =  matrix(0, nrow=n, ncol=n)
  for (i in 1:n) {
    index = sample(1:n, degree[i])
    adj[i, index] = 1 
  }
  return(adj)
}

In the following we test the how

α

influences the number of iterations on scale-free graphs:

n = 100
shape = 1.5
location = 1
sf.H = paretoG(n, shape, location)
b = rep(1/n, n)
aseq = seq(0, 1, 0.05)
t = 10
i = 1
sf.iter = NULL
for (a in aseq) {
  sf.iter[i] = approx.fast.pr(sf.H, b, a, t)$iter
  i = i + 1
}
names(sf.iter) = aseq

> sf.iter
   0 0.05  0.1 0.15  0.2 0.25  0.3 0.35  0.4 0.45  0.5 0.55  0.6 0.65  0.7 0.75 
   1    8    9   11   13   15   17   19   22   24   28   31   36   41   48   58 
 0.8 0.85  0.9 0.95    1 
  69   86  112  157  251 

> exp.iter
   0 0.05  0.1 0.15  0.2 0.25  0.3 0.35  0.4 0.45  0.5 0.55  0.6 0.65  0.7 0.75 
   0    8   10   13   15   17   20   22   26   29   34   39   46   54   65   81 
 0.8 0.85  0.9 0.95    1 
 104  142  219  449 -Inf 

> iter
   0 0.05  0.1 0.15  0.2 0.25  0.3 0.35  0.4 0.45  0.5 0.55  0.6 0.65  0.7 0.75 
   1    6    7    8    9   10   10   11   12   12   13   14   14   15   16   16 
 0.8 0.85  0.9 0.95    1 
  17   18   19   20   20 

plot(aseq, exp.iter, xlab="alpha", ylab="iterations")
points(aseq, sf.iter, pch=2)
points(aseq, iter, pch=3)
legend(x = "topleft", legend = c("expected", "scale-free", "random"), pch = c(1, 2, 3))

Let’s check the subdominant eigenvalues of

S

:

S = sf.H
rs = sf.H %*% rep(1,n)
for (i in 1:n) {
  if (rs[i] == 0) {
    S[i,] = b
  } else {
    S[i,] = S[i,] / rs[i]   
  }  
}
val.S = eigen(S)$val

> abs(val.S[2])
[1] 0.9194817

Notice that the subdominant eigenvalue of

S

is now much closer to 1. This explains why the convergence rate is slower than what observed on random networks.

We now check the sensitivity of the PageRank vector to variations of the balancing parameter

α

. From theory, we know that, for small

α

, PageRank is insensitive to slight variations in

α

. As

α

becomes larger, PageRank becomes increasingly more sensitive to small perturbations in

α

.

n = 100;
p = 0.1;
H = randG(n, p)
b = rep(1/n, n)
aseq = seq(0, 1, 0.05)
t = 10
i = 1
Pi = matrix(nrow = length(aseq), ncol = n)
for (a in aseq) {
  Pi[i,] = approx.fast.pr(H, b, a, t)$pi
  i = i + 1
}

We plot the PageRank vectors for different values of

α

:

plot(sort(Pi[1,]), type = "l", lty = 1, xlab="index", ylab="PR", ylim = c(min(Pi), max(Pi)))
for (i in 2:length(aseq)) {
  lines(sort(Pi[i,]), lty = i)
}

We furthermore compute and plot (norm 1) distance between PageRank vectors for different values of

α

:

d = NULL
for (i in 2:length(aseq)) {
  d[i-1] = sum(abs(Pi[i,] - Pi[i-1,]))
}
plot(aseq[2:length(aseq)], d, xlab="alpha", ylab="sensitivity")

For

α

close to 1, the degree of sensitivity is governed (again) by the subdominant eigenvalue of

S

. Hence we might expect higher sensitivity on scale-free graphs. Let’s check it:

n = 100
shape = 1.5
location = 1
sf.H = paretoG(n, shape, location)
b = rep(1/n, n)
aseq = seq(0, 1, 0.05)
t = 10
i = 1
Pi = matrix(nrow = length(aseq), ncol = n)
for (a in aseq) {
  Pi[i,] = approx.fast.pr(sf.H, b, a, t)$pi
  i = i + 1
}

plot(sort(Pi[1,]), type = "l", lty = 1, xlab="index", ylab="PR", ylim = c(min(Pi), max(Pi)))
for (i in 2:length(aseq)) {
  lines(sort(Pi[i,]), lty = i)
}

sf.d = NULL
for (i in 2:length(aseq)) {
  sf.d[i-1] = sum(abs(Pi[i,] - Pi[i-1,]))
}

plot(aseq[2:length(aseq)], sf.d, xlab="alpha", ylab="sensitivity", ylim=c(0, max(sf.d)))
points(aseq[2:length(aseq)], d, pch=2)
legend(x = "topleft", legend = c("scale-free", "random"), pch = c(1, 2))

Mathematics of Google Search

After http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus

Linear Algebra in a Nutshell

Denote first the set of all real numbers by R. A vector of length n is simply a list of n elements, called vector entries. The notation of a (column) vector v having entries x1x2, … ,xn is the following:

remus - image 00

Example 1: The entries of a vector can in principle be anything: words, colors, numbers, etc. The following are all examples of vectors:

remus - image 01

However, we will only look at vectors whose entries are real numbers (that is x1x2, … ,xn are all real numbers). As in the case of real numbers, real vectors can be added or multiplied by a real number (called scalar) in a straightforward way:

remus - image 02
Example 2: Under the above two notation we can easily check that:

remus - image 04

In a similar way we define the notion of a matrix. An m × n matrix is a rectangular array of entries having height m and width n. This can also be seen as stacking n vectors of length m one next to the other. Again, take the entries to be real numbers. We denote an m × n matrix A with entries aij by:

remus - image 05

This means that on row i and column j we find the real number aij. For example, on the first row, second column lies element a12. Notice that a vector of length m is just a m × matrix and a real number is a 1 × 1 matrix. If m = n then the matrix is called a square matrix, of size n. This is what we will consider from now on. It makes sense now to talk about the diagonal of a matrix, which consists of those elements for which the row number equals the column number ( the elements a11a22, up to ann).

Example 3: The following are all examples of matrices, but only the last three are square:

remus - image 06

Notice that the last two matrices are rather “special”: they are symmetric. This means that entry on row i and column j is equal to that entry on row j and column i (or otherwise said aij = aji). As an easy exercise, prove that any diagonal matrix (has zeros everywhere, except on the diagonal) is symmetric. The matrix that has ones on the diagonal and zeros everywhere else is called the identity matrix and is denoted by I. Matrix multiplication is done in the following general way:

remus - image 07

where the element cij is given by the formula:

remus - image 08

Multiplying a matrix with a vector is done as follows:

remus - image 09

where every element yi from the resulting vector y is given by the formula

remus - image 10

The other two matrix operations, addition and scalar multiplication, are done as in the case of vectors. Adding matrices A and B gives a matrix C which has the entry on row i and column j equal to the sum of the corresponding entries of A and B. Multiplying matrix A with a real number a is the same as multiplying every element of A with a.

Example 4: All above operations on matrices will become more obvious through the following examples:

remus - image 11

Example #2: Consider the following three matrices:

remus - image 12

Show that (A+B)+C = A+(B+C) and that (A·BC=A·(B·C)

One can observe that there is some sort of similarity between matrices A and C in Problem 2. above. The similarity comes from the fact that the entry on row i and column j in matrix A is equal to the entry on row j and column i in matrix C. Of course, the elements on the diagonal are the same. We then say that A is the transpose of C, or equivalently that C is the transpose of A. In general, for a matrix A we denote its transpose by At. More intuitively, given a matrix we find its transpose by interchanging the element at row i, column with the element at row j, column i. If we do this twice we notice that the transpose of the transpose of a matrix is the matrix itself, or (At)t=A.

Fact 2: For any matrices AB, (A·B)t = Bt·At, and (A+B)t = At+Bt.

We now introduce two important notions in the theory about matrices: eigenvector and eigenvalue. We say that the real number z is an eigenvalue for A if there exists a real vector vof length n such that A·v = z·v. Such a vector v is called an eigenvector corresponding to the eigenvalue z. This is not the most general definition, but it will suffice for our purposes. In general eigenvalues and eigenvectors are complex, and not real. If we assume that A is a (real) symmetric matrix of size n, then we know that it has n real eigenvalues and all eigenvectors are also real. In fact, a matrix of size n can have at most n real eigenvalues.

In order to make these definitions more clear, consider the following explicit example:

Example 5: The matrix remus - image 13   has eigenvector remus - image 14 and eigenvalue z = 3 corresponding to this eigenvector. Indeed, the definition is verified because:

remus - image 15

Fact 3: Any matrix A has the same eigenvalues as its transpose At.

Of course, in general a matrix A and its transpose At do not have the same eigenvectors that correspond to the common eigenvalues. For the matrix in the above example, remus - image 16 has eigenvalue z = 3 but the corresponding eigenvector is remus - image 19 . This follows from the computation below

remus - image 20

An important observation is that a matrix A may (in most cases) have more than one eigenvector corresponding to an eigenvalue. These eigenvectors that correspond to the same eigenvalue may have no relation to one another. They can however be related, as for example if one is a scalar multiple of another. More precisely, in the last example, the vector whose entries are 0 and 1 is an eigenvector, but also the vector whose entries are 0 and 2 is an eigenvector. It is a good exercise to check this by direct computation as shown in Example 5.

A matrix is called column-stochastic if all of its entries are greater or equal to zero (nonnegative) and the sum of the entries in each column is equal to 1. If all entries of a matrix are nonnegative, then we say that the matrix itself is nonnegative. Furthermore, a matrix is positive is all its entries are positive (greater than zero) real numbers.

Example 6: Consider a matrix A with transpose At:

remus - image 21

It is easy to see that A is column-stochastic, while At is not. However, the sum of the elements on each row of At is equal to 1. We first show that z = 1 is an eigenvalue for At, with corresponding eigenvector

remus - image 22

This is true since At·v = 1·v. Then, from Fact 3, 1 is an eigenvalue for the matrix A as well.

Next we wish to find an eigenvector that corresponds to this eigenvalue for A. Let u be such an eigenvector. In order to find u explicitly we write the equation

remus - image 23

x1x2x3, and x4 are all real numbers that we don’t yet know. If we multiply term by term we find that

remus - image 24

Substituting remus - image 25 in the third relation we obtain remus - image 26 and so the vector u is of the form

remus - image 27

As x1 is just a real number (hence a scalar) we can take x1 = 12 and we have just proved that the vector whose entries are 12, 4, 9, and 6 (from top to bottom) is an eigenvector for A.

In the first part of the previous example we have just shown 1 is an eigenvalue for that particular case. However, this is true for any column-stochastic matrix, as stated below.

Fact 4: Any column-stochastic matrix has z = 1 as eigenvalue.

Notice also that the eigenvector that we found in the second part of the example above is rather “special” itself. We have chosen x1 = 12 and we obtained an eigenvector with positive entries. If however we choose x1 = -12 then we obtain an eigenvector with negative entries (smaller than 0).

Fact 5: If A is a positive column-stochastic matrix, then any eigenvector corresponding to the eigenvalue z = 1 has either only positive entries or only negative entries.

When we are working with positive column-stochastic matrices A it is possible to find an eigenvector v associated to the eigenvalue z = 1 such that all its entries are positive. Hence A·v = v and the entries of v are all positive.

Fact 6: If A is a positive column-stochastic matrix, then there is a unique eigenvector corresponding to the eigenvalue z = 1 such that it has only positive entries the sum of its entries equals 1.

Lecture #2: Directed Graphs – Transition Matrices

graph is an object that consists of a non-empty set of vertices and another set of edges. When working with real-world examples of graphs, we sometimes refer to them as networks. The vertices are often called nodes or points, while edges are referred to as links or lines. The set of edges may be empty, in which case the graph is just a collection of points.

Example 1: A simple example of a graph with vertices 1,2,3,4, and directed edges from vertex 1 to vertex 2, vertex 3 to vertices 2 and 4, and vertex 4 to vertex 1.

Example of a<br /><br /><br /> graph

In this lecture we will only work with directed graphs and real-world examples of those (Internet graphs), but for other properties of graphs we refer to Math Explorer’s Clubwebsite. The central example in this module is the web graph, in which web pages are represented as vertices and the links between them are represented as edges. An example of such a graph is a sub graph of the BGP (Gateway Protocol) web graph, consisting of major Internet routers. It has about 6400 vertices and 13000 edges and it was produced by Ross Richardson and rendered by Fan Chung Graham.

Subgraph of a<br /><br /><br /> BGP graph

Although Internet graphs are very large, having the number of vertices of the order 30 billion (and growing), all graphs in this module are considered finite (finite number of vertices and edges).

We say that two vertices i and j of a directed graph are joined or adjacent if there is an edge from i to j or from j and i. If such an edge exists, then i and j are its endpoints. If there is an edge from i to j then i is often called tail, while j is called head. In Example 1, vertices 1 and 2 are joined because there is an edge from 1 to 2, while vertices 1 and 3 are not joined. There is however no edge from node 2 to node 1. Notice that there can be no more than two edges between any two vertices. There is a strong relation between graphs and matrices, previously introduced in Lecture 1. Suppose we are given a directed graph with n vertices. Then we construct an n × n adjacency matrix A associated to it as follows: if there is an edge from node i to node j, then we put 1 as the entry on row i, column j of the matrix A.

Example 2: Adjacency matrix for the graph in Example 1: Adjacency<br /><br /><br /> matrix.

If one can walk from node i to node j along the edges of the graph then we say that there is a path from i to j. If we walked on k edges, then the path has length k. For matrices, we denote by Ak the matrix obtained by multiplying A with itself k times. The entry on row i, column j of A2 = A·A corresponds to the number of paths of length 2 from node i to nodej in the graph. For Example 2, the square of the adjacency matrix is

A^2

 

This means that there is a path from vertex 4 to vertex 2, because the entry on fourth row and second column is 1. Similarly there is a path from 3 to 1, as one can easily see from Example 1.

Fact 1: Consider a directed graph and a positive integer k. Then the number of directed walks from node i to node j of length k is the entry on row i and column j of the matrix Ak, where Ais the adjacency matrix.

 

In general, a matrix is called primitive if there is a positive integer k such that Ak is a positive matrix. A graph is called connected if for any two different nodes i and j there is a directed path either from i to j or from j to i. On the other hand, a graph is called strongly connected if starting at any node i we can reach any other different node j by walking on its edges. In terms of matrices, this means that if there is a positive integer k such that the matrix B = I + A + A2 + A3 + … +Ak is positive, then the graph is strongly connected. We add the identity matrix I in order to deal with edges from a vertex to itself. In other words, if there is at least one path from node i to node j of length at most k, then we can travel from node i to j. Thus if matrix B has a positive entry on row i and column j then it is possible to reach node j starting from i. If this happens for all nodes, then the graph is strongly connected.

One can easily see that the graph in Example 1 is connected, but not strongly connected because there is no edge from vertex 1 to vertex 3. For the matrix in Example 2, we notice that A4 is a matrix having only zeros, and so for all k greater than 4, Ak will be a matrix filled with zeros. Then for any k greater than 4, the matrix B = I + A + A2 + A3 + … +Ak is :

A+A^2+A^3+...+A^k

 

Since the matrix B is not positive, the graph in Example 1 is not strongly connected as we already saw.

Problem 1: Construct the adjacency matrix of the graph below. Is this graph connected or strongly connected? How many paths of length 3 from node 3 to node 2 are there?

exercise 1

In the examples above we noticed that for every vertex i there is a number of edges that enter that vertex (i is a head) and a number of edges that exit that vertex (i is a tail). Thus we define the indegree of i as the number of edges for which i is a head. Similarly, the outdegree of i as the number of edges for which i is a tail. For example, for the graph in the Problem 1, the indegree of node 2 is 2 and the outdegree of node 1 is 1. The transition matrix A associated to a directed graph is defined as follows. If there is an edge from i to jand the outdegree of vertex i is di, then on column i and row j we put . Otherwise we mark column i, row j with zero. Notice that we first look at the column, then at the row. We usually write  on the edge going from vertex i to an adjacent vertex j, thus obtaining a weighted graph. This will become clear through the following example.

Example 3: Consider the graph from Example 1 with weights on its edges as described above.

Example<br /><br /><br /> 3

Then the transition matrix associated to it is:

Example 3

Notice that the sum of the entries on the first column is 1. The same holds for the third and fourth column. In general, more is true.
Fact 2: For a strongly connected graph, the transition matrix is column-stochastic.

We use the transition matrix to model the behavior of a random surfer on a web graph. The surfer chooses a page at random, then follows its links to other web pages for as long as he/she wishes. At each step the probability that the surfer moves from node i to node j is zero if there is no link from i to j and  otherwise. Recall that di is the outdegree of vertexi. Initially the probability of each page to be chosen as a starting point is

v

 

At step 1, the probability of each node to be visited after one click is A·v. At step 2, the probability of each node to be visited after two clicks is A2·v. The probability of a page to be visited at step k is thus Ak·v. If the matrix is primitive, column-stochastic, then this process converges to a unique stationary probability distribution vector p, where

v

 

The meaning of the ith entry of p is that the surfer visits page i at any given time with probability pi.

Problem 2: Let A be a transition matrix associated with a graph and B be a matrix of size n filled with . Consider x a positive real number smaller than 1. Then the matrix C = x·A + (1-x)·B is column-stochastic. Show that this is true in the special case of Problem 1.

Lecture #3: PageRank Algorithm – The Mathematics of Google Search

We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine, like Google, AltaVista, Yahoo, type in the key words, and the search engine will display the pages relevant for your search. But how does a search engine really work?

At first glance, it seems reasonable to imagine that what a search engine does is to keep an index of all web pages, and when a user types in a query search, the engine browses through its index and counts the occurrences of the key words in each web file. The winners are the pages with the highest number of occurrences of the key words. These get displayed back to the user.

This used to be the correct picture in the early 90s, when the first search engines used text based ranking systems to decide which pages are most relevant to a given query. There where however a number of problems with this approach. A search about a common term such as “Internet” was problematic. The first page displayed by one of the early search engines was written in Chinese, with repeated occurrences of the word “Internet” and containing no other information about the Internet. Moreover, suppose we wanted to find some information about Cornell. We type in the word “Cornell” and expect that “www.cornell.edu” would be the most relevant site to our query. However there may be millions of pages on the web using the world Cornell, and www.cornell.edu may not be the one that uses it most often. Suppose we decided to write a web site that contains the word “Cornell” a billion times and nothing else. Would it then make sense for our web site to be the first one displayed by a search engine? The answer is obviously no. However, if all a search engine does is to count occurrences of the words given in the query, this is exactly what might happen.

The usefulness of a search engine depends on the relevance of the result set it gives back. There may of course be millions of web pages that include a particular word or phrase; however some of them will be more relevant, popular, or authoritative than others. A user does not have the ability or patience to scan through all pages that contain the given query words. One expects the relevant pages to be displayed within the top 20-30 pages returned by the search engine.

Modern search engines employ methods of ranking the results to provide the “best” results first that are more elaborate than just plain text ranking. One of the most known and influential algorithms for computing the relevance of web pages is the Page Rank algorithm used by the Google search engine. It was invented by Larry Page and Sergey Brin while they were graduate students at Stanford, and it became a Google trademark in 1998. The idea that Page Rank brought up was that, the importance of any web page can be judged by looking at the pages that link to it. If we create a web page i and include a hyperlink to the web page j, this means that we consider j important and relevant for our topic. If there are a lot of pages that link to j, this means that the common belief is that page j is important. If on the other hand, j has only one backlink, but that comes from an authoritative site k, (like www.google.com, www.cnn.com, www.cornell.edu) we say that k transfers its authority to j; in other words, k asserts that j is important. Whether we talk about popularity or authority, we can iteratively assign a rank to each web page, based on the ranks of the pages that point to it.

To this aim, we begin by picturing the Web net as a directed graph, with nodes represented by web pages and edges represented by the links between them.

Suppose for instance, that we have a small Internet consisting of just 4 web sites www.page1.com, www.page2.com, www.page3.com, www.page4.com, referencing each other in the manner suggested by the picture:

Example of 4 web pages which reference each other

 

We “translate” the picture into a directed graph with 4 nodes, one for each web site. When web site i references j, we add a directed edge between node i and node j in the graph. For the purpose of computing their page rank, we ignore any navigational links such as back, next buttons, as we only care about the connections between different web sites. For instance, Page1 links to all of the other pages, so node 1 in the graph will have outgoing edges to all of the other nodes. Page3 has only one link, to Page 1, therefore node 3 will have one outgoing edge to node 1. After analyzing each web page, we get the following graph:

In our model, each page should transfer evenly its importance to the pages that it links to. Node 1 has 3 outgoing edges, so it will pass on  of its importance to each of the other 3 nodes. Node 3 has only one outgoing edge, so it will pass on all of its importance to node 1. In general, if a node has k outgoing edges, it will pass on  of its importance to each of the nodes that it links to. Let us better visualize the process by assigning weights to each edge.

Let us denote by A the transition matrix of the graph, A = .

Dynamical systems point of view:Suppose that initially the importance is uniformly distributed among the 4 nodes, each getting ¼. Denote by v the initial rank vector, having all entries equal to ¼. Each incoming link increases the importance of a web page, so at step 1, we update the rank of each page by adding to the current value the importance of the incoming links. This is the same as multiplying the matrix A with v . At step 1, the new importance vector is v1 = Av. We can iterate the process, thus at step 2, the updated importance vector is v2 = A(Av) = A2v. Numeric computations give:

Numeric Computation

 

We notice that the sequences of iterates vAv, …, Akv tends to the equilibrium value v* = Equilibrium vector. We call this the PageRank vector of our web graph.

Linear algebra point of view:Let us denote by x1x2x3, and x4 the importance of the four pages. Analyzing the situation at each node we get the system:

Computing the weights x_i

 

This is equivalent to asking for the solutions of the equations Matrix equation. From Example 6 in Lecture 1 we know that the eigenvectors corresponding to the eigenvalue 1 are of the form . Since PageRank should reflect only the relative importance of the nodes, and since the eigenvectors are just scalar multiples of each other, we can choose any of them to be our PageRank vector. Choose v* to be the unique eigenvector with the sum of all entries equal to 1. (We will sometimes refer to it as the probabilistic eigenvector corresponding to the eigenvalue 1). The eigenvector  is our PageRank vector.

Probabilistic point of view:Since the importance of a web page is measured by its popularity (how many incoming links it has), we can view the importance of page i as the probability that a random surfer on the Internet that opens a browser to any page and starts following hyperlinks, visits the page i. We can interpret the weights we assigned to the edges of the graph in a probabilistic way: A random surfer that is currently viewing web page 2, has ½ probability to go to page 3, and ½ probability to go to page 4. We can model the process as a random walk on graphs. Each page has equal probability ¼ to be chosen as a starting point. So, the initial probability distribution is given by the column vector [¼ ¼ ¼ ¼]t. The probability that page i will be visited after one step is equal to Ax, and so on. The probability that page i will be visited after k steps is equal to Akx. The sequence AxA2xA3x, …,Akx, … converges in this case to a unique probabilistic vector v*. In this context v* is called the stationary distribution and it will be our Page Rank vector. Moreover, the ith entry in the vector v* is simply the probability that at each moment a random surfer visits page i. The computations are identical to the ones we did in the dynamical systems interpretation, only the meaning we attribute to each step being slightly different.

The Page Rank vector v* we have computed by different methods, indicates that page 1 is the most relevant page. This might seem surprising since page 1 has 2 backlinks, while page 3 has 3 backlinks. If we take a look at the graph, we see that node 3 has only one outgoing edge to node 1, so it transfers all its importance to node 1. Equivalently, once a web surfer that only follows hyperlinks visits page 3, he can only go to page 1. Notice also how the rank of each page is not trivially just the weighted sum of the edges that enter the node. Intuitively, at step 1, one node receives an importance vote from its direct neighbors, at step 2 from the neighbors of its neighbors, and so on.

Changing the web graph might lead to certain problems.

Nodes with no outgoing edges (dangling nodes)

Node 3 is a dangling node

We iteratively compute the rank of the 3 pages:

Computing the Page Rank.

So in this case the rank of every page is 0. This is counterintuitive, as page 3 has 2 incoming links, so it must have some importance!

Disconnected components

2 connected components

A random surfer that starts in the first connected component has no way of getting to web page 5 since the nodes 1 and 2 have no links to node 5 that he can follow. Linear algebra fails to help as well. The transition matrix for this graph is . Notice that 2 eigenvectors are both eigenvectors corresponding to the eigenvalue 1, and they are not just trivially one the a scalar multiple of the other. So, both in theory and in practice, the notation of ranking pages from the first connected component relative to the ones from the second connected component is ambiguous.

The web is very heterogeneous by its nature, and certainly huge, so we do not expect its graph to be connected. Likewise, there will be pages that are plain descriptive and contain no outgoing links. What is to be done in this case? We need a non ambiguous meaning of the rank of a page, for any directed Web graph with n nodes.

The solution of Page and Brin:In order to overcome these problems, fix a positive constant p between 0 and 1, which we call the damping factor (a typical value for p is 0.15). Define the Page Rank matrix (also known as the Google matrix) of the graph by PageRank matrix where .
Problem 1. Prove that M remains a column stochastic matrix. Prove that M has only positive entries.

The matrix M models the random surfer model as follows: most of the time, a surfer will follow links from a page: from a page i the surfer will follow the outgoing links and move on to one of the neighbors of i. A smaller, but positive percentage of the time, the surfer will dump the current page and choose arbitrarily a different page from the web and “teleport” there. The damping factor p reflects the probability that the surfer quits the current page and “teleports” to a new one. Since he/she can teleport to any web page, each page has probability to be chosen. This justifies the structure of the matrix B.

Problem 2. Redo the computations for the Page Rank with the transition matrix A replaced with the matrix M, for the graphs representing the Dangling nodes, respectively Disconnected components. Do the problems mentioned in there still occur?

Intuitively, the matrix M “connects” the graph and gets rid of the dangling nodes. A node with no outgoing edges has now  probability to move to any other node. Rigorously, for the matrix M, the following theorems apply:

Perron-Frobenius Theorem: If M is a positive, column stochastic matrix, then:

  1. 1 is an eigenvalue of multiplicity one.
  2. 1 is the largest eigenvalue: all the other eigenvalues are in modulus smaller than 1.
  3. the eigenvector corresponding to eigenvalue 1 has all entries positive. In particular, for the eigenvalue 1 there exists a unique eigenvector with the sum of its entries equal to 1.
Power Method Convergence Theorem: Let M be a positive, column stochastic n × n matrix. Denote by v* its probabilistic eigenvector corresponding to the eigenvalue 1. Let z be the column vector with all entries equal to . Then the sequence zMz, …, Mkz converges to the vector v*.

In view of everything discussed above, we conclude that:

Fact: The PageRank vector for a web graph with transition matrix A, and damping factor p, is the unique probabilistic eigenvector of the matrix M, corresponding to the eigenvalue 1.

From the mathematical point of view, once we have M, computing the eigenvectors corresponding to the eigenvalue 1 is, at least in theory, a straightforward task. As in Lecture 1, just solve the system Ax = x! But when the matrix M has size 30 billion (as it does for the real Web graph), even mathematical software such as Matlab or Mathematica are clearly overwhelmed.

An alternative way of computing the probabilistic eigenvector corresponding to the eigenvalue 1 is given by the Power Method. The theorem guarantees that the method works for positive, column stochastic matrices. We reasoned that the iteration process corresponds to the way importance distributes over the net following the link structure (Recall the random surfer model). Computationally speaking, it is much more easier, starting from the vector with all entries 1, to multiply xMx,…, Mnx until convergence then it is to compute the eigenvectors of M. In fact, in this case, one needs only compute the first couple of iterates in order to get a good approximation of the PageRank vector. For a random matrix, the power method is in general known to be slow to converge. What makes it work fast in this case however is the fact that the web graph is sparse. This means that a node i has a small number of outgoing links (a couple of hundred at best, which is extremely small corresponding to the 30 billion nodes it could theoretically link to). Hence the transition matrix A has a lot of entries equal to 0.

We end the lecture by proposing the following problems:

Problem 3. Compute the PageRank vector of the following graph, considering the damping constant p to be successively = 0, p = 0.15, = 0.5, and respectively p = 1.

Graph for exercise 3

Problem 4. Compute the PageRank vector of the directed tree depicted below, considering that the damping constant p = 0.15. Interpret your results in terms of the relationship between the number of incoming links that each node has and its rank.

Tree graph for exercise 4

 

Lecture #4: HITS Algorithm – Hubs and Authorities on the Internet

In the same time that PageRank was being developed, Jon Kleinberg a professor in the Department of Computer Science at Cornell came up with his own solution to the Web Search problem. He developed an algorithm that made use of the link structure of the web in order to discover and rank pages relevant for a particular topic. HITS (hyperlink-induced topic search) is now part of the Ask search engine (www.Ask.com).

One of the interesting points that he brought up was that the human perspective on how a search process should go is more complex than just compare a list of query words against a list of documents and return the matches. Suppose we want to buy a car and type in a general query phrase like “the best automobile makers in the last 4 years”, perhaps with the intention to get back a list of top car brands and their official web sites. When you ask this question to your friends, you expect them to be able to understand that automobile means car, vehicle, and that automobile is a general concept that includes vans, trucks, and other type of cars. When you ask this question to a computer that is running a text based ranking algorithm, things might be very different. That computer will count all occurrences of the given words in a given set of documents, but will not do intelligent rephrasing for you. The list of top pages we get back, while algorithmically correct, might be very different than what expected. One problem is that most official web sites are not enough self descriptive. They might not advertise themselves the way general public perceives them. Top companies like Hunday, Toyota, might not even use the terms “automobile makers” on their web sites. They might use the term “car manufacturer” instead, or just describe their products and their business.

What is to be done in this case? It would be of course great if computers could have a dictionary or ontology, such that for any query, they could figure out sinonimes, equivalent meanings of phrases. This might improve the quality of search, nevertheless, in the end, we would still have a text based ranking system for the web pages. We would still be left with the initial problem of sorting the huge number of pages that are relevant to the different meanings of the query phrase. We can easily convince ourselves that this is the case. Just remember one of our first examples, about a page that repeats the phrase “automobile makers = cars manufacturers = vehicle designers” a billion times. This web page would be the first one displayed by the query engine. Nevertheless, this page contains practically no usable information.

The conclusion is that even if trying to find pages that contain the query words should be the starting point, a different ranking system is needed in order to find those pages that are authoritative for a given query. Page i is called an authority for the query “automobile makers” if it contains valuable information on the subject. Official web sites of car manufacturers, such as www.bmw.com, HyundaiUSA.com, www.mercedes-benz.com would be authorities for this search. Commercial web sites selling cars might be authorities on the subject as well. These are the ones truly relevant to the given query. These are the ones that the user expects back from the query engine. However, there is a second category of pages relevant to the process of finding the authoritative pages, called hubs. Their role is to advertise the authoritative pages. They contain useful links towards the authoritative pages. In other words, hubs point the search engine in the “right direction”. In real life, when you buy a car, you are more inclined to purchase it from a certain dealer that your friend recommends. Following the analogy, the authority in this case would be the car dealer, and the hub would be your friend. You trust your friend, therefore you trust what your friend recommends. In the world wide web, hubs for our query about automobiles might be pages that contain rankings of the cars, blogs where people discuss about the cars that they purchased, and so on.

Hubs<br /><br /><br /> and Authorities

Jon Kleinberg’s algorithm called HITS identifies good authorities and hubs for a topic by assigning two numbers to a page: an authority and a hub weight. These weights are defined recursively. A higher authority weight occurs if the page is pointed to by pages with high hub weights. A higher hub weight occurs if the page points to many pages with high authority weights.

In order to get a set rich in both hubs and authorities for a query Q, we first collect the top 200 documents that contain the highest number of occurrences of the search phrase Q. These, as pointed out before may not be of tremendous practical relevance, but one has to start somewhere. Kleinberg points out that the pages from this set called root (RQ) are essentially very heterogeneous and in general contain only a few (if any) links to each other. So the web subgraph determined by these nodes is almost totally disconnected; in particular, we can not enforce Page Rank techniques on RQ.

Authorities for the query Q are not extremely likely to be in the root set RQ. However, they are likely to be pointed out by at least one page in RQ. So it makes sense to extend the subgraph RQ by including all edges coming from or pointing to nodes from RQ. We denote by SQ the resulting subgraph and call it the seed of our search. Notice that SQ we have constructed is a reasonably small graph (it is certainly much smaller then the 30 billion nodes web graph!). It is also likely to contain a lot of authoritative sources for . The question that remains is how to recognize and rate them? Heuristically, authorities on the same topic should have a lot of common pages from SQ pointing to them. Using our previous terminology, there should be a great overlap in the set of hubs that point to them.

From here on, we translate everything into mathematical language. We associate to each page i two numbers: an authority weight ai, and a hub weight hi. We consider pages with a higher ai number as being better authorities, and pages with a higher hi number as being better hubs. Given the weights {ai} and {hi} of all the nodes in SQ, we dynamically update the weights as follows:

Updating the hub weight of node p

 

Updating the authority weight of node p

A good hub increases the authority weight of the pages it points. A good authority increases the hub weight of the pages that point to it. The idea is then to apply the two operations above alternatively until equilibrium values for the hub and authority weights are reached.
Let A be the adjacency matrix of the graph SQ and denote the authority weight vector by v and the hub weight vector by u, where

 

Let us notice that the two update operations described in the pictures translate to: .

If we consider that the initial weights of the nodes are  then, after k steps we get the system:

Recurrence.

 

Example: Let us consider a very simple graph:

Example

The adjacency matrix of the graph is , with transpose . Assume the initial hub weight vector is: .
We compute the authority weight vector by:

Then, the updated hub weight is:

.

This already corresponds to our intuition that node 3 is the most authoritative, since it is the only one with incoming edges, and that nodes 1 and 2 are equally important hubs. If we repeat the process further, we will only obtain scalar multiples of the vectors v and u computed at step 1. So the relative weights of the nodes remain the same.

For more complicated examples of graphs, we would expect the convergence to be problematic, and the equilibrium solutions (if there are any) to be more difficult to find.

Theorem: Under the assumptions that AAt and AtA are primitive matrices, the following statements hold:

  1. If v1, … , vk, … is the sequence of authority weights we have computed, then V1, … ,Vk, … converges to the unique probabilistic vector corresponding to the dominant eigenvalue of the matrix AtA. With a slight abuse of notation, we denoted in here by Vk the vector vk normalized so that the sum of its entries is 1.
  2. Likewise, if u1, … ,uk, … are the hub weights that we have iteratively computed, then U1, … ,Uk, … converges to the unique probabilistic vector corresponding to the dominant eigenvalue of the matrix AAt. We use the same notation, that Uk = (1 ⁄ c)uk, where c is the scalar equal to the sum of the entries of the vector uk.

 

So the authority weight vector is the probabilistic eigenvector corresponding to the largest eigenvalue of AtA, while the hub weights of the nodes are given by the probabilistic eigenvector of the largest eigenvalue of AAt.

In the background we rely on the following mathematical theorems:

Theorems:

  1. The matrices AAt and AtA are real and symmetric, so they have only real eigenvalues.
  2. Perron Frobenius. If M  is a primitive matrix, then:
    1. The largest eigenvalue λ of M is positive and of multiplicity 1.
    2. Every other eigenvalue of M is in modulus strictly less than λ
    3. The largest eigenvalue λ has a corresponding eigenvector with all entries positive.
  3. Let M be a non-negative symmetric and primitive matrix and v be the largest eigenvector of M, with the sum of its entries equal to 1. Let z be the column vector with all entries non-negative, then, if we normalize the vectors zMz,…,Mkz, then the sequence converges to v.

We use the notion of “convergence” in here in a loose sense. We say that a sequence of vectors zk converges to a vector v in the intuitive sense that as k gets big, the entries in the column vector zk are very close to the corresponding entries of the column vector v. Without going into the technical details of the proof, the power method works because we have only one largest eigenvalue that dominates the behavior.

HITS algorithm is in the same spirit as PageRank. They both make use of the link structure of the Weg graph in order to decide the relevance of the pages. The difference is that unlike the PageRank algorithm, HITS only operates on a small subgraph (the seed SQ) from the web graph. This subgraph is query dependent; whenever we search with a different query phrase, the seed changes as well. HITS ranks the seed nodes according to their authority and hub weights. The highest ranking pages are displayed to the user by the query engine.

Problem 1: Prove that for any square matrix A, the matrices AtA and AAt are symmetric.

Problem 2: Compute the hub and authority weights for the following graph:

Graf<br /><br /><br /> S

Hint: Compute the adjacency matrix A and show that the eigenvalues of AAt are 0,1, and 3. The normalized eigenvector for the largest eigenvalue λ = 3 is the hub weight vector. What can you say about the authority weights of the 4 nodes? Does this correspond to your intuition?

 

Lecture #5: Internet Mathematics – Further Directions

The development of World Wide Web lead to a whole new field of mathematics called Internet Mathematics. It is a mixture of probability, linear algebra, graph theory and dynamical systems, designed to answer stringent problems about how information propagates over the net.

In the previous lectures you have seen how Google uses PageRank to determine the relevance of web pages, thus making Internet search simple and efficient. From a web designer’s perspective, it is important not only to create a nice web site, featuring interesting graphical effects, but it is important what other pages link to your web site. A good PageRank can turn your business into a very profitable one. From Google’s point of view however, it is important to keep PageRank computations accurate, and to prevent fraud attempts. Link spam is defined as the intentional attempt to improve the ranking of a page on the search engine, by increasing the number of pages that link to it. Link farms involve creating strongly connected communities of web pages referencing each other, humorously known as “mutual admiring societies”. Although some link farms can be created by hand, most are created through automated programs and services. It is important therefore to recognize dense subgraphs of the web graph designed to do just that, and to eliminate them from page rank computations. Although at its core, Page Rank is as described in the previous section, in the real world, Google uses improved search algorithms to deal with these problems; these algorithms can assign different weights to outgoing links, or decide that certain links should not transfer page rank at all, to name just a few of the extra features. It is important to notice however, that as stated on Google’s website, “while we have dozens of engineers working to improve every aspect of Google on a daily basis, PageRank continues to provide the basis for all of our web search tools”.

Another research direction is the analysis and comparison of different ranking algorithms. One property that makes an algorithm good is stability. The World Wide Web is in constant change, pages are added and deleted at every moment. Stability measures how much rankings change if the Web graph is slightly perturbed. Ideally we would like the ranking algorithms to be stable so that, for instance, we would not have to constantly recompute the PageRank.

With a lot of interest coming from both mathematicians and computer scientists, understanding the structure of the Internet graph is a key research problem. Developing models for an Internet graph that has billion of nodes, is however far from being trivial. The web is often described as having a bow-tie structure. The knot of the bow-tie is represented by a strongly connected component of the graph, called the core. One side of the bow-tie consists of pages that link to the core, while the other side of the bow-tie contains pages that contain at least a link from the core. The rest of the web pages are either in disconnected components, or in tendrils that only link to pages outside of the core. In this spider net, social networks, web communities are defined as subgraphs with more internal links then external ones. Spectral methods are used to analyse the distribution of these communities.

Semi-random Internet topology. The Internet nodes are red dots and linkages are green lines.

Understanding the Internet graph can help answering questions about how information propagates over the net. Computer virus propagation over the net is of particular interest. Understanding the rate at which computer viruses propagate can help antivirus software makers to anticipate and quarantine infected areas.

Although in our previous lectures we only talked about finte graphs, we can also view the web as an infinite graph. The reason for this is that web pages are of two kinds: static and dynamic. Static pages contain information that remains unmodified. Dynamic pages are those built “on the spot”, as a result of a query. For instance, if we do a Google Search with the phrase “mathematics and computer science”, Google returns a page with relevant links and their descriptions. Of course, Google does not have a page prepared beforehand, for every possible query a random user might have. For every query it receives, it searches through its index to track the documents which contains the given search words, then Google runs the Page Rank algorithm to rank the relevant hits. Only after that, does it build up the answer to the user. We call such a page a dynamical page. While the number of static pages is finite (over 30 billion, but still finite), the number of dynamical pages is virtually infinite. Such a complete view of the web including both static and dynamic content can be given by infinite graph theory. The mathematics behind it is both interesting and challenging, and it is an ongoing research field.

The book “A course on web graphs” by Anthony Bonato and the online journal “Internet Mathematics” established in 2003, provide excellent sources of information on how mathematics improves our Internet experience. All the topics discussed above are covered in detail, with a lot of entertaining examples.

References

We found most useful in preparing these lecture “A course of web graph” by Antony Bonato” and “Authoritative sources in a hyperlinked environment” by Jon Kleinberg, which are both written in a very clear and engaging style. We used the second one for presenting the philosophy of searching the web and its solution given by the HITS algorithm. “The $25,000,000,000 Eigenvector: The Linear Algebra behind Google” tries to provide a comprehensive treatment of the PageRank algorithm, while still remaining elementary and accesible. For a more computer science oriented view of a search engine, it is very useful to read “The anatomy of a large-Scale Hypertextual Web Search Engine” by the Google founders Larry Page and Sergey Brin. The book “A course on web graph” is addresed to both mathematicians and computer scientists. It provides a comprehensive treatment to the state-of-the-art of many applications of graph theory to the study of the Web graph. The last reference is to the online journal Internet Mathematics.

 

  1. Anthony Bonato, A Course on The Web Graph, AMS-AARMS, Graduate Studies in Mathematics v. 89, 2008. http://www.math.ryerson.ca/~abonato/webgraph.html
  2. Kurt Bryan, Tanya Leise, The $25,000,000,000 Eigenvector: The Linear Algebra behind Google, SIAM Review, Vol. 48, No. 3. (2006).
    http://www.siam.org/journals/sirev/48-3/62328.html
  3. Sergey Brin, Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Seventh International World-Wide Web Conference (WWW 1998).
    http://infolab.stanford.edu/pub/papers/google.pdf
  4. M. Brin, G. Stuck, Introduction to Dynamical Systems , Cambridge University Press, 2002.
  5. David Austin, How Google Finds Your Needle in the Web’s Haystack.
    http://www.ams.org/featurecolumn/archive/pagerank.html
  6. Jon Kleinberg, Authoritative sources in a hyperlinked environment, Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998.
    http://www.cs.cornell.edu/home/kleinber/auth.pdf
  7. Internet Mathematics Journal
    http://www.internetmathematics.org/

PageRank – eFactory.de

http://pr.efactory.de/e-pagerank-algorithm.shtml

A Survey of Google’s PageRank

Within the past few years, Google has become the far most utilized search engine worldwide. A decisive factor therefore was, besides high performance and ease of use, the superior quality of search results compared to other search engines. This quality of search results is substantially based on PageRank, a sophisticated method to rank web documents.

The aim of these pages is to provide a broad survey of all aspects of PageRank. The contents of these pages primarily rest upon papers by Google founders Lawrence Page and Sergey Brin from their time as graduate students at Stanford University.

It is often argued that, especially considering the dynamic of the internet, too much time has passed since the scientific work on PageRank, as that it still could be the basis for the ranking methods of the Google search engine. There is no doubt that within the past years most likely many changes, adjustments and modifications regarding the ranking methods of Google have taken place, but PageRank was absolutely crucial for Google’s success, so that at least the fundamental concept behind PageRank should still be constitutive.

The PageRank Concept

Since the early stages of the world wide web, search engines have developed different methods to rank web pages. Until today, the occurence of a search phrase within a document is one major factor within ranking techniques of virtually any search engine. The occurence of a search phrase can thereby be weighted by the length of a document (ranking by keyword density) or by its accentuation within a document by HTML tags.

For the purpose of better search results and especially to make search engines resistant against automatically generated web pages based upon the analysis of content specific ranking criteria (doorway pages), the concept of link popularity was developed. Following this concept, the number of inbound links for a document measures its general importance. Hence, a web page is generally more important, if many other web pages link to it. The concept of link popularity often avoids good rankings for pages which are only created to deceive search engines and which don’t have any significance within the web, but numerous webmasters elude it by creating masses of inbound links for doorway pages from just as insignificant other web pages.

Contrary to the concept of link popularity, PageRank is not simply based upon the total number of inbound links. The basic approach of PageRank is that a document is in fact considered the more important the more other documents link to it, but those inbound links do not count equally. First of all, a document ranks high in terms of PageRank, if other high ranking documents link to it.

So, within the PageRank concept, the rank of a document is given by the rank of those documents which link to it. Their rank again is given by the rank of documents which link to them. Hence, the PageRank of a document is always determined recursively by the PageRank of other documents. Since – even if marginal and via many links – the rank of any document influences the rank of any other, PageRank is, in the end, based on the linking structure of the whole web. Although this approach seems to be very broad and complex, Page and Brin were able to put it into practice by a relatively trivial algorithm.

The PageRank Algorithm

The original PageRank algorithm was described by Lawrence Page and Sergey Brin in several publications. It is given by

PR(A) = (1-d) + d (PR(T1)/C(T1) + … + PR(Tn)/C(Tn))

where

PR(A) is the PageRank of page A,
PR(Ti) is the PageRank of pages Ti which link to page A,
C(Ti) is the number of outbound links on page Ti and
d is a damping factor which can be set between 0 and 1.

So, first of all, we see that PageRank does not rank web sites as a whole, but is determined for each page individually. Further, the PageRank of page A is recursively defined by the PageRanks of those pages which link to page A.

The PageRank of pages Ti which link to page A does not influence the PageRank of page A uniformly. Within the PageRank algorithm, the PageRank of a page T is always weighted by the number of outbound links C(T) on page T. This means that the more outbound links a page T has, the less will page A benefit from a link to it on page T.

The weighted PageRank of pages Ti is then added up. The outcome of this is that an additional inbound link for page A will always increase page A’s PageRank.

Finally, the sum of the weighted PageRanks of all pages Ti is multiplied with a damping factor d which can be set between 0 and 1. Thereby, the extend of PageRank benefit for a page by another page linking to it is reduced.

The Random Surfer Model

In their publications, Lawrence Page and Sergey Brin give a very simple intuitive justification for the PageRank algorithm. They consider PageRank as a model of user behaviour, where a surfer clicks on links at random with no regard towards content.

The random surfer visits a web page with a certain probability which derives from the page’s PageRank. The probability that the random surfer clicks on one link is solely given by the number of links on that page. This is why one page’s PageRank is not completely passed on to a page it links to, but is devided by the number of links on the page.

So, the probability for the random surfer reaching one page is the sum of probabilities for the random surfer following links to this page. Now, this probability is reduced by the damping factor d. The justification within the Random Surfer Model, therefore, is that the surfer does not click on an infinite number of links, but gets bored sometimes and jumps to another page at random.

The probability for the random surfer not stopping to click on links is given by the damping factor d, which is, depending on the degree of probability therefore, set between 0 and 1. The higher d is, the more likely will the random surfer keep clicking links. Since the surfer jumps to another page at random after he stopped clicking links, the probability therefore is implemented as a constant (1-d) into the algorithm. Regardless of inbound links, the probability for the random surfer jumping to a page is always (1-d), so a page has always a minimum PageRank.

A Different Notation of the PageRank Algorithm

Lawrence Page and Sergey Brin have published two different versions of their PageRank algorithm in different papers. In the second version of the algorithm, the PageRank of page A is given as

PR(A) = (1-d) / N + d (PR(T1)/C(T1) + … + PR(Tn)/C(Tn))

where N is the total number of all pages on the web. The second version of the algorithm, indeed, does not differ fundamentally from the first one. Regarding the Random Surfer Model, the second version’s PageRank of a page is the actual probability for a surfer reaching that page after clicking on many links. The PageRanks then form a probability distribution over web pages, so the sum of all pages’ PageRanks will be one.

Contrary, in the first version of the algorithm the probability for the random surfer reaching a page is weighted by the total number of web pages. So, in this version PageRank is an expected value for the random surfer visiting a page, when he restarts this procedure as often as the web has pages. If the web had 100 pages and a page had a PageRank value of 2, the random surfer would reach that page in an average twice if he restarts 100 times.

As mentioned above, the two versions of the algorithm do not differ fundamentally from each other. A PageRank which has been calculated by using the second version of the algorithm has to be multiplied by the total number of web pages to get the according PageRank that would have been caculated by using the first version. Even Page and Brin mixed up the two algorithm versions in their most popular paper “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, where they claim the first version of the algorithm to form a probability distribution over web pages with the sum of all pages’ PageRanks being one.

In the following, we will use the first version of the algorithm. The reason is that PageRank calculations by means of this algorithm are easier to compute, because we can disregard the total number of web pages.

The Characteristics of PageRank

The characteristics of PageRank shall be illustrated by a small example.

We regard a small web consisting of three pages A, B and C, whereby page A links to the pages B and C, page B links to page C and page C links to page A. According to Page and Brin, the damping factor d is usually set to 0.85, but to keep the calculation simple we set it to 0.5. The exact value of the damping factor d admittedly has effects on PageRank, but it does not influence the fundamental principles of PageRank. So, we get the following equations for the PageRank calculation:

PR(A) = 0.5 + 0.5 PR(C)
PR(B) = 0.5 + 0.5 (PR(A) / 2)
PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))

These equations can easily be solved. We get the following PageRank values for the single pages:

PR(A) = 14/13 = 1.07692308
PR(B) = 10/13 = 0.76923077
PR(C) = 15/13 = 1.15384615

It is obvious that the sum of all pages’ PageRanks is 3 and thus equals the total number of web pages. As shown above this is not a specific result for our simple example.

For our simple three-page example it is easy to solve the according equation system to determine PageRank values. In practice, the web consists of billions of documents and it is not possible to find a solution by inspection.

The Iterative Computation of PageRank

Because of the size of the actual web, the Google search engine uses an approximative, iterative computation of PageRank values. This means that each page is assigned an initial starting value and the PageRanks of all pages are then calculated in several computation circles based on the equations determined by the PageRank algorithm. The iterative calculation shall again be illustrated by our three-page example, whereby each page is assigned a starting PageRank value of 1.

Iteration PR(A) PR(B) PR(C)
0 1 1 1
1 1 0.75 1.125
2 1.0625 0.765625 1.1484375
3 1.07421875 0.76855469 1.15283203
4 1.07641602 0.76910400 1.15365601
5 1.07682800 0.76920700 1.15381050
6 1.07690525 0.76922631 1.15383947
7 1.07691973 0.76922993 1.15384490
8 1.07692245 0.76923061 1.15384592
9 1.07692296 0.76923074 1.15384611
10 1.07692305 0.76923076 1.15384615
11 1.07692307 0.76923077 1.15384615
12 1.07692308 0.76923077 1.15384615

We see that we get a good approximation of the real PageRank values after only a few iterations. According to publications of Lawrence Page and Sergey Brin, about 100 iterations are necessary to get a good approximation of the PageRank values of the whole web.

The Implementation of PageRank in the Google Search Engine

Regarding the implementation of PageRank, first of all, it is important how PageRank is integrated into the general ranking of web pages by the Google search engine. The proceedings have been described by Lawrencec Page and Sergey Brin in several publications. Initially, the ranking of web pages by the Google search engine was determined by three factors:

Page specific factors
Anchor text of inbound links
PageRank

Page specific factors are, besides the body text, for instance the content of the title tag or the URL of the document. It is more than likely that since the publications of Page and Brin more factors have joined the ranking methods of the Google search engine. But this shall not be of interest here.

In order to provide search results, Google computes an IR score out of page specific factors and the anchor text of inbound links of a page, which is weighted by position and accentuation of the search term within the document. This way the relevance of a document for a query is determined. The IR-score is then combined with PageRank as an indicator for the general importance of the page. To combine the IR score with PageRank the two values are multiplicated. It is obvious that they cannot be added, since otherwise pages with a very high PageRank would rank high in search results even if the page is not related to the search query.

Especially for queries consisting of two or more search terms, there is a far bigger influence of the content related ranking criteria, whereas the impact of PageRank is mainly visible for unspecific single word queries. If webmasters target search phrases of two or more words it is possible for them to achieve better rankings than pages with high PageRank by means of classical search engine optimisation.

If pages are optimised for highly competitive search terms, it is essential for good rankings to have a high PageRank, even if a page is well optimised in terms of classical search engine optimisation. The reason therefore is that the increase of IR score deminishes the more often the keyword occurs within the document or the anchor texts of inbound links to avoid spam by extensive keyword repetition. Thereby, the potentialities of classical search engine optimisation are limited and PageRank becomes the decisive factor in highly competitive areas.

The PageRank Display of the Google Toolbar

PageRank became widely known by the PageRank display of the Google Toolbar. The Google Toolbar is a browser plug-in for Microsoft Internet Explorer which can be downloaded from the Google web site. The Google Toolbar provides some features for searching Google more comfortably.

The Google Toolbar displays PageRank on a scale from 0 to 10. First of all, the PageRank of an actually visited page can be estimated by the width of the green bar within the display. If the user holds his mouse over the display, the Toolbar also shows the PageRank value. Caution: The PageRank display is one of the advanced features of the Google Toolbar. And if those advanced features are enabled, Google collects usage data. Additionally, the Toolbar is self-updating and the user is not informed about updates. So, Google has access to the user’s hard drive.

If we take into account that PageRank can theoretically have a maximum value of up to dN+(1-d), where N is the total number of web pages and d is usually set to 0.85, PageRank has to be scaled for the display on the Google Toolbar. It is generally assumed that the scalation is not linearly but logarithmically. At a damping factor of 0.85 and, therefore, a minimum PageRank of 0.15 and at an assumed logaritmical basis of 6 we get a scalation as follows:

Toolbar-PR Tatsächlicher PR
0/10 0.15 - 0.9
1/10 0.9 - 5.4
2/10 5.4 - 32.4
3/10 32.4 - 194.4
4/10 194.4 - 1,166.4
5/10 1,166.4 - 6,998.4
6/10 6,998.4 - 41,990.4
7/10 41,990.4 - 251,942.4
8/10 251,942.4 - 1,511,654.4
9/10 1,511,654.4 - 9,069,926.4
10/10 9,069,926.4 - 0.85 × N + 0.15

It is uncertain if in fact a logarithmical scalation in a strictly mathematical sense takes place. There is likely a manual scalation which follows a logarithmical scheme, so that Google has control over the number of pages within the single Toolbar PageRank ranges. The logarithmical basis for this scheme should be between 6 and 7, which can for instance be rudimentary deduced from the number of inbound links of pages with a high Toolbar PageRank from pages with a Toolbar PageRank higher than 4, which are shown by Googe using the link command.

The Toolbar’s PageRank Files

Even webmasters who do not want to use the Google Toolbar or the Internet Explorer permanently for security and privacy concerns have the possibility to check the PageRank values of their pages. Google submits PageRank values in simple text files to the Toolbar. In former times, this happened via XML. The switch to text files occured in August 2002.

The PageRank files can be requested directly from the domain www.google.com. Basically, the URLs for those files look like follows (without line breaks):

http://www.google.com/search?client=navclient-auto&

ch=0123456789&features=Rank&q=info:http://www.domain.com/

There is only one line of text in the PageRank files. The last cipher in this line is PageRank.

The parameters incorporated in the above shown URL are inevitable for the display of the PageRank files in a browser. The value “navclient-auto” for the parameter “client” identifies the Toolbar. Via the parameter “q” the URL is submitted. The value “Rank” for the parameter “features” determines that the PageRank files are requested. If it is omitted, Google’s servers still transmit XML files. The parameter “ch” transfers a checksum for the URL to Google, whereby this checksum can only change when the Toolbar version is updated by Google.

Thus, it is necessary to install the Toolbar at least once to find out about the checksum of one’s URLs. To track the communication between the Toolbar and Google, often the use of packet sniffers, local proxies an similar tools is suggested. But this is not necessarily needed, since the PageRank files are cached by the Internet Explorer. So, the checksums can simply been found out by having a look at the folder Temporary Internet Files. Knowing the checksums of your URLs, you can view the PageRank files in your browser and you do not have to accept Google’s 36 years lasting cookies.

Since the PageRank files are kept in the browser cache and, thus, are clearly visible, and as long as requests are not automated, watching the PageRank files in a browser should not be a violation of Google’s Terms of Service. However, you should be cautious. The Toolbar submits its own User-Agent to Google. It is:

Mozilla/4.0 (compatible; GoogleToolbar 1.1.60-deleon; OS SE 4.10)

1.1.60-deleon is a Toolbar version which may of course change. OS is the operating system that you have installed. So, Google is able to identify requests by browsers, if they do not go out via a proxy and if the User-Agent is not modified accordingly.

Taking a look at IE’s cache, one will normally notice that the PageRank files are not requested from the domain www.google.com but from IP addresses like 216.239.33.102. Additionally, the PageRank files’ URLs often contain a parameter “failedip” that is set to values like “216.239.35.102;1111″ (Its function is not absolutely clear). The IP addresses are each related to one of Google’s seven data centers and the reason for the Toolbar querying IP-addresses is most likely to control the PageRank display in a better way, especially in times of the “Google Dance”.

The PageRank Display at the Google Directory

Webmasters who do not want to check the PageRank files that are used by the toolbar have another possibility to receive information about the PageRank of their sites by means of the Google Directory (directory.google.com).

The Google Directory is a dump of the Open Directory Project (dmoz.org), which shows the PageRank for listed documents similarly to the Google Toolbar display scaled and by means of a green bar. In contrast to the Toolbar, the scale is from 1 to 7. The exact value is not displayed, but it can be determined by the divided bar respectively the width of the single graphics in the source code of the page if one is not sure by looking at the bar.

By comparing the Toolbar PageRank of a document with its Directory PageRank, a more exact estimation of a pages PageRank can be deduced, if the page is listed with the ODP. This connection was mentioned first by Chris Raimondi (www.searchnerd.com/pagerank).

Especially for pages with a Toolbar PageRank of 5 or 6, one can appraise if the page is on the upper or the lower end of its Toolbar scale. It shall be noted that for the comparison the Toolbar PageRank of 0 was not taken into account. It can easily be verified that this is appropriate by looking at pages with a Toolbar PageRank of 3. However, it has to be considered that for a verification pages of the Google Directory respectively the ODP with a Toolbar PageRank of 4 or lower have to be chosen, since otherwise no pages linked from there with a Toolbar PageRank of 3 will be found.

The Effect of Inbound Links

It has already been shown that each additional inbound link for a web page always increases that page’s PageRank. Taking a look at the PageRank algorithm, which is given by

PR(A) = (1-d) + d (PR(T1)/C(T1) + … + PR(Tn)/C(Tn))

one may assume that an additional inbound link from page X increases the PageRank of page A by

d × PR(X) / C(X)

where PR(X) is the PageRank of page X and C(X) is the total number of its outbound links. But page A usually links to other pages itself. Thus, these pages get a PageRank benefit also. If these pages link back to page A, page A will have an even higher PageRank benefit from its additional inbound link.

The single effects of additional inbound links shall be illustrated by an example.

We regard a website consisting of four pages A, B, C and D which are linked to each other in circle. Without external inbound links to one of these pages, each of them obviously has a PageRank of 1. We now add a page X to our example, for which we presume a constant Pagerank PR(X) of 10. Further, page X links to page A by its only outbound link. Setting the damping factor d to 0.5, we get the following equations for the PageRank values of the single pages of our site:

PR(A) = 0.5 + 0.5 (PR(X) + PR(D)) = 5.5 + 0.5 PR(D)
PR(B) = 0.5 + 0.5 PR(A)
PR(C) = 0.5 + 0.5 PR(B)
PR(D) = 0.5 + 0.5 PR(C)

Since the total number of outbound links for each page is one, the outbound links do not need to be considered in the equations. Solving them gives us the following PageRank values:

PR(A) = 19/3 = 6.33
PR(B) = 11/3 = 3.67
PR(C) = 7/3 = 2.33
PR(D) = 5/3 = 1.67

We see that the initial effect of the additional inbound link of page A, which was given by

d × PR(X) / C(X) = 0,5 × 10 / 1 = 5

is passed on by the links on our site.

The Influence of the Damping Factor

The degree of PageRank propagation from one page to another by a link is primarily determined by the damping factor d. If we set d to 0.75 we get the following equations for our above example:

PR(A) = 0.25 + 0.75 (PR(X) + PR(D)) = 7.75 + 0.75 PR(D)
PR(B) = 0.25 + 0.75 PR(A)
PR(C) = 0.25 + 0.75 PR(B)
PR(D) = 0.25 + 0.75 PR(C)

Solving these equations gives us the following PageRank values:

PR(A) = 419/35 = 11.97
PR(B) = 323/35 = 9.23
PR(C) = 251/35 = 7.17
PR(D) = 197/35 = 5.63

First of all, we see that there is a significantly higher initial effect of additional inbound link for page A which is given by

d × PR(X) / C(X) = 0.75 × 10 / 1 = 7.5

This initial effect is then propagated even stronger by the links on our site. In this way, the PageRank of page A is almost twice as high at a damping factor of 0.75 than it is at a damping factor of 0.5. At a damping factor of 0.5 the PageRank of page A is almost four times superior to the PageRank of page D, while at a damping factor of 0.75 it is only a little more than twice as high. So, the higher the damping factor, the larger is the effect of an additional inbound link for the PageRank of the page that receives the link and the more evenly distributes PageRank over the other pages of a site.

The Actual Effect of Additional Inbound Links

At a damping factor of 0.5, the accumulated PageRank of all pages of our site is given by

PR(A) + PR(B) + PR(C) + PR(D) = 14

Hence, by a page with a PageRank of 10 linking to one page of our example site by its only outbound link, the accumulated PageRank of all pages of the site is increased by 10. (Before adding the link, each page has had a PageRank of 1.) At a damping factor of 0.75 the accumulated PageRank of all pages of the site is given by

PR(A) + PR(B) + PR(C) + PR(D) = 34

This time the accumulated PageRank increases by 30. The accumulated PageRank of all pages of a site always increases by

(d / (1-d)) × (PR(X) / C(X))

where X is a page additionally linking to one page of the site, PR(X) is its PageRank and C(X) its number of outbound links. The formula presented above is only valid, if the additional link points to a page within a closed system of pages, as, for instance, a website without outbound links to other sites. As far as the website has links pointing to external pages, the surplus for the site itself diminishes accordingly, because a part of the additional PageRank is propagated to external pages.

The justification of the above formula is given by Raph Levien and it is based on the Random Surfer Model. The walk length of the random surfer is an exponential distribution with a mean of (d/(1-d)). When the random surfer follows a link to a closed system of web pages, he visits on average (d/(1-d)) pages within that closed system. So, this much more PageRank of the linking page – weighted by the number of its outbound links – is distributed to the closed system.

For the actual PageRank calculations at Google, Lawrence Page und Sergey Brin claim to usually set the damping factor d to 0.85. Thereby, the boost for a closed system of web pages by an additional link from page X is given by

(0.85 / 0.15) × (PR(X) / C(X)) = 5.67 × (PR(X) / C(X))

So, inbound links have a far larger effect than one may assume.

The PageRank-1 Rule

Users of the Google Toolbar often notice that pages with a certain Toolbar PageRank have an inbound link from a page with a Toolbar PageRank which is higher by one. Some take this observation to doubt the validity of the PageRank algorithm presented here for the actual ranking methods of the Google search engine. It shall be shown, however, that the PageRank-1 rule complies with the PageRank algorithm.

Basically, the PageRank-1 rule proves the fundamental principle of PageRank. Web pages are important themselves if other important web pages link to them. It is not necessary for a page to have many inbound links to rank well. A single link from a high ranking page is sufficient.

To show the actual consistance of the PageRank-1 rule with the PageRank algorithm several factors have to be taken into consideration. First of all, the toolbar PageRank is a logarithmically scaled version of real PageRank values. If the PageRank value of one page is one higher than the PageRank value of another page in terms of Toolbar PageRank, than its real PageRank can at least be higher by an amount which equals the logarithmical basis for the scalation of Toolbar PageRank. If the logarithmical basis for the scalation is 6 and the toolbar PageRank of a linking Page is 5, then the real PageRank of the page which receives the link can be at least 6 times smaller to make that page still get a toolbar PageRank of 4.

However, the number of outbound links on the linking page thwarts the effect of the logarithmical basis, because the PageRank propagation from one page to another is devided by the number of outbound links on the linking page. But it has already been shown that the PageRank benefit by a link is higher than PageRank algorithm’s term d(PR(Ti)/C(Ti)) pretends. The reason is that the PageRank benefit for one page is further distributed to other pages within the site. If those pages link back as it usualy happens, the PageRank benefit for the page which initially received the link is accordingly higher. If we assume that at a high damping factor the logarithmical basis for PageRank scalation is 6 and a page receives a PageRank benefit which is twice as high as the PageRank of the linking page devided by the number of its outbound links, the linking page could have at least 12 outbound links so that the Toolbar PageRank of the page receiving the link is still at most one lower than the toolbar PageRank of the linking page.

A number of 12 outbound links admittedly seems relatively small. But normally, if a page has an external inbound link, this is not the only one for that page. Most likely other pages link to that page and propagate PageRank to it. And if there are examples where a page receives a single link from another page and the PageRanks of both pages comply the PageRank-1 rule although the linking page has many outbound links, this is first of all an indication for the linking page’s toolbar PageRank being at the upper end of its scale. The linking page could be a “high” 5 and the page receiving the link could be a “low” 4. In this way, the linking page could have up to 72 outbound links. This number rises accordingly if we assume a higher logarithmical basis for the scalation of Toolbar PageRank.

The Effect of Outbound Links

Since PageRank is based on the linking structure of the whole web, it is inescapable that if the inbound links of a page influence its PageRank, its outbound links do also have some impact. To illustrate the effects of outbound links, we take a look at a simple example.

We regard a web consisting of to websites, each having two web pages. One site consists of pages A and B, the other constists of pages C and D. Initially, both pages of each site solely link to each other. It is obvious that each page then has a PageRank of one. Now we add a link which points from page A to page C. At a damping factor of 0.75, we therefore get the following equations for the single pages’ PageRank values:

PR(A) = 0.25 + 0.75 PR(B)
PR(B) = 0.25 + 0.375 PR(A)
PR(C) = 0.25 + 0.75 PR(D) + 0.375 PR(A)
PR(D) = 0.25 + 0.75 PR(C)

Solving the equations gives us the following PageRank values for the first site:

PR(A) = 14/23
PR(B) = 11/23

We therefore get an accumulated PageRank of 25/23 for the first site. The PageRank values of the second site are given by

PR(C) = 35/23
PR(D) = 32/23

So, the accumulated PageRank of the second site is 67/23. The total PageRank for both sites is 92/23 = 4. Hence, adding a link has no effect on the total PageRank of the web. Additionally, the PageRank benefit for one site equals the PageRank loss of the other.

The Actual Effect of Outbound Links

As it has already been shown, the PageRank benefit for a closed system of web pages by an additional inbound link is given by

(d / (1-d)) × (PR(X) / C(X)),

where X is the linking page, PR(X) is its PageRank and C(X) is the number of its outbound links. Hence, this value also represents the PageRank loss of a formerly closed system of web pages, when a page X within this system of pages now points by a link to an external page.

The validity of the above formula requires that the page which receives the link from the formerly closed system of pages does not link back to that system, since it otherwise gains back some of the lost PageRank. Of course, this effect may also occur when not the page that receives the link from the formerly closed system of pages links back directly, but another page which has an inbound link from that page. Indeed, this effect may be disregarded because of the damping factor, if there are enough other web pages in-between the link-recursion. The validity of the formula also requires that the linking site has no other external outbound links. If it has other external outbound links, the loss of PageRank of the regarded site diminishes and the pages already receiving a link from that page lose PageRank accordingly.

Even if the actual PageRank values for the pages of an existing web site were known, it would not be possible to calculate to which extend an added outbound link diminishes the PageRank loss of the site, since the above presented formula regards the status after adding the link.

Intuitive Justification of the Effect of Outbound Links

The intuitive justification for the loss of PageRank by an additional external outbound link according to the Random Surfer Modell is that by adding an external outbound link to one page the surfer will less likely follow an internal link on that page. So, the probability for the surfer reaching other pages within a site diminishes. If those other pages of the site have links back to the page to which the external outbound link has been added, also this page’s PageRank will deplete.

We can conclude that external outbound links diminish the totalized PageRank of a site and probably also the PageRank of each single page of a site. But, since links between web sites are the fundament of PageRank and indespensable for its functioning, there is the possibility that outbound links have positive effects within other parts of Google’s ranking criteria. Lastly, relevant outbound links do constitute the quality of a web page and a webmaster who points to other pages integrates their content in some way into his own site.

Dangling Links

An important aspect of outbound links is the lack of them on web pages. When a web page has no outbound links, its PageRank cannot be distributed to other pages. Lawrence Page and Sergey Brin characterise links to those pages as dangling links.

The effect of dangling links shall be illustrated by a small example website. We take a look at a site consisting of three pages A, B and C. In our example, the pages A and B link to each other. Additionally, page A links to page C. Page C itself has no outbound links to other pages. At a damping factor of 0.75, we get the following equations for the single pages’ PageRank values:

PR(A) = 0.25 + 0.75 PR(B)
PR(B) = 0.25 + 0.375 PR(A)
PR(C) = 0.25 + 0.375 PR(A)

Solving the equations gives us the following PageRank values:

PR(A) = 14/23
PR(B) = 11/23
PR(C) = 11/23

So, the accumulated PageRank of all three pages is 36/23 which is just over half the value that we could have expected if page A had links to one of the other pages. According to Page and Brin, the number of dangling links in Google’s index is fairly high. A reason therefore is that many linked pages are not indexed by Google, for example because indexing is disallowed by a robots.txt file. Additionally, Google meanwhile indexes several file types and not HTML only. PDF or Word files do not really have outbound links and, hence, dangling links could have major impacts on PageRank.

In order to prevent PageRank from the negative effects of dangling links, pages wihout outbound links have to be removed from the database until the PageRank values are computed. According to Page and Brin, the number of outbound links on pages with dangling links is thereby normalised. As shown in our illustration, removing one page can cause new dangling links and, hence, removing pages has to be an iterative process. After the PageRank calculation is finished, PageRank can be assigned to the formerly removed pages based on the PageRank algorithm. Therefore, as many iterations are needed as for removing the pages. Regarding our illustration, page C could be processed before page B. At that point, page B has no PageRank yet and, so, page C will not receive any either. Then, page B receives PageRank from page A and during the second iteration, also page C gets its PageRank.

Regarding our example website for dangling links, removing page C from the database results in page A and B each having a PageRank of 1. After the calculations, page C is assigned a PageRank of 0.25 + 0.375 PR(A) = 0.625. So, the accumulated PageRank does not equal the number of pages, but at least all pages which have outbound links are not harmed from the danging links problem.

By removing dangling links from the database, they do not have any negative effects on the PageRank of the rest of the web. Since PDF files are dangling links, links to PDF files do not diminish the PageRank of the linking page or site. So, PDF files can be a good means of search engine optimisation for Google.

The Effect of the Number of Pages

Since the accumulated PageRank of all pages of the web equals the total number of web pages, it follows directly that an additional web page increases the added up PageRank for all pages of the web by one. But far more interesting than the effect on the added up PageRank of the web is the impact of additional pages on the PageRank of actual websites.

To illustrate the effects of addional web pages, we take a look at a hierachically structured web site consisting of three pages A, B and C, which are joined by an additional page D on the hierarchically lower level of the site. The site has no outbound links. A link from page X which has no other outbound links and a PageRank of 10 points to page A. At a damping factor d of 0.75, the equations for the single pages’ PageRank values before adding page D are given by

PR(A) = 0.25 + 0.75 (10 + PR(B) + PR(C))
PR(B) = PR(C) = 0.25 + 0.75 (PR(A) / 2)

Solving the equations gives us the follwing PageRank values:

PR(A) = 260/14
PR(B) = 101/14
PR(C) = 101/14

After adding page D, the equations for the pages’ PageRank values are given by

PR(A) = 0.25 + 0.75 (10 + PR(B) + PR(C) + PR(D))
PR(B) = PR(C) = PR(D) = 0.25 + 0.75 (PR(A) / 3)

Solving these equations gives us the follwing PageRank values:

PR(A) = 266/14
PR(B) = 70/14
PR(C) = 70/14
PR(D) = 70/14

As to be expected since our example site has no outbound links, after adding page D, the accumulated PageRank of all pages increases by one from 33 to 34. Further, the PageRank of page A rises marginally. In contrast, the PageRank of pages B and C depletes substantially.

The Reduction of PageRank by Additional Pages

By adding pages to a hierarchically structured websites, the consequences for the already existing pages are nonuniform. The consequences for websites with a different structure shall be shown by another example.

We take a look at a website constisting of three pages A, B and C which are linked to each other in circle. The pages are then joined by page D which fits into the circular linking structure. The regarded site has no outbound links. Again, a link from page X which has no other outbound links and a PageRank of 10 points to page A. At a damping factor d of 0.75, the equations for the single pages’ PageRank values before adding page D are given by

PR(A) = 0.25 + 0.75 (10 + PR(C))
PR(B) = 0.25 + 0.75 × PR(A)
PR(C) = 0.25 + 0.75 × PR(B)

Solving the equations gives us the follwing PageRank values:

PR(A) = 517/37 = 13.97
PR(B) = 397/37 = 10.73
PR(C) = 307/37 = 8.30

After adding page D, the equations for the pages’ PageRank values are given by

PR(A) = 0.25 + 0.75 (10 + PR(D))
PR(B) = 0.25 + 0.75 × PR(A)
PR(C) = 0.25 + 0.75 × PR(B)
PR(D) = 0.25 + 0.75 × PR(C)

Solving these equations gives us the follwing PageRank values:

PR(A) = 419/35 = 11.97
PR(B) = 323/35 = 9.23
PR(C) = 251/35 = 7.17
PR(D) = 197/35 = 5.63

Again, after adding page D, the accumulated PageRank of all pages increases by one from 33 to 34. But now, any of the pages which already existed before page D was added lose PageRank. The more uniform PageRank is distributed by the links within a site, the more likely will this effect occur.

Since adding pages to a site often reduces PageRank for already existing pages, it becomes obvious that the PageRank algorithm tends to privilege smaller web sites. Indeed, bigger web sites can counterbalance this effect by being more attractive for other webmasters to link to them, simply because they have more content.

None the less, it is also possible to increase the PageRank of existing pages by additional pages. Therefore, it has to be considered that as few PageRank as possible is distributed to these additional pages.

The Distribution of PageRank Regarding Search Engine Optimisation

Up to this point, it has been described how the number of pages and the number of inbound and outbound links, respectively, influence PageRank. Here, it will mainly be discussed in how far PageRank can be affected for the purpose of search engine optimisation by a website’s internal linking structure.

In most cases, websites are hierachically structured to a certain extend, as it is illustrated in our example of a web site consisting of the pages A, B and C. Normally, the root page is withal optimised for the most important search phrase. In our example, the optimised page A has an external inbound link from page X which has no other outbound links and a PageRank of 10. The pages B and C each receive a link from page A and link back to it. If we set the damping factor d to 0.5 the equations for the single pages’ PageRank values are given by

PR(A) = 0.5 + 0.5 (10 + PR(B) + PR (C))
PR(B) = 0.5 + 0.5 (PR(A) / 2)
PR(C) = 0.5 + 0.5 (PR(A) / 2)

Solving the equations gives us the follwing PageRank values:

PR(A) = 8
PR(B) = 2.5
PR(C) = 2.5

It is generally not advisable to solely work on the root page of a site for the purpose of search engine optimisation. Indeed, it is, in most cases, more reasonable to optimise each page of a site for different search phrases.

We now assume that the root page of our example website provides satisfactory results for its search phrase, but the other pages of the site do not, and therefore we modify the linking structure of the website. We add links from page B to page C and vice versa to our formerly hierarchically structured example site. Again, page A has an external inbound link from page X which has no other outbound links and a PageRank of 10. At a damping factor d of 0.5, the equations for the single pages’ PageRank values are given by

PR(A) = 0.5 + 0.5 (10 + PR(B) / 2 + PR(C) / 2)
PR(B) = 0.5 + 0.5 (PR(A) / 2 + PR(C) / 2)
PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B) / 2)

Solving the equations gives us the follwing PageRank values:

PR(A) = 7
PR(B) = 3
PR(C) = 3

The result of adding internal links is an increase of the PageRank values of pages B and C, so that they likely will rise in search engine result pages for their targeted keywords. On the other hand, of course, page A will likely rank lower because of its diminished PageRank.

Generally spoken, PageRank will distribute for the purpose of search engine optimisation more equally among the pages of a site, the more the hierarchically lower pages are interlinked.

Well Directed PageRank Distribution by Concentration of Outbound Links

It has already been demonstrated that external outbound links tend to have negative effects on the PageRank of a website’s web pages. Here, it shall be illustrated how this effect can be reduced for the purpose of search engine optimisation by the systematic arrangement of external outbound links.

We take a look at another hierarchically structured example site consisting of the pages A, B, C and D. Page A has links to the pages B, C and D. Besides a link back to page A, each of the pages B, C and D has one external outbound link. None of those external pages which receive links from the pages B, C and D link back to our example site. If we assume a damping factor d of 0.5, the equations for the calculation of the single pages’ PageRank values are given by

PR(A) = 0.5 + 0.5 (PR(B) / 2 + PR(C) / 2 + PR(D) / 2)
PR(B) = PR(C) = PR(D) = 0.5 + 0.5 (PR(A) / 3)

Solving the equations gives us the follwing PageRank values:

PR(A) = 1
PR(B) = 2/3
PR(C) = 2/3
PR(D) = 2/3

Now, we modify our example site in a way that page D has all three external outbound links while pages B and C have no more external outbound links. Besides this, the general conditions of our example stay the same as above. None of the external pages which receive a link from pages D link back to our example site. If we, again, assume a damping factor d of 0.5, the equations for the calculations of the single pages’ PageRank values are given by

PR(A) = 0.5 + 0.5 (PR(B) + PR(C) + PR(D) / 4)
PR(B) = PR(C) = PR(D) = 0.5 + 0.5 (PR(A) / 3)

Solving these equations gives us the follwing PageRank values:

PR(A) = 17/13
PR(B) = 28/39
PR(C) = 28/39
PR(D) = 28/39

As a result of our modifications, we see that the PageRank values for each single page of our site have increased. Regarding search engine optimisation, it is therefore advisable to concentrate external outbound links on as few pages as possible, as long as it does not lessen a site’s usabilty.

Link Exchanges for the purpose of Search Engine Optimisation

For the purpose of search engine optimisation, many webmasters exchange links with others to increase link popularity. As it has already been shown, adding links within closed systems of web pages has no effects on the accumulated PageRank of those pages. So, it is questionable if link exchanges have positive consequences in terms of PageRank at all.

To show the effects of link exchanges, we take a look at an an example of two hierarchically structured websites consisting of pages A, B and C and D, E and F, respectively. Within the first site, page A links to pages B and C and those link back to page A. The second site is structured accordingly, so that the PageRank values for its pages do not have to be computed explicitly. At a damping factor d of 0.5, the equations for the single pages’ PageRank values are given by

PR(A) = 0.5 + 0.5 (PR(B) + PR(C))
PR(B) = PR(C) = 0.5 + 0.5 (PR(A) / 2)

Solving the equations gives us the follwing PageRank values for the first site

PR(A) = 4/3
PR(B) = 5/6
PR(C) = 5/6

and accordingly for the second site

PR(D) = 4/3
PR(E) = 5/6
PR(F) = 5/6

Now, two pages of our example sites start a link exchange. Page A links to page D and vice versa. If we leave the general conditions of our example the same as above and, again, set the damping factor d to 0.5, the equations for the calculations of the single pages’ PageRank values are given by

PR(A) = 0.5 + 0.5 (PR(B) + PR(C) + PR(D) / 3)
PR(B) = PR(C) = 0.5 + 0.5 (PR(A) / 3)
PR(D) = 0.5 + 0.5 (PR(E) + PR(F) + PR(A) / 3)
PR(E) = PR(F) = 0.5 + 0.5 (PR(D) / 3)

Solving these equations gives us the follwing PageRank values:

PR(A) = 3/2
PR(B) = 3/4
PR(C) = 3/4
PR(D) = 3/2
PR(E) = 3/4
PR(F) = 3/4

We see that the link exchange makes pages A and D benefit in terms of PageRank while all other pages lose PageRank. Regarding search engine optimisation, this means that the exactly opposite effect compared to interlinking hierachically lower pages internally takes place. A link exchange is thus advisable, if one page (e.g. the root page of a site) shall be optimised for one important key phrase.

A basic premise for the positive effects of a link exchange is that both involved pages propagate a similar amount of PageRank to each other. If one of the involved pages has a significantly higher PageRank or fewer outbound links, it is likely that all of its site’s pages lose PageRank. Here, an important influencing factor is the size of a site. The more pages a web site has, the more PageRank from an inbound link is distributed to other pages of the site, regardless of the number of outbound links on the page that is involved in the link exchange. This way, the page involved in a link exchange itself benefits lesser from the link exchange and cannot propagate as much PageRank to the other page involved in the link exchange. All the influencing factors should be weighted up against each other bevor one trades links.

Finally, it shall be noted that it is possible that all pages of a site benefit from a link exchange in terms of PageRank, whereby also the other site taking part in the link exchange does not lose PageRank. This may occur, when the page involved in the link exchange already has a certain number of external outbound links which don’t link back to that site. In this case, less PageRank is lost by the already existing outbound links.

The Yahoo Bonus and its Impact on Search Engine Optimization

Many experts in search engine optimization assume that certain websites obtain a special PageRank evaluation from the Google search engine which needs a manual intervention and does not derive from the PageRank algorithm directly. Mostly, the directories Yahoo and Open Directory Project (dmoz.org) are considered to get this special treatment. In the context of search engine optimization, this assumption would have the consequence that an entry into the above mentioned directories had a big impact on a site’s PageRank.

An approach that is often mentioned when talking about influencing the PageRank of certain websites is to assign higher starting values to them before the iterative computation of PageRank begins. Such proceeding shall be reviewed by looking at a simple example web consisting of two pages, whereby each of these pages solely links to the other. We assign an initial PageRank of 10 to one page and a PageRank of 1 to the other. The damping factor d is set to 0.1, because the lower d is, the faster the PageRank values converge during the iterations. So, we get the following equations for the computation of the pages’ PageRank values:

PR(A) = 0.9 + 0.1 PR(B)
PR(B) = 0.9 + 0.1 PR(A)

During the iterations, we get the following PageRank values:

Iteration PR (A) PR (B)
0 1 10
1 1.9 1.09
2 1.009 1.0009
3 1.00009 1.000009

It is obvious that despite assigning different starting values to the two pages each of the PageRank values converges to 1, just as it would have happened if no initial values were assigned. Hence, starting values have no effect on PageRank if a sufficient number of iterations takes place. Indeed, if the computation is performed with only few iterations, the starting values would influence PageRank. But in this case, we have to consider that in our example the PageRank relation between the two pages reverses after the first iteration. However, it shall be noted that for our computation the actual PageRank values within one iteration have been used and not the ones from the previous iteration. If those values would have been used, the PageRank relation had alternated after each iteration.

Modification of the PageRank Algorithm

If assigning special starting values at the begin of the PageRank calculations has no effect on the results of the computation, this does not mean that it is not possible to influence the PageRank of websites or web pages by an intervention in the PageRank algorithm. Lawrence Page, for instance, describes a method for a special evaluation of web pages in his PageRank patent specifications (United States Patent 6,285,999). The starting point for his consideration is that the random surfer of the Random Surfer Model may get bored and stop following links with a constant probabilty, but when he restarts, he won’t take a random jump to any page of the web but will rather jump to certain web pages with a higher probability than to others. This behaviour is closer to the behaviour of a real user, who would more likely use, for instance, directories like Yahoo or ODP as a starting point for surfing.

If a special evaluation of certain web pages shall take place, the original PageRank algorithm has to be modified. With another expected value implemented, the algorithm is given as follows:

PR(A) = E(A) (1-d) + d (PR(T1)/C(T1) + … + PR(Tn)/C(Tn))

Here, (1-d) is the probability for the random surfer no longer following links. E(A) is the probability for the random surfer going to page A after he has stopped following links, weighted by the number of web pages. So, E is another expected value whose average over all pages is 1. In this way, the average of the PageRank values of all pages of the web will continue to converge to 1. Thus, the PageRank values do not vaccilate because of the special evaluation of web pages and the impact of PageRank on the general ranking of web pages remains stable.

In our example, we set the probability for the random surfer going to page A after he has stopped following links to 0.1. The probability for him going to page B is set to 0.9. Since our web consists of two pages E(A) equals 0.2 and E(B) equals 1.8. At a damping factor d of 0.5 we get the following equations for the calculation of the single pages’ PageRank values:

PR(A) = 0.2 × 0.5 + 0.5 × PR(B)
PR(B) = 1.8 × 0.5 + 0.5 × PR(A)

If we solve these equations we get the following PageRank values:

PR(A) = 11/15
PR(B) = 19/15

The sum of the PageRank values remains 2. The higher probability for the random surfer jumping to page B is reflected by its higher PageRank. Indeed, the uniform interlinking between both pages prevents our example pages’ PageRank values from a more significant impact of our intervention.

So, it is possible to implement the special evaluation of certain web pages into the PageRank algorithm without having to change it fundamentally. It is questionable, indeed, what criteria is used for the evaluation. Lawrence Page suggests explicitly the utilization of real usage data in his PageRank patent specifications. Google, meanwhile, collects usage date by means of the Google Toolbar. And Google would not even need as much data, as if the whole ranking was solely based on usage data. A limited sample would be sufficient to determine the 1,000 or 10,000 most important pages on the web. The PageRank algorithm can then fill the holes in usage data and is thereby able to deliver a more accurate picture of the web.

Of course, all statements regarding the influence of real usage data on PageRank are pure speculation. Even if there is a special evaluation of certain web pages at all will in the end, stay a secret of the people at Google.

Nonetheless, Assigning Starting Values?

Although, assigning special starting values to pages at the begin of PageRank calculations has no effect on PageRank values it can, nonetheless, be reasonable.

We take a look at our example web consisting of the pages A, B and C, whereby page A links to the pages B and C, page B links to page C and page C links to page A. In this case, the damping factor d is set to 0.75. So, we get the following equations for the iterative computation of the single pages’ PageRank values:

PR(A) = 0.25 + 0.75 PR(C)
PR(B) = 0.25 + 0.75 (PR(A) / 2)
PR(C) = 0.25 + 0.75 (PR(A) / 2 + PR(B))

Basically, it is not necessary to assign starting values to the single pages before the computation begins. They simply start with a value of 0 and we get the following PageRank values during the iterations:

Iteration PR(A) PR(B) PR(C)
0 0 0 0
1 0.25 0.34375 0.60156
2 0.70117 0.51294 0.89764
3 0.92323 0.59621 1.04337
4 1.03253 0.63720 1.11510
5 1.08632 0.65737 1.15040
6 1.11280 0.66730 1.16777
7 1.12583 0.67219 1.17633
8 1.13224 0.67459 1.18054
9 1.13540 0.67578 1.18261
10 1.13696 0.67636 1.18363
11 1.13772 0.67665 1.18413
12 1.13810 0.67679 1.18438
13 1.13828 0.67686 1.18450
14 1.13837 0.67689 1.18456
15 1.13842 0.67691 1.18459
16 1.13844 0.67692 1.18460
17 1.13845 0.67692 1.18461
18 1.13846 0.67692 1.18461
19 1.13846 0.67692 1.18461
20 1.13846 0.67692 1.18461
21 1.13846 0.67692 1.18461
22 1.13846 0.67692 1.18462

If we assign 1 to each page before the computation starts, we get the following PageRank values during the iterations:

Iteration PR(A) PR(B) PR(C)
0 1 1 1
1 1 0.625 1.09375
2 1.07031 0.65137 1.13989
3 1.10492 0.66434 1.16260
4 1.12195 0.67073 1.17378
5 1.13034 0.67388 1.17928
6 1.13446 0.67542 1.18199
7 1.13649 0.67618 1.18332
8 1.13749 0.67656 1.18398
9 1.13798 0.67674 1.18430
10 1.13823 0.67684 1.18446
11 1.13835 0.67688 1.18454
12 1.13840 0.67690 1.18458
13 1.13843 0.67691 1.18460
14 1.13845 0.67692 1.18461
15 1.13845 0.67692 1.18461
16 1.13846 0.67692 1.18461
17 1.13846 0.67692 1.18461
18 1.13846 0.67692 1.18461
19 1.13846 0.67692 1.18462

If we now assign a starting value to each page, which is closer to its effective PageRank (1.1 for page A, 0.7 for page B and 1.2 for page C), we get the following results:

Iteration PR(A) PR(B) PR(C)
0 1.1 0.7 1.2
1 1.15 0.68125 1.19219
2 1.14414 0.67905 1.18834
3 1.14126 0.67797 1.18645
4 1.13984 0.67744 1.18552
5 1.13914 0.67718 1.18506
6 1.13879 0.67705 1.18483
7 1.13863 0.67698 1.18472
8 1.13854 0.67695 1.18467
9 1.13850 0.67694 1.18464
10 1.13848 0.67693 1.18463
11 1.13847 0.67693 1.18462
12 1.13847 0.67692 1.18462
13 1.13846 0.67692 1.18462

So, the closer the assigned starting values are to the effective results we would get by solving the equations, the faster do the PageRank values converge in the iterative computation. Less iterations are needed, which can be useful for providing more up to date search results, especially regarding the growth rate of the web. Starting point for an accurate presumption of the actual PageRank distribution may be the PageRank values of a former PageRank calculation. All the pages which are new in the index could get an initial PageRank of 1, which will then be a lot closer to the effective PageRank value after the first few iterations.

Additional Factors Influencing PageRank

It has been widely discussed if additional criteria beyond the link structure of the web have been implemented in the PageRank algorithm since the scientific work on PageRank has been published by Lawrence Page and Sergey Brin. Lawrence Page himself outlines the following potential influencing factors in his patent specifications for PageRank:

Visibility of a link
Position of a link within a document
Distance between web pages
Importance of a linking page
Up-to-dateness of a linking page

First of all, the implementation of additional criteria in PageRank would result in a better approximation of human usage regarding the Random Surfer Model. Considering the visibility of a link and its position within a document implies that a user does not click on links completely at haphazard, but rather follows links which are highly and immediately visible regardless of their anchor text. The other criteria would give Google more flexibility in determing in how far an inbound link of a page should be considered important, than the methods which have been described so far.

Whether or not the above mentioned factors are actually implemented in PageRank can not be proved empirically and shall not be discussed here. It shall rather be illustrated in which way additional influencing factors can be implemented in the PageRank algorithm and which options the Google search engine thereby gets in terms of influencing PageRank values.

Modification of the PageRank Algorithm

To implement additional factors in PageRank, the original PageRank algorithm has again to be modified. Since we have to assume that PageRank calculations are still based on numerous iterations and for the purpose of short computation times, we have to consider to keep the number of database queries during the iterations as small as possible. Therefore, the following modification of the PageRank algorithm shall be assumed:

PR(A) = (1-d) + d (PR(T1)×L(T1,A) + … + PR(Tn)×L(Tn,A))

Here, L(Ti,A) represents the evaluation of a link which points from page Ti to page A. L(Ti,A) withal replaces the PageRank weighting of page Ti by the number of outbound links on page Ti which was given by 1/C(Ti). L(Ti,A) may consist of several factors, each of them having to be determined only once and then being merged to one value before the iterative PageRank calculation begins. So, the number of database queries during the iterations stays the same, although, admittedly, a much larger database has to be queried at each step in comparison to the computation by use of the original algorithm, since now there is an evaluation of each link instead of an evaluation of pages (by the number of their outbound links).

Different Evaluation of Links within a Document

Two of the criteria for the evaluation of links mentioned by Lawrence Page in his PageRank patent specifications are the visibilty of a link and its position within a document. Regarding the Random Surfer Model, those criteria reflect the probability for the random surfer clicking on a link on a specific web page. In the original PageRank algorithm, this probability is given by the term (1/C(Ti)), whereby the probability is equal for each link on one page.

Assigning different probabilities to each link on a page can, for instance, be realized as follows:

We take a look at a web consisting of three pages A, B anc C, where each of these pages has outbound links to both of the other pages. Links are weighted by two evaluation criteria X and Y. X represents the visibility of a link. X equals 1 if a link is not particularly emphasized, and 2 if the link is, for instance, bold or italic. Y represents the position of a link within a document. Y equals 1 if the link is on the lower half of the page, and 3 if the link is on the upper half of the page. If we assume a multiplicative correlation between X and Y, the links in our example are evaluated as follows:

X(A,B) × Y(A,B) = 1 × 3 = 3
X(A,C) × Y(A,C) = 1 × 1 = 1
X(B,A) × Y(B,A) = 2 × 3 = 6
X(B,C) × Y(B,C) = 2 × 1 = 2
X(C,A) × Y(C,A) = 2 × 3 = 6
X(C,B) × Y(C,B) = 2 × 1 = 2

For the purpose of determinig the single factors L, the evaluated links must not simply be weighted by the number of outbound links on one page, but in fact by the total of evaluated links on the page. Thereby, we get the following weighting quotients Z(Ti) for the single pages Ti:

Z(A) = X(A,B) × Y(A,B) + X(A,C) × Y(A,C) = 4
Z(B) = X(B,A) × Y(B,A) + X(B,C) × Y(B,C) = 8
Z(C) = X(C,A) × Y(C,A) + X(C,B) × Y(C,B) = 8

The evaluation factors L(T1,T2) for a link which is pointing from page T1 to page T2 are hence given by

L(T1,T2) = X(T1,T2) × Y(T1,T2) / Z(T1)

Their values regarding our example are as follows:

L(A,B) = 0.75
L(A,C) = 0.25
L(B,A) = 0.75
L(B,C) = 0.25
L(C,A) = 0.75
L(C,B) = 0.25

At a damping factor d of 0.5, we get the following equations for the calculation of PageRank values:

PR(A) = 0.5 + 0.5 (0.75 PR(B) + O.75 PR(C))
PR(B) = 0.5 + 0.5 (0.75 PR(A) + 0.25 PR(C))
PR(C) = 0.5 + 0.5 (0.25 PR(A) + 0.25 PR(B))

Solving these equations gives us the follwing PageRank values for our example:

PR(A) = 819/693
PR(B) = 721/693
PR(C) = 539/693

First of all, we see that page A has the highest PageRank of all three pages. This is caused by page A receiving the relatively higher evaluated link from page B as well as from page C.

Furthermore, we see that even by the evaluation of single links, the sum of the PageRank values of all pages equals 3 (2079/693) and thereby the total number of pages. So, the PageRank values computed by our modified PageRank algorithm can be used for the general ranking of web pages by Google without any normalisation being needed.

Different Evaluation of Links by Page Specific Criteria

Besides the unequal evaluation of links within a document, Lawrence Page mentions the possibility of evaluating links according to criteria which are based upon the linking page. At first glance, this does not seem necessary since it is the main principle of PageRank to rank pages the higher, the more high ranking pages link to them. But, at the time of their scientific work on PageRank, Page and Brin have already recognized that their algorithm is vulnerable to artificial inflation of PageRank.

An artificial influence on PageRank might be exerted by webmasters who generate a multitude of web pages whose links distribute PageRank in a way that single pages within that system receive a special importance. Those pages can have a high PageRank without being linked to from other pages with high PageRank. So, not only the concept of PageRank is undermined, but also the search engine’s index is spammed with an innumerable amount of web pages which were solely created to influence PageRank.

In his patent specifications for PageRank, Lawrence Page presents the evaluation of links by the distance between pages as a means to avoid the artificial inflation of PageRank, because the bigger the distance between two pages, the less likely has one webmaster control over both. A criterium for the distance between two pages may be if they are on the same domain or not. In this way, internal links would be weighted less than external links. In the end, any general measure of the distance between links can be used to determine such a weighting. This comprehends if pages are on the same server or not and also the geographical distance between servers.

As another indicator for the importance of a document, Lawrence Page mentions the up-to-dateness of the documents which link to it. This argument considers that the information on a page is less likely outdated, the more pages which have been modified recently link to it. In contrast, the original PageRank concept, just like any method of measuring link popularity, favours older documents which gained their inbound links in the course of their existence and have at a higher probability been modified less recently than new documents. Basically, recently modified documents may be given a higher evaluation by weighting the factor (1-d). In this way, both those recently modified documents and the pages they link to receive a higher PageRank. But, if a page has been modified recently, is not necessarily an indicator for the importance of the information presented on it. So, as suggested by Lawrence Page, it is advisable not to favour recently modified pages but only their outbound links.

Finally, Page mentions the importance of the web location of a page as an indicator of the importance of its outbound links. As an example for an important web location he names the root page of a domain, but, in the end, Google could exert influence on PageRank absolutely arbitrarily.

To implement the evaluation of the linking page into PageRank, the evaluation factor of the modified algorithm must consist of several single factors. For a link that points from page Ti to page A, it can be given as follows:

L(Ti,A) = K(Ti,A) × K1(Ti) × … × Km(Ti)

where K(Ti,A) is the above presented weighting of a single link within a page by its visibility or position. Additionally, an evaluation of page Ti by m criteria which are represented by the factors Kj(Ti) takes place.

To implement the evaluation of the linking pages, not only the algorithm but also the proceedings of PageRank calculation have to be modified. This shall be illustrated by an example.

We take a look at a web consisting of three pages A, B and C, whereby page A links to the pages B and C, page B links to page C and page C links to page A. The outbound links of one page are evaluated equally, so there is no weighting by visibilty or position. But now, the pages are evaluated by one criterium. In this way, an inbound link from page C shall be considered four times as important as an inbound link from one of the other pages. After weighting by the number of pages, we get the following evaluation factors:

K(A) = 0.5
K(B) = 0.5
K(C) = 2

At a damping factor d of 0.5, the equations for the computation of the PageRank values are given by

PR(A) = 0.5 + 0.5 × 2 PR(C)
PR(B) = 0.5 + 0.5 × 0.5 × 0.5 PR(A)
PR(C) = 0.5 + 0.5 (0.5 PR(B) + 0.5 × 0.5 PR(A))

Solving the equations gives us the follwing PageRank values:

PR(A) = 4/3
PR(B) = 2/3
PR(C) = 5/6

At the current modifications of the PageRank algorithm, the accumulated PageRank of all pages no longer equals the number of pages. The reason therefore is that the weighting of the page evaluation by the number of pages was not appropriate. To determine the proper weighting, the web’s linking structure would have to be anticipated, which is not possible in case of the actual WWW. Therefore, the PageRank calculated by an evaluation of linking pages has to be normalized if there shall not be any unfounded effects on the general ranking of pages by Google. Within the iterative calculation, a normalization would have to take place after each iteration to minimize unintentional distortions.

In the case of a small web, the evaluation of pages often causes severe distortions. In the case of the actual WWW, these distortions should normally equalise by the number of pages. Indeed, it is to be expected that the evaluation of the distance between pages will cause distortions on PageRank, since pages with many inbound links surely tend to be linked to from different geographical regions. But such effects can be anticipated by experience from previous calculation periods, so that a normalisation would only have to be marginal.

In either case, implementing additional factors in PageRank is possible. Indeed, the computation of PageRank values would take more time.

Theme-based PageRank

For many years now, the topic- or theme-based homogeneity of websites has been dicussed as a possible ranking criterion of search engines. There are various theoretical approaches for the implementation of themes in search engine algorithms which all have in common that web pages are no longer ranked only based on their own content, but also based on the content of other web pages. For example, the content of all pages of one website may take influence on the ranking of one single page of this website. On the other hand, it is also conceivable that one page’s ranking is based on the content of those pages which link to it or which it links to itself.

The potential implementation of a theme-based ranking in the Google search engine is discussed controversially. In search engine optimization forums and on websites on this topic we can over and over again find advice that inbound links from sites with a similar theme to our own have a larger influence on PageRank than links from unrelated sites. This hypothesis shall be discussed here. Therefore, we first of all take a look at two relatively new approaches for the integration of themes in the PageRank technique: on the one hand the “intelligent surfer” by Matthew Richardson and Pedro Domingos and on the other hand the Topic-Sensitive PageRank by Taher Haveliwala. Subsequently, we take a look at the possibility of using content analyses in order to compare the text of web pages, which can be a basis for weighting links within the PageRank technique.

The “Intelligent Surfer” by Richardson and Domingos

Matthew Richardson and Pedro Domingos resort to the Random Surfer Model in order to explain their approach for the implementation of themes in the PageRank technique. Instead of a surfer who follws links completely at random, they suggest a more intelligent surfer who, on the one hand, only follows links which are related to an original search query and, on the other hand, also after “getting bored” only jumps to a page which relates to the original query.

So, to Richardson and Domingos’ “intelligent surfer” only pages are relevant that contain the search term of an initial query. But since the Random Surfer Model does nothing but illustrate the PageRank technique, the question is how an “intelligent” behaviour of the Random Surfer influences PageRank. The answer is that for every term occuring on the web a separate PageRank calculation has to be conducted and each calculation is solely based on links between pages which contain that term.

Computing PageRank this way causes some problems. They especially appear for search terms that do not occur so often on the web. To make it into the PageRank calculations for a specific search term, that term has not only to appear on someone’s page, but also on the pages that link to it. So, the search results would often be based on small subsets of the web and may omit relevant sites. In addition, using such small subsets of the web, the algorithms are more vulnerable to spam by automatically generating numerous pages.

Additionally, there are serious problems regarding scalability. Richardson and Domingos estimate the memory and computing time requirements for several 100,000 terms 100-200 times higher compared to the original PageRank calculations. Regarding the large number of small subsets of the web, these numbers appear to be realistic.

The higher memory requirements should not be so much of a problem because Richardson and Domingos correctly state that the term specific PageRank values constitute only a fraction of the data volume of Google’s inverse index. However, the computing time requirements are indeed a large problem. If we assume just five hours for a conventional PageRank calculation, then this would last about 3 weeks based on Richardson and Domingos’ model, which makes it unsuitable for actual employment.

Taher Haveliwala’s Topic-Sensitive PageRank

The approach of Taher Havilewala seems to be more practical for actual employment. Just like Richardson and Domingos, also Havilewala suggests the computation of different PageRanks for different topics. However, the Topic-Sensitive PageRank does not consist of hundreds of thousands of PageRanks for different terms, but of a few PageRanks for different topics. Topic-Sensitive PageRank is based on the link structure of the whole web, whereby the topic sensitivity implies that there is a different weighting for each topic.

The basic principle of Haveliwala’s approach has already been described in our section on the “Yahoo-Bonus”, where we have discussed the possibility to assign a particular imporance to certain web pages. In the words of the Random Surfer Model, this is realized by increasing the probability for the Random Surfer jumping to a page after “getting bored”. Via links, this manual intervention in the PageRank technique has an influence on the PageRank of each page on the web. More precisely, we have reached taking influence on PageRank by implementing another value E in the PageRank algorithm:

PR(A) = E(A) (1-d) + d (PR(T1)/C(T1) + … + PR(Tn)/C(Tn))

Haveliwala’s Topic-Sensitive-PageRank goes one step further. Instead of assigning a universally higher value to a website or a web page, Haveliwala differentiates on the basis of different topics. For each of these topics, he identifies other authority pages. On the basis of this evaluation, different PageRanks are calculated – each separately, but for the entire web.

For his experiments on Topic-Sensitive PageRank, Haveliwala has chosen the 16 top-level categories of the Open Directory Project both for the identification of topics and for the intervention in PageRank. More precisely, Haveliwala assigns a higher value E to the pages of those ODP categories for which he calculates PageRank. If, for example, he calculates the PageRank for the topic health, all the ODP pages in the health category receive a relatively higher value E and they pass this value in the form of PageRank on to the pages which are linked from there. Of course, this PageRank is passed on to other pages and, if we assume that health-related websites tend to link more often to other websites within that topic, pages on the topic health generally receive a higher PageRank.

Haveliwala confirms the incompleteness of choosing the Open Directory Project in order to identify topics, which for example results in a high degree of dependence on ODP editors and in a rather rough subdivision into topics. But, as Haveliwala states, his method shows good results and it can surely be improved without big effort.

However, one crucial point in Haveliwala’s work on Topic-Sensitive-PageRank is the identification of the user’s preferences. Having a Topic-Specific PageRank is useless as long as we do not know in which topics an actual user is interested. In the end, search results must be based on the PageRank that matches the user’s preferences best. The Topic-Sensitive PageRank can only be used if these are known.

Indeed, Haveliwala does supply some practicable approaches for the identification of user preferences. He describes, for example, the search in context by highlighting terms on a web page. In this way, the content of that web page could be an indicator for waht the user is looking for. At this point, we want to note the potential of the Google Toolbar. The Toolbar submits data regarding search terms and pages that a user has visited to Google. This data can be used to create user profiles which can then be a basis for the identification of the user’s preferences. However, even without using such techniques, it is imaginable that a user simply chooses the topic he is interested in before he does a query.

The Weighting of Links Based on Content Analyses

That it is possible to weight single links within the PageRank technique has been shown on the previous page. The thought behind weighting links based on content analyses is to avoid the corrumption of PageRank. By weighting links this way, it is theoretically possible to diminish the influence of links between thematically unrelated page, which have been set for the sole purpose of boosting PageRank of one page. Indeed, it is questionable if it is possible to realize such weighting based on content analyses.

The fundamentals of content analyses are based on Gerard Salton’s work in the 1960s and 1970s. In his vector space model of information retrieval, documents are modeled as vectors which are built upon terms and their weighting within the document. These term vectors allow comparisons between the content of documents by, for instance, calculating the cosine measure (the inner product) of the vectors. In its basic form, the vector space model has some weaknesses. For instance, often the assumption that if and in how far the same words appear in two documents is an indicator for their similarity is criticized. However, numerous enhancements have been developed that solve most of the problems of the vector space model.

One person who excelled at publications which are based on Salton’s vector space model is Krishna Bharat. This is interesting because Bharat meanwhile is a member of Google’s staff and, particularly, because he is deemded to be the developer of “Google News” (news.google.com). Google News is a service that crawls news websites, evaluates articles and then provides them categorized and grouped in different subjects on the Google New website. According to Google, all these procedures are completely automated. Therefore, other criteria like, for example, the time when an article is published, are taken into account, but if there is no manual intervention, the clustering of articles is most certainly only possible, if the contents of the articles are actually compared to each other. The questions is: How can this be realized?

In their publication on a term vector database, Raymie Stata, Krishna Bharat and Farzin Maghoul describe how the contents of web pages can be compared based on term vectors and, particularly, they describe how some of the problems with the vector space model can be solved. Firstly, not all terms in documents are suitable for content analsysis. Very frequent terms provide only little discrimination across vectors and, so, the most frequent third of all terms is eliminated from the database. Infrequent terms, on the other hand, do not provide a good basis for measuring similarity. Such terms are, for example, misspellings. They appear only on few pages which are likely unrelated in terms of their theme, but because they are so infrequent, the term vectors of the pages appear to be closely related. Hence, also the least frequent third of all terms is eliminated from the database.

Even if only one third of all terms is included in the term vectors, this selection is still not very efficient. Stata, Bharat and Maghoul perform another filtering, so that each term vector is based on a maximum of 50 terms. But these are not the 50 most frequent terms on a page. They weight a term by deviding the number of times it appears on a page by the number of times it appears on all pages, and those 50 terms with the highest weight are included in the term vector of a page. This selection actually allows a real differentiation between the content of pages.

The methods described above are standards for the vector space model. If, for example, the inner product of two term vectors is rather high, the contents of the according pages tend to be similar. This may allow content comparisons in many areas, but it is doubtful if it is a good basis for weighting links within the PageRank technique. Most of all, synonyms and terms that describe similar things can not be identified. Indeed, there are algorithms for word stemming which work good for the english language, but in other languages word stemming is much more complicated. Different languages are a general problem. Unless, for instance, brand names or loan words are used, texts in different languages normally do not contain the same terms. And if they do, these terms normally have a completely different meaning, so that comparing content in different languages is not possible. However, Stata, Bharat and Maghoul provide a method of resolution for these problems.

Stata, Bharat und Maghoul present a concrete application for their Term Vector Database by classifying pages thematically. Bharat has also published on this issue together with Monika Henzinger, presently Google’s Research Director, and they called it “topic distillation”. Topic distillation is based on calculating so-called topic vectors. Topic vectors are term vectors, but they do not only include terms of one page but rather the terms of many pages which are on the same topic. So, in order to create topic vectors, they have to know a certain amount of web pages which are on several pre-defined topics. To achieve this, they resort to web directories.

For their application, Stata, Bharat und Maghoul have crawled about 30,000 links within each of the then 12 main categories of Yahoo to create topic vectors which include about 10,000 terms each. Then, in order to identify the topic of any other web page, they matched the according term vector with all the topic vectors which were created from the Yahoo crawl. The topic of a web page derived from the topic vector which matched the term vector of the web page best. That such a classification of web pages works can again be observed by the means of Google News. Google News does not only merge articles to one news topic, but also arranges them to the categories World, U.S., Business, Sci/Tech, Sports, Entertainment and Health. As long as this categorization is not based on the structure of the website where the articles come from (which is unlikely), the actual topic of an article has in fact to be computed.

At the time he published on term vectors, Krishna Bharat did not work on PageRank but rather on Kleinberg’s algorithm, so that he was more interested in filtering off-topic links than in weighting links. But from classifying pages to weighting links based on content comparisons, there is only a small step. Instead of matching the term vectors of two pages, it is much more efficient to match the topics of two pages. We can, for instance, create a “topic affinity vector” for each page based on the degree of affinity of the page’s term vector and all the topic vectors. The better the topic affinity vectors of two pages match, the more likely are they on the same topic and the higher should a link between them be weighted.

Using topic vectors has one big advantage over comparing term vectors directly: A topic vector can include terms in different languages by being based on, for instance, the links on different national Yahoo versions. Deviant site structures of the national versions can most certainly be adapted manually. Even better may be using the ODP because the structure of the sub-categories of the “World” category is based on the main OPD structure. In this way, measuring topic similarities between pages in different languages can be realized, so that a really useful weighting of links based on text analyses appears to be possible.

Is there an Actual Implementation of Themes in PageRank?

That both the approach of Havelivala and the approach of Richardson and Domingos are not utilized by Google is obvious. One would notice it using Google. However, a weighting of links based on text analyses would not be apparent immediately. It has been shown that it is theoretically possible. But it is doubtful that it is actually implemented.

We do not want to claim that we have shown the only way of weighting links on the basis of text analyses. Indeed, there are certainly dozens of others. However, the approach that we provided here is based on publications of important members of Google’s staff and, thus, we want to rest a critical evaluation on it.

Like always, when talking about PageRank, there is the question if our approach is sufficienly scalable. On the one hand, it causes additional memory requirements. After all, Stata, Bharat and Maghoul describe the system architecture of a term vector database which is different from Google’s inverse index, since it maps from page ids to terms and, so, can hardly be integrated in the existing architecture. At the actual size of Google’s index, the additional memory requirements should be several hundred GB to a few TB. However, this should not be so much of a problem since Google’s index is most certainly several times bigger. In fact, the time requirements for building the database and for computing the weigtings appear to be the critical part.

Building a term verctor database should be approximately as time-consuming as building an inverse index. Of course, many procecces can probably be used for building both but if, for instance, the weighting of terms in the term vectors has to differ from the weighting of terms in the inverse index, the time requirements remain substantial. If we assume that, like in our approach, content analyses are based on computing the inner products of topic affinity vectors which have to be calculated by matching term vectors and topic vectors, this process should be approximately as time-consuming as computing PageRank. Moreover, we have to consider that the PageRank calculations themselves beome more complicated by weighting links.

So, the additional time requirements are definitely not negligible. This is why we have to ask ourselves if weighting links based on text analyses is useful at all. Links between thematically unrelated page, which have been set for the sole purpose of boosting PageRank of one page, may be annoying, but most certainly they are only a small fraction of all links. Additionally, the web itself is completely inhomogeneous. Google, Yahoo or the ODP do not owe their high PageRank solely to links from other search engines or directories. A huge part of the links on the web are simply not set for the purpose of showing visitors ways to more thematically related information. Indeed, the motivation for placing links is manifold. Moreover, the problably most popular websites are completely inhomogeneous in terms of theme. Think about portals like Yahoo or news websites which contain articles that cover almost any subject of life. A strong weighting of links as it has been described here could influence those website’s PageRanks significantly.

If the PageRank technique shall not become totally futile, a weighting of links can only take place to a small extent. This, of course, raises the question if the efforts it requires are justifiable. After all, there are certainly other ways to eliminate spam which often comes to the top of search results through thematically unrelated and probably bought links.

PR0 – Google’s PageRank 0 Penalty

By the end of 2001, the Google search engine introduced a new kind of penalty for websites that use questionable search engine optimization tactics: A PageRank of 0. In search engine optimization forums it is called PR0 and this term shall also be used here. Characteristically for PR0 is that all or at least a lot of pages of a website show a PageRank of 0 in the Google Toolbar, even if they do have high quality inbound links. Those pages are not completely removed from the index but they are always at the end of search results and, thus, they are hardly to be found.

A PageRank of 0 does not always mean a penalty. Sometimes, websites which seam to be penalized simply lack inbound links with an sufficiently high PageRank. But if pages of a website which have formerly been placed well in search results, suddenly show the dreaded white PageRank bar, and if there have not been any substantial changes regarding the inbound links of that website, this means – according to the prevailing opinion – certainly a penalty by Google.

We can do nothing but speculate about the causes for PR0 because Google representatives rarely publish new information on Google’s algorithms. But, non the less, we want to give a theoretical approach for the way PR0 may work because of its serious effects on search engine optimization.

The Background of PR0

Spam has always been one of the biggest problems that search engines had to deal with. When spam is detected by search engines, the usual proceeding is the banishment of those pages, websites, domains or even IP addresses from the index. But, removing websites manually from the index always means a large assignment of personnel. This causes costs and definitely runs contrary to Google’s scalability goals. So, it appears to be necessary to filter spam automatically.

Filtering spam automatically carries the risk of penalizing innocent webmasters and, hence, the filters have to react rather sensibly on potential spam. But then, a lot of spam can pass the filters and some additional measures may be necessary. In order to filter spam effectively, it might be useful to take a look at links.

That Google uses link analysis in order to detect spam has been confirmed more or less clearly in WebmasterWorld’s Google News Forum by a Google employee who posts as “GoogleGuy”. Over and over again, he advises webmasters to avoid “linking to bad neighbourhoods”. In the following, we want to specify the “linking to bad neighbourhoods” and, to become more precisely, we want to discuss how an identification of spam can be realized by the analysis of link structures. In particular, it shall be shown how entire networks of spam pages, which may even be located on a lot of different domains, can be detected.

BadRank as the Opposite of PageRank

The theoretical approach for PR0 as it is presented here was initially brought up by Raph Levien (www.advogato.org/person/raph). We want to introduce a technique that – just like PageRank – analyzes link structures, but, that unlike PageRank does not determine the general importance of a web page but rather measures its negative characteristics. For the sake of simplicity this technique shall be called “BadRank”.

BadRank is in priciple based on “linking to bad neighbourhoods”. If one page links to another page with a high BadRank, the first page gets a high BadRank itself through this link. The similarities to PageRank are obvious. The difference is that BadRank is not based on the evaluation of inbound links of a web page but on its outbound links. In this sense, BadRank represents a reversion of PageRank. In a direct adaptation of the PageRank algorithm, BadRank would be given by the following formula:

BR(A) = E(A) (1-d) + d (BR(T1)/C(T1) + … + BR(Tn)/C(Tn))

where

BR(A) is the BadRank of page A,
BR(Ti) is the BadRank of pages Ti which are outbound links of page A,
C(Ti) is here the number of inbound links of page Ti and
d is the again necessary damping factor.

In the previously discussed modifications of the PageRank algorithm, E(A) represented the special evaluation of certain web pages. Regarding the BadRank algorithm, this value reflects if a page was detected by a spam filter or not. Without the value E(A), the BadRank algorithm would be useless because it was nothing but another analysis of link structures which would not take any further criteria into account.

By means of the BadRank algorithm, first of all, spam pages can be evaluated. A filter assigns a numeric value E(A) to them, which can, for example, be based on the degree of spamming or maybe even better on their PageRank. Thereby, again, the sum of all E(A) has to equal the total number of web pages. In the course of an iterative computation, BadRank is not only transfered to pages which link to spam pages. In fact, BadRank is able to identify regions of the web where spam tends to occur relatively often, just as PageRank identifies regions of the web which are of general importance.

Of course, BadRank and PageRank have significant differences, especially, because of using outbound and inbound links, respectively. Our example shows a simple, hierarchically structured website that reflects common link structures pretty well. Each page links to every page which is on a higher hierachical level and on its branch of the website’s tree structure. Each page links to pages which are arranged hierarchically directly below them and, additionally, pages on the same branch and the same hierarchical level link to each other.

The following table shows the distribution of inbound and outbound links for the hierarchical levels of such a site.

Level inbound Links outbound Links
0 6 2
1 4 4
2 2 3

As to be expected, regarding inbound links, a hierarchical gradation from the index page downwards takes place. In contrast, we find the highest number of outbound links on the website’s mid-level. We can see similar results, when we add another level of pages to our website while the above described linking rules stay the same.

Level inbound Links outbound Links
0 14 2
1 8 4
2 4 5
3 2 4

Again, there is a concentration of outbound links on the website’s mid-level. But most of all, the outbound links are much more evenly distributed than the inbound links.

If we assign a value of 100 to the index page’s E(A) in our original example, while all other values E equal 1 and if the damping factor d is 0.85, we get the following BadRank values:

Page BadRank
A 22.39
B/C 17.39
D/E/F/G 12.21

First of all, we see that the BadRank distributes from the index page among all other pages of the website. The combination of PageRank and BadRank will be discussed in detail below, but, no matter how the combination will be realized, it is obvious that both can neutralize each other very well. After all, we can assume that also the page’s PageRank decreases, the lower the hierarchy level is, so that a PR0 can easily be achieved for all pages.

If we now assume that the hierarchically inferior page G links to a page X with a constant BadRank BR(X)=10, whereby the link from page G is the only inbound link for page X, and if all values E for our example website equal 1, we get, at a damping factor d of 0.85, the following values:

Page BadRank
A 4.82
B 7.50
C 14.50
D 4.22
E 4.22
F 11.22
G 17.18

In this case, we see that the distribution of the BadRank is less homogeneous than in the first scenario. Non the less, a distribution of BadRank among all pages of the website takes place. Indeed, the relatively low BadRank of the index page A is remarkable. It could be a problem to neutralize its PageRank which should be higher compared to the rest of the pages. This effect is not really desirable but it reflects the experiences of numerous webmasters. Quite often, we can see the phenomenom that all pages except for the index page of a website show a PR0 in the Google Toolbar, whereby the index page often has a Toolbar PageRank between 2 and 4. Therefore, we can probably assume that this special variant of PR0 is not caused by the detection of the according website by a spam filter, but the site rather received a penalty for “linking to bad neighbourhoods”. Indeed, it is also possible that this variant of PR0 occurs when only hierarchical inferior pages of a website get trapped in a spam filter.

The Combination of PageRank and BadRank to PR0

If we assume that BadRank exists in the form presented here, there is now the question in which way BadRank and PageRank can be combined, in order to penalize as much spammers as possible while at the same time penalizing as few innocent webmasters as possible.

Intuitively, implementing BadRank directly in the actual PageRank computations seems to make sense. For instance, it is possible to calculate BadRank first and, then, divide a page’s PageRank through its BadRank each time in the course of the iterative calculation of PageRank. This would have the advantage, that a page with a high BadRank could pass on just a little PageRank or none at all to the pages it links to. After all, one can argue that if one page links to a suspect page, all the other links on that page may also be suspect.

Indeed, such a direct connection between PageRank and BadRank is very risky. Most of all, the actual influence of BadRank on PageRank cannot be estimated in advance. It is to be considered that we would create a lot of pages which cannot pass on PageRank to the pages they link to. In fact, these pages are dangling links, and as it has been discussed in the section on outbound links, it is absolutely necessary to avoid dangling links while computing PageRank.

So, it would be advisable to have separate iterative calculations for PageRank and BadRank. Combining them afterwards can, for instance, be based on simple arithmetical operations. In principle, a subtraction would have the desirable consequence that relatively small BadRank values can hardly have a large influence on relatively high PageRank values. But, there would certainly be a problem to achieve PR0 for a large number of pages by using the subtraction. We would rather see a PageRank devaluation for many pages.

Achieving the effects that we know as PR0 seems easier to be realized by dividing PageRank through BadRank. But this would imply that BadRank receives an extremely high importance. However, since the average BadRank equals 1, a big part of BadRank values is smaller than 1 and, so, a normalization is necessary. Probably, normalizing and scaling BadRank to values between 0 and 1 so that “good” pages have values close to 1, and “bad” pages have values close to 0 and, subsequently, multiplying these values with PageRank would supply the best results.

A very effective and easy to realize alternative would probably be a simple stepped evaluation of PageRank and BadRank. It would be reasonable that if BadRank exceeds a certain value it will always lead to a PR0. The same could happen when the relation of PageRank to BadRank is below a certain value. Additionally, it would make sense that if BadRank and/or the relation of BadRank to PageRank is below a certain value, BadRank takes no influence at all.

Only if none of these cases occurs, an actual combination of PageRank and BadRank – for instance by dividing PageRank through BadRank – would be necessary. In this way, all unwanted effects could be avoided.

A Critical View on BadRank and PR0

How Google would realize the combination of PageRank and BadRank is of rather minor importance. Indeed, a separate computation and a subsequent combination of both has the consequence that it may not be possible to see the actual effect of a high BadRank by looking at the Toolbar. If a page has a high PageRank in the original sense, the influence of its BadRank can be negligible. But if another page links to it, this could have quite serious consequences.

An even bigger problem is the direct reversion of the PageRank algorithm as we have presented it here: Just as an additional inbound for one page can do nothing but increasing this page’s PageRank, an additional outbound link can only increase its BadRank. This is because of the addition of BadRank values in the BadRank formula. So, it does not matter how many “good” outbound links a page has – one link to a spam page can be enough to lead to a PR0.

Indeed, this problem may appear in exceptional cases only. By our direct reversion of the PageRank algorithm, the BadRank of a page is divided by its inbound links and single links to pages with high BadRank transfer only a part of that BadRank in each case. Google’s Matt Cutts’ remark on this issue is: “If someone accidentally does a link to a bad site, that may not hurt them, but if they do twenty, that’s a problem.” (searchenginewatch.com/sereport/02/11-searchking.html)

However, as long as all links are weighted uniformly within the BadRank computation, there is another problem. If two pages differ widely in PageRank and both have a link to the same page with a high BadRank, this may lead to the page with the higher PageRank suffering far less from the transferred BadRank than the page with the low PageRank. We have to hope that Google knows how to deal with such problems. Nevertheless it shall be noted that, regarding the procedure presented here, outbound links can do nothing but harm.

Of course, all statements regarding how PR0 works are pure speculation. But in principle, the analysis of link structures similarly to the PageRank technique should be the way how only Google understands to deal with spam.

Also, by means of the iterative calculation, the sum of all pages’ PageRanks still converges to the total number of web pages. So the average PageRank of a web page is 1. The minimum PageRank of a page is given by (1-d). Therefore, there is a maximum PageRank for a page which is given by dN+(1-d), where N is total number of web pages. This maximum can theoretically occur, if all web pages solely link to one page, and this page also solely links to itself.