After http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus

## Linear Algebra in a Nutshell

Denote first the set of all real numbers by ** R**. A

**vector**of length

`n`is simply a list of

`n`elements, called vector

*entries*. The notation of a (column) vector

`v`having entries

`x`

_{1},

`x`

_{2}, … ,

`x`

_{n}is the following:

**Example 1:** The entries of a vector can in principle be anything: words, colors, numbers, etc. The following are all examples of vectors:

However, we will only look at vectors whose entries are real numbers (that is `x`_{1}, `x`_{2}, … ,`x`_{n} are all real numbers). As in the case of real numbers, real vectors can be added or multiplied by a real number (called *scalar*) in a straightforward way:

**Example 2:** Under the above two notation we can easily check that:

In a similar way we define the notion of a **matrix**. An `m` × `n` matrix is a rectangular array of entries having height `m` and width `n`. This can also be seen as stacking `n` vectors of length `m` one next to the other. Again, take the entries to be real numbers. We denote an `m` × `n` matrix `A` with entries `a`_{ij} by:

This means that on row `i` and column `j` we find the real number `a`_{ij}. For example, on the first row, second column lies element `a`_{12}. Notice that a vector of length `m` is just a `m` × `1 `matrix and a real number is a `1` × `1` matrix. If `m` = `n` then the matrix is called a *square matrix*, of *size* `n`. This is what we will consider from now on. It makes sense now to talk about the *diagonal* of a matrix, which consists of those elements for which the row number equals the column number ( the elements `a`_{11}, `a`_{22}, up to `a`_{nn}).

**Example 3:** The following are all examples of matrices, but only the last three are square:

Notice that the last two matrices are rather “special”: they are *symmetric*. This means that entry on row `i` and column `j` is equal to that entry on row `j` and column `i` (or otherwise said `a`_{ij} = `a`_{ji}). As an easy exercise, prove that any diagonal matrix (has zeros everywhere, except on the diagonal) is symmetric. The matrix that has ones on the diagonal and zeros everywhere else is called the *identity* matrix and is denoted by `I`. Matrix multiplication is done in the following general way:

where the element `c`_{ij} is given by the formula:

Multiplying a matrix with a vector is done as follows:

where every element `y`_{i} from the resulting vector `y` is given by the formula

The other two matrix operations, addition and scalar multiplication, are done as in the case of vectors. Adding matrices `A` and `B` gives a matrix `C` which has the entry on row `i` and column `j` equal to the sum of the corresponding entries of `A` and `B`. Multiplying matrix `A` with a real number `a` is the same as multiplying every element of `A` with `a`.

**Example 4:** All above operations on matrices will become more obvious through the following examples:

**Example #2:** Consider the following three matrices:

Show that (`A`+`B`)+`C` = `A`+(`B`+`C`) and that (`A`·`B`)·`C`=`A`·(`B`·`C`)

One can observe that there is some sort of similarity between matrices `A` and `C` in Problem 2. above. The similarity comes from the fact that the entry on row `i` and column `j` in matrix `A` is equal to the entry on row `j` and column `i` in matrix `C`. Of course, the elements on the diagonal are the same. We then say that `A` is the **transpose** of `C`, or equivalently that `C` is the transpose of `A`. In general, for a matrix `A` we denote its transpose by `A`^{t}. More intuitively, given a matrix we find its transpose by interchanging the element at row `i`, column `j `with the element at row `j`, column `i`. If we do this twice we notice that the transpose of the transpose of a matrix is the matrix itself, or (`A`^{t})^{t}=`A`.

**Fact 2:** For any matrices `A`, `B`, (`A`·`B`)^{t} = `B`^{t}·`A`^{t}, and (`A`+`B`)^{t} = `A`^{t}+`B`^{t}.

We now introduce two important notions in the theory about matrices: eigenvector and eigenvalue. We say that the real number `z` is an **eigenvalue** for `A` if there exists a real vector `v`of length `n` such that `A`·`v` = `z`·`v`. Such a vector `v` is called an **eigenvector** corresponding to the eigenvalue `z`. This is not the most general definition, but it will suffice for our purposes. In general eigenvalues and eigenvectors are complex, and not real. If we assume that `A` is a (real) symmetric matrix of size `n`, then we know that it has `n` real eigenvalues and all eigenvectors are also real. In fact, a matrix of size `n` can have at most `n` real eigenvalues.

In order to make these definitions more clear, consider the following explicit example:

**Example 5:** The matrix has eigenvector and eigenvalue `z` = 3 corresponding to this eigenvector. Indeed, the definition is verified because:

**Fact 3:** Any matrix `A` has the same eigenvalues as its transpose `A`^{t}.

Of course, in general a matrix `A` and its transpose `A`^{t} *do not* have the same eigenvectors that correspond to the common eigenvalues. For the matrix in the above example, has eigenvalue `z` = 3 but the corresponding eigenvector is . This follows from the computation below

An important observation is that a matrix `A` may (in most cases) have more than one eigenvector corresponding to an eigenvalue. These eigenvectors that correspond to the same eigenvalue may have no relation to one another. They can however be related, as for example if one is a scalar multiple of another. More precisely, in the last example, the vector whose entries are 0 and 1 is an eigenvector, but also the vector whose entries are 0 and 2 is an eigenvector. It is a good exercise to check this by direct computation as shown in Example 5.

A matrix is called **column-stochastic** if all of its entries are greater or equal to zero (nonnegative) and the sum of the entries in each column is equal to 1. If all entries of a matrix are nonnegative, then we say that the matrix itself is **nonnegative**. Furthermore, a matrix is **positive** is all its entries are positive (greater than zero) real numbers.

**Example 6:** Consider a matrix `A` with transpose `A`^{t}:

It is easy to see that `A` is column-stochastic, while `A`^{t} is not. However, the sum of the elements on each row of `A`^{t} is equal to 1. We first show that `z` = 1 is an eigenvalue for `A`^{t}, with corresponding eigenvector

This is true since `A`^{t}·`v` = 1·`v`. Then, from Fact 3, 1 is an eigenvalue for the matrix `A` as well.

Next we wish to find an eigenvector that corresponds to this eigenvalue for `A`. Let `u` be such an eigenvector. In order to find `u` explicitly we write the equation

`x`_{1}, `x`_{2}, `x`_{3}, and `x`_{4} are all real numbers that we don’t yet know. If we multiply term by term we find that

Substituting in the third relation we obtain and so the vector `u` is of the form

As `x`_{1} is just a real number (hence a scalar) we can take `x`_{1} = 12 and we have just proved that the vector whose entries are 12, 4, 9, and 6 (from top to bottom) is an eigenvector for `A`.

In the first part of the previous example we have just shown 1 is an eigenvalue for that particular case. However, this is true for any column-stochastic matrix, as stated below.

**Fact 4:** Any column-stochastic matrix has `z` = 1 as eigenvalue.

Notice also that the eigenvector that we found in the second part of the example above is rather “special” itself. We have chosen `x`_{1} = 12 and we obtained an eigenvector with positive entries. If however we choose `x`_{1} = -12 then we obtain an eigenvector with negative entries (smaller than 0).

**Fact 5:**If

`A`is a positive column-stochastic matrix, then any eigenvector corresponding to the eigenvalue

`z`= 1 has either only positive entries or only negative entries.

When we are working with positive column-stochastic matrices `A` it is possible to find an eigenvector `v` associated to the eigenvalue `z` = 1 such that all its entries are positive. Hence `A`·`v` = `v` and the entries of `v` are all positive.

**Fact 6:**If

`A`is a positive column-stochastic matrix, then there is a

*unique*eigenvector corresponding to the eigenvalue

`z`= 1 such that it has only positive entries the sum of its entries equals 1.

# Lecture #2: Directed Graphs – Transition Matrices

A *graph* is an object that consists of a non-empty set of *vertices* and another set of *edges*. When working with real-world examples of graphs, we sometimes refer to them as *networks*. The vertices are often called *nodes* or *points*, while edges are referred to as *links* or *lines*. The set of edges may be empty, in which case the graph is just a collection of points.

**Example 1:** A simple example of a graph with vertices 1,2,3,4, and directed edges from vertex 1 to vertex 2, vertex 3 to vertices 2 and 4, and vertex 4 to vertex 1.

In this lecture we will only work with *directed* graphs and real-world examples of those (Internet graphs), but for other properties of graphs we refer to Math Explorer’s Clubwebsite. The central example in this module is the *web graph*, in which web pages are represented as vertices and the links between them are represented as edges. An example of such a graph is a sub graph of the BGP (Gateway Protocol) web graph, consisting of major Internet routers. It has about 6400 vertices and 13000 edges and it was produced by Ross Richardson and rendered by Fan Chung Graham.

Although Internet graphs are very large, having the number of vertices of the order 30 billion (and growing), all graphs in this module are considered finite (finite number of vertices and edges).

We say that two vertices `i` and `j` of a directed graph are *joined* or *adjacent* if there is an edge from `i` to `j` or from `j` and `i`. If such an edge exists, then `i` and `j` are its *endpoints*. If there is an edge from `i` to `j` then `i` is often called *tail*, while `j` is called *head*. In Example 1, vertices 1 and 2 are joined because there is an edge from 1 to 2, while vertices 1 and 3 are not joined. There is however no edge from node 2 to node 1. Notice that there can be no more than two edges between any two vertices. There is a strong relation between graphs and matrices, previously introduced in Lecture 1. Suppose we are given a directed graph with `n` vertices. Then we construct an `n` × `n` **adjacency matrix** `A` associated to it as follows: if there is an edge from node `i` to node `j`, then we put 1 as the entry on row `i`, column `j` of the matrix `A`.

**Example 2:** Adjacency matrix for the graph in Example 1: .

If one can walk from node `i` to node `j` along the edges of the graph then we say that there is a *path* from `i` to `j`. If we walked on `k` edges, then the path has length `k`. For matrices, we denote by `A`^{k} the matrix obtained by multiplying `A` with itself `k` times. The entry on row `i`, column `j` of `A`^{2} = `A`·`A` corresponds to the number of paths of length 2 from node `i` to node`j` in the graph. For Example 2, the square of the adjacency matrix is

This means that there is a path from vertex 4 to vertex 2, because the entry on fourth row and second column is 1. Similarly there is a path from 3 to 1, as one can easily see from Example 1.

**Fact 1:**Consider a directed graph and a positive integer

`k`. Then the number of directed walks from node

`i`to node

`j`of length

`k`is the entry on row

`i`and column

`j`of the matrix

`A`

^{k}, where

`A`is the adjacency matrix.

In general, a matrix is called **primitive** if there is a positive integer `k` such that `A`^{k} is a positive matrix. A graph is called **connected** if for any two different nodes `i` and `j` there is a directed path either from `i` to `j` or from `j` to `i`. On the other hand, a graph is called **strongly connected** if starting at any node `i` we can reach any other different node `j` by walking on its edges. In terms of matrices, this means that if there is a positive integer `k` such that the matrix `B` = `I` + `A` + `A`^{2} + `A`^{3} + … +`A`^{k} is positive, then the graph is strongly connected. We add the identity matrix `I` in order to deal with edges from a vertex to itself. In other words, if there is at least one path from node `i` to node `j` of length at most `k`, then we can travel from node `i` to `j`. Thus if matrix `B` has a positive entry on row `i` and column `j` then it is possible to reach node `j` starting from `i`. If this happens for all nodes, then the graph is strongly connected.

One can easily see that the graph in Example 1 is connected, but not strongly connected because there is no edge from vertex 1 to vertex 3. For the matrix in Example 2, we notice that `A`^{4} is a matrix having only zeros, and so for all `k` greater than 4, `A`^{k} will be a matrix filled with zeros. Then for any `k` greater than 4, the matrix `B` = `I` + `A` + `A`^{2} + `A`^{3} + … +`A`^{k} is :

Since the matrix `B` is not positive, the graph in Example 1 is not strongly connected as we already saw.

**Problem 1:** Construct the adjacency matrix of the graph below. Is this graph connected or strongly connected? How many paths of length 3 from node 3 to node 2 are there?

In the examples above we noticed that for every vertex `i` there is a number of edges that enter that vertex (`i` is a head) and a number of edges that exit that vertex (`i` is a tail). Thus we define the **indegree** of `i` as the number of edges for which `i` is a head. Similarly, the **outdegree** of `i` as the number of edges for which `i` is a tail. For example, for the graph in the Problem 1, the indegree of node 2 is 2 and the outdegree of node 1 is 1. The **transition matrix** `A` associated to a directed graph is defined as follows. If there is an edge from `i` to `j`and the outdegree of vertex `i` is `d`_{i}, then on column `i` and row `j` we put . Otherwise we mark column `i`, row `j` with zero. Notice that we first look at the column, then at the row. We usually write on the edge going from vertex `i` to an adjacent vertex `j`, thus obtaining a *weighted graph*. This will become clear through the following example.

**Example 3:** Consider the graph from Example 1 with weights on its edges as described above.

Then the transition matrix associated to it is:

Notice that the sum of the entries on the first column is 1. The same holds for the third and fourth column. In general, more is true.

**Fact 2:** For a strongly connected graph, the transition matrix is column-stochastic.

We use the transition matrix to model the behavior of a random surfer on a web graph. The surfer chooses a page at random, then follows its links to other web pages for as long as he/she wishes. At each step the probability that the surfer moves from node `i` to node `j` is zero if there is no link from `i` to `j` and otherwise. Recall that `d`_{i} is the outdegree of vertex`i`. Initially the probability of each page to be chosen as a starting point is

At step 1, the probability of each node to be visited after one click is `A`·`v`. At step 2, the probability of each node to be visited after two clicks is `A`^{2}·`v`. The probability of a page to be visited at step `k` is thus `A`^{k}·`v`. If the matrix is primitive, column-stochastic, then this process converges to a unique *stationary probability distribution* vector `p`, where

The meaning of the `i`^{th} entry of `p` is that the surfer visits page `i` at any given time with probability `p`_{i}.

**Problem 2:**Let

`A`be a transition matrix associated with a graph and

`B`be a matrix of size

`n`filled with . Consider

`x`a positive real number smaller than 1. Then the matrix

`C`=

`x`·

`A`+

`(1-x)`·

`B`is column-stochastic. Show that this is true in the special case of Problem 1.

# Lecture #3: PageRank Algorithm – The Mathematics of Google Search

We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine, like Google, AltaVista, Yahoo, type in the key words, and the search engine will display the pages relevant for your search. But how does a search engine really work?

At first glance, it seems reasonable to imagine that what a search engine does is to keep an index of all web pages, and when a user types in a query search, the engine browses through its index and counts the occurrences of the key words in each web file. The winners are the pages with the highest number of occurrences of the key words. These get displayed back to the user.

This used to be the correct picture in the early 90s, when the first search engines used *text based ranking systems* to decide which pages are most relevant to a given query. There where however a number of problems with this approach. A search about a common term such as “Internet” was problematic. The first page displayed by one of the early search engines was written in Chinese, with repeated occurrences of the word “Internet” and containing no other information about the Internet. Moreover, suppose we wanted to find some information about Cornell. We type in the word “Cornell” and expect that “www.cornell.edu” would be the most relevant site to our query. However there may be millions of pages on the web using the world Cornell, and www.cornell.edu may not be the one that uses it most often. Suppose we decided to write a web site that contains the word “Cornell” a billion times and nothing else. Would it then make sense for our web site to be the first one displayed by a search engine? The answer is obviously no. However, if all a search engine does is to count occurrences of the words given in the query, this is exactly what might happen.

The usefulness of a search engine depends on the *relevance* of the result set it gives back. There may of course be millions of web pages that include a particular word or phrase; however some of them will be more relevant, popular, or authoritative than others. A user does not have the ability or patience to scan through all pages that contain the given query words. One expects the relevant pages to be displayed within the top 20-30 pages returned by the search engine.

Modern search engines employ methods of ranking the results to provide the “best” results first that are more elaborate than just plain *text ranking*. One of the most known and influential algorithms for computing the relevance of web pages is the Page Rank algorithm used by the Google search engine. It was invented by Larry Page and Sergey Brin while they were graduate students at Stanford, and it became a Google trademark in 1998. The idea that Page Rank brought up was that, the importance of any web page can be judged by looking at the pages that link to it. If we create a web page `i` and include a hyperlink to the web page `j`, this means that we consider `j` important and relevant for our topic. If there are a lot of pages that link to *j*, this means that the common belief is that page *j* is important. If on the other hand, `j` has only one backlink, but that comes from an authoritative site *k*, (like www.google.com, www.cnn.com, www.cornell.edu) we say that *k* transfers its authority to *j*; in other words, *k* asserts that *j* is important. Whether we talk about popularity or authority, we can iteratively assign a rank to each web page, based on the ranks of the pages that point to it.

To this aim, we begin by picturing the Web net as a directed graph, with nodes represented by web pages and edges represented by the links between them.

Suppose for instance, that we have a small Internet consisting of just 4 web sites www.page1.com, www.page2.com, www.page3.com, www.page4.com, referencing each other in the manner suggested by the picture:

We “translate” the picture into a directed graph with 4 nodes, one for each web site. When web site `i` references `j`, we add a directed edge between node `i` and node `j` in the graph. For the purpose of computing their page rank, we ignore any navigational links such as back, next buttons, as we only care about the connections between different web sites. For instance, Page1 links to all of the other pages, so node 1 in the graph will have outgoing edges to all of the other nodes. Page3 has only one link, to Page 1, therefore node 3 will have one outgoing edge to node 1. After analyzing each web page, we get the following graph:

In our model, each page should transfer evenly its importance to the pages that it links to. Node 1 has 3 outgoing edges, so it will pass on of its importance to each of the other 3 nodes. Node 3 has only one outgoing edge, so it will pass on all of its importance to node 1. In general, if a node has `k` outgoing edges, it will pass on of its importance to each of the nodes that it links to. Let us better visualize the process by assigning weights to each edge.

Let us denote by `A` the transition matrix of the graph, `A` = .

**Dynamical systems point of view:**Suppose that initially the importance is uniformly distributed among the 4 nodes, each getting ¼. Denote by `v` the initial rank vector, having all entries equal to ¼. Each incoming link increases the importance of a web page, so at step 1, we update the rank of each page by adding to the current value the importance of the incoming links. This is the same as multiplying the matrix `A` with `v` . At step 1, the new importance vector is `v`_{1} = `Av`. We can iterate the process, thus at step 2, the updated importance vector is `v`_{2} = `A(Av)` = `A`^{2}`v`. Numeric computations give:

We notice that the sequences of iterates `v`, `Av`, …, `A`^{k}`v` tends to the equilibrium value `v`^{*} = . We call this the PageRank vector of our web graph.

**Linear algebra point of view:**Let us denote by `x`_{1}, `x`_{2}, `x`_{3}, and `x`_{4} the importance of the four pages. Analyzing the situation at each node we get the system:

This is equivalent to asking for the solutions of the equations . From Example 6 in Lecture 1 we know that the eigenvectors corresponding to the eigenvalue 1 are of the form . Since PageRank should reflect only the relative importance of the nodes, and since the eigenvectors are just scalar multiples of each other, we can choose any of them to be our PageRank vector. Choose `v`^{*} to be the unique eigenvector with the sum of all entries equal to 1. (We will sometimes refer to it as the probabilistic eigenvector corresponding to the eigenvalue 1). The eigenvector is our PageRank vector.

**Probabilistic point of view:**Since the importance of a web page is measured by its popularity (how many incoming links it has), we can view the importance of page i as the probability that a random surfer on the Internet that opens a browser to any page and starts following hyperlinks, visits the page i. We can interpret the weights we assigned to the edges of the graph in a probabilistic way: A random surfer that is currently viewing web page 2, has ½ probability to go to page 3, and ½ probability to go to page 4. We can model the process as a random walk on graphs. Each page has equal probability ¼ to be chosen as a starting point. So, the initial probability distribution is given by the column vector [¼ ¼ ¼ ¼]

^{t}. The probability that page

`i`will be visited after one step is equal to

`Ax`, and so on. The probability that page

`i`will be visited after

`k`steps is equal to

`A`

^{k}

`x`. The sequence

`Ax`,

`A`

^{2}

`x`,

`A`

^{3}

`x`, …,

`A`

^{k}

`x`, … converges in this case to a unique probabilistic vector

`v`

^{*}. In this context

`v`

^{*}is called the stationary distribution and it will be our Page Rank vector. Moreover, the

`i`

^{th}entry in the vector

`v`

^{*}is simply the probability that at each moment a random surfer visits page

`i`. The computations are identical to the ones we did in the dynamical systems interpretation, only the meaning we attribute to each step being slightly different.

The Page Rank vector `v`^{*} we have computed by different methods, indicates that page 1 is the most relevant page. This might seem surprising since page 1 has 2 backlinks, while page 3 has 3 backlinks. If we take a look at the graph, we see that node 3 has only one outgoing edge to node 1, so it transfers all its importance to node 1. Equivalently, once a web surfer that only follows hyperlinks visits page 3, he can only go to page 1. Notice also how the rank of each page is not trivially just the weighted sum of the edges that enter the node. Intuitively, at step 1, one node receives an importance vote from its direct neighbors, at step 2 from the neighbors of its neighbors, and so on.

Changing the web graph might lead to certain problems.

**Nodes with no outgoing edges (dangling nodes)**

We iteratively compute the rank of the 3 pages:

.

So in this case the rank of every page is 0. This is counterintuitive, as page 3 has 2 incoming links, so it must have some importance!

**Disconnected components**

A random surfer that starts in the first connected component has no way of getting to web page 5 since the nodes 1 and 2 have no links to node 5 that he can follow. Linear algebra fails to help as well. The transition matrix for this graph is . Notice that are both eigenvectors corresponding to the eigenvalue 1, and they are not just trivially one the a scalar multiple of the other. So, both in theory and in practice, the notation of ranking pages from the first connected component relative to the ones from the second connected component is ambiguous.

The web is very heterogeneous by its nature, and certainly huge, so we do not expect its graph to be connected. Likewise, there will be pages that are plain descriptive and contain no outgoing links. What is to be done in this case? We need a non ambiguous meaning of the rank of a page, for any directed Web graph with `n` nodes.

**The solution of Page and Brin:**In order to overcome these problems, fix a positive constant

`p`between 0 and 1, which we call the damping factor (a typical value for

`p`is 0.15). Define the Page Rank matrix (also known as the Google matrix) of the graph by where .

The matrix `M` models the random surfer model as follows: most of the time, a surfer will follow links from a page: from a page `i` the surfer will follow the outgoing links and move on to one of the neighbors of `i`. A smaller, but positive percentage of the time, the surfer will dump the current page and choose arbitrarily a different page from the web and “teleport” there. The damping factor `p` reflects the probability that the surfer quits the current page and “teleports” to a new one. Since he/she can teleport to any web page, each page has probability to be chosen. This justifies the structure of the matrix `B`.

**Problem 2.**Redo the computations for the Page Rank with the transition matrix

`A`replaced with the matrix

`M`, for the graphs representing the Dangling nodes, respectively Disconnected components. Do the problems mentioned in there still occur?

Intuitively, the matrix `M` “connects” the graph and gets rid of the dangling nodes. A node with no outgoing edges has now probability to move to any other node. Rigorously, for the matrix `M`, the following theorems apply:

**Perron-Frobenius Theorem:** If `M` is a positive, column stochastic matrix, then:

- 1 is an eigenvalue of multiplicity one.
- 1 is the largest eigenvalue: all the other eigenvalues are in modulus smaller than 1.
- the eigenvector corresponding to eigenvalue 1 has all entries positive. In particular, for the eigenvalue 1 there exists a unique eigenvector with the sum of its entries equal to 1.

**Power Method Convergence Theorem:**Let

`M`be a positive, column stochastic

`n`×

`n`matrix. Denote by

`v`

^{*}its probabilistic eigenvector corresponding to the eigenvalue 1. Let

`z`be the column vector with all entries equal to . Then the sequence

`z`,

`Mz`, …,

`M`

^{k}

`z`converges to the vector

`v`

^{*}.

In view of everything discussed above, we conclude that:

**Fact:**The PageRank vector for a web graph with transition matrix

`A`, and damping factor

`p`, is the unique probabilistic eigenvector of the matrix

`M`, corresponding to the eigenvalue 1.

From the mathematical point of view, once we have `M`, computing the eigenvectors corresponding to the eigenvalue 1 is, at least in theory, a straightforward task. As in Lecture 1, just solve the system `Ax` = `x`! But when the matrix `M` has size 30 billion (as it does for the real Web graph), even mathematical software such as Matlab or Mathematica are clearly overwhelmed.

An alternative way of computing the probabilistic eigenvector corresponding to the eigenvalue 1 is given by the Power Method. The theorem guarantees that the method works for positive, column stochastic matrices. We reasoned that the iteration process corresponds to the way importance distributes over the net following the link structure (Recall the random surfer model). Computationally speaking, it is much more easier, starting from the vector with all entries 1, to multiply `x`, `M``x`,…, `M`^{n}`x` until convergence then it is to compute the eigenvectors of `M`. In fact, in this case, one needs only compute the first couple of iterates in order to get a good approximation of the PageRank vector. For a random matrix, the power method is in general known to be slow to converge. What makes it work fast in this case however is the fact that the web graph is *sparse*. This means that a node `i` has a small number of outgoing links (a couple of hundred at best, which is extremely small corresponding to the 30 billion nodes it could theoretically link to). Hence the transition matrix `A` has a lot of entries equal to 0.

We end the lecture by proposing the following problems:

**Problem 3.** Compute the PageRank vector of the following graph, considering the damping constant `p` to be successively `p `= 0, `p` = 0.15, `p `= 0.5, and respectively `p` = 1.

**Problem 4.** Compute the PageRank vector of the directed tree depicted below, considering that the damping constant `p` = 0.15. Interpret your results in terms of the relationship between the number of incoming links that each node has and its rank.

# Lecture #4: HITS Algorithm – Hubs and Authorities on the Internet

In the same time that PageRank was being developed, Jon Kleinberg a professor in the Department of Computer Science at Cornell came up with his own solution to the Web Search problem. He developed an algorithm that made use of the link structure of the web in order to discover and rank pages relevant for a particular topic. **HITS (hyperlink-induced topic search)** is now part of the

**Ask**search engine (www.Ask.com).

One of the interesting points that he brought up was that the human perspective on how a search process should go is more complex than just compare a list of query words against a list of documents and return the matches. Suppose we want to buy a car and type in a general query phrase like “the best automobile makers in the last 4 years”, perhaps with the intention to get back a list of top car brands and their official web sites. When you ask this question to your friends, you expect them to be able to understand that automobile means car, vehicle, and that automobile is a general concept that includes vans, trucks, and other type of cars. When you ask this question to a computer that is running a text based ranking algorithm, things might be very different. That computer will count all occurrences of the given words in a given set of documents, but will not do intelligent rephrasing for you. The list of top pages we get back, while algorithmically correct, might be very different than what expected. One problem is that most official web sites are not enough self descriptive. They might not advertise themselves the way general public perceives them. Top companies like Hunday, Toyota, might not even use the terms “automobile makers” on their web sites. They might use the term “car manufacturer” instead, or just describe their products and their business.

What is to be done in this case? It would be of course great if computers could have a dictionary or ontology, such that for any query, they could figure out sinonimes, equivalent meanings of phrases. This might improve the quality of search, nevertheless, in the end, we would still have a text based ranking system for the web pages. We would still be left with the initial problem of sorting the huge number of pages that are relevant to the different meanings of the query phrase. We can easily convince ourselves that this is the case. Just remember one of our first examples, about a page that repeats the phrase “automobile makers = cars manufacturers = vehicle designers” a billion times. This web page would be the first one displayed by the query engine. Nevertheless, this page contains practically no usable information.

The conclusion is that even if trying to find pages that contain the query words should be the starting point, a different ranking system is needed in order to find those pages that are ** authoritative** for a given query. Page

`i`is called an

**for the query “automobile makers” if it contains valuable information on the subject. Official web sites of car manufacturers, such as www.bmw.com, HyundaiUSA.com, www.mercedes-benz.com would be authorities for this search. Commercial web sites selling cars might be authorities on the subject as well. These are the ones truly relevant to the given query. These are the ones that the user expects back from the query engine. However, there is a second category of pages relevant to the process of finding the authoritative pages, called**

*authority***. Their role is to advertise the authoritative pages. They contain useful links towards the authoritative pages. In other words, hubs point the search engine in the “right direction”. In real life, when you buy a car, you are more inclined to purchase it from a certain dealer that your friend recommends. Following the analogy, the authority in this case would be the car dealer, and the hub would be your friend. You trust your friend, therefore you trust what your friend recommends. In the world wide web, hubs for our query about automobiles might be pages that contain rankings of the cars, blogs where people discuss about the cars that they purchased, and so on.**

*hubs*Jon Kleinberg’s algorithm called **HITS** identifies good authorities and hubs for a topic by assigning two numbers to a page: an authority and a hub weight. These weights are defined recursively. A higher authority weight occurs if the page is pointed to by pages with high hub weights. A higher hub weight occurs if the page points to many pages with high authority weights.

In order to get a set rich in both hubs and authorities for a query `Q`, we first collect the top 200 documents that contain the highest number of occurrences of the search phrase `Q`. These, as pointed out before may not be of tremendous practical relevance, but one has to start somewhere. Kleinberg points out that the pages from this set called root (`R`_{Q}) are essentially very heterogeneous and in general contain only a few (if any) links to each other. So the web subgraph determined by these nodes is almost totally disconnected; in particular, we can not enforce Page Rank techniques on `R`_{Q}.

Authorities for the query `Q` are not extremely likely to be in the root set `R`_{Q}. However, they are likely to be pointed out by at least one page in `R`_{Q}. So it makes sense to extend the subgraph `R`_{Q} by including all edges coming from or pointing to nodes from `R`_{Q}. We denote by `S`_{Q} the resulting subgraph and call it the *seed* of our search. Notice that `S`_{Q} we have constructed is a reasonably small graph (it is certainly much smaller then the 30 billion nodes web graph!). It is also likely to contain a lot of authoritative sources for `Q `. The question that remains is how to recognize and rate them? Heuristically, authorities on the same topic should have a lot of common pages from `S`_{Q} pointing to them. Using our previous terminology, there should be a great overlap in the set of hubs that point to them.

From here on, we translate everything into mathematical language. We associate to each page `i` two numbers: an authority weight `a`_{i}, and a hub weight `h`_{i}. We consider pages with a higher `a`_{i} number as being better authorities, and pages with a higher `h`_{i} number as being better hubs. Given the weights {`a`_{i}} and {`h`_{i}} of all the nodes in `S`_{Q}, we dynamically update the weights as follows:

A good hub increases the authority weight of the pages it points. A good authority increases the hub weight of the pages that point to it. The idea is then to apply the two operations above alternatively until equilibrium values for the hub and authority weights are reached.

Let `A` be the adjacency matrix of the graph `S`_{Q} and denote the authority weight vector by `v` and the hub weight vector by `u`, where

Let us notice that the two update operations described in the pictures translate to: .

If we consider that the initial weights of the nodes are then, after `k` steps we get the system:

.

**Example:** Let us consider a very simple graph:

The adjacency matrix of the graph is , with transpose . Assume the initial hub weight vector is: .

We compute the authority weight vector by:

Then, the updated hub weight is:

.

This already corresponds to our intuition that node 3 is the most authoritative, since it is the only one with incoming edges, and that nodes 1 and 2 are equally important hubs. If we repeat the process further, we will only obtain scalar multiples of the vectors v and u computed at step 1. So the relative weights of the nodes remain the same.

For more complicated examples of graphs, we would expect the convergence to be problematic, and the equilibrium solutions (if there are any) to be more difficult to find.

**Theorem:** Under the assumptions that `AA`^{t} and `A`^{t}`A` are primitive matrices, the following statements hold:

- If
`v`_{1}, … ,`v`_{k}, … is the sequence of authority weights we have computed, then, … ,`V`_{1}, … converges to the unique probabilistic vector corresponding to the dominant eigenvalue of the matrix`V`_{k}`A`^{t}`A`. With a slight abuse of notation, we denoted in here bythe vector`V`_{k}`v`_{k}normalized so that the sum of its entries is 1. - Likewise, if
`u`_{1}, … ,`u`_{k}, … are the hub weights that we have iteratively computed, then, … ,`U`_{1}, … converges to the unique probabilistic vector corresponding to the dominant eigenvalue of the matrix`U`_{k}`AA`^{t}. We use the same notation, that= (1 ⁄`U`_{k}`c`)`u`_{k}, where`c`is the scalar equal to the sum of the entries of the vector`u`_{k}.

So the authority weight vector is the probabilistic eigenvector corresponding to the largest eigenvalue of `A`^{t}`A`, while the hub weights of the nodes are given by the probabilistic eigenvector of the largest eigenvalue of `AA`^{t}.

In the background we rely on the following mathematical theorems:

**Theorems:**

- The matrices
`AA`^{t}and`A`^{t}`A`are real and symmetric, so they have only real eigenvalues. **Perron Frobenius**. If`M`is a primitive matrix, then:- The largest eigenvalue λ of
`M`is positive and of multiplicity 1. - Every other eigenvalue of
`M`is in modulus strictly less than λ - The largest eigenvalue λ has a corresponding eigenvector with all entries positive.

- The largest eigenvalue λ of
- Let
`M`be a non-negative symmetric and primitive matrix and`v`be the largest eigenvector of`M`, with the sum of its entries equal to 1. Let`z`be the column vector with all entries non-negative, then, if we normalize the vectors`z`,`Mz`,…,`M`^{k}`z`, then the sequence converges to`v`.

We use the notion of “convergence” in here in a loose sense. We say that a sequence of vectors `z`_{k} converges to a vector `v` in the intuitive sense that as `k` gets big, the entries in the column vector `z`_{k} are very close to the corresponding entries of the column vector `v`. Without going into the technical details of the proof, the power method works because we have only one largest eigenvalue that dominates the behavior.

**HITS** algorithm is in the same spirit as **PageRank**. They both make use of the link structure of the Weg graph in order to decide the relevance of the pages. The difference is that unlike the **PageRank** algorithm, **HITS** only operates on a small subgraph (the seed `S`_{Q}) from the web graph. This subgraph is query dependent; whenever we search with a different query phrase, the seed changes as well. **HITS** ranks the seed nodes according to their authority and hub weights. The highest ranking pages are displayed to the user by the query engine.

**Problem 1:**Prove that for any square matrix

`A`, the matrices

`A`

^{t}

`A`and

`AA`

^{t}are symmetric.

**Problem 2: **Compute the hub and authority weights for the following graph:

*Hint:* Compute the adjacency matrix `A` and show that the eigenvalues of `AA`^{t} are 0,1, and 3. The normalized eigenvector for the largest eigenvalue λ = 3 is the hub weight vector. What can you say about the authority weights of the 4 nodes? Does this correspond to your intuition?

# Lecture #5: Internet Mathematics – Further Directions

The development of World Wide Web lead to a whole new field of mathematics called Internet Mathematics. It is a mixture of probability, linear algebra, graph theory and dynamical systems, designed to answer stringent problems about how information propagates over the net.

In the previous lectures you have seen how Google uses PageRank to determine the relevance of web pages, thus making Internet search simple and efficient. From a web designer’s perspective, it is important not only to create a nice web site, featuring interesting graphical effects, but it is important what other pages link to your web site. A good PageRank can turn your business into a very profitable one. From Google’s point of view however, it is important to keep PageRank computations accurate, and to prevent fraud attempts. *Link spam* is defined as the intentional attempt to improve the ranking of a page on the search engine, by increasing the number of pages that link to it. *Link farms *involve creating strongly connected communities of web pages referencing each other, humorously known as “mutual admiring societies”. Although some link farms can be created by hand, most are created through automated programs and services. It is important therefore to recognize dense subgraphs of the web graph designed to do just that, and to eliminate them from page rank computations. Although at its core, Page Rank is as described in the previous section, in the real world, Google uses improved search algorithms to deal with these problems; these algorithms can assign different weights to outgoing links, or decide that certain links should not transfer page rank at all, to name just a few of the extra features. It is important to notice however, that as stated on Google’s website, “while we have dozens of engineers working to improve every aspect of Google on a daily basis, PageRank continues to provide the basis for all of our web search tools”.

Another research direction is the analysis and comparison of different ranking algorithms. One property that makes an algorithm good is stability. The World Wide Web is in constant change, pages are added and deleted at every moment. Stability measures how much rankings change if the Web graph is slightly perturbed. Ideally we would like the ranking algorithms to be stable so that, for instance, we would not have to constantly recompute the PageRank.

With a lot of interest coming from both mathematicians and computer scientists, understanding the structure of the Internet graph is a key research problem. Developing models for an Internet graph that has billion of nodes, is however far from being trivial. The web is often described as having a *bow-tie* structure. The knot of the bow-tie is represented by a strongly connected component of the graph, called the *core*. One side of the bow-tie consists of pages that link to the core, while the other side of the bow-tie contains pages that contain at least a link from the core. The rest of the web pages are either in disconnected components, or in *tendrils* that only link to pages outside of the core. In this spider net, *social networks*, web communities are defined as subgraphs with more internal links then external ones. Spectral methods are used to analyse the distribution of these communities.

Semi-random Internet topology. The Internet nodes are red dots and linkages are green lines.

Understanding the Internet graph can help answering questions about how information propagates over the net. Computer virus propagation over the net is of particular interest. Understanding the rate at which computer viruses propagate can help antivirus software makers to anticipate and quarantine infected areas.

Although in our previous lectures we only talked about finte graphs, we can also view the web as an infinite graph. The reason for this is that web pages are of two kinds: static and dynamic. Static pages contain information that remains unmodified. Dynamic pages are those built “on the spot”, as a result of a query. For instance, if we do a Google Search with the phrase “mathematics and computer science”, Google returns a page with relevant links and their descriptions. Of course, Google does not have a page prepared beforehand, for every possible query a random user might have. For every query it receives, it searches through its index to track the documents which contains the given search words, then Google runs the Page Rank algorithm to rank the relevant hits. Only after that, does it build up the answer to the user. We call such a page a dynamical page. While the number of static pages is finite (over 30 billion, but still finite), the number of dynamical pages is virtually infinite. Such a complete view of the web including both static and dynamic content can be given by infinite graph theory. The mathematics behind it is both interesting and challenging, and it is an ongoing research field.

The book “A course on web graphs” by Anthony Bonato and the online journal “Internet Mathematics” established in 2003, provide excellent sources of information on how mathematics improves our Internet experience. All the topics discussed above are covered in detail, with a lot of entertaining examples.

# References

We found most useful in preparing these lecture “A course of web graph” by Antony Bonato” and “Authoritative sources in a hyperlinked environment” by Jon Kleinberg, which are both written in a very clear and engaging style. We used the second one for presenting the philosophy of searching the web and its solution given by the HITS algorithm. “The $25,000,000,000 Eigenvector: The Linear Algebra behind Google” tries to provide a comprehensive treatment of the PageRank algorithm, while still remaining elementary and accesible. For a more computer science oriented view of a search engine, it is very useful to read “The anatomy of a large-Scale Hypertextual Web Search Engine” by the Google founders Larry Page and Sergey Brin. The book “A course on web graph” is addresed to both mathematicians and computer scientists. It provides a comprehensive treatment to the state-of-the-art of many applications of graph theory to the study of the Web graph. The last reference is to the online journal Internet Mathematics.

- Anthony Bonato,
*A Course on The Web Graph*, AMS-AARMS, Graduate Studies in Mathematics v. 89, 2008. http://www.math.ryerson.ca/~abonato/webgraph.html - Kurt Bryan, Tanya Leise,
*The $25,000,000,000 Eigenvector: The Linear Algebra behind Google*, SIAM Review, Vol. 48, No. 3. (2006).

http://www.siam.org/journals/sirev/48-3/62328.html - Sergey Brin, Lawrence Page,
*The Anatomy of a Large-Scale Hypertextual Web Search Engine*, Seventh International World-Wide Web Conference (WWW 1998).

http://infolab.stanford.edu/pub/papers/google.pdf - M. Brin, G. Stuck,
*Introduction to Dynamical Systems*, Cambridge University Press, 2002. - David Austin,
*How Google Finds Your Needle in the Web’s Haystack*.

http://www.ams.org/featurecolumn/archive/pagerank.html - Jon Kleinberg,
*Authoritative sources in a hyperlinked environment*, Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998.

http://www.cs.cornell.edu/home/kleinber/auth.pdf - Internet Mathematics Journal

http://www.internetmathematics.org/