Simple PHP Scraper Class

I gave a presentation entitled “The SEOs Guide to Scraping Everything” on May 10th at the SEOmoz and SEER Interactive Meetup in Philadelphia, PA.  Since I only had 8 minutes to present, I figured I’d augment my presentation by providing a simple PHP scraper class that people can use (and extend) to get started with scraping.

You can download the scraper class here.

There’s a quick sample for how to use the scraper class below my slide deck from the meetup:

Using the Scraper:




And here’s the actual scraper class:


class Eppie_Service_Scraper{

    public function __construct(){
        // set proxies -- you can add your own here or use the setProxies method
        $this->_proxies = array();

    public function scrape($url)
        $this->_url = $url;
        $dom = new DOMDocument();
    	$proxy = $this->_pickProxy();

    	$ch = curl_init();
    	curl_setopt($ch, CURLOPT_URL, $url);
    	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    	curl_setopt($ch, CURLOPT_REFERER, "");
    	curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
    	curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_6) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.151 Safari/535.19");
        	curl_setopt($ch, CURLOPT_PROXY, $proxy);
        	curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
    	$body = curl_exec($ch);

    	$this->_curl_result = $body;
    	$this->_dom = $dom;


    public function setProxies($proxies)
        $this->_proxies = $proxies;

    private function _pickProxy()
        if(count($this->_proxies) > 0)
            return $this->_proxies[rand(0, count($this->_proxies) - 1)];
        else return false;

    public function setKeyword($keyword)
        $this->_keyword = $keyword;

    private function _parseDOM()
        $xpath = new DOMXPath($this->_dom);
        $title = $xpath->query("//head/title");
        $meta_desc = $xpath->query("//head/meta[@name='description']/@content");
        $meta_kw = $xpath->query("//head/meta[@name='keywords']/@content");
        $h1 = $xpath->query("//h1");
        $h2 = $xpath->query("//h2");
        $h3 = $xpath->query("//h3");
	$h4 = $xpath->query("//h4");
	$h5 = $xpath->query("//h5");
	$h6 = $xpath->query("//h6");
        $img = $xpath->query("//img");
        $img_alt = $xpath->query("//img[@alt!='']/@alt");
        $strong = $xpath->query("//strong | //b");
        $body = $xpath->query("//body");

        if($title->length > 0)
            $this->_title = $title->item(0)->nodeValue;

        if($meta_desc->length > 0)
            $this->_meta_desc = $meta_desc->item(0)->nodeValue;

        if($meta_kw->length > 0)
            $this->_meta_kw = $meta_kw->item(0)->nodeValue;

        if($h1->length > 0)
            for($i=0; $i < $h1->length; $i++)
                $this->_h1[] = $h1->item($i)->nodeValue;

        if($h2->length > 0)
            for($i=0; $i < $h2->length; $i++)
                $this->_h2[] = $h2->item($i)->nodeValue;

        if($h3->length > 0)
            for($i=0; $i < $h3->length; $i++)
                $this->_h3[] = $h3->item($i)->nodeValue;

	if($h4->length > 0)
            for($i=0; $i < $h4->length; $i++)
                $this->_h4[] = $h4->item($i)->nodeValue;

	if($h5->length > 0)
            for($i=0; $i < $h5->length; $i++)
                $this->_h5[] = $h5->item($i)->nodeValue;

	if($h6->length > 0)
            for($i=0; $i < $h6->length; $i++)
                $this->_h6[] = $h6->item($i)->nodeValue;

        if($img_alt->length > 0)
            for($i=0; $i < $img_alt->length; $i++)
                $this->_img_alt[] = $img_alt->item($i)->nodeValue;

        $this->_img_alt_pct = ($img_alt->length / $img->length)*100;

        if($strong->length > 0)
            for($i=0; $i < $strong->length; $i++)
                $this->_strong[] = $strong->item($i)->nodeValue;



Using DOMXPath for Parsing Page Content in PHP

The DOMXPath class is a convenient and popular means to parse HTML content with XPath.
If you have a small set of HTML pages that you want to scrape data from and then to stuff into a database, Regexes might work fine… this works well for a limited, one-time job (from community Wiki).

If we are to apply XPath methods then, after we upload a content, we had better brush it up to prepare for export into DOM and DOMXPath objects.

Here I’ve summed the basic steps to be done with DOMXPath class usage:
  1. Initialize a DOMDocument class instance from page content (work with HTML as with XML)
  2. Initialize a DOMXPath class instance from DOMDocument class instance.
  3. Parse the DOMXPath object.

1. Initializing a DOMDocument  class instance from page content

  • create a new DOMDocument class instance
$DOM = new DOMDocument;
When using this function be sure to clear your internal error buffer ( libxml_clear_errors() ). If you don’t and you use this in a long running process, you may find that all your memory is used up. Outsourced from here. See the ‘enable user error handling’ bullet point.
  • load the HTML text into the DOMDocument object
if (!$DOM-&gt;loadHTML($page))
  • enable user error handling
    {   $errors=&amp;quot;&amp;quot;;
        foreach (libxml_get_errors() as $error)  {
        print “libxml errors:&lt;br&gt;$errors”;

Now the DOMDocument object (named ‘$DOM’) contains all the target text as a HTML DOM structure. It’s ready for different methods and properties to be applied.

2. Initializing a DOMXPath object from the DOMDocument object

  • Initialize DOMXPath object for further parse
$xpath = new DOMXPath($DOM);

Now XPath methods are applicable to the content

Parsing the DOMXPath object

As a test page I took the Blocks Testing Ground page and wrote a code using XPath to retrieve data.

$case1 = $xpath-&gt;query(‘//*[@id="case1"]‘)-&gt;item(0);
$query = ‘div[not (@class="ads")]/span[1]‘;
$entries = $xpath-&gt;query($query, $case1);
foreach ($entries as $entry) {
    echo ” {$entry-&gt;firstChild-&gt;nodeValue} &lt;br /&gt; “;


How libxml library reacts to a malformed HTML

The libxml library gave no warning about a malformed HTML non-related to the direct DOM structure parse, yet the library has issued an error for the malformed HTML instance that is the subject of a direct parse:

  • No warning for this case: <p><p><p>
  • For a missed bracket: <div prod=’name1′ <div …> and then for the extra opened tag: <div prod=’name1′ ><div>  the library has issued an exception for the DOMXPath ‘query’ method.

The whole Scraper Listing

$curl = curl_init(‘’);
$page = curl_exec($curl);
if(curl_errno($curl)) // check for execution errors
    echo ‘Scraper error: ‘ . curl_error($curl);
$DOM = new DOMDocument;
if (!$DOM->loadHTML($page))
        foreach (libxml_get_errors() as $error)  {
        print “libxml errors:<br>$errors”;
$xpath = new DOMXPath($DOM);
$case1 = $xpath->query(‘//*[@id="case1"]‘)->item(0);
$query = ‘div[not (@class="ads")]/span[1]‘;
$entries = $xpath->query($query, $case1);
foreach ($entries as $entry) {
    echo ” {$entry->firstChild->nodeValue} <br /> “;


User-Sensitive PageRank

Problems with Conventional PageRank

Conventional PageRank computes authority weights of different HTML pages based on a Random Surfer Model.

In this model a steady-state distribution of the Markov chain is computed based on a transition matrix defined by a surfer that uniformly randomly follows the page out-links. To meet certain mathematical requirements, a blend of such a random surfer with uniform teleportation is typically used.

In such an approach, a surfer either:

  • randomly selects an out-link to follow with probability c, or
  • “gets bored” and randomly selects some unconnected page to jump (or teleport) to, with probability 1−c.

According to a conventional formulation, PageRank can be introduced as a vector defined over all nodes of a Web graph that satisfies the following PageRank linear system:

p=cP T p+(1−c)v.  (1)

Here P is a Markov transition matrix in which

Pij = 1 / deg(i) if there is a link i -> j, and
Pij = 0 if there is no link i -> j
c is a teleportation coefficient usually picked around 0.85 – 0.9
v=(1/n,1/n, . . . ,1/n) is a uniform teleportation vector
n is a total number of all Web pages

The system can be rewritten in a more straightforward component-wise way that explicitly uses Web graph structure (deg(i) is out-degree of node i)

US 7624104 Eq 2

Many iterative methods of solving PageRank equation (1) have been proposed. 1

And though the numerical properties of PageRank are relatively well studied, the usefulness of conventional formulations of PageRank in the relevancy ranking of query search results (one of its primary uses) is debatable. This is due in large part to the fact that some of the basic assumptions underlying widely used PageRank formulations are either flawed or not reflective of reality. Indeed, this fact is evidenced in the many attempts which have been made to adjust PageRank formulations to more realistic settings from the time of its introduction.

For example, the assumption that all the outgoing links in a Web page are followed by a surfer uniformly randomly is unrealistic. In reality, links can be classified into different groups, some of which are followed rarely if at all (e.g., disclaimer links). Such “internal links” are known to be less reliable and more self-promotional than “external links” yet are often weighted equally. Attempts to assign weights to links based on IR similarity measures have been made but are not widely used. 2

The uniform teleportation jump to all the Web pages is another example of an unrealistic assumption upon which conventional PageRank formulations are based. That is, nothing is further from reality than the assumption that users begin new sessions on major portals and obscure home pages with equal probability. Alternatively, it is sometimes assumed that teleportation is restricted to a trusted set of pages or sites. 3 However, this assumption is equally flawed in that it is intended to combat link spam rather than being reflective of real-world user behavior. An additional and less recognized problem is that attrition from different pages is very different and therefore cannot accurately be described by the same scalar coefficient 1−c.

Conventional PageRank formulations have another issue which relates to the manner in which they are used in practice. That is, because of the vast number of pages on the Web, PageRank computing is typically implemented with regard to aggregations of pages by site, host, or domain, also referred to as “blocked” PageRank. 4 In formulating viable blocked PageRank computations, links between pages have to be somehow aggregated to a block level. Unfortunately, most heuristics for performing this aggregation do not work well.

In view of the foregoing, new formulations of PageRank are needed which address these shortcomings.

User-Sensitive PageRank - P. Berkhin et al (2006)

Techniques are provided for generating an authority value of a first one of a plurality of documents and/or for generating a variety of ways to compute PageRank with reference to various types of data corresponding to actual user behavior.

Figure 1 is a flow diagram which illustrates this general idea. User data 100 which reflect the behavior and/or demographics of an underlying user population are collected and indexed (102). At least some of these data track the navigational behavior of the user population with regard to documents, pages, sites, and domains visited, and links selected. The user population, the computing context, and the techniques for collecting these data may vary considerably.

Figure 1.

First Component

A first component of the authority value is generated with reference to outbound links associated with the first document. The outbound links enable access to a first subset of the plurality of documents.

According to a more specific embodiment, generation of the first component includes assigning a weight to each of the outbound links. Each of the weights is derived with reference to a portion of the user data representing a frequency with which the corresponding outbound link was selected by a population of users.

Second Component

A second component of the authority value is generated with reference to a second subset of the plurality of documents. Each of the second subset of documents represents a potential starting point for a user session.

According to another specific embodiment, generation of the second component of the authority value includes generating a teleportation distribution which includes a term for each of the second subset of documents. Each of the terms is derived with reference to a portion of the user data representing relevance of the corresponding document among a population of users.

Third Component

A third component of the authority value is generated representing a likelihood that a user session initiated by any of a population of users will end with the first document.

The first, second, and third components of the authority value are combined to generate the authority value.

At least one of the first, second, and third components of the authority value is computed with reference to user data relating to at least some of the outbound links and the second subset of documents.

According to some embodiments, authority value components generated with reference to user data may also be generated with reference to conventional formulations for these components such as, for example, the components represented in equation (1). Moreover, these new and conventional components may be blended together to varying degrees to generate the authority value components of the present invention.

According to yet another embodiment, an authority value of a first one of a plurality of documents is generated. Text associated with each of a plurality of inbound links enabling access to the first document is identified. A weight is assigned to the text associated with each of the inbound links. Each of the weights is derived with reference to user data representing a frequency with which the corresponding inbound link was selected by a population of users. The authority value is generated with reference to the weights.


PageRank computation is performed for a plurality of pages and/or documents using a PageRank formulation constructed according to the present invention (104). Such PageRank formulations include at least one component which is derived with reference to the user data. In addition, the PageRank computation may be performed for each page/document on the Web or at some higher level of aggregation (e.g., site, host, domain, etc.). The PageRank computations may then be employed in support of a wide variety of applications (106) such as, for example, in relevancy determinations for the ranking of search results in response to user queries. And because the set of pages, the connections between them, and user behavior may change over time, the user data collection and PageRank computations may be iterated (dashed line) to ensure that they reflect the most current conditions in the computing environment.

Various embodiments of the present invention may employ PageRank formulations which incorporate or make use of user data in a variety of ways which address one or more of the issues described above. For example, as noted above the assumption of uniform endorsement along all outward-bound links associated with a page is unrealistic, e.g., internal links (e.g., disclaimer links) are typically not equal to external links. To the contrary, users “vote” by their behavior in terms of the links they actually select. Moreover, the popularity of links selected is not static, but changes over time.

Therefore, according to various embodiments, empirical data corresponding to link selection behavior by users are employed to weight outbound links in a PageRank formulation such that this user behavior is taken into account. According to a specific embodiment, the number of users who browsed from page i to page j along a link connecting the two pages is employed to assign to the link a weight which reflects a likelihood that a user will move along the directed edge corresponding to the link. Additional details regarding exemplary techniques by which this weighting may be accomplished are provided in U.S. Pat. No. 6,792,419 for System And Method For Ranking Hyperlinked Documents Based On A Stochastic Backoff Processes, the entire disclosure of which is incorporated herein by reference for all purposes.

Because most pages have very little traffic associated with them, and the traffic they do have corresponds to a low confidence estimate of user intent, according to a specific embodiment of the invention, the terms in the Markov transition matrix of equation (1) may instead be derived as follows:


ij = 1 + α ⁢ ⁢ n i -> j deg ⁡ ( i ) + α ⁢ ∑ i -> j ⁢ ⁢ n i -> j ( 3 )
where α≧0 reflects some Laplace smoothing factor, and ni→j is the number of users following a particular link. It should be noted that coefficient α=0 corresponds to a conventional formulation of this component. Notice also that higher values of ni→jrepresent a higher impact on wij in agreement with the fact that higher values imply higher confidence.

While equation (3) does incorporate some measure of the likelihood that specific links will be selected by users, more specific embodiments of the invention are contemplated which reflect further refinement of the underlying assumptions. That is, for example, users are not equal. Rather, they are part of a social network in which different weights can be assigned to different users based on a variety of factors. In addition, because the popularity of pages and links change over time, the incorporation of one or more recency factors into the PageRank formulation may be desirable. Third, the use of user data enables the creation of a targeted PageRank by aggregating user behavior over a particular user segment as defined by demographics, behavioral characteristics, user profile, etc.

According to a more specific embodiment of the invention, these refinements result in the following generalization of equation (3) in which u denote a user and S stands for a particular user segment:


ij = 1 + α ⁢ ∑ u ∈ S ⋂ u ∈ i -> j ⁢ ⁢ f ⁡ ( u ) deg ⁡ ( i ) + α ⁢ ∑ u ∈ S ⋂ u ∈ i -> j ⁢ ⁢ f ⁡ ( u ) ( 3 ⁢ A )
Here uεi→j means that user u followed link i→j. According to one formulation, u reflects user meta-data which may include, but are not limited to, weight, recency, tenure, and time spent on a page, thus yielding:


ij = 1 + α ⁢ ∑ u ∈ S ⋂ u ∈ i -> j ⁢ ⁢ f ⁡ ( u weight , u recency , u tenure , u time ⁢ ⁢ spent ⁢ ⁢ on ⁢ ⁢ j ) deg ⁡ ( i ) + α ⁢ ∑ u ∈ S ⋂ u ∈ i -> j ⁢ ⁢ f ⁡ ( u weight , u recency , u tenure ) . ( 3 ⁢ B )

Yet another specific embodiment reflects a further generalization of this idea. That is, conditioning by a user segment may assume use of a step function that is equal to one for users within S and to zero for users outside S. However, it should be noted that this idea may be generalized to any probability distribution ρ(in practice we can assign different significance levels to different user segments), thus yielding:


ij = 1 + α ⁢ ∑ u ∈ i -> j ⁢ ⁢ ρ u ⁢ f ⁡ ( u ) deg ⁡ ( i ) + α ⁢ ∑ u ∈ ⁢ i -> j ⁢ ⁢ ρ u ⁢ f ⁡ ( u ) . ( 3 ⁢ C )

It should be noted that embodiments of the invention may work on any level of aggregation (i.e., for blocked PageRank formulations). For example, for a site or host level graph, a link between site I and site J exists if there are pages i and j connected by a hyperlink such that iεI, jεJ. Now we can assign weights WIJ to the link I→J using a formula similar to any of (3)-(3C) with NIJ being a count of users who proceeded from any page i in site I to any page j in site J.

Because of “dangling” pages, i.e., pages having no out-links, and because of the requirement of a graph’s strong connectivity (i.e., the Markov transition matrix P has to be irreducible), a degree of teleportation is added to the PageRank formulation of equation (1) as described above. And a typical teleportation distribution v=(vj) used in a conventional PageRank formulation is selected either uniformly or uniformly among a subset of trusted pages. As noted above, both approaches have shortcomings. That is, users do not start from obscure pages with the same probability as from popular hubs (e.g., think of the effect of bookmarks), and uniform teleportation actually leads to a link-based spam. On the other hand, what can be trusted is in dispute, and a restrictive definition of trust defeats the purpose of creating a strongly-connected graph.

Therefore, according to various embodiments of the invention, user data are utilized to meaningfully estimate a teleportation distribution for a PageRank formulation. Consider different user sessions. Each session has a first or a starting page. Let mjbe the count of how many times a page j was a first page in a session. Then, according to a specific embodiment, a realistic teleportation distribution v′ can be defined as a blend of a more conventional distribution (e.g., v as defined above) with user-data-based component as follows:


j ′ = β ⁢ ⁢ v j + ( 1 - β ) ⁢ m j ∑ j ⁢ ⁢ m j . ( 4 )
where 0≦β≦1 is a tuning parameter which adjusts the degree of blending of the two components. Again, it should be noted that β=1 corresponds to a conventional formulation of this component. A higher β means a larger degree of exploration and a lesser degree of relying on behavioral data. According to one exemplary embodiment, β=0.2 is recommended as a reasonable tradeoff. It should be noted that equation (4) can be generalized in a manner similar to the generalization of equation (3) to equations (3A)-(3C) to incorporate user network utility, user tenure, recency, and time spent on a page. Even, if relatively few pages on the Web actually have a non-zero count mj, the idea leads to a good teleportation distribution with a small β accounting for a degree of exploration. The fact that only a small fraction of pages on the Web would have significant teleportation component agrees with the well known fact that a small portion of pages actually carries the bulk volume of PageRank distribution. Again, in deriving this teleportation distribution, we can take into account many other characteristics beyond frequency counts as was done for equations (3A)-(3C). The above-described embodiments suggest simple yet powerful frameworks for addressing two of the faulty assumptions underlying conventional PageRank formulations, i.e., uniform link weighting and uniform teleportation. According to further embodiments of the invention, another shortcoming of conventional PageRank formulations, i.e., the teleportation coefficient c, is addressed. Previously, it has been assumed that given a particular page, a random surfer “becomes bored” and jumps or “teleports” to a new session (i.e., at a new page) with uniform probability (1−c). In reality, uniformly assuming this dropout rate is a very bad approximation. Therefore, according to various embodiments of the invention, user data are utilized to estimate individual teleportation coefficients for specific pages or blocks. Let gbe a fraction of sessions that end on the page i of all sessions containing i. Then, according to a specific embodiment, a page-specific estimate of a dropout rate may be given by:
(1−c i)=(1−c)γ+(1−γ)g i  (5)
where c is a conventional teleportation coefficient, and 0≦γ≦1 is a tuning parameter which enables varying degrees of blending of conventional teleportation coefficients with page-specific data. Here γ=1 corresponds to a conventional formulation with γ=0.25 being a reasonable default.

As discussed above, equations (3), (4), and (5) compute quantities related to PageRank formulations with reference to data corresponding to actual user behavior. In addition, further generalizations make it possible to account for other elements of user behavior such as, for example, user network utility, user recency, user tenure, time spent on a page, etc., e.g., equations (3A)-(3C). However, because the confidence levels for user behavior estimates relating to infrequently visited pages are low, some regularization may be desirable for specific embodiments of the invention.

It can be argued that the fraction of pages for which user data are available is small in comparison with the realm of all Web pages. Were it not so, the count of visits per page would serve as a good approximation of authority. Therefore, as described above, embodiments of the invention utilize authority propagation from conventional PageRank formulations while deriving out-link weights, teleportation vectors, and teleportation coefficients based on user behavioral data, thus blending these two types of data to varying degrees. Thus, embodiments of the invention provide more accurate PageRank authority of all pages, including pages that have little or no visitation.

Put another way, embodiments of the present invention, consolidate conventional formulations applicable to any pages with new formulations applicable to relatively few frequent, and so high authority, pages. According to some of the exemplary formulations described herein, this consolidation may be achieved to varying degrees using a kind of Laplace smoothing represented in equations (3)-(5) by parameters α, β, γ. For α=0 and β=γ=1 the formulations are reduced to the conventional formulations represented by equation (1). On the other hand, if any one of these three parameters departs from these values, some level of blending occurs and is therefore within the scope of the invention. Thus, it should be noted that embodiments of the invention are contemplated in which these tuning parameters range in value such that only one, two, or all three of the corresponding components are in play.

Further refinements and applications of the present invention will now be described.

User Segment Personalized PageRank

Many attempts have been made to define personalized PageRank formulations. For example, by selecting a narrow set of topic specific pages and restricting teleportation to these pages, a topical PageRank formulation can be constructed. According to specific embodiments of the present invention, PageRank formulations (or individual components thereof) derived in accordance with the present invention may be flexibly and straightforwardly applied to or used with any type of personalized PageRank formulation.

For example, user segmentation is commonly used in targeted advertising. A user segment can be defined in terms of a user demographic profile (e.g., age, gender, income, etc.), user location, user behavior, etc. Any or all of equations (3)-(5) above can then be specified to reflect any such user segment in that they are constructed with reference to user data corresponding to an underlying population which, in turn, can be restricted to the relevant user segment. Moreover, as discussed above, such formulations can take into account any probabilistic distribution of user relevancy such as, for example, assigning weights to different users on the basis of an age range distribution.

Blocked PageRank

As discussed above, PageRank formulations are often applied to aggregations at the host, site, or domain levels, often referred to as blocked PageRank. Blocked PageRank is useful in acceleration of PageRank computing and in PageRank personalization. To construct a blocked PageRank formulation, parameters for a factorized directed graph are defined. For example, equal weights may be assigned for any link from one block to another as between two blocks having nodes connected by a directed edge. However, such a formulation would not distinguish between a pair of blocks connected by a single spurious link, and a pair of blocks connected by multiple direct edges. A variety of schemes have been developed to derive weights for block super-edges, but performance in practice has yielded mixed results.

However, because user behavior naturally aggregates at the various different “block” levels (i.e., site, host, domain, etc.), the various PageRank formulations of the present invention naturally scale up to the various block levels.

Overall PageRank Iterations

PageRank computing is related to the so-called simple power iteration method. This method depends on parameters such as edge probability distribution and teleportation described above. Equations (3)-(5) above and the generalization exemplified by equation (3A) thus lead to the following:


j ( n + 1 ) = ∑ i -> j ⁢ ⁢ c i ⁢ w ij ⁢ p i ( n ) + ( 1 + c i ) ⁢ v j ( 6 )
where transition weights wij are defined by equation (3) or its analogs (e.g., equations (3A)-(3C)), teleportation distribution vis defined by equation (4) or its analogs, and teleportation coefficients care defined by equation (5) or its analogs. And any derived iterative schemes that accelerate PageRank convergence and/or construct or compute blocked PageRank which employ any of the PageRank formulations or components thereof described herein are within the scope of the present invention.

Time Dynamics

In principle, PageRank should be periodically recomputed because the Web graph grows and its topology changes with time. In line with this is purely topological change, core pages with the same in and out-links still come in and out of fashion or significance over time. This is particularly important given that there is no “garbage collection” on the Web. Yet another advantage of the PageRank formulations of the present invention is that it is relatively straightforward to incorporate time dynamics. For example, a discount procedure such as, for example, exponential averaging, could readily be included into user behavior counts to emphasize recent events and discount old ones. Not only does such a modification capture temporally dependent changes in page popularity, it also operates as a de-facto Web garbage collection utility.

Other Applications

As will be understood, the various PageRank formulations of the present invention may be used in conjunction with other information to evaluate page relevance in ranking search results according to any of a wide variety of techniques. However, it should be noted that the PageRank formulations of the present invention may be used in a wide variety of other applications. An example of one such application is controlling the manner in which a web crawling application crawls the Web. That is, the PageRank formulations of the present invention may be used to support decision making by a web crawler to determine whether and on which links associated with a given page to crawl.

Moreover, the basic principles described herein can be generalized beyond PageRank formulations. Consider an anchor-text that is known as one of the most useful features used in ranking retrieved Web search results. It is usually assembled through aggregation of different \href HTML tag text strings related to incoming links. However, since incoming links have different popularity, this text can be supplied with some weights derived according to the present invention. According to the invention, knowledge of user behavior may be incorporated into such a technique as follows. Given a target page j, anchor-texts corresponding to incoming links i→j are weighted with user behavior scores wij computed as described above. As will be understood, various formulas may be used in relevancy ranking to aggregate hyperlink anchor text. Any of those formulas may be modified in accordance with the present invention to reflect link weights corresponding to user behavior in a manner similar to equations (3)-(3C).

Embodiments of the present invention may be employed to compute PageRank or similar formulations in any of a wide variety of computing contexts. For example, as illustrated in FIG. 2, implementations are contemplated in which the relevant population of users interact with a diverse network environment via any type of computer (e.g., desktop, laptop, tablet, etc.)202, media computing platforms 203 (e.g., cable and satellite set top boxes and digital video recorders), handheld computing devices (e.g., PDAs) 204, cell phones 206, or any other type of computing or communication platform.

And according to various embodiments, user data processed in accordance with the invention may be collected using a wide variety of techniques. For example, collection of data representing a user’s interaction with specific Web pages may be accomplished using any of a variety of well known mechanisms for recording a user’s online behavior. However, it should be understood that such methods of data collection are merely exemplary and that user data may be collected in many other ways. For example, user data may be collected when a user registers with, for example, a particular web site or service.

Once collected, the user data are processed and stored in some centralized manner. This is represented in FIG. 2 by server208 and data store 210 which, as will be understood, may correspond to multiple distributed devices and data stores. The invention may also be practiced in a wide variety of network environments (represented by network 212) including, for example, TCP/IP-based networks, telecommunications networks, wireless networks, etc. In addition, the computer program instructions with which embodiments of the invention are implemented may be stored in any type of computer-readable media, and may be executed according to a variety of computing models including a client/server model, a peer-to-peer model, on a stand-alone computing device, or according to a distributed computing model in which various of the functionalities described herein may be effected or employed at different locations.

  1. For an introduction to this subject see A Survey on “PageRank” Computing, P. Berkhin, Internet Mathematics, Vol. 2, No 1., pp. 73-120, 2005.
  2.  See, for example, The Intelligent Surfer. Probabilistic Combination of Link and Content Information in PageRank, M. Richardson and P. Domingos, Advances in Neural Information Processing Systems 14, MIT Press, 2002.
  3.  See, for example, Combating Web Spam with TrustRank, Z. Gyongyi, H. Garcia-Molina, J. Pedersen, In Proceedings of 30thVLDB Conference, Toronto, Canada, ACM Press, 2004.
  4.  See, for example, Exploiting the Block Structure of the Web for Computing PageRank, S. Kamvar, T. Haveliwala, C. Manning, G. Golub, Stanford University Technical Report, 2003.

A Simple Guide to Five Normal Forms in Relational Database Theory

William Kent, “A Simple Guide to Five Normal Forms in Relational Database Theory”, Communications of the ACM 26(2), Feb. 1983, 120-125. Also IBM Technical Report TR03.159, Aug. 1981. Also presented at SHARE 62, March 1984, Anaheim, California. Also in A.R. Hurson, L.L. Miller and S.H. Pakzad, Parallel Architectures for Database Systems, IEEE Computer Society Press, 1989. [12 pp]

Copyright 1996 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or

A Simple Guide to Five Normal Forms in Relational Database Theory

William Kent
Sept 1982

> 1 INTRODUCTION . . . 2
>> 3.1 Second Normal Form . . . 2
>> 3.2 Third Normal Form . . . 3
>> 3.3 Functional Dependencies . . . 4
>> 4.1 Fourth Normal Form . . . 6
>>> 4.1.1 Independence . . . 8
>>> 4.1.2 Multivalued Dependencies . . . 9
>> 4.2 Fifth Normal Form . . . 9
> 7 CONCLUSION . . . 13
> 9 REFERENCES . . . 14


The normal forms defined in relational database theory represent guidelines for record design. The guidelines corresponding to first through fifth normal forms are presented here, in terms that do not require an understanding of relational theory. The design guidelines are meaningful even if one is not using a relational database system. We present the guidelines without referring to the concepts of the relational model in order to emphasize their generality, and also to make them easier to understand. Our presentation conveys an intuitive sense of the intended constraints on record design, although in its informality it may be imprecise in some technical details. A comprehensive treatment of the subject is provided by Date [4].

The normalization rules are designed to prevent update anomalies and data inconsistencies. With respect to performance tradeoffs, these guidelines are biased toward the assumption that all non-key fields will be updated frequently. They tend to penalize retrieval, since data which may have been retrievable from one record in an unnormalized design may have to be retrieved from several records in the normalized form. There is no obligation to fully normalize all records when actual performance requirements are taken into account.


First normal form [1] deals with the “shape” of a record type.

Under first normal form, all occurrences of a record type must contain the same number of fields.

First normal form excludes variable repeating fields and groups. This is not so much a design guideline as a matter of definition. Relational database theory doesn’t deal with records having a variable number of fields.


Second and third normal forms [2, 3, 7] deal with the relationship between non-key and key fields.

Under second and third normal forms, a non-key field must provide a fact about the key, us the whole key, and nothing but the key. In addition, the record must satisfy first normal form.

We deal now only with “single-valued” facts. The fact could be a one-to-many relationship, such as the department of an employee, or a one-to-one relationship, such as the spouse of an employee. Thus the phrase “Y is a fact about X” signifies a one-to-one or one-to-many relationship between Y and X. In the general case, Y might consist of one or more fields, and so might X. In the following example, QUANTITY is a fact about the combination of PART and WAREHOUSE.

3.1 Second Normal Form

Second normal form is violated when a non-key field is a fact about a subset of a key. It is only relevant when the key is composite, i.e., consists of several fields. Consider the following inventory record:


The key here consists of the PART and WAREHOUSE fields together, but WAREHOUSE-ADDRESS is a fact about the WAREHOUSE alone. The basic problems with this design are:

  • The warehouse address is repeated in every record that refers to a part stored in that warehouse.
  • If the address of the warehouse changes, every record referring to a part stored in that warehouse must be updated.
  • Because of the redundancy, the data might become inconsistent, with different records showing different addresses for the same warehouse.
  • If at some point in time there are no parts stored in the warehouse, there may be no record in which to keep the warehouse’s address.

To satisfy second normal form, the record shown above should be decomposed into (replaced by) the two records:

-------------------------------  --------------------------------- 
====================-----------  =============--------------------

When a data design is changed in this way, replacing unnormalized records with normalized records, the process is referred to as normalization. The term “normalization” is sometimes used relative to a particular normal form. Thus a set of records may be normalized with respect to second normal form but not with respect to third.

The normalized design enhances the integrity of the data, by minimizing redundancy and inconsistency, but at some possible performance cost for certain retrieval applications. Consider an application that wants the addresses of all warehouses stocking a certain part. In the unnormalized form, the application searches one record type. With the normalized design, the application has to search two record types, and connect the appropriate pairs.

3.2 Third Normal Form

Third normal form is violated when a non-key field is a fact about another non-key field, as in


The EMPLOYEE field is the key. If each department is located in one place, then the LOCATION field is a fact about the DEPARTMENT — in addition to being a fact about the EMPLOYEE. The problems with this design are the same as those caused by violations of second normal form:

  • The department’s location is repeated in the record of every employee assigned to that department.
  • If the location of the department changes, every such record must be updated.
  • Because of the redundancy, the data might become inconsistent, with different records showing different locations for the same department.
  • If a department has no employees, there may be no record in which to keep the department’s location.

To satisfy third normal form, the record shown above should be decomposed into the two records:

-------------------------  -------------------------
============-------------  ==============-----------

To summarize, a record is in second and third normal forms if every field is either part of the key or provides a (single-valued) fact about exactly the whole key and nothing else.

3.3 Functional Dependencies

In relational database theory, second and third normal forms are defined in terms of functional dependencies, which correspond approximately to our single-valued facts. A field Y is “functionally dependent” on a field (or fields) X if it is invalid to have two records with the same X-value but different Y-values. That is, a given X-value must always occur with the same Y-value. When X is a key, then all fields are by definition functionally dependent on X in a trivial way, since there can’t be two records having the same X value.

There is a slight technical difference between functional dependencies and single-valued facts as we have presented them. Functional dependencies only exist when the things involved have unique and singular identifiers (representations). For example, suppose a person’s address is a single-valued fact, i.e., a person has only one address. If we don’t provide unique identifiers for people, then there will not be a functional dependency in the data:

|   PERSON   |       ADDRESS                 |
| John Smith | 123 Main St., New York        |
| John Smith | 321 Center St., San Francisco |

Although each person has a unique address, a given name can appear with several different addresses. Hence we do not have a functional dependency corresponding to our single-valued fact.

Similarly, the address has to be spelled identically in each occurrence in order to have a functional dependency. In the following case the same person appears to be living at two different addresses, again precluding a functional dependency.

|   PERSON   |       ADDRESS          |
| John Smith | 123 Main St., New York |
| John Smith | 123 Main Street, NYC   |

We are not defending the use of non-unique or non-singular representations. Such practices often lead to data maintenance problems of their own. We do wish to point out, however, that functional dependencies and the various normal forms are really only defined for situations in which there are unique and singular identifiers. Thus the design guidelines as we present them are a bit stronger than those implied by the formal definitions of the normal forms.

For instance, we as designers know that in the following example there is a single-valued fact about a non-key field, and hence the design is susceptible to all the update anomalies mentioned earlier.

| EMPLOYEE  |  FATHER    |  FATHER'S-ADDRESS             |
| Art Smith | John Smith | 123 Main St., New York        |
| Bob Smith | John Smith | 123 Main Street, NYC          |
| Cal Smith | John Smith | 321 Center St., San Francisco |

However, in formal terms, there is no functional dependency here between FATHER’S-ADDRESS and FATHER, and hence no violation of third normal form.


Fourth [5] and fifth [6] normal forms deal with multi-valued facts. The multi-valued fact may correspond to a many-to-many relationship, as with employees and skills, or to a many-to-one relationship, as with the children of an employee (assuming only one parent is an employee). By “many-to-many” we mean that an employee may have several skills, and a skill may belong to several employees.

Note that we look at the many-to-one relationship between children and fathers as a single-valued fact about a child but a multi-valued fact about a father.

In a sense, fourth and fifth normal forms are also about composite keys. These normal forms attempt to minimize the number of fields involved in a composite key, as suggested by the examples to follow.

4.1 Fourth Normal Form

Under fourth normal form, a record type should not contain two or more independent multi-valued facts about an entity. In addition, the record must satisfy third normal form.

The term “independent” will be discussed after considering an example.

Consider employees, skills, and languages, where an employee may have several skills and several languages. We have here two many-to-many relationships, one between employees and skills, and one between employees and languages. Under fourth normal form, these two relationships should not be represented in a single record such as


Instead, they should be represented in the two records

--------------------  -----------------------
====================  =======================

Note that other fields, not involving multi-valued facts, are permitted to occur in the record, as in the case of the QUANTITY field in the earlier PART/WAREHOUSE example.

The main problem with violating fourth normal form is that it leads to uncertainties in the maintenance policies. Several policies are possible for maintaining two independent multi-valued facts in one record:

(1) A disjoint format, in which a record contains either a skill or a language, but not both:

| Smith    | cook  |          |   
| Smith    | type  |          |
| Smith    |       | French   |
| Smith    |       | German   |
| Smith    |       | Greek    |

This is not much different from maintaining two separate record types. (We note in passing that such a format also leads to ambiguities regarding the meanings of blank fields. A blank SKILL could mean the person has no skill, or the field is not applicable to this employee, or the data is unknown, or, as in this case, the data may be found in another record.)

(2) A random mix, with three variations:

(a) Minimal number of records, with repetitions:

| Smith    | cook  | French   |   
| Smith    | type  | German   |
| Smith    | type  | Greek    |

(b) Minimal number of records, with null values:

| Smith    | cook  | French   |   
| Smith    | type  | German   |
| Smith    |       | Greek    |

(c) Unrestricted:

| Smith    | cook  | French   |   
| Smith    | type  |          |
| Smith    |       | German   |
| Smith    | type  | Greek    |

(3) A “cross-product” form, where for each employee, there must be a record for every possible pairing of one of his skills with one of his languages:

| Smith    | cook  | French   |
| Smith    | cook  | German   |
| Smith    | cook  | Greek    |
| Smith    | type  | French   |
| Smith    | type  | German   |
| Smith    | type  | Greek    |

Other problems caused by violating fourth normal form are similar in spirit to those mentioned earlier for violations of second or third normal form. They take different variations depending on the chosen maintenance policy:

  • If there are repetitions, then updates have to be done in multiple records, and they could become inconsistent.
  • Insertion of a new skill may involve looking for a record with a blank skill, or inserting a new record with a possibly blank language, or inserting multiple records pairing the new skill with some or all of the languages.
  • Deletion of a skill may involve blanking out the skill field in one or more records (perhaps with a check that this doesn’t leave two records with the same language and a blank skill), or deleting one or more records, coupled with a check that the last mention of some language hasn’t also been deleted.

Fourth normal form minimizes such update problems.

4.1.1 Independence

We mentioned independent multi-valued facts earlier, and we now illustrate what we mean in terms of the example. The two many-to-many relationships, employee:skill and employee:language, are “independent” in that there is no direct connection between skills and languages. There is only an indirect connection because they belong to some common employee. That is, it does not matter which skill is paired with which language in a record; the pairing does not convey any information. That’s precisely why all the maintenance policies mentioned earlier can be allowed.

In contrast, suppose that an employee could only exercise certain skills in certain languages. Perhaps Smith can cook French cuisine only, but can type in French, German, and Greek. Then the pairings of skills and languages becomes meaningful, and there is no longer an ambiguity of maintenance policies. In the present case, only the following form is correct:

| Smith    | cook  | French   |
| Smith    | type  | French   |
| Smith    | type  | German   |
| Smith    | type  | Greek    |

Thus the employee:skill and employee:language relationships are no longer independent. These records do not violate fourth normal form. When there is an interdependence among the relationships, then it is acceptable to represent them in a single record.

4.1.2 Multivalued Dependencies

For readers interested in pursuing the technical background of fourth normal form a bit further, we mention that fourth normal form is defined in terms of multivalued dependencies, which correspond to our independent multi-valued facts. Multivalued dependencies, in turn, are defined essentially as relationships which accept the “cross-product” maintenance policy mentioned above. That is, for our example, every one of an employee’s skills must appear paired with every one of his languages. It may or may not be obvious to the reader that this is equivalent to our notion of independence: since every possible pairing must be present, there is no “information” in the pairings. Such pairings convey information only if some of them can be absent, that is, only if it is possible that some employee cannot perform some skill in some language. If all pairings are always present, then the relationships are really independent.

We should also point out that multivalued dependencies and fourth normal form apply as well to relationships involving more than two fields. For example, suppose we extend the earlier example to include projects, in the following sense:

  • An employee uses certain skills on certain projects.
  • An employee uses certain languages on certain projects.

If there is no direct connection between the skills and languages that an employee uses on a project, then we could treat this as two independent many-to-many relationships of the form EP:S and EP:L, where “EP” represents a combination of an employee with a project. A record including employee, project, skill, and language would violate fourth normal form. Two records, containing fields E,P,S and E,P,L, respectively, would satisfy fourth normal form.

4.2 Fifth Normal Form

Fifth normal form deals with cases where information can be reconstructed from smaller pieces of information that can be maintained with less redundancy. Second, third, and fourth normal forms also serve this purpose, but fifth normal form generalizes to cases not covered by the others.

We will not attempt a comprehensive exposition of fifth normal form, but illustrate the central concept with a commonly used example, namely one involving agents, companies, and products. If agents represent companies, companies make products, and agents sell products, then we might want to keep a record of which agent sells which product for which company. This information could be kept in one record type with three fields:

| Smith | Ford    | car     | 
| Smith | GM      | truck   | 

This form is necessary in the general case. For example, although agent Smith sells cars made by Ford and trucks made by GM, he does not sell Ford trucks or GM cars. Thus we need the combination of three fields to know which combinations are valid and which are not.

But suppose that a certain rule was in effect: if an agent sells a certain product, and he represents a company making that product, then he sells that product for that company.

| Smith | Ford    | car     | 
| Smith | Ford    | truck   | 
| Smith | GM      | car     | 
| Smith | GM      | truck   | 
| Jones | Ford    | car     | 

In this case, it turns out that we can reconstruct all the true facts from a normalized form consisting of three separate record types, each containing two fields:

-------------------   ---------------------   ------------------- 
|-------+---------|   |---------+---------|   |-------+---------|
| Smith | Ford    |   | Ford    | car     |   | Smith | car     |
| Smith | GM      |   | Ford    | truck   |   | Smith | truck   |
| Jones | Ford    |   | GM      | car     |   | Jones | car     |
-------------------   | GM      | truck   |   -------------------

These three record types are in fifth normal form, whereas the corresponding three-field record shown previously is not.

Roughly speaking, we may say that a record type is in fifth normal form when its information content cannot be reconstructed from several smaller record types, i.e., from record types each having fewer fields than the original record. The case where all the smaller records have the same key is excluded. If a record type can only be decomposed into smaller records which all have the same key, then the record type is considered to be in fifth normal form without decomposition. A record type in fifth normal form is also in fourth, third, second, and first normal forms.

Fifth normal form does not differ from fourth normal form unless there exists a symmetric constraint such as the rule about agents, companies, and products. In the absence of such a constraint, a record type in fourth normal form is always in fifth normal form.

One advantage of fifth normal form is that certain redundancies can be eliminated. In the normalized form, the fact that Smith sells cars is recorded only once; in the unnormalized form it may be repeated many times.

It should be observed that although the normalized form involves more record types, there may be fewer total record occurrences. This is not apparent when there are only a few facts to record, as in the example shown above. The advantage is realized as more facts are recorded, since the size of the normalized files increases in an additive fashion, while the size of the unnormalized file increases in a multiplicative fashion. For example, if we add a new agent who sells x products for y companies, where each of these companies makes each of these products, we have to add x+y new records to the normalized form, but xy new records to the unnormalized form.

It should be noted that all three record types are required in the normalized form in order to reconstruct the same information. From the first two record types shown above we learn that Jones represents Ford and that Ford makes trucks. But we can’t determine whether Jones sells Ford trucks until we look at the third record type to determine whether Jones sells trucks at all.

The following example illustrates a case in which the rule about agents, companies, and products is satisfied, and which clearly requires all three record types in the normalized form. Any two of the record types taken alone will imply something untrue.

| Smith | Ford    | car     | 
| Smith | Ford    | truck   | 
| Smith | GM      | car     | 
| Smith | GM      | truck   | 
| Jones | Ford    | car     | 
| Jones | Ford    | truck   | 
| Brown | Ford    | car     | 
| Brown | GM      | car     | 
| Brown | Totota  | car     | 
| Brown | Totota  | bus     | 
-------------------   ---------------------   ------------------- 
|-------+---------|   |---------+---------|   |-------+---------|
| Smith | Ford    |   | Ford    | car     |   | Smith | car     | Fifth
| Smith | GM      |   | Ford    | truck   |   | Smith | truck   | Normal
| Jones | Ford    |   | GM      | car     |   | Jones | car     | Form
| Brown | Ford    |   | GM      | truck   |   | Jones | truck   |
| Brown | GM      |   | Toyota  | car     |   | Brown | car     |
| Brown | Toyota  |   | Toyota  | bus     |   | Brown | bus     |
-------------------   ---------------------   -------------------

Observe that:

  • Jones sells cars and GM makes cars, but Jones does not represent GM.
  • Brown represents Ford and Ford makes trucks, but Brown does not sell trucks.
  • Brown represents Ford and Brown sells buses, but Ford does not make buses.

Fourth and fifth normal forms both deal with combinations of multivalued facts. One difference is that the facts dealt with under fifth normal form are not independent, in the sense discussed earlier. Another difference is that, although fourth normal form can deal with more than two multivalued facts, it only recognizes them in pairwise groups. We can best explain this in terms of the normalization process implied by fourth normal form. If a record violates fourth normal form, the associated normalization process decomposes it into two records, each containing fewer fields than the original record. Any of these violating fourth normal form is again decomposed into two records, and so on until the resulting records are all in fourth normal form. At each stage, the set of records after decomposition contains exactly the same information as the set of records before decomposition.

In the present example, no pairwise decomposition is possible. There is no combination of two smaller records which contains the same total information as the original record. All three of the smaller records are needed. Hence an information-preserving pairwise decomposition is not possible, and the original record is not in violation of fourth normal form. Fifth normal form is needed in order to deal with the redundancies in this case.


Normalization certainly doesn’t remove all redundancies. Certain redundancies seem to be unavoidable, particularly when several multivalued facts are dependent rather than independent. In the example shown Section 4.1.1, it seems unavoidable that we record the fact that “Smith can type” several times. Also, when the rule about agents, companies, and products is not in effect, it seems unavoidable that we record the fact that “Smith sells cars” several times.


The normal forms discussed here deal only with redundancies occurring within a single record type. Fifth normal form is considered to be the “ultimate” normal form with respect to such redundanciesæ.

Other redundancies can occur across multiple record types. For the example concerning employees, departments, and locations, the following records are in third normal form in spite of the obvious redundancy:

-------------------------  -------------------------
============-------------  ==============-----------

In fact, two copies of the same record type would constitute the ultimate in this kind of undetected redundancy.

Inter-record redundancy has been recognized for some time [1], and has recently been addressed in terms of normal forms and normalization [8].


While we have tried to present the normal forms in a simple and understandable way, we are by no means suggesting that the data design process is correspondingly simple. The design process involves many complexities which are quite beyond the scope of this paper. In the first place, an initial set of data elements and records has to be developed, as candidates for normalization. Then the factors affecting normalization have to be assessed:

  • Single-valued vs. multi-valued facts.
  • Dependency on the entire key.
  • Independent vs. dependent facts.
  • The presence of mutual constraints.
  • The presence of non-unique or non-singular representations.

And, finally, the desirability of normalization has to be assessed, in terms of its performance impact on retrieval applications.


I am very grateful to Ted Codd and Ron Fagin for reading earlier drafts and making valuable comments, and especially to Chris Date for helping clarify some key points.


  1. E.F. Codd, “A Relational Model of Data for Large Shared Data Banks”, Comm. ACM 13 (6), June 1970, pp. 377-387.The original paper introducing the relational data model.
  2. E.F. Codd, “Normalized Data Base Structure: A Brief Tutorial”, ACM SIGFIDET Workshop on Data Description, Access, and Control, Nov. 11-12, 1971, San Diego, California, E.F. Codd and A.L. Dean (eds.).An early tutorial on the relational model and normalization.
  3. E.F. Codd, “Further Normalization of the Data Base Relational Model”, R. Rustin (ed.), Data Base Systems (Courant Computer Science Symposia 6), Prentice-Hall, 1972. Also IBM Research Report RJ909.The first formal treatment of second and third normal forms.
  4. C.J. Date, An Introduction to Database Systems (third edition), Addison-Wesley, 1981.An excellent introduction to database systems, with emphasis on the relational.
  5. R. Fagin, “Multivalued Dependencies and a New Normal Form for Relational Databases”, ACM Transactions on Database Systems 2 (3), Sept. 1977. Also IBM Research Report RJ1812.The introduction of fourth normal form.
  6. R. Fagin, “Normal Forms and Relational Database Operators”, ACM SIGMOD International Conference on Management of Data, May 31-June 1, 1979, Boston, Mass. Also IBM Research Report RJ2471, Feb. 1979.The introduction of fifth normal form.
  7. W. Kent, “A Primer of Normal Forms”, IBM Technical Report TR02.600, Dec. 1973.An early, formal tutorial on first, second, and third normal forms.
  8. T.-W. Ling, F.W. Tompa, and T. Kameda, “An Improved Third Normal Form for Relational Databases”, ACM Transactions on Database Systems, 6(2), June 1981, 329-346.One of the first treatments of inter-relational dependencies.

Creating a Table From a Spreadsheet

Importing Spreadsheets vs. Connecting to Spreadsheets

Base can use data from spreadsheets in two different ways. One option is to connect to an existing spreadsheet. This method is described in the Using a Calc spreadsheet as a data source article. When using this method, the data can only be edited within the source spreadsheet; data cannot be edited from within Base.

The other option is to import the data from the spreadsheet into the Base document. The data can then be edited in Base. (The original spreadsheet will not be altered when changes are made in Base.) In our case, we want to import the data, but first the spreadsheet needs some work.

Formatting the Spreadsheet

Before importing the data, we need to make sure that it is correctly formatted.

  • The first row of the data range (not necessarily of the document) should contain the field names.
  • Field names should not include spaces. Change the cell containing the text English Name to EnglishName
  • Any fields that must not be empty need to have data in them.
In the case of the AOU list, a careful inspection will show that the Order, Family, Subfamily, and genus fields only contain data for the first record with that order, family, subfamily, or genus. In our database, this information needs to be in each record. To fix this:

  • Click in a cell that contains an Order (or Family, etc.) name. The cell border becomes black, and a small back square is visible in the lower right hand corner of the cell.
  • Drag this square downward; as you do so, a red border drags with you. Keep dragging until the border encompasses all of the empty cells, up to but not including the next cell with text in it, then release the mouse button. The empty cells fill with the text content of the first cell.
  • Repeat this procedure until the empty fields are filled. (Note that in some cases, Subfamily will remain empty)
  • Any rows with empty (null) cells in the English Name column must be deleted.
A careful examination of the spreadsheet will now show that there are some rows that contain Order, Family, Subfamily, etc. names, but do not include an entry in the English Name column, nor in any column to the right of English Name. These are the rows that originally formed the “headers” for the various sections. To delete these rows, right-click or control-click in the rectangle to the far left of the “empty” row to select the whole row and choose Delete Rows from the contextual menu.
  • The order of the AOU checklist a standard order, used in birding books and checklists. To ensure that we can reproduce this order in the database, it would be wisest to add an ID field, with the first record given the number 1, the next 2, etc. To do so:
  • Insert a new column to the left of Column A (click on column A and then choose Columns from the Insert menu.
  • Type AOU_ID into the cell in new column A just to the left of the cell labeled order
  • Type a 1 into the first cell under ID:
  • Follow the instructions on the Selecting and Filling a Cell Range wiki page to fill the column so that each cell contains a number one greater than the one above it. (i.e. 1, 2, 3, etc.)
  • I also changed some of the field names, especially those that were originally just one letter; adjust these names as you see fit.

[edit]Importing the Data

We are now ready to import the data.

Select the entire range of the data, including the row with column names. The easiest way to do this is probably to click in the upper left hand cell of the range (probably the cell labeled ID), then scroll down to the last row of data and Shift-click in the far right hand cell of the range.
Copy this selected range
Open the Birder’s journal database
Click on the Tables icon in the Database column on the right.
In the lower rectangular area, labeled Tables, right-click or control-click in the white space. (If there is an existing table, make sure not to click on it).
In the contextual menu that comes up, choose Paste. The Copy Table window appears.
Give the table a name in the Table Name field; I chose AOU_Birdlist.
Make sure the Definition and Data radio button is marked
Because we added an ID field while prepping the spreadsheet, we want to leave the Create primary key checkbox unchecked. If you were importing data that did not already include a primary key field, you would want to check this box.
Click Next
The Apply columns that appears allows you to choose which fields you want to import. In our case we want to import all of them, so you can click on the >> button.
Then click Next
The Type Formatting window appears. Highlight each field in turn in the vertical column to the left, and adjust the Field Name, Field Type, Entry Required, and Length settings as needed.
In this table, all the fields are TEXT except for the ID field, whose field type should be INTEGER.
Most of the Lengths are ok, although the Hawaii, Introduced, Accidental, and Nonbreeding could be reduced to a length of 5 or so, since the data in those columns is generally only one character.
The ID and English Name fields should be marked for Entry Required.
Right-click or control-click on the “ID” field name in the left hand column. Select Primary Key from the contextual menu (it is the only option.) A yellow key appears to the left of the field name.
Once the fields are properly defined, click on Create

Note: Make certain that all the field definitions are correct before clicking Create. Attempting to change field types, lengths, etc. later will destroy data in those fields.

If you forget to set the primary key, a No primary key warning window will appear. You can click Cancel and set the primary key as above, or you can click No and identify the primary key in as follows:
Once the new table has appeared in the Tables Section of the document window, right-click or control-click on the new table. Select Edit from the contextual menu. The table opens in design view.
Click on the line containing the ID field. A green flippy triangle appears in the “cell” to the left of the word ID
control-click or right-click on the green triangle and choose Primary Key from the contextual menu. A key icon appears in the box
Click on the disk-drive icon in the toolbar to save your changes, then close the window.

You have successfully imported data from a spreadsheet.

Related Articles

Using a Calc spreadsheet as a data source

Bringing data into OpenOffice 2.0′s database

Base – Designing a Database

For a discussion of the formal theory of database design, see the Sum-it Database Design Course.

First Questions

Begin planning by asking yourself these questions:

  • What kind of data do I need to store?
  • What do I want to do with the data? What kind of reports do I want to be able generate?
  • How should the data be grouped

Simple linear databases only allowed you to work with one table or group of data in a data base. This limited what kind of data you could store and how it could relate. Relational databases like Base allow you to work with multiple tables of relatively independent data that can relate to each other in a variety of ways.

Figuring out how to group your data, or what tables you need, is an essential. Think carefully about what data is closely tied together, and what data needs to be independent of other groups (or tables) of data. In a simple database to catalog music, for example, you might want to three groups: Albums (Name, date, etc.), Songs, and Composers. These should be separate because an Album contains more than one song, and Composers(or Artists) produce more than one Album.

Planning Tables and Fields

Once you have given thought to these questions, you are ready to beginning planning what tables and fields you need.

Names of Tables and Fields

  • Field names must begin with a letter and must only contain letters, numbers, and underscores. No spaces or other characters.
  • Table names can contain spaces, but the standard is to use underscores instead.
  • Field names should be easily understandable by anyone using the database
  • Field names should be unique across the whole database. You can have a field called “ID” in each table, and the database will work (because Neo keeps track of what table it comes from), but it is more confusing for the User. It is better to name these fields Query_ID, Page_ID, etc., so it is clear in the field name what table it relates to.

Note: For more guidelines for creating good tables, see the book SQL Queries for Mere Mortals: A Hands-On Guide to Data Manipulation in SQL (2nd Edition) by John L. Viescas and Michael J. Hernandez and/or the Wikipedia article on Database Normalization.

List of Tables and Fields

Now make a list of the tables you need.

  • Include the names of all the fields that need to be included in each table.
  • Make sure to include a “Primary Key” in each table. Primary Keys are fields whose entry is unique for each record. Most often, it is best for the Primary Key to be an number (interger) that Base sets itself. But sometimes it makes sense for the primary key to be something else.
  • For each field, decide what type of field it needs to be (text, date, number, yes/no, etc.).
  • Decide how the table will relate to each other. What are the common fields?
  • If certain fields will usually (but not always) have a certain value (e.g. Country, etc.) make note of a default value to be entered when defining fields and creating the table.

Creating the Database Document

It is finally time to create a new database document. Under the File menu, choose New and then Database In the Database Wizard window, make sure the Create a new database radio button is selected. Click Next Make sure that the Yes, register the database for me and Open the Database for Editing settings are selected.
Click Finish.

The Next Steps

You are now ready to begin creating your tables. Table creation is covered in the following articles:

Creating a Table From a Spreadsheet
Creating a Table in Design View

Be sure to keep your notes about the design of your database. More decisions will need to be made later. It will be easier to make those decisions if you can refer to this list of tables and fields.

Base Wizard – Portable ‘Split’ HSQL Database Template

Download: Split_HSQLDB_Wizard_v3c.odb …includes HSQLDB 2.3.0.


This Base template (.odb) generates a split database as necessary for data-reliability and other considerations. The template combines various presets, macro-automation, and the popular HSQL database engine into one of the most powerful and flexible, cross-platform, database options available for the desktop today.

A split-database involves multiple files, but the template maintains everything in a dedicated folder for portability and backup purposes. This folder may be moved, renamed, or zipped for database distribution. This template is configured for single-user access, suitable for non-concurrent internet access, but the database may be run in server mode for concurrent read/write access through Base or other front-ends on a LAN or web-server.


    1. Ensure global macro security in *Office is set to Medium (or Low): Tools (*Office) > Options (Preferences) > *Office > Security > Macro Security > Medium.
    2. Remove any global Class Path to hsqldb.jar that you may have setup manually in *Office: Tools (*Office) > Options (Preferences) > *Office > Java/Advanced > Class Path > Remove
    3. Create a new folder on your computer to serve as a dedicated split-database folder (one per database)
    4. Download the above template and place this (.odb) file in the empty folder.
    5. Then click to open the template in Base while selecting Enable Macros.

The integral macro-wizard runs automatically. It adds a copy of the HSQLDB engine to the split-database folder and configures the template for new/existing database access, as applicable. The macro then runs quietly for a split-second each time you open the .odb in support of database portability. The database location (file path) is derived and saved as a setting within the Base .odb file, so you can safely disable the macro if you never plan to move or rename your split-database files or folder. You can rename the .odb file in any case. The template is a standard Base (.odb) front-end file used for storing your Queries, Forms, Reports and Macros. Consider moving the entire split-database folder into a cloud folder (such as Dropbox, Google Drive or SpiderOak) to gain automated session backup with file-history and personal encryption, as well as, non-concurrent internet access to the database. You can also zip-archive the folder for electronic distribution purposes as a single-file. This macro-driven Base (.odb) file must be opened within the dedicated split-database folder to provide these features — although you can subsequently disable the macro and move the .odb file independent of a fixed database location. You can register this .odb file in *Office as a ‘data source’ (Tools > Options > Base > Databases) for seamless access (F4) from any Writer or Calc document including standalone forms. You can also maintain a desktop shortcut to this file and/or folder as desired.

Open Office Base


Base is a graphical user interface (GUI) used primarily to connect OpenOffice documents to various databases (data sources). Base generates a connection file (.odb) which provides access to a data sources through special software drivers. Base contributes only “front-end” components including table-views, stored queries, forms, reports and macros. OpenOffice is bundled with various “back-end” options including spreadsheet access, dBase (flat-file) table creation and access, as well as an SQL (relational) database-engine (HSQLDB). The bundled HSQLDB engine is often sufficient, but HSQLDB 2.x takes Base to new levels with all of the latest desktop database features and functions.

Database Connection

Base is a database connection tool. Base can be used to generate connections to various data-sources, thereby enabling data access/integration across the *Office suite. To provide this function, a Base wizard is used to create a connection file (.odb). This standalone file is then ‘registered’ in *Office for universal access to the data-source. This process may be repeated to gain access to additional data-sources. Simply press the ‘F4′ key from any *Office application to view all registered data-sources.
Typical data-source usage includes:

  • Writer (mail-merge, forms),
  • Calc (pivot tables, graphs, forms),
  • Impress (dynamic slide data, graphs),
  • Base (stored queries, forms, reports).
  • Supported data sources include:
    • spreadsheets (tabular data; read-only),
    • delimited text files (CSV; read-only),
    • popular address book formats,
    • dBase tables (one table per file; read/write to include table creation),
    • SQL relational databases (multiple tables; read/write plus many additional functions).


Base is a well-rounded front-end. Base provides database management tools (DDL / DML), as well as, stored queries, forms, reports and macros. The associated tools include:

Database Application Development

Base provides a database application development environment. Base is merely a front-end, but it’s bundled with a few ‘back-end’ database options. You’ll find both flat-file (non-relational) and relational database functionality included. “Flat-file” connection drivers (dBase, Spreadsheet and/or Text) may offer SQL-like functions, but don’t expect them to be as extensive as those available when connecting Base to an SQL engine. Some additional function is provided by Base SubForms, which support filtering and ad hoc relationships even among flat-file tables. Beyond these Base functions, Calc ‘pivot tables’ offer pseudo-queries including aggregate functions, while Calc charting is suitable for reports. So flat-file tables can provide entry-level database functionality, sufficient for many home-projects including a simple contact list, inventory, etc. But when data redundancies become a liability, or the full power of SQL is desired, it’s time to connect Base to an SQL engine such as the bundled HSQL database engine. In this case, the ‘existing database’ wizard should be used to create a ‘JDBC’ connection with HSQL (see the ‘split database’ setup steps below). This engine adds a plethora of featuresfunctions, and query options to the mix. So, while you’re not limited to the installed back-end database options, everything required to create a complete *Office database application is included with Base. A quick summary of the bundled back-end options include:

  • flat-file table (Calc spreadsheet with tabular data, CSV, or dBase file)
  • flat-file table queries (dBase file-format generated through Base; single-table query support; one table per file; similar to MS Works)
  • flat-file based functions (provided by Base through selected flat-file ‘drivers’ such as dBase, Text, etc.)
  • flat-file filtering and ad hoc relationships (provided by SubForms in Base/Writer/Calc)
  • flat-file pseudo queries (provided by Calc through Pivot Tables)
  • SQL relational database (multiple related tables and extensive functionality through the included HSQL database engine; similar to MS Access + MS Jet engine).

Embedded Database

Base offers to create an all-in-one ‘embedded database’ file similar to MS Access. In this configuration, the front-end components (queries, forms, reports and macros) are packaged together with the database files produced by HSQLDB, all within a single Base (.odb) file. Unfortunately, this single-file concept has proven unreliable with Base, leading to rampant data corruption. So it’s best to avoid the ‘Create a new database’ wizard in Base thereby avoiding ‘embedded database’ files (perhaps with the exception of developing prototypes or distributingexamples). Fortunately, Base offers an alternative wizard which produces a robust ‘split-database‘ configuration in which the database itself (tables and user-data) is maintained as a separate file (or files) as necessary for the proper stability. Steps for creating a new ‘split-database’ utilizing the built-in HSQL database-engine with Base are outlined below.


Java is used to run various wizards and data sources such as the built-in “HSQL relational database engine.” With a Java Runtime Environment (JRE) installed (even portably) Base runs seamlessly with various Java databases including the built-in HSQLDB. This has some advantages over SQLite, and perhaps over other relational database (RDBMS) options. Like SQLite, Java database-engines are light-weight, fast, cross-platform and run seamlessly (auto start/stop) with Base. But Java databases enjoy broad support by the Base GUI including most database creation (DDL) and data manipulation (DML) tasks. And Java databases are particularly rich in features. For example, modern Java databases like HSQLDB 2.xH2 and perhaps Apache Derby offer advanced features including: ACID transactionsmodern SQL syntaxdata-typingcustom functionsstored proceduresencryptionhot-backups, and advanced multi-user support (MVCC). These additional features can ease project-development to a degree. For instance Base queries, forms and reports thrive in a full-featured SQL environment, thereby avoiding macros. Something as simple as ‘data-typing’ enables flexible queries based on date/time fields; something that becomes a nightmare with a weak data-typed engine like SQLite. And as the need arises, these Java engines scale modestly in support of multiple, concurrent users. So Java database engines occupy a niche on the desktop and pair well with Base, expanding the scope and potential of Base projects while easing application development. Given the advantages, it’s no wonder that HSQLDB 1.8 (released July 2005) was selected over SQLite as the default with Base — albeit a rather dated/crippled version of HSQLDB by today’s standards. But you’re not confined to the bundled version/engine. HSQLDB 2.x and H2 are much more advanced and particularly well-suited for Base projects. You’ll also find wide support for HSQLDB and H2 in the Base community forums and other support channels.


HSQLDB is an open source, cross-platform, SQL, relational database management system (RDBMS). It’s also known as HyperSQL or HSQL database engine. It’s written in Java so a JRE (Java Runtime Environment) must be installed (or available as a portable app) in order to utilize this database engine.

Version (circa 2005) of HSQLDB is integrated with *Office (AOO/LibO).

The integrated JDBC driver supports several HSQLDB operating modes including two with Base.

  • In-process (‘file mode’) provides seamless, single-user access with Base. “Seamless” implies automated database access with no need to start/stop a database server; similar to H2,dBase, and SQLite.
  • Client-server (‘server mode’) provides multi-user access on a network or locally from client-applications such as Base.

HSQLDB is a robust, transactional, database engine with ACID properties and a mature code base. These same technologies are at the heart of PostgreSQL, MySQL, Oracle, DB2, MS SQL Server, etc. In addition, HSQLDB employs a transaction log with auto-recovery. An automatic backup file is also maintained if enabled. The end-user can supplement these measures by storing the database files in a folder protected by automatic backup software or cloud-sync services. It’s also prudent to run database applications on a computer protected by an ‘uninterruptable power supply’ (UPS) or on a laptop with the battery installed. These measures avoid loss (rollback) of any buffered data in the event of a power failure. Speaking of buffered data, all transactional database engines employ a write-delay buffer to enhance performance. The buffer can be set to 0 seconds at the expense of write performance. HSQLDB uses a modest 500ms buffer by default, so new data is safely written to disk after a half-second delay. Base sets this write-delay buffer to 60 seconds for all ‘embedded databases’ files, which is excessive, but most folks can recover from the loss of one minute’s productivity in the event of a power failure or similar event. In all cases, buffered data is written immediately to disk upon normal shutdown. Advanced, enterprise-level, reliability measures such as hot backups, clustering and replication, designed for 24/7 availability, are not available with HSQLDB 1.8 — although hot backups are supported by HSQLDB 2.x, while clustering is oddly-enough a feature of H2 database.

HSQL Database Template

macro-enhanced Base template (.odb) as offered in the link below is highly recommended for single-user, split-database access. These templates do not require manual Class Path setup as outlined below. The templates use a macro-derived ‘session’ Class Path as opposed to a manually-setup ‘global’ Class Path. So unless you’re running a split HSQL database in server-mode (multi-user support), there’s no good reason to setup the global Class Path manually in Base.

See Wizard – Portable’Split’ HSQL Database Template.

Re-presenting CTRs as a function of Query x Position on SERP

Scraping Top Queries in GWT yields (after a fairly labour-intensive process) a table of Query-specific CTRs expressed as a function of Position on the SERP.

We want to re-present the CTRs in a more meaningful way – including grouping the CTRs into these categories: (Position 1 to 5 on Page 1 of SERP), (Position 6 to 10 on Page 1), (Page 1), (Page 2), and (Page 3 or more).

  1. Open the original data file from scraping the Top Queries: “<start date>-<end date>-<filter>-positions-tq.xls.”
  2. Open the processing file: “CTRs as a function of tq x pos on SERP – Template.xls.”
  3. Save #2 with appropriate identifiers.
  4. Copy  #1 to sheet #2:”tq x pos orig”.
  5. Copy column = “Query” from sheet #2:”tq x pos orig” to sheet#2:”queries from tq x pos”.
  6. Apply Filter -> Advanced Filter to sheet#2:”queries from tq x pos” to copy unique Queries only to an adjacent column (see Figure below).
  7. Copy column of unique Queries from sheet#2:”queries from tq x pos” to sheet#2:”queries unique”.
  8. Copy column = “Query” from sheet#2:”queries unique” to column = “Query” of sheet#2:”CTRs dyn”.
  9. After some time the CTRs will be re-presented.

excel advanced filter queries in ctrs x pos

Our final step is to consolidate the CTRs into groups corresponding to different Segments or Pages of the SERP results.

  1. Copy and Paste Special = Values the columns headed “BLOCK #1″ from sheet#2:”CTRs dyn” to the columns headed “BLOCK #1″ in sheet#2:”CTRs flat”.
  2. Ditto for columns headed “BLOCK #2″ and “BLOCK #3″.
  3. Save the file.

… and we’re done!

Further analysis of these data is usually done on the Copy and Paste Special = Values of sheet#2:”CTRs flat”.