Using DOMXPath for Parsing Page Content in PHP

The DOMXPath class is a convenient and popular means to parse HTML content with XPath.
If you have a small set of HTML pages that you want to scrape data from and then to stuff into a database, Regexes might work fine… this works well for a limited, one-time job (from community Wiki).

If we are to apply XPath methods then, after we upload a content, we had better brush it up to prepare for export into DOM and DOMXPath objects.

Here I’ve summed the basic steps to be done with DOMXPath class usage:
  1. Initialize a DOMDocument class instance from page content (work with HTML as with XML)
  2. Initialize a DOMXPath class instance from DOMDocument class instance.
  3. Parse the DOMXPath object.

1. Initializing a DOMDocument  class instance from page content

  • create a new DOMDocument class instance
1
$DOM = new DOMDocument;
1
libxml_use_internal_errors(true);
When using this function be sure to clear your internal error buffer ( libxml_clear_errors() ). If you don’t and you use this in a long running process, you may find that all your memory is used up. Outsourced from here. See the ‘enable user error handling’ bullet point.
  • load the HTML text into the DOMDocument object
1
if (!$DOM->loadHTML($page))
  • enable user error handling
1
2
3
4
5
6
7
8
    {   $errors="";
        foreach (libxml_get_errors() as $error)  {
           $errors.=$error->message.’<br/>’;
        }
        libxml_clear_errors();
        print “libxml errors:<br>$errors”;
        return;
    }

Now the DOMDocument object (named ‘$DOM’) contains all the target text as a HTML DOM structure. It’s ready for different methods and properties to be applied.

2. Initializing a DOMXPath object from the DOMDocument object

  • Initialize DOMXPath object for further parse
1
$xpath = new DOMXPath($DOM);

Now XPath methods are applicable to the content

Parsing the DOMXPath object

As a test page I took the Blocks Testing Ground page and wrote a code using XPath to retrieve data.

1
2
3
4
5
6
$case1 = $xpath->query(‘//*[@id="case1"]‘)->item(0);
$query = ‘div[not (@class="ads")]/span[1]‘;
$entries = $xpath->query($query, $case1);
foreach ($entries as $entry) {
    echo ” {$entry->firstChild->nodeValue} <br /> “;
}

 

How libxml library reacts to a malformed HTML

The libxml library gave no warning about a malformed HTML non-related to the direct DOM structure parse, yet the library has issued an error for the malformed HTML instance that is the subject of a direct parse:

  • No warning for this case: <p><p><p>
  • For a missed bracket: <div prod=’name1′ <div …> and then for the extra opened tag: <div prod=’name1′ ><div>  the library has issued an exception for the DOMXPath ‘query’ method.

The whole Scraper Listing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
<?php
$curl = curl_init(‘http://testing-ground.scraping.pro/blocks’);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$page = curl_exec($curl);
if(curl_errno($curl)) // check for execution errors
{
    echo ‘Scraper error: ‘ . curl_error($curl);
    exit;
}
curl_close($curl);
$DOM = new DOMDocument;
libxml_use_internal_errors(true);
if (!$DOM->loadHTML($page))
    {
        $errors=””;
        foreach (libxml_get_errors() as $error)  {
            $errors.=$error->message.”<br/>”;
        }
        libxml_clear_errors();
        print “libxml errors:<br>$errors”;
        return;
    }
$xpath = new DOMXPath($DOM);
$case1 = $xpath->query(‘//*[@id="case1"]‘)->item(0);
$query = ‘div[not (@class="ads")]/span[1]‘;
$entries = $xpath->query($query, $case1);
foreach ($entries as $entry) {
    echo ” {$entry->firstChild->nodeValue} <br /> “;
}
?>

http://scraping.pro/5-best-xpath-cheat-sheets-and-quick-references/#more-5731

 

The following two tabs change content below.

allenpg

Latest posts by allenpg (see all)