PHP in Action

Get links with XPath

dagfinn | 06 October, 2008 12:53

There's a tutorial that appeared recently called Get Links With DOM. Planet PHP lists the author as Kevin Waterson, although his name is not mentioned on the page itself. Anyway, he claims:

Perhaps the biggest mistake people make when trying to get URLs or link text from a web page is trying to do it using regular expressions. The job can be done with regular expressions, however, there is a high overhead in having preg loop over the entire document many times. The correct way, and the faster, and infinitely cooler ways is to use DOM.

Yes, of course it's cooler. But I'm a little bit surprised at the claim that it's the "correct" (only) way, since there's at least one more that I find even cooler: XPath. Admittedly, it's slower, yet it's a more powerful language.

In his example, we just need to add a line to create an XPath object after we've created the DOM object:

$xpath = new DOMXpath($dom);
 

Then, instead of the DOM call:

/*** get the links from the HTML ***/
$links = $dom->getElementsByTagName('a');
 

we can use an XPath query:

/*** get the links from the HTML ***/
$links = $xpath->query('//a');
 

That's all. So why is that cooler? Because you can do more powerful searches easily. The DOM just happens to have a simple call to find all elements with a certain tag name, so there's not much difference in this case. But more complex stuff is something else. For instance, we can get just the URLs with a single expression:

$links = $xpath->query('//a/@href');
 

Or we can get just the URLs of just the links whose CSS class is "bookmark":

$links = $xpath->query("//a[@class='bookmark']/@href");
 

I've been using this for ages when testing web pages. Then there's the not quite official SimpleTest DOM tester, which uses CSS selectors to specify paths. But I won't go into that right now.

Comments

jQuery

David M | 06/10/2008, 19:42

What's even cooler: jQuery :-)

jQuery

dagfinn | 06/10/2008, 22:48

Yes, but for slightly different purposes. ;-)

hi

daliada | 06/10/2008, 23:48

Thank you for promoting the right tools for php

Benchmarks

Zilvinas | 07/10/2008, 00:00

Did you try to benchmark this? I think not. Here is the code for xpath approach to get all imdb links:

$dom = new domDocument;
@$dom->loadHTML(file_get_contents('http://www.imdb.com/'));
$dom->preserveWhiteSpace = false;
$xpath = new DOMXpath($dom);
$links = $xpath->query('//a');
$ret = array();
foreach ($links as $tag) {
$ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
}
print_r($ret);

And here's the code to do it with regular expressions:

preg_match_all("/<a>/i", file_get_contents('http://www.imdb.com/'), $matches);
print_r($matches);

Sadly DOM and xpath does not make a big difference here. Regular expressions are even slightly faster on my pc. DOM and Xpath leave a bigger memory footprint than pregs on this particular case. And dom aproach is hell a lot more ugly than preg match :]

Using XQuery

William Candillon | 07/10/2008, 04:36

Using XQuery can also help: http://www.zorba-xquery.com/index.php/24/

Regular expressions

dagfinn | 07/10/2008, 05:28

@Zilvinas: That's correct, I haven't benchmarked it. I wasn't the one who claimed the DOM is faster.

And I'm not saying it's a sin to used regular expressions. But DOM and XPath tend to be more precise since they respect the structure of the HTML document. For instance, you can search for the presence of an attribute without worrying about the possibility that another attribute might be inserted between the tag and the attribute you're searching for.

And thanks for the other comment, which I've deleted for reasons I'm sure you understand.

Re: Get links with XPath

Anup | 20/10/2008, 09:41

Using // is a real performance killer as it causes node traversal of every single element in the document.

Admittedly finding links throughout a document means you need to use some kind of traversal through lots of unknown elements.

To address this to some extent it is good to be as specific as you can.

In (X)HTML documents you could start off by trying an xpath such as this:

/html/body//a

This saves traversal of all head elements.

If you want all anchors inside a div with id content that is immediately inside body you could use something like this:

/html/body/div[@id='content']//a

if your div could appear anywhere that you cannot easily predict or control, then something like this is okay:

/html/body//div[@id='content']/a

XPath is quite flexible so you can do a lot more if you know something about the document you are traversing.

Basically the more precise you can be, the less wasteful traversal you'll need. That will really help with performance. Especially for large documents.

(Of course, the more precise your XPath the more likely it will fail when the source HTML changes, so this needs to be considered carefully.)

 
Accessible and Valid XHTML 1.0 Strict and CSS
Powered by LifeType - Design by BalearWeb