Using regex to parse HTML is probably not the best way to go.
You might take a look at DOMDocument::loadHTML, which will allow you to work with an HTML document using DOM methods (and XPath queries, for instance, if you know those).
You might also want to take a look at Zend_Dom
and Zend_Dom_Query
, btw, which are quite nice if you can use some parts of Zend Framework in your application.
They are used to get fetch data from HTML pages when doing functionnal testing with Zend_Test
, for instance -- and work quite well ;-)
It may seem harder in the first place... But, considering the mess some HTML pages are, it is probably a much wiser idea...
EDIT after the comment and the edit of the OP
Here are a couple of thought about, to begin by something "simple", an input tag :
- it can spread accross several lines
- it can have many attributes
- condirering only name and value are of interest to you, you have to deal with the fact that those two can be in any possible order
- attributes can have double-quotes, single-quotes, or even nothing arround their values
- tags / attributes can be both lower-case or upper-case
- tags don't always have to be closed
Well, some of those points are not valid-HTML ; but still work in the most commons web-browsers, so they have to be taken into account...
Only with those points, I wouldn't like to be the one writting the regex ^^
But I suppose there might be others difficulties I didn't think about.
On the other side, you have DOM and xpath... To get the value of an input name="q" (example is this page), it's a matter of something like this :
$url = 'http://www.google.fr/search?q=test&ie=utf-8&oe=utf-8&aq=t&rls=com.ubuntu:en-US:unofficial&client=firefox-a';
$html = file_get_contents($url);
$dom = new DOMDocument();
if (@$dom->loadHTML($html)) {
// yep, not necessarily valid-html...
$xpath = new DOMXpath($dom);
$nodeList = $xpath->query('//input[@name="q"]');
if ($nodeList->length > 0) {
for ($i=0 ; $i<$nodeList->length ; $i++) {
$node = $nodeList->item($i);
var_dump($node->getAttribute('value'));
}
}
} else {
// too bad...
}
What matters here ? The XPath query, and only that... And is there anything static/constant in it ?
Well, I say I want all <input>
that have a name
attribute that is equal to "q
".
And it just works : I'm getting this result :
string 'test' (length=4)
string 'test' (length=4)
(I checked : there are two input name="q" on the page ^^ )
Do I know the structure of the page ? Absolutly not ;-)
I just know I/you/we want input tags named q ;-)
And that's what we get ;-)
EDIT 2 : and a bit fun with select and options :
Well, just for fun, here's what I came up for select and option :
$url = 'http://www.google.fr/language_tools?hl=fr';
$html = file_get_contents($url);
$dom = new DOMDocument();
if (@$dom->loadHTML($html)) {
// yep, not necessarily valid-html...
$xpath = new DOMXpath($dom);
$nodeListSelects = $xpath->query('//select');
if ($nodeListSelects->length > 0) {
for ($i=0 ; $i<$nodeListSelects->length ; $i++) {
$nodeSelect = $nodeListSelects->item($i);
$name = $nodeSelect->getAttribute('name');
$nodeListOptions = $xpath->query('option[@selected="selected"]', $nodeSelect); // We want options that are inside the current select
if ($nodeListOptions->length > 0) {
for ($j=0 ; $j<$nodeListOptions->length ; $j++) {
$nodeOption = $nodeListOptions->item($j);
$value = $nodeOption->getAttribute('value');
var_dump("name='$name' => value='$value'");
}
}
}
}
} else {
// too bad...
}
And I get as an output :
string 'name='sl' => value='fr'' (length=23)
string 'name='tl' => value='en'' (length=23)
string 'name='sl' => value='en'' (length=23)
string 'name='tl' => value='fr'' (length=23)
string 'name='sl' => value='en'' (length=23)
string 'name='tl' => value='fr'' (length=23)
Which is what I expected.
Some explanations ?
Well, first of all, I get all the select tags of the page, and keep their name in memory.
Then, for each one of those, I get the selected option tags that are its descendants (there's always only one, btw).
And here, I have the value.
A bit more complicated that the previous example... But still much more easy than regex, I believe... Took me maybe 10 minutes, not more... And I still won't have the courage (madness ?) to start thinkg about some kind of mutant regex that would be able to do that :-D
Oh, and, as a sidenote : I still have no idea what the structure of the HTML document looks like : I have not even taken a single look at it's source ^^
I hope this helps a bit more...
Who knows, maybe I'll convince you regex are not a good idea when it comes to parsing HTML... maybe ? ;-)
Still : have fun !