Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
673 views
in Technique[技术] by (71.8m points)

php - Extract form fields using RegEx

I'm looking for a way to get all the form inputs and respective values from a page given a specific URL and form name.

function GetForm($url, $name)
{
    return array
    (
        'field_name_1' => 'value_1',
        'field_name_2' => 'value_2',
        'select_field_name' => array('option_1', 'option_2', 'option_3'),
    );
}

GetForm('http://www.google.com/', 'f');

Can anyone provide me with the necessary regular expressions to accomplish this?

EDIT: I understand that querying the DOM would be far more reliable, however what I'm looking for is a website agnostic solution that allows me to get all the fields of a given form. I don't believe this is possible with DOM without knowing the document nodes first, am I wrong?

I don't need a bullet proof solution, just something that works on standard web pages, for the FORM tag I've come up with the following RegEx;

'~<form.*?name=['"]?' . $name . '['"]?.*?>(.+?)</form>~is'

I believe that doing something similar for input fields won't be difficult, what I find most challenging is the RegEx for the select and option fields.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Using regex to parse HTML is probably not the best way to go.

You might take a look at DOMDocument::loadHTML, which will allow you to work with an HTML document using DOM methods (and XPath queries, for instance, if you know those).

You might also want to take a look at Zend_Dom and Zend_Dom_Query, btw, which are quite nice if you can use some parts of Zend Framework in your application.
They are used to get fetch data from HTML pages when doing functionnal testing with Zend_Test, for instance -- and work quite well ;-)

It may seem harder in the first place... But, considering the mess some HTML pages are, it is probably a much wiser idea...


EDIT after the comment and the edit of the OP

Here are a couple of thought about, to begin by something "simple", an input tag :

  • it can spread accross several lines
  • it can have many attributes
  • condirering only name and value are of interest to you, you have to deal with the fact that those two can be in any possible order
  • attributes can have double-quotes, single-quotes, or even nothing arround their values
  • tags / attributes can be both lower-case or upper-case
  • tags don't always have to be closed

Well, some of those points are not valid-HTML ; but still work in the most commons web-browsers, so they have to be taken into account...

Only with those points, I wouldn't like to be the one writting the regex ^^
But I suppose there might be others difficulties I didn't think about.


On the other side, you have DOM and xpath... To get the value of an input name="q" (example is this page), it's a matter of something like this :

$url = 'http://www.google.fr/search?q=test&ie=utf-8&oe=utf-8&aq=t&rls=com.ubuntu:en-US:unofficial&client=firefox-a';
$html = file_get_contents($url);
$dom = new DOMDocument();
if (@$dom->loadHTML($html)) {
    // yep, not necessarily valid-html...
    $xpath = new DOMXpath($dom);

    $nodeList = $xpath->query('//input[@name="q"]');
    if ($nodeList->length > 0) {
        for ($i=0 ; $i<$nodeList->length ; $i++) {
            $node = $nodeList->item($i);
            var_dump($node->getAttribute('value'));
        }
    }

} else {
    // too bad...
}

What matters here ? The XPath query, and only that... And is there anything static/constant in it ?
Well, I say I want all <input> that have a name attribute that is equal to "q".
And it just works : I'm getting this result :

string 'test' (length=4)
string 'test' (length=4)

(I checked : there are two input name="q" on the page ^^ )

Do I know the structure of the page ? Absolutly not ;-)
I just know I/you/we want input tags named q ;-)

And that's what we get ;-)


EDIT 2 : and a bit fun with select and options :

Well, just for fun, here's what I came up for select and option :

$url = 'http://www.google.fr/language_tools?hl=fr';
$html = file_get_contents($url);
$dom = new DOMDocument();
if (@$dom->loadHTML($html)) {
    // yep, not necessarily valid-html...
    $xpath = new DOMXpath($dom);

    $nodeListSelects = $xpath->query('//select');
    if ($nodeListSelects->length > 0) {
        for ($i=0 ; $i<$nodeListSelects->length ; $i++) {
            $nodeSelect = $nodeListSelects->item($i);
            $name = $nodeSelect->getAttribute('name');
            $nodeListOptions = $xpath->query('option[@selected="selected"]', $nodeSelect);  // We want options that are inside the current select
            if ($nodeListOptions->length > 0) {
                for ($j=0 ; $j<$nodeListOptions->length ; $j++) {
                    $nodeOption = $nodeListOptions->item($j);
                    $value = $nodeOption->getAttribute('value');
                    var_dump("name='$name' => value='$value'");
                }
            }
        }
    }
} else {
    // too bad...
}

And I get as an output :

string 'name='sl' => value='fr'' (length=23)
string 'name='tl' => value='en'' (length=23)
string 'name='sl' => value='en'' (length=23)
string 'name='tl' => value='fr'' (length=23)
string 'name='sl' => value='en'' (length=23)
string 'name='tl' => value='fr'' (length=23)

Which is what I expected.


Some explanations ?

Well, first of all, I get all the select tags of the page, and keep their name in memory.
Then, for each one of those, I get the selected option tags that are its descendants (there's always only one, btw).
And here, I have the value.

A bit more complicated that the previous example... But still much more easy than regex, I believe... Took me maybe 10 minutes, not more... And I still won't have the courage (madness ?) to start thinkg about some kind of mutant regex that would be able to do that :-D

Oh, and, as a sidenote : I still have no idea what the structure of the HTML document looks like : I have not even taken a single look at it's source ^^


I hope this helps a bit more...
Who knows, maybe I'll convince you regex are not a good idea when it comes to parsing HTML... maybe ? ;-)

Still : have fun !


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...