Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
202 views
in Technique[技术] by (71.8m points)

Selective data extraction from forum site using DOM PHP web crawler

I have this PHP dom web crawler which works fine. it extracts mentioned tag along with its link from a (external) forum site to my page.

But recently i ran into a problem. Like

this is the HTML of the forum data::

<tbody>
<tr>
    <td width="1%" height="25">&nbsp;</td>
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837880.php" target="_top" class="Links2">Hispanic Study Partner</a> - dreamer1984</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="14%" height="25" class="FootNotes2" align="center">02/28/17 01:42</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="8%" height="25" align="Center" class="FootNotes2">0</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="9%" height="25" align="Center" class="FootNotes2">200</td>
</tr>
<tr>
    <td width="1%" height="25">&nbsp;</td>
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837879.php" target="_top" class="Links2">nbme</a> - monariyadh</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="14%" height="25" class="FootNotes2" align="center">02/27/17 23:12</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="8%" height="25" align="Center" class="FootNotes2">0</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="9%" height="25" align="Center" class="FootNotes2">108</td>
</tr>
</tbody>

Now if we consider the above code (table data) as the only statements available in that site. and if i tried to extract it with a web crawler like,

<?php
    require_once('dom/simple_html_dom.php'); 
    $html = file_get_html('http://www.sitename.com/');
    foreach($html->find('td.FootNotes2') as $element) {
    echo $element;
}
?>

It extracts al the data that is inside with a class name as "FootNote2"

Now what if i want to extract specific data in tag, for example names like, " dreamer1984" and "monariyadh" from the first tag/line.

and what if i wanted to extract data from 3rd (skipping the rest) which has same class names.

Hope i made the problem clear to understand.

Any help is appreciated..

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I suggest to you use regex.

this is example of what you need

$subject = <<<EOF
<tbody>
<tr>
    <td width="1%" height="25">&nbsp;</td>
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837880.php" target="_top" class="Links2">Hispanic Study Partner</a> - dreamer1984</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="14%" height="25" class="FootNotes2" align="center">02/28/17 01:42</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="8%" height="25" align="Center" class="FootNotes2">0</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="9%" height="25" align="Center" class="FootNotes2">200</td>
</tr>
<tr>
    <td width="1%" height="25">&nbsp;</td>
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837879.php" target="_top" class="Links2">nbme</a> - monariyadh</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="14%" height="25" class="FootNotes2" align="center">02/27/17 23:12</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="8%" height="25" align="Center" class="FootNotes2">0</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="9%" height="25" align="Center" class="FootNotes2">108</td>
</tr>
</tbody>
EOF;

preg_match_all('/<td.+?FootNotes2.+?<a.+?</a> - (?P<name>.*?)</td>.+?<td.+?FootNotes2.+?(?P<date>d{2}/d{2}/d{2} d{2}:d{2})/siu', $subject, $matchs);

foreach ($matchs['name'] as $k => $v){
    var_dump('name: '. $v, 'relative date: '. $matchs['date'][$k]);
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...