Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
322 views
in Technique[技术] by (71.8m points)

python - Extracting data from HTML table

I am looking for a way to get certain info from HTML in linux shell environment.

This is bit that I'm interested in :

<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
  <tr valign="top">
    <th>Tests</th>
    <th>Failures</th>
    <th>Success Rate</th>
    <th>Average Time</th>
    <th>Min Time</th>
    <th>Max Time</th>
  </tr>
  <tr valign="top" class="Failure">
    <td>103</td>
    <td>24</td>
    <td>76.70%</td>
    <td>71 ms</td>
    <td>0 ms</td>
    <td>829 ms</td>
  </tr>
</table>

And I want to store in shell variables or echo these in key value pairs extracted from above html. Example :

Tests         : 103
Failures      : 24
Success Rate  : 76.70 %
and so on..

What I can do at the moment is to create a java program that will use sax parser or html parser such as jsoup to extract this info.

But using java here seems to be overhead with including the runnable jar inside the "wrapper" script you want to execute.

I'm sure that there must be "shell" languages out there that can do the same i.e. perl, python, bash etc.

My problem is that I have zero experience with these, can somebody help me resolve this "fairly easy" issue

Quick update:

I forgot to mention that I've got more tables and more rows in the .html document sorry about that (early morning).

Update #2:

Tried to install Bsoup like this since I don't have root access :

$ wget http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-4.1.0.tar.gz
$ tar -zxvf beautifulsoup4-4.1.0.tar.gz
$ cp -r beautifulsoup4-4.1.0/bs4 .
$ vi htmlParse.py # (paste code from ) Tichodromas' answer, just in case this (http://pastebin.com/4Je11Y9q) is what I pasted
$ run file (python htmlParse.py)

error:

$ python htmlParse.py
Traceback (most recent call last):
  File "htmlParse.py", line 1, in ?
    from bs4 import BeautifulSoup
  File "/home/gdd/setup/py/bs4/__init__.py", line 29
    from .builder import builder_registry
         ^
SyntaxError: invalid syntax

Update #3 :

Running Tichodromas' answer get this error :

Traceback (most recent call last):
  File "test.py", line 27, in ?
    headings = [th.get_text() for th in table.find("tr").find_all("th")]
TypeError: 'NoneType' object is not callable

any ideas?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

A Python solution using BeautifulSoup4 (Edit: with proper skipping. Edit3: Using class="details" to select the table):

from bs4 import BeautifulSoup

html = """
  <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
    <tr valign="top">
      <th>Tests</th>
      <th>Failures</th>
      <th>Success Rate</th>
      <th>Average Time</th>
      <th>Min Time</th>
      <th>Max Time</th>
   </tr>
   <tr valign="top" class="Failure">
     <td>103</td>
     <td>24</td>
     <td>76.70%</td>
     <td>71 ms</td>
     <td>0 ms</td>
     <td>829 ms</td>
  </tr>
</table>"""

soup = BeautifulSoup(html)
table = soup.find("table", attrs={"class":"details"})

# The first tr contains the field names.
headings = [th.get_text() for th in table.find("tr").find_all("th")]

datasets = []
for row in table.find_all("tr")[1:]:
    dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
    datasets.append(dataset)

print datasets

The result looks like this:

[[(u'Tests', u'103'),
  (u'Failures', u'24'),
  (u'Success Rate', u'76.70%'),
  (u'Average Time', u'71 ms'),
  (u'Min Time', u'0 ms'),
  (u'Max Time', u'829 ms')]]

Edit2: To produce the desired output, use something like this:

for dataset in datasets:
    for field in dataset:
        print "{0:<16}: {1}".format(field[0], field[1])

Result:

Tests           : 103
Failures        : 24
Success Rate    : 76.70%
Average Time    : 71 ms
Min Time        : 0 ms
Max Time        : 829 ms

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...