Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
223 views
in Technique[技术] by (71.8m points)

parse a HTML file with table using Python

I got problem with my python parser. its a part of my file:

<tr>
    <td class="zeit"><div>03.12. 10:45:00</div></td>
    <td class="system"><div><a target="_blank" href="detail.php?host=CG&factor=2&delay=1&Y=15">CG</div></a></td>
    <td class="fehlertext"><div>System steht nicht zur Verfügung!</div></td>
</tr>

<tr>
    <td class="zeit"><div>03.12. 10:10:01</div></td>
    <td class="system"><div><a target="_blank" href="detail.php?host=DEXProd&factor=2&delay=5&Y=15">DEX</div></a></td>
    <td class="fehlertext"><div>ssh: Connection refused Couldn't read packet: Connection reset by peer</div></td>
</tr>

<tr>
    <td class="zeit"><div>03.12. 06:23:06</div></td>
    <td class="system"><div><a target="_blank" href="detail.php?host=FRAUD&factor=2&delay=1&Y=15">Boni</div></a></td>
    <td class="fehlertext"><div>ID Fehler</div></td>
</tr>

Now i'm going to get few information for each:

1) DATA 2) NAME 3) ERROR

so for 1st table should be:

03.12. 10:45:00 CG System steht nicht zur Verfügung!

i was reading some information about BS4 but i have no idea how to initiate below python script.

-bash-3.2$ cat out2.py

from bs4 import BeautifulSoup


with open ("file.txt", "r") as myfile:
    html=myfile.read().replace('
', '')

soup = BeautifulSoup(html)
tag = soup.findAll('a') #all "a" tag in a list

count = 0
passx = 0
for i in tag:
        if count > 3:
                print "-------------------------------"
                #FILE.write("-------------------------------" + "
")
                count = 0
                passx = 0
        if passx == 0:
                print i['href']
                #FILE.write(i['href'] + "
")
                passx = 1
        print i.text
        count = count + 1

#FILE.close()
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Find all tr tags and get td tags by class attribute:

# encoding: utf-8
from bs4 import BeautifulSoup

data = u"""
<table>
    <tr>
        <td class="zeit"><div>03.12. 10:45:00</div></td>
        <td class="system"><div><a target="_blank" href="detail.php?host=CG&factor=2&delay=1&Y=15">CG</div></a></td>
        <td class="fehlertext"><div>System steht nicht zur Verfügung!</div></td>
    </tr>

    <tr>
        <td class="zeit"><div>03.12. 10:10:01</div></td>
        <td class="system"><div><a target="_blank" href="detail.php?host=DEXProd&factor=2&delay=5&Y=15">DEX</div></a></td>
        <td class="fehlertext"><div>ssh: Connection refused Couldn't read packet: Connection reset by peer</div></td>
    </tr>

    <tr>
        <td class="zeit"><div>03.12. 06:23:06</div></td>
        <td class="system"><div><a target="_blank" href="detail.php?host=FRAUD&factor=2&delay=1&Y=15">Boni</div></a></td>
        <td class="fehlertext"><div>ID Fehler</div></td>
    </tr>
</table>
"""

soup = BeautifulSoup(data)
for tr in soup.find_all('tr'):
    zeit = tr.find('td', class_='zeit').get_text(strip=True)
    system = tr.find('td', class_='system').get_text(strip=True)
    fehlertext = tr.find('td', class_='fehlertext').get_text(strip=True)

    print zeit, system, fehlertext

Prints:

03.12. 10:45:00 CG System steht nicht zur Verfügung!
03.12. 10:10:01 DEX ssh: Connection refused Couldn't read packet: Connection reset by peer
03.12. 06:23:06 Boni ID Fehler

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...