I am trying to extract the first and third columns of this data table using BeautifulSoup. From looking at the HTML the first column has a <th>
tag. The other column of interest has as <td>
tag. In any case, all I've been able to get out is a list of the column with the tags. But, I just want the text.
table
is already a list so I can't use findAll(text=True)
. I'm not sure how to get the listing of the first column in another form.
from BeautifulSoup import BeautifulSoup
from sys import argv
import re
filename = argv[1] #get HTML file as a string
html_doc = ''.join(open(filename,'r').readlines())
soup = BeautifulSoup(html_doc)
table = soup.findAll('table')[0].tbody.th.findAll('th') #The relevant table is the first one
print table
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…