I want to make search engine and I follow tutorial in some web.
I want to test parse html
from bs4 import BeautifulSoup
def parse_html(filename):
"""Extract the Author, Title and Text from a HTML file
which was produced by pdftotext with the option -htmlmeta."""
with open(filename) as infile:
html = BeautifulSoup(infile, "html.parser", from_encoding='utf-8')
d = {'text': html.pre.text}
if html.title is not None:
d['title'] = html.title.text
for meta in html.findAll('meta'):
try:
if meta['name'] in ('Author', 'Title'):
d[meta['name'].lower()] = meta['content']
except KeyError:
continue
return d
parse_html("C:\pdf\pydf\data\muellner2011.html")
and it getting error
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 867: character maps to <undefined>enter code here
I saw some solutions on the Web using the encode(). But I don't know how to insert encode() function in code. Can anyone help me?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…