I am fetching data from a web page using urllib2. The content of all the pages is in the English language so there is no issue of dealing with non-English text. The pages are encoded however, and they sometimes contain HTML entities such as £ or the copyright symbol etc.
I want to check if portions of a page contains certain keywords - however, I want to do a case insensitive check (for obvious reasons).
What is the best way to convert the returned page content into all lower case letters?
def get_page_content_as_lower_case(url):
request = urllib2.Request(url)
page = urllib2.urlopen(request)
temp = page.read()
return str(temp).lower() # this dosen't work because page contains utf-8 data
[[Update]]
I don't have to use urllib2 to get the data, in fact I may use BeautifulSoup instead, since I need to retrieve data from a specific element(s) in the page - for which BS is a much better choice. I have changed the title to reflect this.
HOWEVER, the problem still remains that the fetched data is in some non-asci coding (supposed to be) in utf-8. I did check one of the pages and the encoding was iso-8859-1.
Since I am only concerned with the English language, I want to know how I can obtain a lower case ASCII string version of the data retrieved from the page - so that I can carry out a case sensitive test as to whether a keyword is found in the text.
I am assuming that the fact that I have restricted myself to only English (from English speaking websites) reduces the choices of encoding?. I don't know much about encoding, but I assuming that the valid choices are:
Is that a valid assumption, and if yes, perhaps there is a way to write a 'robust' function that accepts an encoded string containing English text and returns a lower case ASCII string version of it?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…