Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
214 views
in Technique[技术] by (71.8m points)

python - Returning a lower case ASCII string from a (possibly encoded) string fetched using urllib2 or BeautifulSoup

I am fetching data from a web page using urllib2. The content of all the pages is in the English language so there is no issue of dealing with non-English text. The pages are encoded however, and they sometimes contain HTML entities such as £ or the copyright symbol etc.

I want to check if portions of a page contains certain keywords - however, I want to do a case insensitive check (for obvious reasons).

What is the best way to convert the returned page content into all lower case letters?

def get_page_content_as_lower_case(url):
    request = urllib2.Request(url)
    page = urllib2.urlopen(request)
    temp = page.read()

    return str(temp).lower() # this dosen't work because page contains utf-8 data

[[Update]]

I don't have to use urllib2 to get the data, in fact I may use BeautifulSoup instead, since I need to retrieve data from a specific element(s) in the page - for which BS is a much better choice. I have changed the title to reflect this.

HOWEVER, the problem still remains that the fetched data is in some non-asci coding (supposed to be) in utf-8. I did check one of the pages and the encoding was iso-8859-1.

Since I am only concerned with the English language, I want to know how I can obtain a lower case ASCII string version of the data retrieved from the page - so that I can carry out a case sensitive test as to whether a keyword is found in the text.

I am assuming that the fact that I have restricted myself to only English (from English speaking websites) reduces the choices of encoding?. I don't know much about encoding, but I assuming that the valid choices are:

  • ASCII
  • iso-8859-1
  • utf-8

Is that a valid assumption, and if yes, perhaps there is a way to write a 'robust' function that accepts an encoded string containing English text and returns a lower case ASCII string version of it?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Case-insensitive string search is more complicated than simply searching in the lower-cased variant. For example, a German user would expect to match both STRASSE as well as Stra?e with the search term Stra?e, but 'STRASSE'.lower() == 'strasse' (and you can't simply replace a double s with ? - there's no ? in Trasse). Other languages (in particular Turkish) will have similar complications as well.

If you're looking to support other languages than English, you should therefore use a library that can handle proper casefolding (such as Matthew Barnett's regexp).

That being said, the way to extract the page's content is:

import contextlib
def get_page_content(url):
  with contextlib.closing(urllib2.urlopen(url)) as uh:
    content = uh.read().decode('utf-8')
  return content
  # You can call .lower() on the result, but that won't work in general

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...