python - Returning a lower case ASCII string from a (possibly encoded) string fetched using urllib2 or BeautifulSoup

Question

Welcome To Ask or Share your Answers For Others

python - Returning a lower case ASCII string from a (possibly encoded) string fetched using urllib2 or BeautifulSoup

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Returning a lower case ASCII string from a (possibly encoded) string fetched using urllib2 or BeautifulSoup

I am fetching data from a web page using urllib2. The content of all the pages is in the English language so there is no issue of dealing with non-English text. The pages are encoded however, and they sometimes contain HTML entities such as ￡ or the copyright symbol etc.

I want to check if portions of a page contains certain keywords - however, I want to do a case insensitive check (for obvious reasons).

What is the best way to convert the returned page content into all lower case letters?

def get_page_content_as_lower_case(url):
    request = urllib2.Request(url)
    page = urllib2.urlopen(request)
    temp = page.read()

    return str(temp).lower() # this dosen't work because page contains utf-8 data

[[Update]]

I don't have to use urllib2 to get the data, in fact I may use BeautifulSoup instead, since I need to retrieve data from a specific element(s) in the page - for which BS is a much better choice. I have changed the title to reflect this.

HOWEVER, the problem still remains that the fetched data is in some non-asci coding (supposed to be) in utf-8. I did check one of the pages and the encoding was iso-8859-1.

Since I am only concerned with the English language, I want to know how I can obtain a lower case ASCII string version of the data retrieved from the page - so that I can carry out a case sensitive test as to whether a keyword is found in the text.

I am assuming that the fact that I have restricted myself to only English (from English speaking websites) reduces the choices of encoding?. I don't know much about encoding, but I assuming that the valid choices are:

ASCII
iso-8859-1
utf-8

Is that a valid assumption, and if yes, perhaps there is a way to write a 'robust' function that accepts an encoded string containing English text and returns a lower case ASCII string version of it?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T03:08:52+0000

Case-insensitive string search is more complicated than simply searching in the lower-cased variant. For example, a German user would expect to match both STRASSE as well as Stra?e with the search term Stra?e, but 'STRASSE'.lower() == 'strasse' (and you can't simply replace a double s with ? - there's no ? in Trasse). Other languages (in particular Turkish) will have similar complications as well.

If you're looking to support other languages than English, you should therefore use a library that can handle proper casefolding (such as Matthew Barnett's regexp).

That being said, the way to extract the page's content is:

import contextlib
def get_page_content(url):
  with contextlib.closing(urllib2.urlopen(url)) as uh:
    content = uh.read().decode('utf-8')
  return content
  # You can call .lower() on the result, but that won't work in general

Categories

python - Returning a lower case ASCII string from a (possibly encoded) string fetched using urllib2 or BeautifulSoup

python - Returning a lower case ASCII string from a (possibly encoded) string fetched using urllib2 or BeautifulSoup

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags