Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
930 views
in Technique[技术] by (71.8m points)

encoding - Python 3.4.0 -- 'ascii' codec can't encode characters in position 11-15: ordinal not in range(128) -- Unix 14.04

Trying to retrieve some data from the web using urlib and lxml, I've got an error and have no idea, how to fix it.

url='http://sum.in.ua/?swrd=автор'
page = urllib.request.urlopen(url)

The error itself:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-15: ordinal not in range(128)

I'm using Ukrainian in API this time, but when I use API (without any Ukrainian letters in it) here:

url="http://www.toponymic-dictionary.in.ua/index.php?option=com_content&view=section&layout=blog&id=8&Itemid=9"
page = urllib.request.urlopen(url)
pageWritten = page.read()
pageReady = pageWritten.decode('utf-8')
xmldata = lxml.html.document_fromstring(pageReady)
text1 = xmldata.xpath('//p[@class="MsoNormal"]//text()')

it gets me the data in Ukrainian and everything works just fine.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

URLs can only use a subset of printable ASCII codepoints; everything else must be properly encoded using URL percent encoding.

You can best achieve that by letting Python handle your parameters. The urllib.parse.urlencode() function can convert a dictionary (or a sequence of key-value pairs) for use in URLs:

from urllib.parse import urlencode

url = 'http://sum.in.ua/'
parameters = {'swrd': 'автор'}
url = '{}?{}'.format(url, urlencode(parameters))

page = urllib.request.urlopen(url)

This will first encode the parameters to UTF-8 bytes, then convert those bytes to percent-encoding sequences:

>>> from urllib.parse import urlencode
>>> parameters = {'swrd': 'автор'}
>>> urlencode(parameters)
'swrd=%D0%B0%D0%B2%D1%82%D0%BE%D1%80'

If you did not construct this URL yourself, you'll need to 'repair' the encoding. You can split of the query string, parse it into a dictionary, then pass it to urlencode to put it back into the URL using urllib.parse.urlparse() and urllib.parse.parse_qs():

from urllib.parse import urlparse, parse_qs, urlencode

url = 'http://sum.in.ua/?swrd=автор'
parsed_url = urlparse(url)
parameters = parse_qs(parsed_url.query)
url = parsed_url._replace(query=urlencode(parameters, doseq=True)).geturl()

This splits the URL into its constituent parts, parses out the query string, re-encodes and re-builds the URL afterwards:

>>> from urllib.parse import urlparse, parse_qs, urlencode
>>> url = 'http://sum.in.ua/?swrd=автор'
>>> parsed_url = urlparse(url)
>>> parameters = parse_qs(parsed_url.query)
>>> parsed_url._replace(query=urlencode(parameters, doseq=True)).geturl()
'http://sum.in.ua/?swrd=%D0%B0%D0%B2%D1%82%D0%BE%D1%80'

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...