Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
347 views
in Technique[技术] by (71.8m points)

python - urllib.urlencode doesn't like unicode values: how about this workaround?

If I have an object like:

d = {'a':1, 'en': 'hello'}

...then I can pass it to urllib.urlencode, no problem:

percent_escaped = urlencode(d)
print percent_escaped

But if I try to pass an object with a value of type unicode, game over:

d2 = {'a':1, 'en': 'hello', 'pt': u'olá'}
percent_escaped = urlencode(d2)
print percent_escaped # This fails with a UnicodeEncodingError

So my question is about a reliable way to prepare an object to be passed to urlencode.

I came up with this function where I simply iterate through the object and encode values of type string or unicode:

def encode_object(object):
  for k,v in object.items():
    if type(v) in (str, unicode):
      object[k] = v.encode('utf-8')
  return object

This seems to work:

d2 = {'a':1, 'en': 'hello', 'pt': u'olá'}
percent_escaped = urlencode(encode_object(d2))
print percent_escaped

And that outputs a=1&en=hello&pt=%C3%B3la, ready for passing to a POST call or whatever.

But my encode_object function just looks really shaky to me. For one thing, it doesn't handle nested objects.

For another, I'm nervous about that if statement. Are there any other types that I should be taking into account?

And is comparing the type() of something to the native object like this good practice?

type(v) in (str, unicode) # not so sure about this...

Thanks!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You should indeed be nervous. The whole idea that you might have a mixture of bytes and text in some data structure is horrifying. It violates the fundamental principle of working with string data: decode at input time, work exclusively in unicode, encode at output time.

Update in response to comment:

You are about to output some sort of HTTP request. This needs to be prepared as a byte string. The fact that urllib.urlencode is not capable of properly preparing that byte string if there are unicode characters with ordinal >= 128 in your dict is indeed unfortunate. If you have a mixture of byte strings and unicode strings in your dict, you need to be careful. Let's examine just what urlencode() does:

>>> import urllib
>>> tests = ['x80', 'xe2x82xac', 1, '1', u'1', u'x80', u'u20ac']
>>> for test in tests:
...     print repr(test), repr(urllib.urlencode({'a':test}))
...
'x80' 'a=%80'
'xe2x82xac' 'a=%E2%82%AC'
1 'a=1'
'1' 'a=1'
u'1' 'a=1'
u'x80'
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:python27liburllib.py", line 1282, in urlencode
    v = quote_plus(str(v))
UnicodeEncodeError: 'ascii' codec can't encode character u'x80' in position 0: ordinal not in range(128)

The last two tests demonstrate the problem with urlencode(). Now let's look at the str tests.

If you insist on having a mixture, then you should at the very least ensure that the str objects are encoded in UTF-8.

'x80' is suspicious -- it is not the result of any_valid_unicode_string.encode('utf8').
'xe2x82xac' is OK; it's the result of u'u20ac'.encode('utf8').
'1' is OK -- all ASCII characters are OK on input to urlencode(), which will percent-encode such as '%' if necessary.

Here's a suggested converter function. It doesn't mutate the input dict as well as returning it (as yours does); it returns a new dict. It forces an exception if a value is a str object but is not a valid UTF-8 string. By the way, your concern about it not handling nested objects is a little misdirected -- your code works only with dicts, and the concept of nested dicts doesn't really fly.

def encoded_dict(in_dict):
    out_dict = {}
    for k, v in in_dict.iteritems():
        if isinstance(v, unicode):
            v = v.encode('utf8')
        elif isinstance(v, str):
            # Must be encoded in UTF-8
            v.decode('utf8')
        out_dict[k] = v
    return out_dict

and here's the output, using the same tests in reverse order (because the nasty one is at the front this time):

>>> for test in tests[::-1]:
...     print repr(test), repr(urllib.urlencode(encoded_dict({'a':test})))
...
u'u20ac' 'a=%E2%82%AC'
u'x80' 'a=%C2%80'
u'1' 'a=1'
'1' 'a=1'
1 'a=1'
'xe2x82xac' 'a=%E2%82%AC'
'x80'
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "<stdin>", line 8, in encoded_dict
  File "C:python27libencodingsutf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
>>>

Does that help?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...