Farm Development

What I Thought I Knew About Unicode in Python Amounted To Nothing

After reading through all the many helpful tips on unicode in Python, I gathered this much: to represent languages other than English one needs to work in unicode and the most popular encoding is "UTF-8." I found enough examples to get some tests passing using non-English characters, but I still didn't fully get it. Specifically, I would see the infamous error...

UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 2: ordinal not in range(128)

...and I wasn't really sure why. I had to back up, way up, and when it finally clicked, here is an explanation of what confused me the most. Disclaimer: This could all be very wrong! If so, please offer corrections. UPDATED on 6/15/2007 to fix some inaccuracies pointed out by Leandro Lameiro. Thanks!

First and foremost, simply placing sys.setdefaultencoding('utf-8') into your sitecustomize.py file is a bad idea. If you do so, your code will seamlessly encode unicode as utf-8 on your machine but not on one with the default Python encoding, which is "ascii." [sigh]

Encode what? Huh? This is where I got confused. The Most Important Thing To Know: the str object in Python stores its value as bytes, which are 8-bit binary sequences, a.k.a. strings. For the purposes of ascii this means that each byte can represent a number from 0-255 but this is not enough if you want to represent Russian, Arabic, Japanese, etc. The unicode object in Python stores 16-bit or 32-bit sequences (depending on the system) of code points and thus can represent just about any character in any language.

Hurray! I'll just use unicode for everything, right? No, you can't. Specifically, you can't write unicode to a file (or most streams, like a terminal) because it wants an 8-bit string. Since Python is smart it will actually try to encode it for you, but alas this is why I was confused to see UnicodeEncodeError: 'ascii' codec can't encode character...etc. I thought: but unicode is better, can't it just stay that way?

Here is exactly what I'm talking about :

>>> price_info = u'it costs \u20ac 5'
>>> type(price_info)
<type 'unicode'>
>>> f = open('priceinfo.txt','wb')
>>> f.write(price_info)
Traceback (most recent call last):
  File "", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 9: ordinal not in range(128)
>>> 

What happened? When trying to write the text "it costs € 5" to a file, Python tried to encode the unicode data into an 8-bit string using the default encoding, which is ascii. I was bright enough to know that ascii does not contain the euro sign. You with me? The solution:

>>> price_info_enc = price_info.encode('utf-8') # <- encodes the string as UTF-8
>>> price_info_enc
'it costs \xe2\x82\xac 5'
>>> type(price_info_enc)
<type 'str'>
>>> f.write(price_info_enc)
>>> f.close()

It's just as simple as that. Notice how it takes 3 characters to represent the Euro sign in a str object when it only takes one character, \u20ac, in unicode.

When you want to read the value back in you will get the same encoded string and you'll need to convert it to unicode if you want to do useful things with it (like encode it to something else for comparison, which is another thing Python will try to do automatically). This should be straight forward enough:

>>> f = open('priceinfo.txt','rb')
>>> price_info_enc = f.read()
>>> price_info = price_info_enc.decode('utf-8') # <- create a unicode object out of UTF-8 encoded bytes
>>> price_info
u'it costs \u20ac 5'
>>> type(price_info)
<type 'unicode'>

And just so you know, calling the str.decode() method is the same as instantiating the unicode() type:

>>> price_info = unicode(price_info_enc, 'utf-8')
>>> price_info
u'it costs \u20ac 5'
>>> type(price_info)
<type 'unicode'>

(And yet people still complain about this so-called "only one way" policy in Python :))

You've probably realized that it will get annoying to encode and decode all the time. From what I can gather, most modules solve this by ensuring all strings are unicode for as long as possible. And better yet, Python 3 will do everything in unicode to fix the problem. There is also another solution in the codecs module: codecs.open(), which claims to handle all this encoding for you. I.E. :

>>> price_info = u'it costs \u20ac 5'
>>> f = codecs.open('priceinfo.txt','wb','utf-8')
>>> f.write(price_info)
>>> f.close()
>>> f = codecs.open('priceinfo.txt','rb','utf-8')
>>> f.read()
u'it costs \u20ac 5'

However, it's not always possible to work with unicode all the time because not everything supports it. As just one example, you'll need to create a wrapper that temporarily encodes / decodes data when reading a csv file using the standard csv module.

Want to get into the nitty gritty details? This Unicode HOWTO by AMK pretty much covers it. I also found Python Unicode Objects by Fredrik Lundh to have some handy tricks and useful tips, especially for writing regular expressions. There are also some gotchas, like the BOM (byte order mark), which is explained in How To Use UTF-8 in Python by Evan Jones. Last but not least, as the title suggests, there is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

  • Re: What I Thought I Knew About Unicode in Python Amounted To Nothing

    Hi new blog

    http://sunni.muslim.purplesphere.in/?entry-juliet

    singapore hindus jacko small beards

Note: HTML tags will be stripped. Hit enter twice for a new paragraph.

Recent Projects

  • JSTestNet

    Like botnet but for JS tests in CI.

  • Nose Nicedots

    Nose plugin that prints nicer dots.

  • Fudge

    Mock objects for testing.

  • Fixture

    Loading and referencing test data.

  • NoseJS

    Nose plugin that runs JavaScript tests for a Python project.

  • Wikir

    converts reST to various Wiki formats.