What I Thought I Knew About Unicode in Python Amounted To Nothing
After reading through all the many helpful tips on unicode in Python, I gathered this much: to represent languages other than English one needs to work in unicode and the most popular encoding is "UTF-8." I found enough examples to get some tests passing using non-English characters, but I still didn't fully get it. Specifically, I would see the infamous error...
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 2: ordinal not in range(128)
...and I wasn't really sure why. I had to back up, way up, and when it finally clicked, here is an explanation of what confused me the most. Disclaimer: This could all be very wrong! If so, please offer corrections. UPDATED on 6/15/2007 to fix some inaccuracies pointed out by Leandro Lameiro. Thanks!
First and foremost, simply placing sys.setdefaultencoding('utf-8') into your sitecustomize.py file is a bad idea. If you do so, your code will seamlessly encode unicode as utf-8 on your machine but not on one with the default Python encoding, which is "ascii." [sigh]
Encode what? Huh? This is where I got confused. The Most Important Thing To Know: the str object in Python stores its value as bytes, which are 8-bit binary sequences, a.k.a. strings. For the purposes of ascii this means that each byte can represent a number from 0-255 but this is not enough if you want to represent Russian, Arabic, Japanese, etc. The unicode object in Python stores 16-bit or 32-bit sequences (depending on the system) of code points and thus can represent just about any character in any language.
Hurray! I'll just use unicode for everything, right? No, you can't. Specifically, you can't write unicode to a file (or most streams, like a terminal) because it wants an 8-bit string. Since Python is smart it will actually try to encode it for you, but alas this is why I was confused to see UnicodeEncodeError: 'ascii' codec can't encode character...etc. I thought: but unicode is better, can't it just stay that way?
Here is exactly what I'm talking about :
>>> price_info = u'it costs \u20ac 5'
>>> type(price_info)
<type 'unicode'>
>>> f = open('priceinfo.txt','wb')
>>> f.write(price_info)
Traceback (most recent call last):
File "", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 9: ordinal not in range(128)
>>>
What happened? When trying to write the text "it costs € 5" to a file, Python tried to encode the unicode data into an 8-bit string using the default encoding, which is ascii. I was bright enough to know that ascii does not contain the euro sign. You with me? The solution:
>>> price_info_enc = price_info.encode('utf-8') # <- encodes the string as UTF-8
>>> price_info_enc
'it costs \xe2\x82\xac 5'
>>> type(price_info_enc)
<type 'str'>
>>> f.write(price_info_enc)
>>> f.close()
It's just as simple as that. Notice how it takes 3 characters to represent the Euro sign in a str object when it only takes one character, \u20ac, in unicode.
When you want to read the value back in you will get the same encoded string and you'll need to convert it to unicode if you want to do useful things with it (like encode it to something else for comparison, which is another thing Python will try to do automatically). This should be straight forward enough:
>>> f = open('priceinfo.txt','rb')
>>> price_info_enc = f.read()
>>> price_info = price_info_enc.decode('utf-8') # <- create a unicode object out of UTF-8 encoded bytes
>>> price_info
u'it costs \u20ac 5'
>>> type(price_info)
<type 'unicode'>
And just so you know, calling the str.decode() method is the same as instantiating the unicode() type:
>>> price_info = unicode(price_info_enc, 'utf-8')
>>> price_info
u'it costs \u20ac 5'
>>> type(price_info)
<type 'unicode'>
(And yet people still complain about this so-called "only one way" policy in Python :))
You've probably realized that it will get annoying to encode and decode all the time. From what I can gather, most modules solve this by ensuring all strings are unicode for as long as possible. And better yet, Python 3 will do everything in unicode to fix the problem. There is also another solution in the codecs module: codecs.open(), which claims to handle all this encoding for you. I.E. :
>>> price_info = u'it costs \u20ac 5'
>>> f = codecs.open('priceinfo.txt','wb','utf-8')
>>> f.write(price_info)
>>> f.close()
>>> f = codecs.open('priceinfo.txt','rb','utf-8')
>>> f.read()
u'it costs \u20ac 5'
However, it's not always possible to work with unicode all the time because not everything supports it. As just one example, you'll need to create a wrapper that temporarily encodes / decodes data when reading a csv file using the standard csv module.
Want to get into the nitty gritty details? This Unicode HOWTO by AMK pretty much covers it. I also found Python Unicode Objects by Fredrik Lundh to have some handy tricks and useful tips, especially for writing regular expressions. There are also some gotchas, like the BOM (byte order mark), which is explained in How To Use UTF-8 in Python by Evan Jones. Last but not least, as the title suggests, there is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Re: What I Thought I Knew About Unicode in Python Amounted To Nothing
posted by Martijn Faassen on Thursday Jun 14th, 2007 at 5:18p.m.
Some discussion on this topic from a few years ago that you might find interesting.
http://faassen.n--tree.net/blog/view/weblog/2005/08/02/0
Re: What I Thought I Knew About Unicode in Python Amounted To Nothing
posted by Leandro Lameiro on Thursday Jun 14th, 2007 at 11:59p.m.
Hi Kumar.
I would just like to point one small inaccuracy in this excellent post.
You said "Notice how it takes 3 bytes to represent the Euro sign in a str object when it only takes one byte, \u20ac, in unicode."
Maybe you meant the Euro sign takes 1 unicode character? Or that the unicode string u"€" length is 1 (instead of 3)?
It doesn't take 1 byte to represent Euro sign in unicode. There is no such thing as "one byte in unicode".
It takes 1 unicode code point (in this case, U+20AC). That may use 2 bytes, 3 bytes, 8-bytes, it depends on the encoding used. UTF-8 will take 3 bytes but UCS-4 will use 4 bytes. cp1252 will use 1 byte.
Python internally uses UCS-2 to represent Unicode strings (or UCS-4 depending on the compiler/libc/platform being used.)
There is an excellent post from Joel Spolsky about it: http://www.joelonsoftware.com/articles/Unicode.html
And PEP 100: http://www.python.org/dev/peps/pep-0100/
Best regards
Leandro Lameiro
Re: What I Thought I Knew About Unicode in Python Amounted To Nothing
posted by Greg Jorgensen on Friday Jun 15th, 2007 at 12:09a.m.
Nice simple explanation. I had to learn all of this recently. You've distilled hours of trial-and-error into a few easy to understand examples.
Re: What I Thought I Knew About Unicode in Python Amounted To Nothing
posted by Remco Gerlich on Friday Jun 15th, 2007 at 2:57a.m.
That Joel Spolsky article is essential.
What I try to get confused coworkers to understand is this: Unicode is not an encoding. Unicode is an abstract idea.
Since the way your programming language handles that abstract idea is implementation dependent, you need to translate to bytes to communicate with the outside world, and for that you need to pick an encoding.
When they get that it usually becomes easier to understand the rest of the details.
Re: What I Thought I Knew About Unicode in Python Amounted To Nothing
posted by Michael Schlenker on Friday Jun 15th, 2007 at 5:04a.m.
You've seen the ugly faces of Python Unicode handling, in contrast to some other language (like Tcl which has a way better unicode integration than Python), maybe it gets better with Py3k.
I always wondered for the reason the Python file object has an encoding attribute, but as read only. Thats totally insane design, forcing you to use errorprone encode/decode all the time instead of simply setting the encoding for the file. Ok, codecs.open() fixes it, but it feels a bit tacked on (like all of pythons unicode support) instead of integrated into the language.
My favorite python glitch with unicode is exception printing..., you really have fun when your exception catching code tries to print an exception (for logging for example) and the printing fails with an exception because you got an unicode string.
One slight comment on a comment here:
Since PEP 261 Python has no longer UCS-2 internally, but uses UTF-16 with surrogates.
Re: What I Thought I Knew About Unicode in Python Amounted To Nothing
posted by Kumar McMillan on Friday Jun 15th, 2007 at 5:15p.m.
Leandro Lameiro, thanks for the important details, I've updated the post.