Farm Development

What I Thought I Knew About Unicode in Python Amounted To Nothing

After reading through all the many helpful tips on unicode in Python, I gathered this much: to represent languages other than English one needs to work in unicode and the most popular encoding is "UTF-8." I found enough examples to get some tests passing using non-English characters, but I still didn't fully get it. Specifically, I would see the infamous error...

UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 2: ordinal not in range(128)

...and I wasn't really sure why. I had to back up, way up, and when it finally clicked, here is an explanation of what confused me the most. Disclaimer: This could all be very wrong! If so, please offer corrections. UPDATED on 6/15/2007 to fix some inaccuracies pointed out by Leandro Lameiro. Thanks!

First and foremost, simply placing sys.setdefaultencoding('utf-8') into your file is a bad idea. If you do so, your code will seamlessly encode unicode as utf-8 on your machine but not on one with the default Python encoding, which is "ascii." [sigh]

Encode what? Huh? This is where I got confused. The Most Important Thing To Know: the str object in Python stores its value as bytes, which are 8-bit binary sequences, a.k.a. strings. For the purposes of ascii this means that each byte can represent a number from 0-255 but this is not enough if you want to represent Russian, Arabic, Japanese, etc. The unicode object in Python stores 16-bit or 32-bit sequences (depending on the system) of code points and thus can represent just about any character in any language.

Hurray! I'll just use unicode for everything, right? No, you can't. Specifically, you can't write unicode to a file (or most streams, like a terminal) because it wants an 8-bit string. Since Python is smart it will actually try to encode it for you, but alas this is why I was confused to see UnicodeEncodeError: 'ascii' codec can't encode character...etc. I thought: but unicode is better, can't it just stay that way?

Here is exactly what I'm talking about :

>>> price_info = u'it costs \u20ac 5'
>>> type(price_info)
<type 'unicode'>
>>> f = open('priceinfo.txt','wb')
>>> f.write(price_info)
Traceback (most recent call last):
  File "", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 9: ordinal not in range(128)

What happened? When trying to write the text "it costs € 5" to a file, Python tried to encode the unicode data into an 8-bit string using the default encoding, which is ascii. I was bright enough to know that ascii does not contain the euro sign. You with me? The solution:

>>> price_info_enc = price_info.encode('utf-8') # <- encodes the string as UTF-8
>>> price_info_enc
'it costs \xe2\x82\xac 5'
>>> type(price_info_enc)
<type 'str'>
>>> f.write(price_info_enc)
>>> f.close()

It's just as simple as that. Notice how it takes 3 characters to represent the Euro sign in a str object when it only takes one character, \u20ac, in unicode.

When you want to read the value back in you will get the same encoded string and you'll need to convert it to unicode if you want to do useful things with it (like encode it to something else for comparison, which is another thing Python will try to do automatically). This should be straight forward enough:

>>> f = open('priceinfo.txt','rb')
>>> price_info_enc =
>>> price_info = price_info_enc.decode('utf-8') # <- create a unicode object out of UTF-8 encoded bytes
>>> price_info
u'it costs \u20ac 5'
>>> type(price_info)
<type 'unicode'>

And just so you know, calling the str.decode() method is the same as instantiating the unicode() type:

>>> price_info = unicode(price_info_enc, 'utf-8')
>>> price_info
u'it costs \u20ac 5'
>>> type(price_info)
<type 'unicode'>

(And yet people still complain about this so-called "only one way" policy in Python :))

You've probably realized that it will get annoying to encode and decode all the time. From what I can gather, most modules solve this by ensuring all strings are unicode for as long as possible. And better yet, Python 3 will do everything in unicode to fix the problem. There is also another solution in the codecs module:, which claims to handle all this encoding for you. I.E. :

>>> price_info = u'it costs \u20ac 5'
>>> f ='priceinfo.txt','wb','utf-8')
>>> f.write(price_info)
>>> f.close()
>>> f ='priceinfo.txt','rb','utf-8')
u'it costs \u20ac 5'

However, it's not always possible to work with unicode all the time because not everything supports it. As just one example, you'll need to create a wrapper that temporarily encodes / decodes data when reading a csv file using the standard csv module.

Want to get into the nitty gritty details? This Unicode HOWTO by AMK pretty much covers it. I also found Python Unicode Objects by Fredrik Lundh to have some handy tricks and useful tips, especially for writing regular expressions. There are also some gotchas, like the BOM (byte order mark), which is explained in How To Use UTF-8 in Python by Evan Jones. Last but not least, as the title suggests, there is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

  • Re: What I Thought I Knew About Unicode in Python Amounted To Nothing

    Hi new blog

    singapore hindus jacko small beards

  • Re: What I Thought I Knew About Unicode in Python Amounted To Nothing

    New sissy girls blog website

    want women saddlemen seats free gay prison

    men on dating piercing dicks professional powerpoint videos to xxx tube sex toys transgender supplies license plate words costume sexy

  • Re: What I Thought I Knew About Unicode in Python Amounted To Nothing

    What i don't understood is if truth be told how

    you're now not actually a lot more smartly-favored than you may be now.

    You are very intelligent. You recognize thus considerably

    relating to this subject, produced me in my view believe it from so many various angles.

    Its like women and men are not fascinated except it's

    something to do with Lady gaga! Your personal stuffs outstanding.

    Always maintain it up!

    Website: Make Lash commenti

  • Re: What I Thought I Knew About Unicode in Python Amounted To Nothing

    My new blog project

    sexual women and men costumes arnival feminizing face

    satin mens underwear sociological conflict sociology theory and methods desi girl com feminine masks homemade ball stretchers beauty tips for makeup metal ball stretcher

  • Re: What I Thought I Knew About Unicode in Python Amounted To Nothing

    Hi fashionable project

    winfrey seat code hijab sabrina

  • Re: What I Thought I Knew About Unicode in Python Amounted To Nothing

    Further lodgings stage after concoct:

  • Re: What I Thought I Knew About Unicode in Python Amounted To Nothing

    Study my altered engagement

    farmers dating site free meeting sites best singles website free sexual dating site afrikaans singles adting

  • Re: What I Thought I Knew About Unicode in Python Amounted To Nothing

    My novel time

    old women young men handmadfe save the date cards free best dating sites uk mature women dating sites top online dating sites 2012

Note: HTML tags will be stripped. Hit enter twice for a new paragraph.

Recent Projects

  • JSTestNet

    Like botnet but for JS tests in CI.

  • Nose Nicedots

    Nose plugin that prints nicer dots.

  • Fudge

    Mock objects for testing.

  • Fixture

    Loading and referencing test data.

  • NoseJS

    Nose plugin that runs JavaScript tests for a Python project.

  • Wikir

    converts reST to various Wiki formats.