What I Thought I Knew About Unicode in Python Amounted To Nothing
After reading through all the many helpful tips on unicode in Python, I gathered this much: to represent languages other than English one needs to work in unicode and the most popular encoding is "UTF-8." I found enough examples to get some tests passing using non-English characters, but I still didn't fully get it. Specifically, I would see the infamous error...
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 2: ordinal not in range(128)
...and I wasn't really sure why. I had to back up, way up, and when it finally clicked, here is an explanation of what confused me the most. Disclaimer: This could all be very wrong! If so, please offer corrections. UPDATED on 6/15/2007 to fix some inaccuracies pointed out by Leandro Lameiro. Thanks!
First and foremost, simply placing sys.setdefaultencoding('utf-8') into your sitecustomize.py file is a bad idea. If you do so, your code will seamlessly encode unicode as utf-8 on your machine but not on one with the default Python encoding, which is "ascii." [sigh]
Encode what? Huh? This is where I got confused. The Most Important Thing To Know: the str object in Python stores its value as bytes, which are 8-bit binary sequences, a.k.a. strings. For the purposes of ascii this means that each byte can represent a number from 0-255 but this is not enough if you want to represent Russian, Arabic, Japanese, etc. The unicode object in Python stores 16-bit or 32-bit sequences (depending on the system) of code points and thus can represent just about any character in any language.
Hurray! I'll just use unicode for everything, right? No, you can't. Specifically, you can't write unicode to a file (or most streams, like a terminal) because it wants an 8-bit string. Since Python is smart it will actually try to encode it for you, but alas this is why I was confused to see UnicodeEncodeError: 'ascii' codec can't encode character...etc. I thought: but unicode is better, can't it just stay that way?
Here is exactly what I'm talking about :
>>> price_info = u'it costs \u20ac 5'
>>> type(price_info)
<type 'unicode'>
>>> f = open('priceinfo.txt','wb')
>>> f.write(price_info)
Traceback (most recent call last):
File "", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 9: ordinal not in range(128)
>>>
What happened? When trying to write the text "it costs € 5" to a file, Python tried to encode the unicode data into an 8-bit string using the default encoding, which is ascii. I was bright enough to know that ascii does not contain the euro sign. You with me? The solution:
>>> price_info_enc = price_info.encode('utf-8') # <- encodes the string as UTF-8
>>> price_info_enc
'it costs \xe2\x82\xac 5'
>>> type(price_info_enc)
<type 'str'>
>>> f.write(price_info_enc)
>>> f.close()
It's just as simple as that. Notice how it takes 3 characters to represent the Euro sign in a str object when it only takes one character, \u20ac, in unicode.
When you want to read the value back in you will get the same encoded string and you'll need to convert it to unicode if you want to do useful things with it (like encode it to something else for comparison, which is another thing Python will try to do automatically). This should be straight forward enough:
>>> f = open('priceinfo.txt','rb')
>>> price_info_enc = f.read()
>>> price_info = price_info_enc.decode('utf-8') # <- create a unicode object out of UTF-8 encoded bytes
>>> price_info
u'it costs \u20ac 5'
>>> type(price_info)
<type 'unicode'>
And just so you know, calling the str.decode() method is the same as instantiating the unicode() type:
>>> price_info = unicode(price_info_enc, 'utf-8')
>>> price_info
u'it costs \u20ac 5'
>>> type(price_info)
<type 'unicode'>
(And yet people still complain about this so-called "only one way" policy in Python :))
You've probably realized that it will get annoying to encode and decode all the time. From what I can gather, most modules solve this by ensuring all strings are unicode for as long as possible. And better yet, Python 3 will do everything in unicode to fix the problem. There is also another solution in the codecs module: codecs.open(), which claims to handle all this encoding for you. I.E. :
>>> price_info = u'it costs \u20ac 5'
>>> f = codecs.open('priceinfo.txt','wb','utf-8')
>>> f.write(price_info)
>>> f.close()
>>> f = codecs.open('priceinfo.txt','rb','utf-8')
>>> f.read()
u'it costs \u20ac 5'
However, it's not always possible to work with unicode all the time because not everything supports it. As just one example, you'll need to create a wrapper that temporarily encodes / decodes data when reading a csv file using the standard csv module.
Want to get into the nitty gritty details? This Unicode HOWTO by AMK pretty much covers it. I also found Python Unicode Objects by Fredrik Lundh to have some handy tricks and useful tips, especially for writing regular expressions. There are also some gotchas, like the BOM (byte order mark), which is explained in How To Use UTF-8 in Python by Evan Jones. Last but not least, as the title suggests, there is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Re: What I Thought I Knew About Unicode in Python Amounted To Nothing
posted by Martijn Faassen on Thursday Jun 14th, 2007 at 5:18p.m.
Some discussion on this topic from a few years ago that you might find interesting.
http://faassen.n--tree.net/blog/view/weblog/2005/08/02/0
Re: What I Thought I Knew About Unicode in Python Amounted To Nothing
posted by Leandro Lameiro on Thursday Jun 14th, 2007 at 11:59p.m.
Hi Kumar.
I would just like to point one small inaccuracy in this excellent post.
You said "Notice how it takes 3 bytes to represent the Euro sign in a str object when it only takes one byte, \u20ac, in unicode."
Maybe you meant the Euro sign takes 1 unicode character? Or that the unicode string u"€" length is 1 (instead of 3)?
It doesn't take 1 byte to represent Euro sign in unicode. There is no such thing as "one byte in unicode".
It takes 1 unicode code point (in this case, U+20AC). That may use 2 bytes, 3 bytes, 8-bytes, it depends on the encoding used. UTF-8 will take 3 bytes but UCS-4 will use 4 bytes. cp1252 will use 1 byte.
Python internally uses UCS-2 to represent Unicode strings (or UCS-4 depending on the compiler/libc/platform being used.)
There is an excellent post from Joel Spolsky about it: http://www.joelonsoftware.com/articles/Unicode.html
And PEP 100: http://www.python.org/dev/peps/pep-0100/
Best regards
Leandro Lameiro
Re: What I Thought I Knew About Unicode in Python Amounted To Nothing
posted by Greg Jorgensen on Friday Jun 15th, 2007 at 12:09a.m.
Nice simple explanation. I had to learn all of this recently. You've distilled hours of trial-and-error into a few easy to understand examples.
Re: What I Thought I Knew About Unicode in Python Amounted To Nothing
posted by Remco Gerlich on Friday Jun 15th, 2007 at 2:57a.m.
That Joel Spolsky article is essential.
What I try to get confused coworkers to understand is this: Unicode is not an encoding. Unicode is an abstract idea.
Since the way your programming language handles that abstract idea is implementation dependent, you need to translate to bytes to communicate with the outside world, and for that you need to pick an encoding.
When they get that it usually becomes easier to understand the rest of the details.
Re: What I Thought I Knew About Unicode in Python Amounted To Nothing
posted by Michael Schlenker on Friday Jun 15th, 2007 at 5:04a.m.
You've seen the ugly faces of Python Unicode handling, in contrast to some other language (like Tcl which has a way better unicode integration than Python), maybe it gets better with Py3k.
I always wondered for the reason the Python file object has an encoding attribute, but as read only. Thats totally insane design, forcing you to use errorprone encode/decode all the time instead of simply setting the encoding for the file. Ok, codecs.open() fixes it, but it feels a bit tacked on (like all of pythons unicode support) instead of integrated into the language.
My favorite python glitch with unicode is exception printing..., you really have fun when your exception catching code tries to print an exception (for logging for example) and the printing fails with an exception because you got an unicode string.
One slight comment on a comment here:
Since PEP 261 Python has no longer UCS-2 internally, but uses UTF-16 with surrogates.
Re: What I Thought I Knew About Unicode in Python Amounted To Nothing
posted by Kumar McMillan on Friday Jun 15th, 2007 at 5:15p.m.
Leandro Lameiro, thanks for the important details, I've updated the post.
Generics are really work. It is not a fake!
posted by Clontettect on Monday Feb 8th, 2010 at 3:26p.m.
Yesterday i bought viagra in Online drugstore.
On my surprise it works excellent! All the matter is that the price low, because I do not pay for the trade mark. That's all!
You can see explanations about it.
A generic drug (generic drugs, short: generics) is a drug which is produced and distributed without patent protection. According to the U.S. Food and Drug Administration, generic drugs are identical bioequivalent range to the brand name counterpart with respect to pharmacokinetic and pharmacodynamic properties. By extension, therefore, generics are considered identical in dose, strength, route of administration, safety, efficacy, and intended use. In most cases, generic products are available once the patent protections afforded to the original developer have expired. When generic products become available, the market competition often leads to substantially lower prices for both the original brand name product and the generic forms. You can read more at http://shoppharm.com
differences lexapro zoloft
posted by Soypealledo on Thursday Feb 18th, 2010 at 8:45a.m.
menstruation and zoloft zoloft for children sturcture of zoloft zoloft onset time zoloft liver effects zoloft with lorazepam ritalin with zoloft concerta and zoloft zoloft and kids zoloft and sudafed time release zoloft zoloft young children zoloft and ejaculation sertraline zoloft hci stop taking zoloft
Thema!
posted by grosbewymadlud on Wednesday Feb 24th, 2010 at 11:35p.m.
телефонный пеленгатор скачать,скачать пеленгатор для нокиа,поиск по базе данных билайн,поиск человека по номеру абонента,найти нахождение абонента по телефону,программа пеленгатор для компьютера,найти человека по базе телефонов,Gsm локатор лохотрон,скачать пеленгатор jad,где найти номер сотового телефона,найти телефон через спутник,Gsm поиск мобильного,пеленгатор для nokia 6233,новосибирск поиск абонента по номеру,скачать игру на sony пеленгатор,новосибирск поиск абонента телефона,поиск номера абонента mtc,узнать где находится телефон билайн,Gsm поиск по номеру,скачать приложение пеленгатор,поисковик для потерянного мобильного телефона,возможно ли найти мобильный телефон,Gsm пеленгатор для компьютера,номер мобильного телефона найти оператора,как найти мобильный телефон,Gsm поиск онлайн,сонник потерять телефон,поиск мегафон северо запад,Gsm пеленгатор,найти телефон по фамилии екатеринбург,пеленгатор для nokia 6300,Nokia software не находит телефон,скачать java игру пеленгатор,скачать игру пеленгатор на nokia,билайн мобильный поиск,скачать пеленгатор для мобильного телефона,поиск местонахождения абонента сотового телефона,скачать пеленгатор номеров телефона местоположение,поиск местонахождения абонента сотового телефона,пеленгатор онлайн скачать на компьютер,пеленгатор онлайн без кода,Gsm поиск билайн,проверить где находится телефон,поиск мобильного телефона по фамилии,поиск мобильного телефона по параметрам,где находится сотовый телефон,найти адрес по мобильному телефону,поиск местонахождения абонента сотового телефона,программа gsm локатора,поиск адреса по номеру мобильного
levitra vardenafil hci
posted by beetsBefkaf on Tuesday Mar 9th, 2010 at 6:33p.m.
Sorry for choosing this to leave a offer for all about levitra vardenafil hci lasix kansas city vsd and lasix lasix for dogs levitra without a presription levitra vardenafil hcl php levitra web site christians unite propecia propecia erectile orgasm problems line pharmacy propecia taking sudafed with zithromax zithromax for infants zithromax otitis media 25 mg zoloft best zoloft doses zoloft substitute tetra
Re: What I Thought I Knew About Unicode in Python Amounted To Nothing
posted by dentists Staffordshire on Thursday Mar 11th, 2010 at 12:45p.m.
I am very interested for this post. Its really give me lots of pleasure. I choose this article very much. So lots of thanks for this post.
lasix and potassium level
posted by estinammema on Thursday Mar 11th, 2010 at 9:15p.m.
Sorry for choosing this to leave a offer for all about lasix and potassium level lasix dangerous to babies irregular heart beat and lasix lasix and weight loss levitra how it works cialis levitra pharmacycom buy levitra vardenafil subaction showcomments propecia optional posted using propecia along with minoxidil propecia side effects zithromax 1200mg prices zithromax 250 mg tabs green poop zithromax zoloft trazodone contraindications zoloft light headed learn about zoloft