Farm Development

Unicode In Python, Completely Demystified (slides available)

Big thanks to everyone who attended my talk at PyCon today, Unicode In Python, Completely Demystified. It was an ambitious title; I highly doubt I "demystified" everything, but I was happy with how it went. There were a ton of great questions — even some I couldn't answer, of course. If you have any other questions feel free to comment here and I'll try my best to answer.

For those who couldn't make the talk or those who just want to refer back to the talk and / or source code, I've posted the slides here: http://farmdev.com/talks/unicode/. The audio / video should be available soon so I'll re-post a message when that's available.

Also, for reference, here is the primer:

This talks aims to make every single last person in the audience understand exactly how to write Unicode-aware applications in Python 2. If necessary, we will move to a Birds of Feather gathering, to the bar, to your hotel room, I'll start hanging around your cube at work -- whatever it takes -- until you completely "get it." But it's really simple so bring an open mind, a notepad, and get ready to create bullet proof Python software that can read and write text in Arabic, Russian, Chinese, Klingon, et cetera. As a citizen of the Python community you have the responsibility of creating Unicode-aware applications!

  • Re: Unicode In Python, Completely Demystified (slides available)

    There are some points you missed:

    1. Printing unicode strings is a complete joke in Python, and just an exercise in frustration. The official suggestion I've seen is to use the codecs module to wrap sys.stdout/sys.stderr in an encoder, but that just doesn't work.

    2. You'll probably just want to shoot yourself when you find out that the default encoding isn't reliable. The encoding Python uses when stdout/stderr is redirected isn't necessarily the same as what it uses when they're not.

    3. There are some Unicode characters that can be decomposed, which can screw up len() when you have an umlaut and the letter it's attached to taking up two characters. unicodedata.normalize() can fix this, but you have to know about it beforehand, and it's obscure as hell.

    4. Some stdlib modules do their own encoding, even though you never asked them to do that. optparse is a good example: it tries to encode your help output itself, which can throw you for a loop if you take the other official solution of "never printing unicode objects" and use UTF-8 everywhere. I was using gettext for my help messages, and I made it leave the original encoding intact, but then optparse tried to UTF-8-encode UTF-8-encoded strings, and well, that just raises that lovely encoding exception.

    Unicode in Python is an absolute joke. It's handled terribly all around the standard library, or just not handled at all. Python itself has inconsistent behavior, and there's no quick and easy way to make printing behave in a sane manner.

    I'm sure I'm missing on some other points here, but that touches on most of my frustrations with trying to build terminal applications that use Unicode with Python. Maybe Python 3.0 will finally solve this once and for all, but I'm very skeptical.

  • Re: Unicode In Python, Completely Demystified (slides available)

    The slides state that UTF-32 is not supported in Python. However Python 2.6 and 3.0 *will* support UTF-32.

  • Re: Unicode In Python, Completely Demystified (slides available)

    @Brodie, thanks for the excellent points. I have been following the stdout/stderr talks in python 3 a *little* and it is still evolving. I'd highly suggest subscribing to the Python 3000 mailing list to listen in:

    http://mail.python.org/mailman/listinfo/python-3000

    They could use some help on it, especially feedback from someone like you trying to use Unicode in the terminal.

    @Walter, good news about UTF-32, I didn't know!

  • Re: Unicode In Python, Completely Demystified (slides available)

    By the way, the slideshow is very popular for explaining unicode at the #python channel on irc.freenode.net . Only problem is, now it gets a 404. Could you re-upload it, or post a new link, or whatever? Thanks. :P

  • Re: Unicode In Python, Completely Demystified (slides available)

    I just switched servers and was reconfiguring everything in Nginx syntax -- forgot a trailing slash! Fixed now, thanks.

Note: HTML tags will be stripped. Hit enter twice for a new paragraph.

Recent Projects

  • JSTestNet

    Like botnet but for JS tests in CI.

  • Nose Nicedots

    Nose plugin that prints nicer dots.

  • Fudge

    Mock objects for testing.

  • Fixture

    Loading and referencing test data.

  • NoseJS

    Nose plugin that runs JavaScript tests for a Python project.

  • Wikir

    converts reST to various Wiki formats.