Unicode In Python, Completely Demystified

Author: Kumar McMillan
Location:PyCon 2008, Chicago
URL:http://farmdev.com/talks/unicode/
Source:https://github.com/kumar303/unicode-in-python

What does this mean?

UnicodeDecodeError: 'ascii' codec
can't decode byte 0xc4 in position
10: ordinal not in range(128)

Overview

Why use Unicode in Python?

Web application

images/text-lifecycle-web.png
[form input] => [Python] => [HTML]

Interacting with a database

images/text-lifecycle-db.png
[read from DB] => [Python] => [write to DB]

Command line script

images/text-lifecycle-script.png
[text files] => [Python] => [stdout]

Let's open a UTF-8 file

Ivan Krstić

>>> f = open('/tmp/ivan_utf8.txt', 'r')
>>> ivan_utf8 = f.read()
>>> ivan_utf8
'Ivan Krsti\xc4\x87'
  • Ivan Krstić is the director of security architecture at OLPC
  • pretend you opened this in a desktop text editor (nothing fancy like vi) and you saved it in UTF-8 format. This might not have been the default.
  • now you are opening the file in Python

What is it?

>>> ivan_utf8
'Ivan Krsti\xc4\x87'
>>> type(ivan_utf8)
<type 'str'>

Text is encoded

Ivan Krstić

'Ivan Krsti\xc4\x87'
  • This string is encoded in UTF-8 format
  • An encoding is a set of rules that assign numeric values to each text character
  • Notice the c with a hachek takes up 2 bytes
  • Other encodings might represent ć differently
  • Python stdlib supports over 100 encodings
  • c with a hachek is part of the Croatian language
  • each encoding has its own byte representation of text

ASCII

char I v a n
hex \x49 \x76 \x61 \x6e
decimal 73 118 97 110

ASCII

char K r s t i ć
hex \x4b \x76 \x72 \x74 \x69 nope
decimal 75 118 114 116 105 sorry

built-in string types

(Python 2)

<type 'basestring'>
   |
   +--<type 'str'>
   |
   +--<type 'unicode'>

Important methods

s.decode(encoding)

  • <type 'str'> to <type 'unicode'>

u.encode(encoding)

  • <type 'unicode'> to <type 'str'>

The problem

Can't my Python text remain encoded?

Ivan Krstić

>>> ivan_utf8
'Ivan Krsti\xc4\x87'
>>> len(ivan_utf8)
12
>>> ivan_utf8[-1]
'\x87'
  • isn't encoded text good enough? No decoding errors anywhere
  • is the length of Ivan Krstić really 12?
    • what happens if the text were encoded differently?
  • is the last character really hexadecimal 87? Is that what I wanted?

Unicode is more accurate

Ivan Krstić

>>> ivan_utf8
'Ivan Krsti\xc4\x87'
>>> ivan_uni = ivan_utf8.decode('utf-8')
>>> ivan_uni
u'Ivan Krsti\u0107'
>>> type(ivan_uni)
<type 'unicode'>

Unicode is more accurate

Ivan Krstić

>>> ivan_uni
u'Ivan Krsti\u0107'
>>> len(ivan_uni)
11
>>> ivan_uni[-1]
u'\u0107'

Unicode, what is it?

u'Ivan Krsti\u0107'

Unicode, the ideal

If ASCII, UTF-8, and other byte strings are "text" ...

...then Unicode is "text-ness";

it is the abstract form of text

Unicode is a concept

letter Unicode Code Point
ć \u0107
Byte Encodings
letter UTF-8 UTF-16 Shift-JIS
ć \xc4\x87 \x07\x01 \x85\xc9

Unicode Transformation Format

>>> ab = unicode('AB')

UTF-8

>>> ab.encode('utf-8')
'AB'
  • variable byte representation
  • first 128 characters encoded just like ASCII
  • 1 byte (8 bits) to 4 bytes per code point

Unicode Transformation Format

>>> ab = unicode('AB')

UTF-16

>>> ab.encode('utf-16')
'\xff\xfeA\x00B\x00'
  • variable byte representation
  • 2 bytes (16 bits) to 4 bytes per code point
  • optimized for languages residing in the 2 byte character range

Unicode Transformation Format

UTF-32

  • fixed width byte representation, fastest
  • 4 bytes (32 bits) per code point
  • not supported in Python

Unicode chart

Ian Albert's Unicode chart

  • this guy decided to print the entire Unicode chart
  • 1,114,112 code points
  • 6 feet by 12 feet
  • 22,017 × 42,807 pixels

Unicode chart

images/unichart-printed.jpg

Ian Albert's Unicode chart. Says it only cost him $20 at Kinko's but he was pretty sure they rang him up wrong.

Unicode chart 50 %

images/unichart-50.jpg

Unicode chart 100 %

images/unichart-100.jpg

Decoding text into Unicode

Python magic

>>> ivan_uni
u'Ivan Krsti\u0107'
>>> f = open('/tmp/ivan.txt', 'w')
>>> f.write(ivan_uni)
Traceback (most recent call last):
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0107' in position 10: ordinal not in range(128)

Python magic, revealed

>>> ivan_uni
u'Ivan Krsti\u0107'
>>> f = open('/tmp/ivan.txt', 'w')
>>> import sys
>>> f.write(ivan_uni.encode(
...         sys.getdefaultencoding()))
...
Traceback (most recent call last):
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0107' in position 10: ordinal not in range(128)

Gasp!

THE DEFAULT ENCODING FOR PYTHON 2 IS ASCII

Just reset it?!

sys.setdefaultencoding('utf-8')

Solution

  1. Decode early
  2. Unicode everywhere
  3. Encode late

1. Decode early

Decode to <type 'unicode'> ASAP

>>> def to_unicode_or_bust(
...         obj, encoding='utf-8'):
...     if isinstance(obj, basestring):
...         if not isinstance(obj, unicode):
...             obj = unicode(obj, encoding)
...     return obj
...
>>>

detects if object is a string and if so converts to unicode, if not already.

2. Unicode everywhere

>>> to_unicode_or_bust(ivan_uni)
u'Ivan Krsti\u0107'
>>> to_unicode_or_bust(ivan_utf8)
u'Ivan Krsti\u0107'
>>> to_unicode_or_bust(1234)
1234

3. Encode late

Encode to <type 'str'> when you write to disk or print

>>> f = open('/tmp/ivan_out.txt','wb')
>>> f.write(ivan_uni.encode('utf-8'))
>>> f.close()

Shortcuts

codecs.open()

>>> import codecs
>>> f = codecs.open('/tmp/ivan_utf8.txt', 'r',
...                 encoding='utf-8')
...
>>> f.read()
u'Ivan Krsti\u0107'
>>> f.close()

Shortcuts

codecs.open()

>>> import codecs
>>> f = codecs.open('/tmp/ivan_utf8.txt', 'w',
...                 encoding='utf-8')
...
>>> f.write(ivan_uni)
>>> f.close()

Python 2 Unicode incompatibility

Python 2 Unicode workarounds

>>> ivan_bytes = ivan_uni.encode('utf-8')
>>> # do stuff
>>> ivan_bytes.decode('utf-8')
u'Ivan Krsti\u0107'

The BOM

Detecting the BOM

>>> f = open('/tmp/ivan_utf16.txt','r')
>>> sample = f.read(4)
>>> sample
'\xff\xfeI\x00'

Detecting the BOM

>>> import codecs
>>> (sample.startswith(codecs.BOM_UTF16_LE) or
...  sample.startswith(codecs.BOM_UTF16_BE))
...
True
>>> sample.startswith(codecs.BOM_UTF8)
False

Do I have to remove the BOM?

How do you guess an encoding?

Summary of problems

Summary of solutions

Unicode in Python 3

Unicode in Python 3

Fin