16
except UnicodeError: # A practical guide to fighting Unicode demons except UnicodeError: # A practical guide to fighting Unicode demons Aram Dulyan (@Aramgutang) Sydney Python Users group (SyPy) 05 APR 2012 Aram Dulyan (@Aramgutang) Sydney Python Users group (SyPy) 05 APR 2012

Except UnicodeError: battling Unicode demons in Python

Embed Size (px)

DESCRIPTION

Issues that come up in practice when working with Unicode in Python, and how to avoid them.

Citation preview

Page 1: Except UnicodeError: battling Unicode demons in Python

except UnicodeError:# A practical guide to fighting Unicode demons

except UnicodeError:# A practical guide to fighting Unicode demons

Aram Dulyan (@Aramgutang)Sydney Python Users group (SyPy)

05 APR 2012

Aram Dulyan (@Aramgutang)Sydney Python Users group (SyPy)

05 APR 2012

Page 2: Except UnicodeError: battling Unicode demons in Python
Page 3: Except UnicodeError: battling Unicode demons in Python

What is Unicode?What is Unicode?

Page 4: Except UnicodeError: battling Unicode demons in Python

Looking inside:Looking inside:

Page 5: Except UnicodeError: battling Unicode demons in Python

In Python:In Python:

class unicode(basestring):...

class unicode(basestring):...

Page 6: Except UnicodeError: battling Unicode demons in Python

The great escapes:The great escapes:

>>> 'e' == u'e'True

>>> '\xc9' == u'\xc9'False

>>> u'\xc9' == u'\u00c9' == u'\U000000c9'True

>>> 'e' == u'e'True

>>> '\xc9' == u'\xc9'False

>>> u'\xc9' == u'\u00c9' == u'\U000000c9'True

Page 7: Except UnicodeError: battling Unicode demons in Python

UTF-8UTF-8● There is no difference between an ASCII-encoded and a UTF-8 encoded

file if no “extended” characters appear in it.

● Except if there's a BOM (byte order mark):

● UTF-8: EF BB BF (  )● UTF-16: FE FF ( U+FFFE is reserved for this very purpose )

NOT HELPFUL:

● There is no difference between an ASCII-encoded and a UTF-8 encoded file if no “extended” characters appear in it.

● Except if there's a BOM (byte order mark):

● UTF-8: EF BB BF (  )● UTF-16: FE FF ( U+FFFE is reserved for this very purpose )

NOT HELPFUL:

Page 8: Except UnicodeError: battling Unicode demons in Python

Encode/decode:Encode/decode:

● Encode to bytes● Decode to unicode

● or, forget decode completely:

>>> 'fort\xc3\xa3'.decode('utf-8')u'fort\xe9'>>> unicode('fort\xc3\xa3', 'utf-8')u'fort\xe9'

● Encode to bytes● Decode to unicode

● or, forget decode completely:

>>> 'fort\xc3\xa3'.decode('utf-8')u'fort\xe9'>>> unicode('fort\xc3\xa3', 'utf-8')u'fort\xe9'

Page 9: Except UnicodeError: battling Unicode demons in Python

This is why we declare encodings:This is why we declare encodings:

RIGHT SINGLE QUOTATION MARKU+2019

RIGHT SINGLE QUOTATION MARKU+2019

>>> u'\u2019'.encode('utf-8')'\xe2\x80\x99'>>> '\xe2\x80\x99'.decode('cp1252')u'\xe2\u20ac\u2122'>>> print u'\xe2\u20ac\u2122'’

>>> u'\u2019'.encode('utf-8')'\xe2\x80\x99'>>> '\xe2\x80\x99'.decode('cp1252')u'\xe2\u20ac\u2122'>>> print u'\xe2\u20ac\u2122'’

All because of a missing <meta charset="utf-8">All because of a missing <meta charset="utf-8">

Page 10: Except UnicodeError: battling Unicode demons in Python

If you REALLY need ASCII:If you REALLY need ASCII:

>>> print u'r\xe9sum\xe9'résumé>>> print u'r\xe9sum\xe9'.encode(errors='ignore')rsum>>> print u'r\xe9sum\xe9'.encode(errors='replace')r?sum?

$ pip install unidecode>>> from unidecode import unidecode>>> print unidecode(u'r\xe9sum\xe9')resume

>>> print u'r\xe9sum\xe9'résumé>>> print u'r\xe9sum\xe9'.encode(errors='ignore')rsum>>> print u'r\xe9sum\xe9'.encode(errors='replace')r?sum?

$ pip install unidecode>>> from unidecode import unidecode>>> print unidecode(u'r\xe9sum\xe9')resume

Page 11: Except UnicodeError: battling Unicode demons in Python

The “u” prefix:The “u” prefix:

>>> '%s %s' % (u'unicode', 'string')u'unicode string'>>> 'string ' + u'unicode'u'string unicode'

class Loonie(object):def __str__(self):

return 'Throatwobbler Mangrove'def __unicode__(self):

return u'Richard Luxuryyacht'

>>> '%s' % Loonie()'Throatwobbler Mangrove'>>> u'%s' % Loonie()u'Richard Luxuryyacht'

>>> '%s %s' % (Loonie(), u'is silly')u'Throatwobbler Mangrove is silly'

>>> '%s %s' % (u'unicode', 'string')u'unicode string'>>> 'string ' + u'unicode'u'string unicode'

class Loonie(object):def __str__(self):

return 'Throatwobbler Mangrove'def __unicode__(self):

return u'Richard Luxuryyacht'

>>> '%s' % Loonie()'Throatwobbler Mangrove'>>> u'%s' % Loonie()u'Richard Luxuryyacht'

>>> '%s %s' % (Loonie(), u'is silly')u'Throatwobbler Mangrove is silly'

Page 12: Except UnicodeError: battling Unicode demons in Python

Combining marks:Combining marks:

COMBINING DIAERESISU+0308

COMBINING DIAERESISU+0308

LATIN SMALL LETTER EU+0065

LATIN SMALL LETTER EU+0065

LATIN SMALL LETTER EWITH DIAERESIS

U+00EB

LATIN SMALL LETTER EWITH DIAERESIS

U+00EB

>>> print u'Zo\xeb'Zoë>>> print u'Zoe\u0308'Zoë

>>> from unicodedata import normalize>>> normalize('NFC', u'Zoe\u0308')u'Zo\xeb'>>> normalize('NFD', u'Zo\xeb')u'Zoe\u0308'

>>> print u'Zo\xeb'Zoë>>> print u'Zoe\u0308'Zoë

>>> from unicodedata import normalize>>> normalize('NFC', u'Zoe\u0308')u'Zo\xeb'>>> normalize('NFD', u'Zo\xeb')u'Zoe\u0308'

OS X on HFS+ normalises filenames, others don'tOS X on HFS+ normalises filenames, others don't

Page 13: Except UnicodeError: battling Unicode demons in Python

Warning:Warning:

Page 14: Except UnicodeError: battling Unicode demons in Python

PEP-8PEP-8

Code in the core Python distribution should always use the ASCII or Latin-1 encoding (a.k.a. ISO-8859-1). For Python 3.0 and beyond, UTF-8 is preferred over Latin-1, see PEP 3120.

Files using ASCII should not have a coding cookie. Latin-1 (or UTF-8) should only be used when a comment or docstring needs to mention an author name that requires Latin-1; otherwise, using \x, \u or \U escapes is the preferred way to include non-ASCII data in string literals.

For Python 3.0 and beyond, the following policy is prescribed for the standard library (see PEP 3131): All identifiers in the Python standard library MUST use ASCII-only identifiers, and SHOULD use English words wherever feasible (in many cases, abbreviations and technical terms are used which aren't English). In addition, string literals and comments must also be in ASCII. The only exceptions are (a) test cases testing the non-ASCII features, and (b) names of authors. Authors whose names are not based on the latin alphabet MUST provide a latin transliteration of their names.

Code in the core Python distribution should always use the ASCII or Latin-1 encoding (a.k.a. ISO-8859-1). For Python 3.0 and beyond, UTF-8 is preferred over Latin-1, see PEP 3120.

Files using ASCII should not have a coding cookie. Latin-1 (or UTF-8) should only be used when a comment or docstring needs to mention an author name that requires Latin-1; otherwise, using \x, \u or \U escapes is the preferred way to include non-ASCII data in string literals.

For Python 3.0 and beyond, the following policy is prescribed for the standard library (see PEP 3131): All identifiers in the Python standard library MUST use ASCII-only identifiers, and SHOULD use English words wherever feasible (in many cases, abbreviations and technical terms are used which aren't English). In addition, string literals and comments must also be in ASCII. The only exceptions are (a) test cases testing the non-ASCII features, and (b) names of authors. Authors whose names are not based on the latin alphabet MUST provide a latin transliteration of their names.

Page 15: Except UnicodeError: battling Unicode demons in Python

Libraries:Libraries:

● unidecode● For when you absolutely need ASCII – folds accents and

transliterates from many languages.

● chardet● Guesses most likely character encoding of a given bytestring.

Based on Mozilla's code.

● unicode-nazi● Yells about any implicit unicode/bytestring conversion in your

code. Useful when porting code to Python 3.

● unidecode● For when you absolutely need ASCII – folds accents and

transliterates from many languages.

● chardet● Guesses most likely character encoding of a given bytestring.

Based on Mozilla's code.

● unicode-nazi● Yells about any implicit unicode/bytestring conversion in your

code. Useful when porting code to Python 3.

Page 16: Except UnicodeError: battling Unicode demons in Python

Links:Links:

● All About Python and Unicode

● A detailed reference on all things pertaining to Python and Unicode.

● Pragmatic Unicode

● PyCon 2012 talk on Unicode in Python, covering v3 as well.

● Love Hotels and Unicode

● A look at the inside politics and other quirky aspects of Unicode.

● Python Unicode – Fixing UTF-8 encoded as Latin-1

● Another poor soul who ran into this problem.

● Why the Obama tweet was garbled

● A quick explanation with comments from the people responsible.

● Unicode Support Shootout

● An advanced treatise on how most languages (including Python) fail at Unicode.

● All About Python and Unicode

● A detailed reference on all things pertaining to Python and Unicode.

● Pragmatic Unicode

● PyCon 2012 talk on Unicode in Python, covering v3 as well.

● Love Hotels and Unicode

● A look at the inside politics and other quirky aspects of Unicode.

● Python Unicode – Fixing UTF-8 encoded as Latin-1

● Another poor soul who ran into this problem.

● Why the Obama tweet was garbled

● A quick explanation with comments from the people responsible.

● Unicode Support Shootout

● An advanced treatise on how most languages (including Python) fail at Unicode.