Except UnicodeError: battling Unicode demons in Python

except UnicodeError:# A practical guide to fighting Unicode demons

except UnicodeError:# A practical guide to fighting Unicode demons

Aram Dulyan (@Aramgutang)Sydney Python Users group (SyPy)

05 APR 2012

Aram Dulyan (@Aramgutang)Sydney Python Users group (SyPy)

05 APR 2012

What is Unicode?What is Unicode?

Looking inside:Looking inside:

In Python:In Python:

class unicode(basestring):...

class unicode(basestring):...

The great escapes:The great escapes:

>>> 'e' == u'e'True

>>> '\xc9' == u'\xc9'False

>>> u'\xc9' == u'\u00c9' == u'\U000000c9'True

>>> 'e' == u'e'True

>>> '\xc9' == u'\xc9'False

>>> u'\xc9' == u'\u00c9' == u'\U000000c9'True

UTF-8UTF-8● There is no difference between an ASCII-encoded and a UTF-8 encoded

file if no “extended” characters appear in it.

● Except if there's a BOM (byte order mark):

● UTF-8: EF BB BF ( ï»¿ )● UTF-16: FE FF ( U+FFFE is reserved for this very purpose )

NOT HELPFUL:

● There is no difference between an ASCII-encoded and a UTF-8 encoded file if no “extended” characters appear in it.

● Except if there's a BOM (byte order mark):

● UTF-8: EF BB BF ( ï»¿ )● UTF-16: FE FF ( U+FFFE is reserved for this very purpose )

NOT HELPFUL:

Encode/decode:Encode/decode:

● Encode to bytes● Decode to unicode

● or, forget decode completely:

>>> 'fort\xc3\xa3'.decode('utf-8')u'fort\xe9'>>> unicode('fort\xc3\xa3', 'utf-8')u'fort\xe9'

● Encode to bytes● Decode to unicode

● or, forget decode completely:

>>> 'fort\xc3\xa3'.decode('utf-8')u'fort\xe9'>>> unicode('fort\xc3\xa3', 'utf-8')u'fort\xe9'

This is why we declare encodings:This is why we declare encodings:

RIGHT SINGLE QUOTATION MARKU+2019

RIGHT SINGLE QUOTATION MARKU+2019

>>> u'\u2019'.encode('utf-8')'\xe2\x80\x99'>>> '\xe2\x80\x99'.decode('cp1252')u'\xe2\u20ac\u2122'>>> print u'\xe2\u20ac\u2122'â€™

>>> u'\u2019'.encode('utf-8')'\xe2\x80\x99'>>> '\xe2\x80\x99'.decode('cp1252')u'\xe2\u20ac\u2122'>>> print u'\xe2\u20ac\u2122'â€™

All because of a missing <meta charset="utf-8">All because of a missing <meta charset="utf-8">

If you REALLY need ASCII:If you REALLY need ASCII:

>>> print u'r\xe9sum\xe9'résumé>>> print u'r\xe9sum\xe9'.encode(errors='ignore')rsum>>> print u'r\xe9sum\xe9'.encode(errors='replace')r?sum?

$ pip install unidecode>>> from unidecode import unidecode>>> print unidecode(u'r\xe9sum\xe9')resume

>>> print u'r\xe9sum\xe9'résumé>>> print u'r\xe9sum\xe9'.encode(errors='ignore')rsum>>> print u'r\xe9sum\xe9'.encode(errors='replace')r?sum?

$ pip install unidecode>>> from unidecode import unidecode>>> print unidecode(u'r\xe9sum\xe9')resume

The “u” prefix:The “u” prefix:

>>> '%s %s' % (u'unicode', 'string')u'unicode string'>>> 'string ' + u'unicode'u'string unicode'

class Loonie(object):def __str__(self):

return 'Throatwobbler Mangrove'def __unicode__(self):

return u'Richard Luxuryyacht'

>>> '%s' % Loonie()'Throatwobbler Mangrove'>>> u'%s' % Loonie()u'Richard Luxuryyacht'

>>> '%s %s' % (Loonie(), u'is silly')u'Throatwobbler Mangrove is silly'

>>> '%s %s' % (u'unicode', 'string')u'unicode string'>>> 'string ' + u'unicode'u'string unicode'

class Loonie(object):def __str__(self):

return 'Throatwobbler Mangrove'def __unicode__(self):

return u'Richard Luxuryyacht'

>>> '%s' % Loonie()'Throatwobbler Mangrove'>>> u'%s' % Loonie()u'Richard Luxuryyacht'

>>> '%s %s' % (Loonie(), u'is silly')u'Throatwobbler Mangrove is silly'

Combining marks:Combining marks:

COMBINING DIAERESISU+0308

COMBINING DIAERESISU+0308

LATIN SMALL LETTER EU+0065

LATIN SMALL LETTER EU+0065

LATIN SMALL LETTER EWITH DIAERESIS

U+00EB

LATIN SMALL LETTER EWITH DIAERESIS

U+00EB

>>> print u'Zo\xeb'Zoë>>> print u'Zoe\u0308'Zoë

>>> from unicodedata import normalize>>> normalize('NFC', u'Zoe\u0308')u'Zo\xeb'>>> normalize('NFD', u'Zo\xeb')u'Zoe\u0308'

>>> print u'Zo\xeb'Zoë>>> print u'Zoe\u0308'Zoë

>>> from unicodedata import normalize>>> normalize('NFC', u'Zoe\u0308')u'Zo\xeb'>>> normalize('NFD', u'Zo\xeb')u'Zoe\u0308'

OS X on HFS+ normalises filenames, others don'tOS X on HFS+ normalises filenames, others don't

Warning:Warning:

PEP-8PEP-8

Code in the core Python distribution should always use the ASCII or Latin-1 encoding (a.k.a. ISO-8859-1). For Python 3.0 and beyond, UTF-8 is preferred over Latin-1, see PEP 3120.

Files using ASCII should not have a coding cookie. Latin-1 (or UTF-8) should only be used when a comment or docstring needs to mention an author name that requires Latin-1; otherwise, using \x, \u or \U escapes is the preferred way to include non-ASCII data in string literals.

For Python 3.0 and beyond, the following policy is prescribed for the standard library (see PEP 3131): All identifiers in the Python standard library MUST use ASCII-only identifiers, and SHOULD use English words wherever feasible (in many cases, abbreviations and technical terms are used which aren't English). In addition, string literals and comments must also be in ASCII. The only exceptions are (a) test cases testing the non-ASCII features, and (b) names of authors. Authors whose names are not based on the latin alphabet MUST provide a latin transliteration of their names.

Code in the core Python distribution should always use the ASCII or Latin-1 encoding (a.k.a. ISO-8859-1). For Python 3.0 and beyond, UTF-8 is preferred over Latin-1, see PEP 3120.

Files using ASCII should not have a coding cookie. Latin-1 (or UTF-8) should only be used when a comment or docstring needs to mention an author name that requires Latin-1; otherwise, using \x, \u or \U escapes is the preferred way to include non-ASCII data in string literals.

For Python 3.0 and beyond, the following policy is prescribed for the standard library (see PEP 3131): All identifiers in the Python standard library MUST use ASCII-only identifiers, and SHOULD use English words wherever feasible (in many cases, abbreviations and technical terms are used which aren't English). In addition, string literals and comments must also be in ASCII. The only exceptions are (a) test cases testing the non-ASCII features, and (b) names of authors. Authors whose names are not based on the latin alphabet MUST provide a latin transliteration of their names.

Libraries:Libraries:

● unidecode● For when you absolutely need ASCII – folds accents and

transliterates from many languages.

● chardet● Guesses most likely character encoding of a given bytestring.

Based on Mozilla's code.

● unicode-nazi● Yells about any implicit unicode/bytestring conversion in your

code. Useful when porting code to Python 3.

● unidecode● For when you absolutely need ASCII – folds accents and

transliterates from many languages.

● chardet● Guesses most likely character encoding of a given bytestring.

Based on Mozilla's code.

● unicode-nazi● Yells about any implicit unicode/bytestring conversion in your

code. Useful when porting code to Python 3.

http://pypi.python.org/pypi/Unidecode

http://pypi.python.org/pypi/chardet

http://pypi.python.org/pypi/unicode-nazi

http://pypi.python.org/pypi/Unidecode

http://pypi.python.org/pypi/chardet

http://pypi.python.org/pypi/unicode-nazi

Links:Links:

● All About Python and Unicode

● A detailed reference on all things pertaining to Python and Unicode.

● Pragmatic Unicode

● PyCon 2012 talk on Unicode in Python, covering v3 as well.

● Love Hotels and Unicode

● A look at the inside politics and other quirky aspects of Unicode.

● Python Unicode – Fixing UTF-8 encoded as Latin-1

● Another poor soul who ran into this problem.

● Why the Obama tweet was garbled

● A quick explanation with comments from the people responsible.

● Unicode Support Shootout

● An advanced treatise on how most languages (including Python) fail at Unicode.

● All About Python and Unicode

● A detailed reference on all things pertaining to Python and Unicode.

● Pragmatic Unicode

● PyCon 2012 talk on Unicode in Python, covering v3 as well.

● Love Hotels and Unicode

● A look at the inside politics and other quirky aspects of Unicode.

● Python Unicode – Fixing UTF-8 encoded as Latin-1

● Another poor soul who ran into this problem.

● Why the Obama tweet was garbled

● A quick explanation with comments from the people responsible.

● Unicode Support Shootout

● An advanced treatise on how most languages (including Python) fail at Unicode.

http://boodebr.org/main/python/all-about-python-and-unicode

http://nedbatchelder.com/text/unipain.html

http://www.reigndesign.com/blog/love-hotels-and-unicode/

http://www.red-mercury.com/blog/eclectic-tech/python-unicode-fixing-utf-8-encoded-as-latin-1-iso-8859-1/

http://www.hanselman.com/blog/WhyTheAskObamaTweetWasGarbledOnScreenKnowYourUTF8UnicodeASCIIAndANSIDecodingMrPresident.aspx

http://training.perl.com/tcpc/OSCON2011/gbu/gbu.html

http://boodebr.org/main/python/all-about-python-and-unicode

http://nedbatchelder.com/text/unipain.html

http://www.reigndesign.com/blog/love-hotels-and-unicode/

http://www.red-mercury.com/blog/eclectic-tech/python-unicode-fixing-utf-8-encoded-as-latin-1-iso-8859-1/

http://www.hanselman.com/blog/WhyTheAskObamaTweetWasGarbledOnScreenKnowYourUTF8UnicodeASCIIAndANSIDecodingMrPresident.aspx

http://training.perl.com/tcpc/OSCON2011/gbu/gbu.html

Technology

Except UnicodeError: battling Unicode demons in Python