73
ℙƴ☂ℌ ø ⒝⒴⒯⒠⒮ ⒝⒴⒯⒠⒮ D Σ MY ƧƬ IFI Σ D Boris FELD - PyParis , Paris - 2017

PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Embed Size (px)

Citation preview

Page 1: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

ℙƴ☂ℌøἤ ⒝⒴⒯⒠⒮⒝⒴⒯⒠⒮

DΣMYƧƬIFIΣD

BorisFELD-PyParis,Paris-2017

Page 2: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

BorisFELD

Pythondeveloper

MercurialandPythonconsultantatOctobus

https://lothiraldan.github.io/

@lothiraldan

/me

Page 3: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Unicodeis���!

Page 4: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Let'stestit!

Page 5: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

WhatisthelengthofthisUnicodestringinPython2?

len(u' ')

1

2

3

4

1.Unicodelength

Page 6: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Itdependsofyourpython:

DOCKER_IMAGE=quay.io/pypa/manylinux1_x86_64$>dockerrun-t-i$DOCKER_IMAGE/opt/python/cp27-cp27mu/bin/python\-c"printlen(u'\U0001f60e')"1

Butitcanalsobe:

DOCKER_IMAGE=quay.io/pypa/manylinux1_x86_64$>dockerrun-t-i$DOCKER_IMAGE/opt/python/cp27-cp27m/bin/python\-c"printlen(u'\U0001f60e')"2

Unicodelength

Page 7: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Whencouldyouseethiserrormessage?

UnicodeEncodeError:'ascii'codeccan'tencodecharacter

Whendoing.encode('ascii')

Whendoing.decode('ascii')

Whendoing.decode('utf-8')

Inallofthesessituations

2.UnicodeEncodeError

Page 8: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Inallofthesesituations!

>>>x=u'é'>>>x.encode('ascii')Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>UnicodeEncodeError:'ascii'codeccan'tencodecharacteru'\xe9'inposition0:ordinalnotinrange(128)>>>x.decode('ascii')Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>UnicodeEncodeError:'ascii'codeccan'tencodecharacteru'\xe9'inposition0:ordinalnotinrange(128)>>>x.decode('utf-8')Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>UnicodeEncodeError:'ascii'codeccan'tencodecharacteru'\xe9'inposition0:ordinalnotinrange(128)

UnicodeEncodeError

Page 9: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Whenshouldyouusechrandunichr?

Youshouldalwaysusechr.

Youshouldalwaysuseunichr.

YoushouldchrforASCIIandunichrforUnicode.

3.Chrvsunichr

Page 10: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Preferusingunichrforeverything.

Chrvsunichr

Page 11: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Skepticaldogisskeptical

Page 12: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Wehavetogoback!

Page 13: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

The60s

Page 14: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Apollo11

Page 15: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Woodstock

Page 16: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Somethingimportant

Page 17: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Somethinghuge

Page 18: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

ASCIIwasborn

Page 19: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

In1960s,theAmericanStandardsAssociationwantedtoanswerthequestion:

Howtorepresenttextdigitally?

Theimportantquestion

Page 20: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Problem,computersareonlyspeakingbits.Howtotransformtextintobits?

Problem

Page 21: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Weknowhowtoconvertintegertobinary:

0=00000001=00000012=00000103=0000011.............127=1111111

Let'sassigneachcharacteranintegerfrom0to127named"codepoint".

Prettysimplesolution

Page 22: PyParis 2017 / Unicode and bytes demystified, by Boris Feld
Page 23: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

ASCIIwithPython

Page 24: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Let'stakeastring:

"pyparis"

Astringisasequenceofcharacters:

assertlist("pyparis")==['p','y','p','a','r','i','s']

Whatisastring?

Page 25: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

asserttype("pyparis"[0])==<type'str'>assertlen("pyparis"[0])==1

Acharacter(fromtheGreekχαρακτήρ"engravedorstampedmark"oncoinsorseals,"brandingmark,symbol")

isasignorsymbol.

—Wikipedia

Acharacterisbasicallyanything.Itcouldrepresentsbealetter,adigitorevenanemoji.

Whatischaracter

Page 26: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

ForretrievingtheASCIIcodepointofacharacter,wecanuseord:

assertord("p")==112

Toreversetheprocesswecanusechr:

assertchr(112)=="p"

CodepointinPython

Page 27: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

p y p a r i s

CodePoint 112 121 112 97 114 105 115

Codepoints

Page 28: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

p y p a r i s

CodePoint 112 121 112 97 114 105 115

Binary 1110000 1111001 1110000 1100001 1110010 1101001 1110011

codepoint encode binarycodepoint decode binary

ASCIIencoding

Page 29: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

encodeismeanttotransformastringintosomebytes:

string='abc'bytes=bytes.encode('ascii')asserthex(bytes)=='616263'

decodeismeanttotransformsomebytesintoastring:

bytes=unhex('616263')string=bytes.decode('ascii')assertstring=='abc'

Eachofthesemethodsacceptsanencodingparameterforthenameoftheconversionalgorithmtouse.

EncodevsDecode

Page 30: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Everythingisawesome...

Page 31: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

...right?

Page 32: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Smallproblem

Page 33: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

ASCIIsolvedtheproblemforUSAbutnotforeveryoneelse.

Noteveryonespeaksenglish

Page 34: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

ASCIIonlyusethe7lowerbitsofabyte.01100001

Butonmostcomputerabyteisactually8bitssowecansupportmorecharacters.

Andsonewstandardwereborn...

Otherstandards

Page 35: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

SomewerebasedonASCIIandusea8bittoaddsupportforaccentsforexample,likeLatin1thatdefinesthecharacterÉwiththecodepoint201.

Someother,werenotcompatibleatall,likeEBCDIC,usedonIBMmainframes,wherethe1001011(codepoint75)codepointrepresentthepunctuationmark"."whileinASCIIitrepresent"A".

Ofcoursetheywerenotallcross-compatible...

Otherstandards

Page 36: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Itwasamess

Page 37: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Initialtext a b ã é

Latin1CodePoint 97 98 227 233

Latin1encoding 01100001 01100010 11100011 11101001

ASCIIdecoding a b ERROR ERROR

MacOSRomandecoding a b „ È

EBCDICdecoding / ERROR T Z

Example

Page 38: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Herecomesoursavior!

Page 39: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

OneStandardtorulethemall,

OneStandardtofindthem,

OneStandardtobringthemall

andinthegreatergoodbindthem

Unicodethesavior

Page 40: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Unicodeisacomputingindustrystandardfortheconsistentencoding,representation,andhandlingoftextexpressedin

mostoftheworld'swritingsystems.

—Wikipedia

Itallstartedin1987-1988asacoordinationbetweenJoeBeckerfromXeroxandLeeCollinsandMarkDavisfromApple.

TheunicodecodepointsarefortunatelyforusASCIIcompatible.

WhatisUnicode?

Page 41: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

ThelatestversionofUnicodecontainsarepertoireof128,237characterscovering135modernandhistoric

scripts,aswellasmultiplesymbolsets.

—Wikipedia

ASCIIwasdefining127characters,soUnicodedefines1000timesmorecharacters.

Itdefinesseveralblocks:

BasicLatin:ab...XYZ

Greek,Aramaic,Cherokee:ΔעᏗ

Righttoleftscripts,Cuneiform,hieroglyphs:

MahjongTiles,DominoTiles,Playingcards:

Emoticons,Musicalnotations:

Unicodesize

Page 42: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

RemembertheASCIItable?

UnicodevsASCII

Page 43: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

UnicodewithPython

Page 44: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Let'stakeaunicodecharacter€.

First,declaretheencodingofyourpythonsourcefileasutf-8:

#-*-coding:utf-8-*-

Then,youcanwriteitthisway:

u'€'

Or:

u'\u20AC'

Itscodepointis8364:

ord(u'€')==8364

HowtowriteUnicodeinPython

Page 45: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Let'sconvertthecodepointintobinary:

CodePoint 8364

Naiveconversion 0010000010101100

Problem

Page 46: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Itdoesn'tfitinto1byte.

Theproblemswhenyoustartusingmorethan1bytesaremultipleandannoying:

Howtoorderthebytes,BigAndLittleEndianproblemsanyone?

Howtorecognizewhichbyteyouarereadinginafileorstream?

Howtodetectandcorrecttransmissionerrorswhereonlysomebytesweremissing?

8364intobinarytakestwobytes.Unicodecharacterscodepointsgoeswellbeyond1000000(becauseofnonallocatedyet),takingupto3bytes.

Multi-bytes

Page 47: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

AsASCIIwassimple,transformingASCIIcodepointsintobinarywasstraightforward.

ButthepresenceofhighcodepointcharactersinUnicodecomplexifytheprocess.Therearemultiplewaysofdoingit,calledencodings:

UTF-8

UTF-16

UTF-32

Multipleencoding

Page 48: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Ifyouarenotsure,useUTF-8,itwillbecompatiblewitheverycharacters,workswellmostofthetimeandsolvedmulti-bytesrelatedproblemsElegantly.

IfyouprocessmoreAsiancharactersthanLatin,useUTF-16soyouuselessspaceandmemory.

Ifyouneedtointeractwithanotherprogram,usethedefaultotherprogramencoding(CSVanyone?).

ComparisonofUnicodeencodings-Wikipedia

Chooseanencoding

Page 49: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

UTF-8EverywhereManifesto

UTF-8everywhere

Page 50: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

A €

CodePoint 65 8364

Naiveconversion

01000001 0010000010101100

UTF-8 01000001 111000101000001010101100

UTF-16 0000000001000001 0010000010101100

UTF-3200000000000000000000000001000001

00000000000000000010000010101100

Whatarethedifferences?

Page 51: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Let'sclarifysomething:

encodeismeanttotransformanunicodestringintosomebytes:

hex(u'é'.encode('utf-8'))=='c3a9'

decodeismeanttotransformsomebytesintoanunicodestring:

unhex('c3a9').decode('utf-8')==u'é'

EncodevsDecode

Page 52: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Python2

Page 53: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

CountingthelengthofanASCIIstringiseasy,countthenumberofbytes!

Butit'smuchmoreharderwithUnicodestrings.

Python2trieshardtogetyouacorrectanswer.

Let'stakebackourexample: .Itscodepointis128526.

1.Stringlength

Page 54: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Python2comesinseveralflavor,twoarerelatedtoUnicode.Itseitheranarrowbuildorawidebuild.ItbasicallychangehowPythonstoresitsstrings.

Forcodepoint<65535,everythingworksthesame,Pythonstoreeachcharacterseparatelyandonlyonecharacter.

Forcodepoint>65535,itdiffers.ThewidebuildcharactersizeisenoughforallUnicodecodepoints.Butthenarrowbuildcharactersizeisnotbigenoughforcodepoint>65535,soitstoreuppercodepointsasapairofcharacters.

Thenarrowbuilduselessmemorybutitexplainswhythenarrowbuildreturns2forlen(u' '),it'sbecausePython2actuallystoretwocharacters.

MultipleflavorsofPython2

Page 55: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Rememberthesignificationofencodeanddecode?

EncodetransformsanUnicodestringintosomebytes.

DecodetransformssomebytesintoanUnicodestring.

2.Encoding/DecodinginPython2

Page 56: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Python2alwayshadastringtypebutintroducedtheUnicodetypeinPython2.1.

Python2strisbadlynamedasit'sbasicallyabagofbytes.Whenyoudisplayit,Pythonwilltrytodecodeitforyou.SoforASCIIonlystrings,encodeanddecodewillreturnthesame.

x='abc'assertx.encode('ascii')==xassertx.decode('ascii')==x

Python2typesystem

Page 57: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Pythonisastronglytypedlanguage,meaningthatPythonshouldn'tcoercetypesbehindyourback:

'012'+3Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>TypeError:cannotconcatenate'str'and'int'objects

Butit'snotrespectingthispropertywithstrings.RememberthatdecodeconvertbytesintoanUnicodestringinPython?

x=u'é'x.decode('utf-8')

AsdecodeiscalledonanUnicodeinstance,itisn'tbytes.Sopythontriestomakessomebytesoutofthestringanddoes:

x=u'é'x.encode('ascii').decode('utf-8')

That'swayyoucanseeanUnicodeEncodeErrorerrorwhiletryingtodecodeanUnicodestringinPython2.

Python2typecoercing

Page 58: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Youcanusechrtogetthecharacterofacodepoint:

assertchr(65)=='A'

ButitonlyworkswithASCIIcharacters!

chr(8364)Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>ValueError:chr()argnotinrange(256)

ForUnicodeyouneedtouseunichr:

assertunichr(8364)==u'€'

3.Python2chrvsunichr

Page 59: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Python3♥ ♥ ♥ ♥

Page 60: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Python3nowalwaysstoreitsstringsthesamewayandlenreturnsyoutherightanswernomatterwhat:

x=' 'assertlen(x)==1

1.Python3singleflavor

Page 61: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Python3biggestchangewastochangethetypesystemsofstrings.

Bytes String Unicodestrings

Python2 str unicode

Python3 bytes str

2.Python3bigchange

Page 62: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

NowthatPython3haveseparatetypesforbytesandstring,wenowlongercanmesswithencodeanddecode:

string=''string.decode('ascii')Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>AttributeError:'str'objecthasnoattribute'decode'

DecodinganUnicodestringnevermadesenseanyway.

bytes=b''bytes.encode('utf-8')Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>AttributeError:'bytes'objecthasnoattribute'encode'

Soyoualwaysknowwhatthetypesyouaredealingwith.

2.Python3coherenttypesystem

Page 63: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Unicodestringsarenowthenorm,soPython3droppedtheuprefixforUnicodestringsandreplaceditbyabprefixforbytes,soyoudirectlywrite:

x=' '

Python3.3reintroducedtheprefixforcodebasesthatneedstobecompatiblewithPython2andPython3,soit'salsoworks:

x=u' '

2.Nomoreuprefix

Page 64: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Python3nolongerhaveseparatefunctionsforchrandunichr,justusechr.

assertchr(65)=='A'assertchr(8364)=='€'

3.Python3chr

Page 65: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Painrelieftips

Page 66: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Thankstothenewtypesystem,itisnoweasiertoidentifywhichpartofthecodeneedstoencodestringsanddecodebytes.

bytes Outsideworld

decode Library

unicode

Businesslogic

unicode

encode Library

bytes Outsideworld

1.Unicodesandwich

Page 67: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

SoftwareshouldonlyworkwithUnicodestringsinternally,decodingtheinputdataassoonaspossibleandencoding

theoutputonlyattheend.

—Pythondoconunicode

Unicodesandwich

Page 68: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Youcannotinfertheencodingsofbytes:

Content-Type:text/html;charset=ISO-8859-4

<metahttp-equiv="Content-Type"content="text/html;charset=utf-8"/>

<?xmlversion="1.0"encoding="UTF-8"?>

#-*-coding:iso8859-1-*-

Ifyoureallyreallyreallyreallyneedtoguesstheencoding,youcanusechardet,butremember,it'sabesteffortscenario.

2.Usedeclaredencoding

Page 69: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

encodeanddecodeacceptsasecondargumentsforerrorhandling.Bydefaultitissetonstrict,whichmeanscrash

x=u'abcé'x.encode('ascii',errors='strict')Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>UnicodeEncodeError:'ascii'codeccan'tencodecharacteru'\xe9'inposition3...

Youcanalsousereplacetoreplaceinvalidcharacterby?:

assertx.encode('ascii',errors='replace')=='abc?'

Oryoucansimplyignorethem:

assertx.encode('ascii',errors='ignore')=='abc'

FinallyyoucanreplacethembytheirXMLcode:

assertx.encode('ascii',errors='xmlcharrefreplace')=='abc&#233;'

3.Errorhandling

Page 70: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

UseUnicodeanytimepossible.

UsePython3.

ExplicitlyencodestranddecodestrinPython2,itmightsolvesbugsinyourcodeandeasePython3conversions.

Unicodesandwich.

Neverguessanencoding!

Useerrorhandling.

Conclusion

Page 71: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

forcinrange(0x1F410,0x1F4f0):print(r"\U%08x"%c).decode("unicode-escape"),

Pythonfun

Page 72: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

Thankyou!

Page 73: PyParis 2017 / Unicode and bytes demystified, by Boris Feld

TheAbsoluteMinimumEverySoftwareDeveloperAbsolutely,PositivelyMustKnowAboutUnicodeandCharacterSets(NoExcuses!)

PragmaticUnicode

UnicodeInPython,CompletelyDemystified

Whateveryprogrammerabsolutely,positivelyneedstoknowaboutencodingsandcharactersetstoworkwithtext

Holybatman

Redditonunicode

References