Upload
ashton-rojas
View
35
Download
3
Embed Size (px)
DESCRIPTION
UTF-8, Perl and You. By Rafael Almeria. Chapter 1: Introduction. 1 - Introduction. This talk does not deal with the motivation for using utf-8. 1 - Introduction. This talk is about: Implementation details. Understanding UTF-8. Converting your data, And knowing how to fix common problems. - PowerPoint PPT Presentation
Citation preview
UTF-8, Perl and YouBy Rafael Almeria
Chapter 1:Introduction
1 - Introduction
This talk does not deal with themotivation for using utf-8.
1 - Introduction
This talk is about:
Implementation details.
Understanding UTF-8.
Converting your data,
And knowing how to fix common problems.
1 - Introduction
Some assumptions:
Language: Perl
Unix Operating System
Input encoded as: ASCII, ISO-8859-1/Latin-1 or Windows-1252.
Output encoded as: UTF-8
1 - Introduction
What we’ll cover in this talk:
A primer on character encoding
A simplifying principle
UTF-8
Perl & UTF-8
Making the Browser Happy
Encoding Hell
Chapter 2:A Very Brief Primer on Character
Encoding.
2 - A Very Brief Primer on Character Encoding.
What is a character encoding?
2 - A Very Brief Primer on Character Encoding.
It’s a specific way to represent the characters in a given character set.
2 - A Very Brief Primer on Character Encoding.
A character set may have a numerical ordering on it for use with a given
character encoding.
2 - A Very Brief Primer on Character Encoding.
The number given to a specific character in an ordered character set is
its code point.
2 - A Very Brief Primer on Character Encoding.
Do not confuse the character’s code point with its representation!
2 - A Very Brief Primer on Character Encoding.
It may be the same for ASCII, ISO-8859-1 and Windows-1252 and…
2 - A Very Brief Primer on Character Encoding.
it may be the same for 1-byte UTF-8 but…
2 - A Very Brief Primer on Character Encoding.
it’s definitely not true for multi-byte UTF-8.
2 - A Very Brief Primer on Character Encoding.
It’s a common problem. So don’t confuse them!
Chapter 3:A Simplifying Principle
3 - A Simplifying Principle
If all of our data is encoded using only the following encodings (code point ranges are in parenthesis):
ASCII (0x00 - 0x7F)
ISO-8859-1/Latin-1 (0x00 - 0xFF)
Windows-1252 (0x00 - 0xFF)
3 - A Simplifying Principle
and if we only care about printable content then
ASCII ISO-8859-1 Windows-1252
3 - A Simplifying Principle
We can treat everything as Windows-1252!
3 - A Simplifying Principle
This should be ok if we are sure that the documents are from one of these three kinds of encodings but we’re not sure
how each document is encoded.
Chapter 4: UTF-8.
A Brave New World
4 - UTF-8. A Brave New World
It supports every language you’ll probably ever need.
4 - UTF-8. A Brave New World
No need for Windows-1252 this and Windows-1253 that.
4 - UTF-8. A Brave New World
Its code point range is from 0x00 to 0x10FFFF
4 - UTF-8. A Brave New World
It uses a variable (1 to 4) byte encoding.
4 - UTF-8. A Brave New World
1-byte UTF-8 is used for code points in the range 0x00 to 0x7F.
4 - UTF-8. A Brave New World
1-byte UTF-8 ASCIIMSBit is 0
code point representation
4 - UTF-8. A Brave New World
Examples of 1-byte UTF-8:
“A” -> 0100 0001
“&” -> 0010 0110
“5” -> 0011 0101
4 - UTF-8. A Brave New World
2-byte UTF-8 is used for code points in the range 0x0080 to 0x07FF.
4 - UTF-8. A Brave New World
2-byte UTF-8code point != representation
4 - UTF-8. A Brave New World
The code point is broken apart into two pieces.
4 - UTF-8. A Brave New World
The five MSBits of the code point are assigned to the first byte and the six
LSBits are assigned to the second byte.
4 - UTF-8. A Brave New World
For the first byte of 2-byte UTF-8
The three MSBits are set to 110
The remaining bits are the five MSBits of the code point.
4 - UTF-8. A Brave New World
For the second byte of 2-byte UTF-8
The two MSBits are set to 10
The remaining bits are the six LSBits of the code point.
4 - UTF-8. A Brave New World
3-byte UTF-8 is used for code points in the range 0x0800 to 0xFFFF.
4 - UTF-8. A Brave New World
3-byte UTF-8code point != representation
4 - UTF-8. A Brave New World
The code point is broken apart into three pieces.
4 - UTF-8. A Brave New World The four MSBits of the code point are assigned to
the first byte.
The middle six bits are assigned to the second byte.
The six LSBits are assigned to the third byte.
4 - UTF-8. A Brave New World
For the first byte of 3-byte UTF-8
The four MSBits are set to 1110
The remaining bits are the four MSBits of the code point.
4 - UTF-8. A Brave New World
For the second byte of 3-byte UTF-8
The two MSBits are set to 10
The remaining bits are the six middle bits of the code point.
4 - UTF-8. A Brave New World
For the third byte of 3-byte UTF-8
The two MSBits are set to 10
The remaining bits are the six LSBits of the code point.
4 - UTF-8. A Brave New World
4-byte UTF-8 is used for code points in the range 0x10000 to 0x10FFFF.
4 - UTF-8. A Brave New World
4-byte UTF-8code point != representation
4 - UTF-8. A Brave New World
The code point is broken apart into four pieces.
4 - UTF-8. A Brave New World
The three MSBits of the code point are assigned to the first byte.
The next six MSBits are assigned to the second byte.
Another of the next six MSBits are assigned to the third byte.
The six LSBits are assigned to the fourth byte.
4 - UTF-8. A Brave New World
For the first byte of 4-byte UTF-8
The five MSBits are set to 11110
The remaining bits are the three MSBits of the code point.
4 - UTF-8. A Brave New World
For the second byte of 4-byte UTF-8
The two MSBits are set to 10
The remaining bits are the next six middle bits of the code point.
4 - UTF-8. A Brave New World
For the third byte of 4-byte UTF-8
The two MSBits are set to 10
The remaining bits are the next six middle bits of the code point.
4 - UTF-8. A Brave New World
For the fourth byte of 4-byte UTF-8
The two MSBits are set to 10
The remaining bits are the six LSBits of the code point.
Chapter 5:Perl & UTF-8
5 - Perl & UTF-8
If you want to create UTF-8 strings in your Perl code then all you have to do is
use the following notation:
\x{codepoint}
5 - Perl & UTF-8
For example, to create the string “niño”:
my $str = “ni\x{f1}o”;
5 - Perl & UTF-8
To write this string to STDOUT you might do this:
binmode STDOUT, “:utf8”;print $str;
5 - Perl & UTF-8
To undo it, do this:
binmode STDOUT;print $str;
5 - Perl & UTF-8
Or to write UTF-8 data to disk, you could do this:
open(OFILE, “>:utf8”, $filename);print OFILE $str;
5 - Perl & UTF-8
To read UTF-8 data from disk, you could do this:
open(IFILE, “<:utf8”, $filename);my $str = <IFILE>;
5 - Perl & UTF-8
To convert Windows-1252 to UTF-8, you could do something like this:
use Text::Iconv;use Encode;my $utf8_str = Text::Iconv->new(“WINDOWS-1252”, “UTF-8”)->convert($str);Encode::_utf8_on($utf8_str);
Chapter 6:Making the Browser Happy
6 - Making the Browser Happy
All the efforts up to now will be for naught if the browser doesn’t
understand how the page is encoded.
6 - Making the Browser Happy
To make the browser aware of the nature of the data either add…
6 - Making the Browser Happy
Content-type: text/html; charset=utf-8
6 - Making the Browser Happy
or if you want to tag each document…
6 - Making the Browser Happy
for XML add this declaration at the top of the document:
<?xml version=“1.0” encoding=“utf-8” ?>
6 - Making the Browser Happy
for HTML add this declaration at the top of the <head> section of the document:
<meta http-equiv=“Content-Type” content=“text/html; charset=utf-8” >
6 - Making the Browser Happy
for XHTML add this declaration at the top of the <head> section of the document:
<meta http-equiv=“Content-Type” content=“text/html; charset=utf-8” />
Chapter 7:Encoding Hell
7 - Encoding Hell
So now we think we understand UTF-8…
7 - Encoding Hell
…and we think we understand how to process this data in Perl but…
7 - Encoding Hell
there is still SO MUCH OPPORTUNITY for things to go wrong!
7 - Encoding Hell
The Byte Order Mark (0xFEFF code point) is one of them.
7 - Encoding Hell
The intention is probably good but it can cause much grief.
7 - Encoding Hell
Solution is to cut out the byte sequence EF BB BF from the beginning of the document.
7 - Encoding Hell
Encoded Gibberish.
(It takes several forms)
7 - Encoding Hell
All Gibberish
7 - Encoding Hell
If it’s all gibberish then maybe the data is ok but you’re looking at it using the wrong pair of glasses. Change the document encoding declaration. Or try changing your browser’s
or application’s encoding setting.
7 - Encoding Hell
Partially Gibberish
(Two Cases)
7 - Encoding Hell
First Case: What does it look like?
Niño vs Ni?oNiño vs Ni o
7 - Encoding Hell
You likely have the dreaded “mixed encoding” nightmare. Probably someone has poured ISO-8859-1 or Windows-1252 into a UTF-8 document or vice-versa. You
will need to figure out which bytes are which and clean the document up to make it pure
UTF-8.
7 - Encoding Hell
Second Case: What does it look like?
niño (viewed in UTF-8 mode)niño (viewed in Windows-1252 mode)
7 - Encoding Hell
You likely have the double encoding problem. Sometimes some of the data gets encoded as UTF-8 twice! Again, you’ll need
to look at the bytes and fix it.
7 - Encoding Hell
Now some odds and ends…
7 - Encoding Hell
HTML::Entities::decode_entities doesn’t always do what you think. Sometimes it returns ISO-8859-1 instead of UTF-8.
Caveat programmer!
7 - Encoding Hell
Be careful if you’re using the encode or decode routines from Encode.pm, they may not set the string’s UTF-8 flag appropriately.
7 - Encoding Hell
And as a checklist of sorts when you’re debugging…
7 - Encoding Hell
When debugging…make sure that
The data has been encoded properly
The data has been flagged as UTF-8
That it has been written out properly.
That the document has the appropriate encoding declaration.
That your terminal or browser has been set to the correct encoding.
Conclusion
Conclusion
We notice that it is not easy to navigate the transition from traditional encodings to UTF-8 but with perseverance it is doable. We have illustrated the common encodings, how to process our information in this environment and how to tackle the common issues that might arise.
References
References
http://www.utf8-chartable.de/unicode-utf8-table.pl?htmlent=1 A nice list of UTF-8 characters, their character entities, code points and representation.
http://en.wikipedia.org/wiki/UTF-8
http://en.wikipedia.org/wiki/Replacement_character#Replacement_character
http://en.wikipedia.org/wiki/Character_encoding
http://en.wikipedia.org/wiki/Byte-order_mark
References
http://en.wikipedia.org/wiki/Windows-1252
http://en.wikipedia.org/wiki/ISO/IEC_8859-1
http://en.wikipedia.org/wiki/ASCII
http://www.w3.org/International/O-charset
http://www.w3.org/International/O-HTTP-charset
http://www.w3.org/International/tutorials/tutorial-char-enc/
References
http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode
http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF
http://www.joelonsoftware.com/articles/Unicode.html
http://unicode.org/