Upload
asher-barker
View
229
Download
1
Embed Size (px)
Citation preview
Unicode Normalize Engine
Submitted by: Jose Yallouz
Shlomi Ben-Shabat
Supervisor: Maxim Gurevich
Agenda
Project Goals Background Preliminary Examination Unicode Normalize Design Application Analyses Summary and conclusion
Project Goals
Recognition of web pages’ encoding.
Translation of web page to Utf-8.
Normalize the web into a single encoding standard- Utf-8.
Background - Definitions
Character Set – collection of characters that can be represented.
Character Encoding – bit representation of a character set.
Unicode – character set which includes most of the world‘s writing systems characters.
Utf-8 - character encoding of Unicode used in the web.
Recognizing Encodings
HTML meta tag <meta http-equiv=Content-Type content="text/html; charset=”Shift_JIS">
HTTP protocol Content-Type: text/html; charset =windows-1255
BOM (byte order mark) tag - EF BB BF ("  ")
Auto detection – based on Firefox.
Preliminary Examination System
100 first results of Google search All languages supported by Google
Goals: Success rate of each recognition method Contradiction cases Encodings supported by java
Examination Results
Bom tag is very reliable.
In case of contradiction between Http and Meta tag – Http is mostly correct.
Auto detection is very reliable when recognizing Utf-8.
Except Utf-8 Auto detection is reliable only when language indication is given.
Translation
Decision
HTML
HTTP Header
URL
Bom tag
Auto Detection
METAHTTP
Uni
code
Out
put
Unicode Normalize Design
Recognition System Four mentioned methods Heuristic decision tree
Translation System Translates a web page into utf-8. Using java translation mechanism.
+NormalizeHtml(in html_file : string, in output_file : string, in ouput_type : Output) : string+NormalizeUrl(in url, in file_name : string, in output_file : string, in output_type : Output) : string+NormalizeHttp_html(in html_file : string, in http_file : string, in output_file : string, in output_type : Output) : string
UnicodeNormalizer
+init(in html_file : string)+isHtmlPage() : bool+getPage() : string+getFileName() : string
-html_file : string-html_string : string
HtmlPage
+init(in http_header : string)+getContentType() : string
-http_header : string
HttpHeader
1
*
+init(in htmlPage : HtmlPage, in httpHeader : HttpHeader)+getDecision() : Encoding
EncodingDetector
1
*
1
*
+init(in html_page : HtmlPage)+Recognize() : Encoding+SupportEncoding(in encoding : string) : bool+getEncodingLanguage(in encoding : string) : int+SetDetectiontype(in encoding : string) : void
-SupportedEncoding : hash_table
AutoRecognizer
1
*
-encodingName : string-chanonicalName : string-charset
Encoding
*
*
**
1
*
1
*
+Recognize() : Encoding
«interface»Recognizer
+init(in html_page : HtmlPage)+Recognize() : Encoding
HttpRecognizer
+init(in html_page : HtmlPage)+Recognize() : Encoding
MetaTagRecognizer
1 *1 *
+DownloadPage(in url : string, in file_name : string) : string+RecursiveDownloadPage(in url : string, in file_name : string) : string
Downloader
1
*
+recognize() : string
refreshTagRecognizer1
+init(in html_page : HtmlPage)+Recognize() : Encoding
BomRecognizer
+downloadFile(in urlStr : string, in fileName : string)
Filedownloader
1
*
1 *
+getFixedCharsetName(in encoding : string) : string
-hashCharset : hash_table
charsetAliasTable
1
*
«exception»InvalidEncodingDetection
«exception»InvalidHtmlException
«exception»InvalidURLException
+init()+translate(in html_page : HtmlPage, in encoding : Encoding) : string
-charsetEncoder
Translator
+init(in fileName : string, in output_type : OutputType)
Output*
*
-html_file-text_file
OutputType1
*
+InvertText(in text : string) : string
HtmlTextInverter
+ConvertMetaTag(in str : string) : string
MetaTagConvertor
1
* 1
*
Class Diagram
Recognition System
+init(in htmlPage : HtmlPage, in httpHeader : HttpHeader)+getDecision() : Encoding
EncodingDetector
+init(in html_page : HtmlPage)+Recognize() : Encoding+SupportEncoding(in encoding : string) : bool+getEncodingLanguage(in encoding : string) : int+SetDetectiontype(in encoding : string) : void
-SupportedEncoding : hash_table
AutoRecognizer
1
*
+Recognize() : Encoding
«interface»Recognizer
+init(in html_page : HtmlPage)+Recognize() : Encoding
HttpRecognizer
+init(in html_page : HtmlPage)+Recognize() : Encoding
MetaTagRecognizer
1 *1 *
+init(in html_page : HtmlPage)+Recognize() : Encoding
BomRecognizer
1 *
+getFixedCharsetName(in encoding : string) : string
-hashCharset : hash_table
charsetAliasTable
1
*
BOM?
Http?
yesno
UTF-8
noyes
Meta? yes
Http==Meta?
no
Auto include http?
Meta?
yes
Auto include meta?
no
Auto?
no
null
yes
UTF-8yes
UTF-8
no
null
no
Metayes
Auto?no
yes
Meta
yes
Auto==Meta?
Meta
no
null
no
http
yes
Auto?no
yes
http
yes
Http==Auto?
Auto
no
null
yes
http Auto include Http or
meta?
no
http
yes
Auto?no
httpyes
(Http==Auto)or
(meta==Auto)?
Auto
no
null
yes
no
(Auto==Ascii )or
(Auto==UTF-8)?
yes
UTF-8 no
Decision heuristic
+NormalizeHtml(in html_file : string, in output_file : string, in ouput_type : Output) : string+NormalizeUrl(in url, in file_name : string, in output_file : string, in output_type : Output) : string+NormalizeHttp_html(in html_file : string, in http_file : string, in output_file : string, in output_type : Output) : string
UnicodeNormalizer
+init(in html_file : string)+isHtmlPage() : bool+getPage() : string+getFileName() : string
-html_file : string-html_string : string
HtmlPage
+init(in http_header : string)+getContentType() : string
-http_header : string
HttpHeader
1
*
+init(in htmlPage : HtmlPage, in httpHeader : HttpHeader)+getDecision() : Encoding
EncodingDetector
1
*
1
*
+init(in html_page : HtmlPage)+Recognize() : Encoding+SupportEncoding(in encoding : string) : bool+getEncodingLanguage(in encoding : string) : int+SetDetectiontype(in encoding : string) : void
-SupportedEncoding : hash_table
AutoRecognizer
1
*
-encodingName : string-chanonicalName : string-charset
Encoding
*
*
**
1
*
1
*
+Recognize() : Encoding
«interface»Recognizer
+init(in html_page : HtmlPage)+Recognize() : Encoding
HttpRecognizer
+init(in html_page : HtmlPage)+Recognize() : Encoding
MetaTagRecognizer
1 *1 *
+DownloadPage(in url : string, in file_name : string) : string+RecursiveDownloadPage(in url : string, in file_name : string) : string
Downloader
1
*
+recognize() : string
refreshTagRecognizer1
+init(in html_page : HtmlPage)+Recognize() : Encoding
BomRecognizer
+downloadFile(in urlStr : string, in fileName : string)
Filedownloader
1
*
1 *
+getFixedCharsetName(in encoding : string) : string
-hashCharset : hash_table
charsetAliasTable
1
*
«exception»InvalidEncodingDetection
«exception»InvalidHtmlException
«exception»InvalidURLException
+init()+translate(in html_page : HtmlPage, in encoding : Encoding) : string
-charsetEncoder
Translator
+init(in fileName : string, in output_type : OutputType)
Output*
*
-html_file-text_file
OutputType1
*
+InvertText(in text : string) : string
HtmlTextInverter
+ConvertMetaTag(in str : string) : string
MetaTagConvertor
1
* 1
*
Class Diagram
Translation System
+init()+translate(in html_page : HtmlPage, in encoding : Encoding) : string
-charsetEncoder
Translator
+init(in fileName : string, in output_type : OutputType)
Output*
*
-html_file-text_file
OutputType1
*
+InvertText(in text : string) : string
HtmlTextInverter
+ConvertMetaTag(in str : string) : string
MetaTagConvertor
1
* 1
*
Problems and solutions
Left to right : The encoding ISO-8859-8 (Hebrew visual)
specification defines that a Hebrew character will be written in an invert order.
Solution: The system checks for ISO-8859-8 encoding,
and when it is detected we invert the order of the Hebrew characters
Translate Example
before after
Application Analyses
Two kinds of analyses were performed in our application:
Google analysis This analysis checks the 100 first results of
Google in each language Google supports. This analysis checked about 10000 web
pages. The average detection of all languages is
about 97 percent.
Application Analyses- cont’
ODP analysis Open Directory Project (ODP) is a widely
distributed data base of Web content classified by humans.
This analysis checks about 150000 random pages of the odp database.
The average detection of all languages is about 92.615685 percent.
Google analysis
Google Analysis
0
20
40
60
80
100
120
Por
tugu
ese
Japa
nese
Arm
enia
n
Fre
nch
Sw
edis
h
Chi
nese
_tra
ditio
nal
Nor
weg
ian
Eng
lish
Heb
rew
Ger
man
Rus
sian
Per
sian
Ukr
aini
an
Est
onia
n
Ser
bian
Slo
vak
Pol
ish
Vie
tnam
ese
Ara
bic
Bel
arus
ian
Fili
pino
Indo
nesi
an
Tur
kish
Slo
veni
an
Hun
garia
n
Icel
andi
c
Rom
ania
n
Gre
ek
Chi
nese
_sim
ple
Dut
ch
Kor
ean
Fin
nish
Cze
ch
Esp
eran
to
Tha
i
Spa
nish
Ital
ian
Lith
uani
an
Dan
ish
Latv
ian
Bul
garia
n
Cat
alan
ISO-8859-15
KOI8-R
EUC-JP
UTF-8
KOI8-U
GB2312
Shift_JIS
ISO-8859-2
windows-1251
ISO-8859-1
windows-1250
ISO-8859-4
ISO-8859-8
ISO-8859-7
GBK
ISO-8859-9
windows-1257
windows-1256
windows-1255
no-detect
windows-1254
EUC-KR
windows-1253
windows-1252
ODP analysis
ODP Analysis
0%
20%
40%
60%
80%
100%
org cz nl be com gov it fr ru biz us in edu za ch au uk ie ca gr jp mil net dk pl nz info de se
TIS-620
US-ASCII
x-windows-874
Big5
ISO-8859-15
KOI8-R
UTF-8
EUC-JP
KOI8-U
GB2312
windows-31j
Shift_JIS
ISO-8859-2
windows-1251
windows-1250
ISO-8859-1
UTF-16
ISO-8859-6
ISO-8859-5
ISO-8859-8
GBK
ISO-8859-7
ISO-8859-9
windows-1257
windows-1256
windows-1255
no-detect
windows-1254
EUC-KR
windows-1253
windows-1252
Application Usage
Client usage – client browser can use this system to show the different web page in one encoding format – utf8.
Server usage – web server can use this system to translate the different storage pages into utf8.
Processing usage – different web page processing systems, like search engines, can use our system to convert different pages into the standard Unicode encoding.
Future Project Proposals
Implementation of the application on Firefox Browser
Implementation of the application on Apache Server
Design of a new auto-detection method (based on a encoding dictionary)
Summary and Conclusion
We build an efficient system which translates a page to utf8-encoding.
Analyses show 93 percent of Success.
Implementation of the application will improve the web surfing experience for millions of users all over the world.
Questions
THANK YOUTHANK YOU!!