Upload
terminatory808
View
273
Download
0
Embed Size (px)
Citation preview
8/3/2019 Optimizing the Usage of Normalization
1/22
121st International Unicode Conference Dublin, Ireland, May 2002
Optimizing the Usage of
Normalization
Vladimir Weinstein
Globalization Center of Competency, San Jose, CA
8/3/2019 Optimizing the Usage of Normalization
2/22
221st International Unicode Conference Dublin, Ireland, May 2002
Introduction
1. Unicode standard has multiple ways to encode
equivalent strings
rsu
m
re
sum
re
sume
NF
D:
NFC:
rsum
e
2. Accents that dont interact are put into aunique order
8/3/2019 Optimizing the Usage of Normalization
3/22
321st International Unicode Conference Dublin, Ireland, May 2002
Introduction (contd.)
Normalization provides a way to transform astring to an unique form (NFD, NFC)
Strings that can be transformed to the same
form are called canonically equivalent Time-critical applications need to minimize the
number of passes over the text
ICU gives a number of tools to deal with thisproblem
We will use collation (language-sensitive stringcomparison) as an example
8/3/2019 Optimizing the Usage of Normalization
4/22
421st International Unicode Conference Dublin, Ireland, May 2002
Avoiding Normalization
Force users to provide already normalized data
The performance problem does not go away
When the strings are processed many times, itcould be beneficial to normalize them
beforehand
Forcing users to provide a specific form can be
unpopular
8/3/2019 Optimizing the Usage of Normalization
5/22
521st International Unicode Conference Dublin, Ireland, May 2002
Check for Normalized Text
Most strings are already in normalized form
Quick Check is significantly faster than the fullnormalization
Needs canonical class data and additional datafor checking the relation between a code pointand a normalization form
Algorithm in UAX #15 Annex 8 (http://www.unicode.org/unicode/reports/tr15/#Annex8)
http://www.unicode.org/unicode/reports/tr15/http://www.unicode.org/unicode/reports/tr15/http://www.unicode.org/unicode/reports/tr15/http://www.unicode.org/unicode/reports/tr15/http://www.unicode.org/unicode/reports/tr15/http://www.unicode.org/unicode/reports/tr15/http://www.unicode.org/unicode/reports/tr15/8/3/2019 Optimizing the Usage of Normalization
6/22
621st International Unicode Conference Dublin, Ireland, May 2002
Normalize Incrementally
Instead of normalizing the whole string at once,
normalize one piece at a time
This technique is usually combined with an
incremental Quick Check
Useful for procedures with early exit, such as
string comparing or scanning
Normalizes up to the next safe point
8/3/2019 Optimizing the Usage of Normalization
7/22
721st International Unicode Conference Dublin, Ireland, May 2002
Incremental Normalization: Example
re
sume
rsu
m
resume
rsu
m
Initialstring
Normalize just theparts that failquick check
Non incrementalnormalization
Quickcheck
Incremental
normaliz
ation
If normalized regularly, thewhole string is processed by
normalization
8/3/2019 Optimizing the Usage of Normalization
8/22
821st International Unicode Conference Dublin, Ireland, May 2002
Optimized Concatenation
Simple concatenation of two normalized strings
can yield a string that is not normalized
One option is to normalize the result
Unnecessarily duplicates normalization
8/3/2019 Optimizing the Usage of Normalization
9/22
921st International Unicode Conference Dublin, Ireland, May 2002
Optimized Concatenation: Example
Findboundaries
Concatenate thennormalize
Concatenate andnormalize up tothe boundaries
re
sum
+
resum
rsum
r sum
+e
r sume
rsum
It is enough to normalize the boundary parts
Incremental normalization is used Much faster than redoing the whole resulting
string
8/3/2019 Optimizing the Usage of Normalization
10/22
1021st International Unicode Conference Dublin, Ireland, May 2002
Accepting the FCD Form
Fast Composed or Decomposed form is a
partially normalized form
Not unique
More lenient than NFD or NFC form
It requires that the procedure has supportfor all the canonically equivalent strings oninput
It is possible to quick check the FCD format
8/3/2019 Optimizing the Usage of Normalization
11/22
1121st International Unicode Conference Dublin, Ireland, May 2002
FCD Form: Examples
SEQUENCE FCD NFC NFD
A-ring Y Y
Angstrom Y
A + ring Y Y A + grave Y Y
A-ring + grave Y
A + cedilla + ring Y Y
A + ring + cedilla
A-ring + cedilla Y
8/3/2019 Optimizing the Usage of Normalization
12/22
1221st International Unicode Conference Dublin, Ireland, May 2002
Canonical Closure
Preprocessing data to support the FCD form
Ensures that if data is assigned to a sequence (or
a code point) it will also be assigned to all
canonically equivalent FCD sequences
=
X
A+ =
X
= X,=
>A-ring
(U+00C5)
Angstromsign
(U+212B)
A + combiningring above(U+0041U+030A)
8/3/2019 Optimizing the Usage of Normalization
13/22
1321st International Unicode Conference Dublin, Ireland, May 2002
Collation
Locale specific sorting of strings
Relation between code points and collation
elements
Context sensitive: Contractions: H < Z, but CZ < CH
Expansions: OE < < OF
Both: < or >
See Collation in ICU by Mark Davis
8/3/2019 Optimizing the Usage of Normalization
14/22
1421st International Unicode Conference Dublin, Ireland, May 2002
Collation Implementation in ICU
Two modes of operation: Normalization OFF: expects the users to pass in FCD strings
Normalization ON: accepts any strings
Some locales require normalization to be turned on Canonical closure done for contractions and regular
mappings
Two important services
Sort key generation String compare function
More about ICU at the end of presentation
8/3/2019 Optimizing the Usage of Normalization
15/22
1521st International Unicode Conference Dublin, Ireland, May 2002
FCD Support in Collation
Much higher performance
Values assigned to a code point or a contraction
are equal to those for its FCD canonically
equivalent sequences
This process is time consuming, but it is done at
build time
May increase data set
8/3/2019 Optimizing the Usage of Normalization
16/22
1621st International Unicode Conference Dublin, Ireland, May 2002
Sort Key Generation
Whole strings are processed
Sort keys tend to get reused, so the emphasis is
on producing as short sort keys as possible
Two modes of operationNormalization ON: strings are quick checked and
normalization is performed, if required
Normalization OFF: depends on strings being in FCDform. The performance increases by 20% to 50%
8/3/2019 Optimizing the Usage of Normalization
17/22
1721st International Unicode Conference Dublin, Ireland, May 2002
String Compare
Very time critical
Result is usually determined before fully
processing both strings
First step is binary comparison for equality
When it fails, comparison continues from a safe
spot
A
No need tobackup, normalsituation
c h
c z
Must backup tothe start ofcontraction
Must backup tothenormalizationsafe spot
8/3/2019 Optimizing the Usage of Normalization
18/22
1821st International Unicode Conference Dublin, Ireland, May 2002
String Compare Continued
Normalization ON: incremental FCD check and
incremental FCD normalization if required
Normalization OFF: assumes that the source
strings are FCD
Most locales dont require normalization on and
thus are 20% faster by using FCD
8/3/2019 Optimizing the Usage of Normalization
19/22
1921st International Unicode Conference Dublin, Ireland, May 2002
International Components for Unicode
International Components for Unicode(ICU) is alibrary that provides robust and full-featured Unicodesupport
The ICU normalization engine supports the
optimizations mentioned here
Library services accept FCD strings as input
Wide variety of supported platforms Open source (X license non-viral)
C/C++ and JAVA versions
http://oss.software.ibm.com/icu/
http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/8/3/2019 Optimizing the Usage of Normalization
20/22
2021st International Unicode Conference Dublin, Ireland, May 2002
Conclusion
The presented techniques allow much fasterstring processing
In case of collation, sort key generation gets up
to 50% faster than if normalizing beforehand String compare function becomes up to 3 times
faster!
May increase data size
Canonical closure preprocessing takes moretime to build, but pays off at runtime
http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/8/3/2019 Optimizing the Usage of Normalization
21/22
2121st International Unicode Conference Dublin, Ireland, May 2002
Q & A
http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/8/3/2019 Optimizing the Usage of Normalization
22/22
2221st International Unicode Conference Dublin, Ireland, May 2002
Summary
Introduction
Avoiding normalization
Check for normalized text
Normalize incrementally
Concatenation of normalized strings
Accepting the FCD form
Implementation of collation in ICU
http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/