Optimizing the Usage of Normalization

Embed Size (px)

Citation preview

  • 8/3/2019 Optimizing the Usage of Normalization

    1/22

    121st International Unicode Conference Dublin, Ireland, May 2002

    Optimizing the Usage of

    Normalization

    Vladimir Weinstein

    [email protected]

    Globalization Center of Competency, San Jose, CA

  • 8/3/2019 Optimizing the Usage of Normalization

    2/22

    221st International Unicode Conference Dublin, Ireland, May 2002

    Introduction

    1. Unicode standard has multiple ways to encode

    equivalent strings

    rsu

    m

    re

    sum

    re

    sume

    NF

    D:

    NFC:

    rsum

    e

    2. Accents that dont interact are put into aunique order

  • 8/3/2019 Optimizing the Usage of Normalization

    3/22

    321st International Unicode Conference Dublin, Ireland, May 2002

    Introduction (contd.)

    Normalization provides a way to transform astring to an unique form (NFD, NFC)

    Strings that can be transformed to the same

    form are called canonically equivalent Time-critical applications need to minimize the

    number of passes over the text

    ICU gives a number of tools to deal with thisproblem

    We will use collation (language-sensitive stringcomparison) as an example

  • 8/3/2019 Optimizing the Usage of Normalization

    4/22

    421st International Unicode Conference Dublin, Ireland, May 2002

    Avoiding Normalization

    Force users to provide already normalized data

    The performance problem does not go away

    When the strings are processed many times, itcould be beneficial to normalize them

    beforehand

    Forcing users to provide a specific form can be

    unpopular

  • 8/3/2019 Optimizing the Usage of Normalization

    5/22

    521st International Unicode Conference Dublin, Ireland, May 2002

    Check for Normalized Text

    Most strings are already in normalized form

    Quick Check is significantly faster than the fullnormalization

    Needs canonical class data and additional datafor checking the relation between a code pointand a normalization form

    Algorithm in UAX #15 Annex 8 (http://www.unicode.org/unicode/reports/tr15/#Annex8)

    http://www.unicode.org/unicode/reports/tr15/http://www.unicode.org/unicode/reports/tr15/http://www.unicode.org/unicode/reports/tr15/http://www.unicode.org/unicode/reports/tr15/http://www.unicode.org/unicode/reports/tr15/http://www.unicode.org/unicode/reports/tr15/http://www.unicode.org/unicode/reports/tr15/
  • 8/3/2019 Optimizing the Usage of Normalization

    6/22

    621st International Unicode Conference Dublin, Ireland, May 2002

    Normalize Incrementally

    Instead of normalizing the whole string at once,

    normalize one piece at a time

    This technique is usually combined with an

    incremental Quick Check

    Useful for procedures with early exit, such as

    string comparing or scanning

    Normalizes up to the next safe point

  • 8/3/2019 Optimizing the Usage of Normalization

    7/22

    721st International Unicode Conference Dublin, Ireland, May 2002

    Incremental Normalization: Example

    re

    sume

    rsu

    m

    resume

    rsu

    m

    Initialstring

    Normalize just theparts that failquick check

    Non incrementalnormalization

    Quickcheck

    Incremental

    normaliz

    ation

    If normalized regularly, thewhole string is processed by

    normalization

  • 8/3/2019 Optimizing the Usage of Normalization

    8/22

    821st International Unicode Conference Dublin, Ireland, May 2002

    Optimized Concatenation

    Simple concatenation of two normalized strings

    can yield a string that is not normalized

    One option is to normalize the result

    Unnecessarily duplicates normalization

  • 8/3/2019 Optimizing the Usage of Normalization

    9/22

    921st International Unicode Conference Dublin, Ireland, May 2002

    Optimized Concatenation: Example

    Findboundaries

    Concatenate thennormalize

    Concatenate andnormalize up tothe boundaries

    re

    sum

    +

    resum

    rsum

    r sum

    +e

    r sume

    rsum

    It is enough to normalize the boundary parts

    Incremental normalization is used Much faster than redoing the whole resulting

    string

  • 8/3/2019 Optimizing the Usage of Normalization

    10/22

    1021st International Unicode Conference Dublin, Ireland, May 2002

    Accepting the FCD Form

    Fast Composed or Decomposed form is a

    partially normalized form

    Not unique

    More lenient than NFD or NFC form

    It requires that the procedure has supportfor all the canonically equivalent strings oninput

    It is possible to quick check the FCD format

  • 8/3/2019 Optimizing the Usage of Normalization

    11/22

    1121st International Unicode Conference Dublin, Ireland, May 2002

    FCD Form: Examples

    SEQUENCE FCD NFC NFD

    A-ring Y Y

    Angstrom Y

    A + ring Y Y A + grave Y Y

    A-ring + grave Y

    A + cedilla + ring Y Y

    A + ring + cedilla

    A-ring + cedilla Y

  • 8/3/2019 Optimizing the Usage of Normalization

    12/22

    1221st International Unicode Conference Dublin, Ireland, May 2002

    Canonical Closure

    Preprocessing data to support the FCD form

    Ensures that if data is assigned to a sequence (or

    a code point) it will also be assigned to all

    canonically equivalent FCD sequences

    =

    X

    A+ =

    X

    = X,=

    >A-ring

    (U+00C5)

    Angstromsign

    (U+212B)

    A + combiningring above(U+0041U+030A)

  • 8/3/2019 Optimizing the Usage of Normalization

    13/22

    1321st International Unicode Conference Dublin, Ireland, May 2002

    Collation

    Locale specific sorting of strings

    Relation between code points and collation

    elements

    Context sensitive: Contractions: H < Z, but CZ < CH

    Expansions: OE < < OF

    Both: < or >

    See Collation in ICU by Mark Davis

  • 8/3/2019 Optimizing the Usage of Normalization

    14/22

    1421st International Unicode Conference Dublin, Ireland, May 2002

    Collation Implementation in ICU

    Two modes of operation: Normalization OFF: expects the users to pass in FCD strings

    Normalization ON: accepts any strings

    Some locales require normalization to be turned on Canonical closure done for contractions and regular

    mappings

    Two important services

    Sort key generation String compare function

    More about ICU at the end of presentation

  • 8/3/2019 Optimizing the Usage of Normalization

    15/22

    1521st International Unicode Conference Dublin, Ireland, May 2002

    FCD Support in Collation

    Much higher performance

    Values assigned to a code point or a contraction

    are equal to those for its FCD canonically

    equivalent sequences

    This process is time consuming, but it is done at

    build time

    May increase data set

  • 8/3/2019 Optimizing the Usage of Normalization

    16/22

    1621st International Unicode Conference Dublin, Ireland, May 2002

    Sort Key Generation

    Whole strings are processed

    Sort keys tend to get reused, so the emphasis is

    on producing as short sort keys as possible

    Two modes of operationNormalization ON: strings are quick checked and

    normalization is performed, if required

    Normalization OFF: depends on strings being in FCDform. The performance increases by 20% to 50%

  • 8/3/2019 Optimizing the Usage of Normalization

    17/22

    1721st International Unicode Conference Dublin, Ireland, May 2002

    String Compare

    Very time critical

    Result is usually determined before fully

    processing both strings

    First step is binary comparison for equality

    When it fails, comparison continues from a safe

    spot

    A

    No need tobackup, normalsituation

    c h

    c z

    Must backup tothe start ofcontraction

    Must backup tothenormalizationsafe spot

  • 8/3/2019 Optimizing the Usage of Normalization

    18/22

    1821st International Unicode Conference Dublin, Ireland, May 2002

    String Compare Continued

    Normalization ON: incremental FCD check and

    incremental FCD normalization if required

    Normalization OFF: assumes that the source

    strings are FCD

    Most locales dont require normalization on and

    thus are 20% faster by using FCD

  • 8/3/2019 Optimizing the Usage of Normalization

    19/22

    1921st International Unicode Conference Dublin, Ireland, May 2002

    International Components for Unicode

    International Components for Unicode(ICU) is alibrary that provides robust and full-featured Unicodesupport

    The ICU normalization engine supports the

    optimizations mentioned here

    Library services accept FCD strings as input

    Wide variety of supported platforms Open source (X license non-viral)

    C/C++ and JAVA versions

    http://oss.software.ibm.com/icu/

    http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/
  • 8/3/2019 Optimizing the Usage of Normalization

    20/22

    2021st International Unicode Conference Dublin, Ireland, May 2002

    Conclusion

    The presented techniques allow much fasterstring processing

    In case of collation, sort key generation gets up

    to 50% faster than if normalizing beforehand String compare function becomes up to 3 times

    faster!

    May increase data size

    Canonical closure preprocessing takes moretime to build, but pays off at runtime

    http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/
  • 8/3/2019 Optimizing the Usage of Normalization

    21/22

    2121st International Unicode Conference Dublin, Ireland, May 2002

    Q & A

    http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/
  • 8/3/2019 Optimizing the Usage of Normalization

    22/22

    2221st International Unicode Conference Dublin, Ireland, May 2002

    Summary

    Introduction

    Avoiding normalization

    Check for normalized text

    Normalize incrementally

    Concatenation of normalized strings

    Accepting the FCD form

    Implementation of collation in ICU

    http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/