55
BSDCONV Buganini Q Since 2009

Journey of Bsdconv

Embed Size (px)

DESCRIPTION

Unicode, Charset, Encoding, Conversion, Detection, Variants

Citation preview

Page 1: Journey of Bsdconv

BSDCONV

Buganini Q

Since 2009

Charset amp Encoding

Character SetCollection of charactersEncodingBinary representation

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

GB18030

CNS11643

CP950

Latin1

UTF-32 UCS4

UTF-81 UTF-16

UCS2

GB18030

CNS11643

CP950 (DBCS)

ISO-8859-1 EBCDIC-0372

1Could cover more but restricted by RFC 36292Aka IBM-37 some control characters are different from ISO-8859-1

Encoding UTF-32 UCS4

Fixed Length4 bytesFilesize = 4 for ASCII text fileIncompatible with C-style string conventionEndianness concern

Encoding UCS2

Fixed Length2 bytesFilesize = 2 for ASCII text fileIncompatible with C-style string conventionEndianness concernBMP-only

Encoding UTF-16

Variable Length2 bytes 4 bytes (Surrogate pairs)SurrogatesUsing U+D800U+DFFFIncompatible with C-style string conventionEndianness concern

110110 110111

Table UTF-16 Structure

Encoding UTF-8

Variable Length1~6 bytesCompatible with C-style string conventionSelf-synchronizingEndian-neutralSorting order = Code point order

0 (ASCII)110 101110 10 1011110 10 10 10111110 10 10 10 101111110 10 10 10 10 10

Table UTF-8 Structure

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Encoding CCCII

VariantsVariant glyph at different planeMostly used for library indexing

強 21 3D 48彊 2D 3D 48强 33 3D 48

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Bsdconv Decoding and Encoding

Alternative to iconv ISO-8859-1 UTF-8

from

toFigure Basic two phases conversion

Bsdconv Codecs amp Fallback

Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F

from

toFigure Fallback codec

Transliteration UTF-8 CP936 CP936-TRANS 3F

from

toFigure Multiple fallback codecs

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Big5 5C issue (許功蓋)

BIG5BIG5-5CBIG5 Input Output

Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

BIG5-5CBIG5BIG5 Input Output

Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

TraditionalSimplified Chinese

NOT one-to-one mappingTraditional 乾幹干

vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯

Project Chvar (12)httpsgithubcombuganinichvar

签簽 籖籤

Canonical group

Canonical group

Compatibility group

Figure Two level grouping in Chvar

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Project Chvar (22)httpsgithubcombuganinichvar

NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Bsdconv Phases

Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8

from

inter

toFigure Conversion with inter-mapping phase

Bsdconv Phases

Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8

from

inter

inter

toFigure Conversion with multiple inter-mapping phases

Unicode Casing

IS complicatedLowercase Uppercase

a Ai ITable English

Lowercase Uppercaseı Ii İTable Turkic

Lowercase Uppercasea Aagrave ATable French

Lowercase Uppercaseσ Σς Σ

Table Greek

Default Case Folding

Unicode Normalization Forms (12)UAX15

IndexingIdentification securityUsername Domain name

Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω

Table Canonical Equivalence

Unicode Normalization Forms (22)UAX15

Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1

Width size rotated カ カ︷

Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +

Table Compatibility Equivalence

Normalization for fuzzy matching

UTF-8UPPERUTF-8Input aăⅷDžбᾥ

Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8

Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss

Composition DecompositionCanonical NFC NFD

Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 2: Journey of Bsdconv

Charset amp Encoding

Character SetCollection of charactersEncodingBinary representation

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

GB18030

CNS11643

CP950

Latin1

UTF-32 UCS4

UTF-81 UTF-16

UCS2

GB18030

CNS11643

CP950 (DBCS)

ISO-8859-1 EBCDIC-0372

1Could cover more but restricted by RFC 36292Aka IBM-37 some control characters are different from ISO-8859-1

Encoding UTF-32 UCS4

Fixed Length4 bytesFilesize = 4 for ASCII text fileIncompatible with C-style string conventionEndianness concern

Encoding UCS2

Fixed Length2 bytesFilesize = 2 for ASCII text fileIncompatible with C-style string conventionEndianness concernBMP-only

Encoding UTF-16

Variable Length2 bytes 4 bytes (Surrogate pairs)SurrogatesUsing U+D800U+DFFFIncompatible with C-style string conventionEndianness concern

110110 110111

Table UTF-16 Structure

Encoding UTF-8

Variable Length1~6 bytesCompatible with C-style string conventionSelf-synchronizingEndian-neutralSorting order = Code point order

0 (ASCII)110 101110 10 1011110 10 10 10111110 10 10 10 101111110 10 10 10 10 10

Table UTF-8 Structure

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Encoding CCCII

VariantsVariant glyph at different planeMostly used for library indexing

強 21 3D 48彊 2D 3D 48强 33 3D 48

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Bsdconv Decoding and Encoding

Alternative to iconv ISO-8859-1 UTF-8

from

toFigure Basic two phases conversion

Bsdconv Codecs amp Fallback

Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F

from

toFigure Fallback codec

Transliteration UTF-8 CP936 CP936-TRANS 3F

from

toFigure Multiple fallback codecs

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Big5 5C issue (許功蓋)

BIG5BIG5-5CBIG5 Input Output

Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

BIG5-5CBIG5BIG5 Input Output

Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

TraditionalSimplified Chinese

NOT one-to-one mappingTraditional 乾幹干

vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯

Project Chvar (12)httpsgithubcombuganinichvar

签簽 籖籤

Canonical group

Canonical group

Compatibility group

Figure Two level grouping in Chvar

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Project Chvar (22)httpsgithubcombuganinichvar

NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Bsdconv Phases

Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8

from

inter

toFigure Conversion with inter-mapping phase

Bsdconv Phases

Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8

from

inter

inter

toFigure Conversion with multiple inter-mapping phases

Unicode Casing

IS complicatedLowercase Uppercase

a Ai ITable English

Lowercase Uppercaseı Ii İTable Turkic

Lowercase Uppercasea Aagrave ATable French

Lowercase Uppercaseσ Σς Σ

Table Greek

Default Case Folding

Unicode Normalization Forms (12)UAX15

IndexingIdentification securityUsername Domain name

Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω

Table Canonical Equivalence

Unicode Normalization Forms (22)UAX15

Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1

Width size rotated カ カ︷

Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +

Table Compatibility Equivalence

Normalization for fuzzy matching

UTF-8UPPERUTF-8Input aăⅷDžбᾥ

Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8

Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss

Composition DecompositionCanonical NFC NFD

Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 3: Journey of Bsdconv

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

GB18030

CNS11643

CP950

Latin1

UTF-32 UCS4

UTF-81 UTF-16

UCS2

GB18030

CNS11643

CP950 (DBCS)

ISO-8859-1 EBCDIC-0372

1Could cover more but restricted by RFC 36292Aka IBM-37 some control characters are different from ISO-8859-1

Encoding UTF-32 UCS4

Fixed Length4 bytesFilesize = 4 for ASCII text fileIncompatible with C-style string conventionEndianness concern

Encoding UCS2

Fixed Length2 bytesFilesize = 2 for ASCII text fileIncompatible with C-style string conventionEndianness concernBMP-only

Encoding UTF-16

Variable Length2 bytes 4 bytes (Surrogate pairs)SurrogatesUsing U+D800U+DFFFIncompatible with C-style string conventionEndianness concern

110110 110111

Table UTF-16 Structure

Encoding UTF-8

Variable Length1~6 bytesCompatible with C-style string conventionSelf-synchronizingEndian-neutralSorting order = Code point order

0 (ASCII)110 101110 10 1011110 10 10 10111110 10 10 10 101111110 10 10 10 10 10

Table UTF-8 Structure

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Encoding CCCII

VariantsVariant glyph at different planeMostly used for library indexing

強 21 3D 48彊 2D 3D 48强 33 3D 48

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Bsdconv Decoding and Encoding

Alternative to iconv ISO-8859-1 UTF-8

from

toFigure Basic two phases conversion

Bsdconv Codecs amp Fallback

Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F

from

toFigure Fallback codec

Transliteration UTF-8 CP936 CP936-TRANS 3F

from

toFigure Multiple fallback codecs

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Big5 5C issue (許功蓋)

BIG5BIG5-5CBIG5 Input Output

Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

BIG5-5CBIG5BIG5 Input Output

Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

TraditionalSimplified Chinese

NOT one-to-one mappingTraditional 乾幹干

vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯

Project Chvar (12)httpsgithubcombuganinichvar

签簽 籖籤

Canonical group

Canonical group

Compatibility group

Figure Two level grouping in Chvar

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Project Chvar (22)httpsgithubcombuganinichvar

NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Bsdconv Phases

Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8

from

inter

toFigure Conversion with inter-mapping phase

Bsdconv Phases

Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8

from

inter

inter

toFigure Conversion with multiple inter-mapping phases

Unicode Casing

IS complicatedLowercase Uppercase

a Ai ITable English

Lowercase Uppercaseı Ii İTable Turkic

Lowercase Uppercasea Aagrave ATable French

Lowercase Uppercaseσ Σς Σ

Table Greek

Default Case Folding

Unicode Normalization Forms (12)UAX15

IndexingIdentification securityUsername Domain name

Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω

Table Canonical Equivalence

Unicode Normalization Forms (22)UAX15

Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1

Width size rotated カ カ︷

Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +

Table Compatibility Equivalence

Normalization for fuzzy matching

UTF-8UPPERUTF-8Input aăⅷDžбᾥ

Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8

Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss

Composition DecompositionCanonical NFC NFD

Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 4: Journey of Bsdconv

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

GB18030

CNS11643

CP950

Latin1

UTF-32 UCS4

UTF-81 UTF-16

UCS2

GB18030

CNS11643

CP950 (DBCS)

ISO-8859-1 EBCDIC-0372

1Could cover more but restricted by RFC 36292Aka IBM-37 some control characters are different from ISO-8859-1

Encoding UTF-32 UCS4

Fixed Length4 bytesFilesize = 4 for ASCII text fileIncompatible with C-style string conventionEndianness concern

Encoding UCS2

Fixed Length2 bytesFilesize = 2 for ASCII text fileIncompatible with C-style string conventionEndianness concernBMP-only

Encoding UTF-16

Variable Length2 bytes 4 bytes (Surrogate pairs)SurrogatesUsing U+D800U+DFFFIncompatible with C-style string conventionEndianness concern

110110 110111

Table UTF-16 Structure

Encoding UTF-8

Variable Length1~6 bytesCompatible with C-style string conventionSelf-synchronizingEndian-neutralSorting order = Code point order

0 (ASCII)110 101110 10 1011110 10 10 10111110 10 10 10 101111110 10 10 10 10 10

Table UTF-8 Structure

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Encoding CCCII

VariantsVariant glyph at different planeMostly used for library indexing

強 21 3D 48彊 2D 3D 48强 33 3D 48

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Bsdconv Decoding and Encoding

Alternative to iconv ISO-8859-1 UTF-8

from

toFigure Basic two phases conversion

Bsdconv Codecs amp Fallback

Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F

from

toFigure Fallback codec

Transliteration UTF-8 CP936 CP936-TRANS 3F

from

toFigure Multiple fallback codecs

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Big5 5C issue (許功蓋)

BIG5BIG5-5CBIG5 Input Output

Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

BIG5-5CBIG5BIG5 Input Output

Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

TraditionalSimplified Chinese

NOT one-to-one mappingTraditional 乾幹干

vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯

Project Chvar (12)httpsgithubcombuganinichvar

签簽 籖籤

Canonical group

Canonical group

Compatibility group

Figure Two level grouping in Chvar

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Project Chvar (22)httpsgithubcombuganinichvar

NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Bsdconv Phases

Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8

from

inter

toFigure Conversion with inter-mapping phase

Bsdconv Phases

Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8

from

inter

inter

toFigure Conversion with multiple inter-mapping phases

Unicode Casing

IS complicatedLowercase Uppercase

a Ai ITable English

Lowercase Uppercaseı Ii İTable Turkic

Lowercase Uppercasea Aagrave ATable French

Lowercase Uppercaseσ Σς Σ

Table Greek

Default Case Folding

Unicode Normalization Forms (12)UAX15

IndexingIdentification securityUsername Domain name

Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω

Table Canonical Equivalence

Unicode Normalization Forms (22)UAX15

Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1

Width size rotated カ カ︷

Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +

Table Compatibility Equivalence

Normalization for fuzzy matching

UTF-8UPPERUTF-8Input aăⅷDžбᾥ

Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8

Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss

Composition DecompositionCanonical NFC NFD

Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 5: Journey of Bsdconv

Encoding UTF-32 UCS4

Fixed Length4 bytesFilesize = 4 for ASCII text fileIncompatible with C-style string conventionEndianness concern

Encoding UCS2

Fixed Length2 bytesFilesize = 2 for ASCII text fileIncompatible with C-style string conventionEndianness concernBMP-only

Encoding UTF-16

Variable Length2 bytes 4 bytes (Surrogate pairs)SurrogatesUsing U+D800U+DFFFIncompatible with C-style string conventionEndianness concern

110110 110111

Table UTF-16 Structure

Encoding UTF-8

Variable Length1~6 bytesCompatible with C-style string conventionSelf-synchronizingEndian-neutralSorting order = Code point order

0 (ASCII)110 101110 10 1011110 10 10 10111110 10 10 10 101111110 10 10 10 10 10

Table UTF-8 Structure

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Encoding CCCII

VariantsVariant glyph at different planeMostly used for library indexing

強 21 3D 48彊 2D 3D 48强 33 3D 48

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Bsdconv Decoding and Encoding

Alternative to iconv ISO-8859-1 UTF-8

from

toFigure Basic two phases conversion

Bsdconv Codecs amp Fallback

Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F

from

toFigure Fallback codec

Transliteration UTF-8 CP936 CP936-TRANS 3F

from

toFigure Multiple fallback codecs

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Big5 5C issue (許功蓋)

BIG5BIG5-5CBIG5 Input Output

Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

BIG5-5CBIG5BIG5 Input Output

Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

TraditionalSimplified Chinese

NOT one-to-one mappingTraditional 乾幹干

vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯

Project Chvar (12)httpsgithubcombuganinichvar

签簽 籖籤

Canonical group

Canonical group

Compatibility group

Figure Two level grouping in Chvar

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Project Chvar (22)httpsgithubcombuganinichvar

NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Bsdconv Phases

Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8

from

inter

toFigure Conversion with inter-mapping phase

Bsdconv Phases

Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8

from

inter

inter

toFigure Conversion with multiple inter-mapping phases

Unicode Casing

IS complicatedLowercase Uppercase

a Ai ITable English

Lowercase Uppercaseı Ii İTable Turkic

Lowercase Uppercasea Aagrave ATable French

Lowercase Uppercaseσ Σς Σ

Table Greek

Default Case Folding

Unicode Normalization Forms (12)UAX15

IndexingIdentification securityUsername Domain name

Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω

Table Canonical Equivalence

Unicode Normalization Forms (22)UAX15

Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1

Width size rotated カ カ︷

Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +

Table Compatibility Equivalence

Normalization for fuzzy matching

UTF-8UPPERUTF-8Input aăⅷDžбᾥ

Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8

Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss

Composition DecompositionCanonical NFC NFD

Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 6: Journey of Bsdconv

Encoding UCS2

Fixed Length2 bytesFilesize = 2 for ASCII text fileIncompatible with C-style string conventionEndianness concernBMP-only

Encoding UTF-16

Variable Length2 bytes 4 bytes (Surrogate pairs)SurrogatesUsing U+D800U+DFFFIncompatible with C-style string conventionEndianness concern

110110 110111

Table UTF-16 Structure

Encoding UTF-8

Variable Length1~6 bytesCompatible with C-style string conventionSelf-synchronizingEndian-neutralSorting order = Code point order

0 (ASCII)110 101110 10 1011110 10 10 10111110 10 10 10 101111110 10 10 10 10 10

Table UTF-8 Structure

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Encoding CCCII

VariantsVariant glyph at different planeMostly used for library indexing

強 21 3D 48彊 2D 3D 48强 33 3D 48

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Bsdconv Decoding and Encoding

Alternative to iconv ISO-8859-1 UTF-8

from

toFigure Basic two phases conversion

Bsdconv Codecs amp Fallback

Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F

from

toFigure Fallback codec

Transliteration UTF-8 CP936 CP936-TRANS 3F

from

toFigure Multiple fallback codecs

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Big5 5C issue (許功蓋)

BIG5BIG5-5CBIG5 Input Output

Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

BIG5-5CBIG5BIG5 Input Output

Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

TraditionalSimplified Chinese

NOT one-to-one mappingTraditional 乾幹干

vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯

Project Chvar (12)httpsgithubcombuganinichvar

签簽 籖籤

Canonical group

Canonical group

Compatibility group

Figure Two level grouping in Chvar

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Project Chvar (22)httpsgithubcombuganinichvar

NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Bsdconv Phases

Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8

from

inter

toFigure Conversion with inter-mapping phase

Bsdconv Phases

Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8

from

inter

inter

toFigure Conversion with multiple inter-mapping phases

Unicode Casing

IS complicatedLowercase Uppercase

a Ai ITable English

Lowercase Uppercaseı Ii İTable Turkic

Lowercase Uppercasea Aagrave ATable French

Lowercase Uppercaseσ Σς Σ

Table Greek

Default Case Folding

Unicode Normalization Forms (12)UAX15

IndexingIdentification securityUsername Domain name

Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω

Table Canonical Equivalence

Unicode Normalization Forms (22)UAX15

Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1

Width size rotated カ カ︷

Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +

Table Compatibility Equivalence

Normalization for fuzzy matching

UTF-8UPPERUTF-8Input aăⅷDžбᾥ

Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8

Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss

Composition DecompositionCanonical NFC NFD

Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 7: Journey of Bsdconv

Encoding UTF-16

Variable Length2 bytes 4 bytes (Surrogate pairs)SurrogatesUsing U+D800U+DFFFIncompatible with C-style string conventionEndianness concern

110110 110111

Table UTF-16 Structure

Encoding UTF-8

Variable Length1~6 bytesCompatible with C-style string conventionSelf-synchronizingEndian-neutralSorting order = Code point order

0 (ASCII)110 101110 10 1011110 10 10 10111110 10 10 10 101111110 10 10 10 10 10

Table UTF-8 Structure

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Encoding CCCII

VariantsVariant glyph at different planeMostly used for library indexing

強 21 3D 48彊 2D 3D 48强 33 3D 48

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Bsdconv Decoding and Encoding

Alternative to iconv ISO-8859-1 UTF-8

from

toFigure Basic two phases conversion

Bsdconv Codecs amp Fallback

Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F

from

toFigure Fallback codec

Transliteration UTF-8 CP936 CP936-TRANS 3F

from

toFigure Multiple fallback codecs

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Big5 5C issue (許功蓋)

BIG5BIG5-5CBIG5 Input Output

Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

BIG5-5CBIG5BIG5 Input Output

Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

TraditionalSimplified Chinese

NOT one-to-one mappingTraditional 乾幹干

vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯

Project Chvar (12)httpsgithubcombuganinichvar

签簽 籖籤

Canonical group

Canonical group

Compatibility group

Figure Two level grouping in Chvar

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Project Chvar (22)httpsgithubcombuganinichvar

NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Bsdconv Phases

Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8

from

inter

toFigure Conversion with inter-mapping phase

Bsdconv Phases

Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8

from

inter

inter

toFigure Conversion with multiple inter-mapping phases

Unicode Casing

IS complicatedLowercase Uppercase

a Ai ITable English

Lowercase Uppercaseı Ii İTable Turkic

Lowercase Uppercasea Aagrave ATable French

Lowercase Uppercaseσ Σς Σ

Table Greek

Default Case Folding

Unicode Normalization Forms (12)UAX15

IndexingIdentification securityUsername Domain name

Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω

Table Canonical Equivalence

Unicode Normalization Forms (22)UAX15

Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1

Width size rotated カ カ︷

Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +

Table Compatibility Equivalence

Normalization for fuzzy matching

UTF-8UPPERUTF-8Input aăⅷDžбᾥ

Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8

Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss

Composition DecompositionCanonical NFC NFD

Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 8: Journey of Bsdconv

Encoding UTF-8

Variable Length1~6 bytesCompatible with C-style string conventionSelf-synchronizingEndian-neutralSorting order = Code point order

0 (ASCII)110 101110 10 1011110 10 10 10111110 10 10 10 101111110 10 10 10 10 10

Table UTF-8 Structure

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Encoding CCCII

VariantsVariant glyph at different planeMostly used for library indexing

強 21 3D 48彊 2D 3D 48强 33 3D 48

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Bsdconv Decoding and Encoding

Alternative to iconv ISO-8859-1 UTF-8

from

toFigure Basic two phases conversion

Bsdconv Codecs amp Fallback

Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F

from

toFigure Fallback codec

Transliteration UTF-8 CP936 CP936-TRANS 3F

from

toFigure Multiple fallback codecs

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Big5 5C issue (許功蓋)

BIG5BIG5-5CBIG5 Input Output

Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

BIG5-5CBIG5BIG5 Input Output

Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

TraditionalSimplified Chinese

NOT one-to-one mappingTraditional 乾幹干

vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯

Project Chvar (12)httpsgithubcombuganinichvar

签簽 籖籤

Canonical group

Canonical group

Compatibility group

Figure Two level grouping in Chvar

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Project Chvar (22)httpsgithubcombuganinichvar

NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Bsdconv Phases

Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8

from

inter

toFigure Conversion with inter-mapping phase

Bsdconv Phases

Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8

from

inter

inter

toFigure Conversion with multiple inter-mapping phases

Unicode Casing

IS complicatedLowercase Uppercase

a Ai ITable English

Lowercase Uppercaseı Ii İTable Turkic

Lowercase Uppercasea Aagrave ATable French

Lowercase Uppercaseσ Σς Σ

Table Greek

Default Case Folding

Unicode Normalization Forms (12)UAX15

IndexingIdentification securityUsername Domain name

Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω

Table Canonical Equivalence

Unicode Normalization Forms (22)UAX15

Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1

Width size rotated カ カ︷

Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +

Table Compatibility Equivalence

Normalization for fuzzy matching

UTF-8UPPERUTF-8Input aăⅷDžбᾥ

Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8

Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss

Composition DecompositionCanonical NFC NFD

Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 9: Journey of Bsdconv

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Encoding CCCII

VariantsVariant glyph at different planeMostly used for library indexing

強 21 3D 48彊 2D 3D 48强 33 3D 48

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Bsdconv Decoding and Encoding

Alternative to iconv ISO-8859-1 UTF-8

from

toFigure Basic two phases conversion

Bsdconv Codecs amp Fallback

Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F

from

toFigure Fallback codec

Transliteration UTF-8 CP936 CP936-TRANS 3F

from

toFigure Multiple fallback codecs

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Big5 5C issue (許功蓋)

BIG5BIG5-5CBIG5 Input Output

Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

BIG5-5CBIG5BIG5 Input Output

Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

TraditionalSimplified Chinese

NOT one-to-one mappingTraditional 乾幹干

vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯

Project Chvar (12)httpsgithubcombuganinichvar

签簽 籖籤

Canonical group

Canonical group

Compatibility group

Figure Two level grouping in Chvar

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Project Chvar (22)httpsgithubcombuganinichvar

NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Bsdconv Phases

Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8

from

inter

toFigure Conversion with inter-mapping phase

Bsdconv Phases

Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8

from

inter

inter

toFigure Conversion with multiple inter-mapping phases

Unicode Casing

IS complicatedLowercase Uppercase

a Ai ITable English

Lowercase Uppercaseı Ii İTable Turkic

Lowercase Uppercasea Aagrave ATable French

Lowercase Uppercaseσ Σς Σ

Table Greek

Default Case Folding

Unicode Normalization Forms (12)UAX15

IndexingIdentification securityUsername Domain name

Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω

Table Canonical Equivalence

Unicode Normalization Forms (22)UAX15

Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1

Width size rotated カ カ︷

Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +

Table Compatibility Equivalence

Normalization for fuzzy matching

UTF-8UPPERUTF-8Input aăⅷDžбᾥ

Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8

Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss

Composition DecompositionCanonical NFC NFD

Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 10: Journey of Bsdconv

Encoding CCCII

VariantsVariant glyph at different planeMostly used for library indexing

強 21 3D 48彊 2D 3D 48强 33 3D 48

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Bsdconv Decoding and Encoding

Alternative to iconv ISO-8859-1 UTF-8

from

toFigure Basic two phases conversion

Bsdconv Codecs amp Fallback

Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F

from

toFigure Fallback codec

Transliteration UTF-8 CP936 CP936-TRANS 3F

from

toFigure Multiple fallback codecs

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Big5 5C issue (許功蓋)

BIG5BIG5-5CBIG5 Input Output

Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

BIG5-5CBIG5BIG5 Input Output

Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

TraditionalSimplified Chinese

NOT one-to-one mappingTraditional 乾幹干

vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯

Project Chvar (12)httpsgithubcombuganinichvar

签簽 籖籤

Canonical group

Canonical group

Compatibility group

Figure Two level grouping in Chvar

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Project Chvar (22)httpsgithubcombuganinichvar

NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Bsdconv Phases

Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8

from

inter

toFigure Conversion with inter-mapping phase

Bsdconv Phases

Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8

from

inter

inter

toFigure Conversion with multiple inter-mapping phases

Unicode Casing

IS complicatedLowercase Uppercase

a Ai ITable English

Lowercase Uppercaseı Ii İTable Turkic

Lowercase Uppercasea Aagrave ATable French

Lowercase Uppercaseσ Σς Σ

Table Greek

Default Case Folding

Unicode Normalization Forms (12)UAX15

IndexingIdentification securityUsername Domain name

Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω

Table Canonical Equivalence

Unicode Normalization Forms (22)UAX15

Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1

Width size rotated カ カ︷

Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +

Table Compatibility Equivalence

Normalization for fuzzy matching

UTF-8UPPERUTF-8Input aăⅷDžбᾥ

Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8

Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss

Composition DecompositionCanonical NFC NFD

Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 11: Journey of Bsdconv

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Bsdconv Decoding and Encoding

Alternative to iconv ISO-8859-1 UTF-8

from

toFigure Basic two phases conversion

Bsdconv Codecs amp Fallback

Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F

from

toFigure Fallback codec

Transliteration UTF-8 CP936 CP936-TRANS 3F

from

toFigure Multiple fallback codecs

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Big5 5C issue (許功蓋)

BIG5BIG5-5CBIG5 Input Output

Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

BIG5-5CBIG5BIG5 Input Output

Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

TraditionalSimplified Chinese

NOT one-to-one mappingTraditional 乾幹干

vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯

Project Chvar (12)httpsgithubcombuganinichvar

签簽 籖籤

Canonical group

Canonical group

Compatibility group

Figure Two level grouping in Chvar

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Project Chvar (22)httpsgithubcombuganinichvar

NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Bsdconv Phases

Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8

from

inter

toFigure Conversion with inter-mapping phase

Bsdconv Phases

Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8

from

inter

inter

toFigure Conversion with multiple inter-mapping phases

Unicode Casing

IS complicatedLowercase Uppercase

a Ai ITable English

Lowercase Uppercaseı Ii İTable Turkic

Lowercase Uppercasea Aagrave ATable French

Lowercase Uppercaseσ Σς Σ

Table Greek

Default Case Folding

Unicode Normalization Forms (12)UAX15

IndexingIdentification securityUsername Domain name

Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω

Table Canonical Equivalence

Unicode Normalization Forms (22)UAX15

Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1

Width size rotated カ カ︷

Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +

Table Compatibility Equivalence

Normalization for fuzzy matching

UTF-8UPPERUTF-8Input aăⅷDžбᾥ

Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8

Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss

Composition DecompositionCanonical NFC NFD

Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 12: Journey of Bsdconv

Bsdconv Decoding and Encoding

Alternative to iconv ISO-8859-1 UTF-8

from

toFigure Basic two phases conversion

Bsdconv Codecs amp Fallback

Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F

from

toFigure Fallback codec

Transliteration UTF-8 CP936 CP936-TRANS 3F

from

toFigure Multiple fallback codecs

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Big5 5C issue (許功蓋)

BIG5BIG5-5CBIG5 Input Output

Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

BIG5-5CBIG5BIG5 Input Output

Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

TraditionalSimplified Chinese

NOT one-to-one mappingTraditional 乾幹干

vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯

Project Chvar (12)httpsgithubcombuganinichvar

签簽 籖籤

Canonical group

Canonical group

Compatibility group

Figure Two level grouping in Chvar

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Project Chvar (22)httpsgithubcombuganinichvar

NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Bsdconv Phases

Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8

from

inter

toFigure Conversion with inter-mapping phase

Bsdconv Phases

Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8

from

inter

inter

toFigure Conversion with multiple inter-mapping phases

Unicode Casing

IS complicatedLowercase Uppercase

a Ai ITable English

Lowercase Uppercaseı Ii İTable Turkic

Lowercase Uppercasea Aagrave ATable French

Lowercase Uppercaseσ Σς Σ

Table Greek

Default Case Folding

Unicode Normalization Forms (12)UAX15

IndexingIdentification securityUsername Domain name

Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω

Table Canonical Equivalence

Unicode Normalization Forms (22)UAX15

Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1

Width size rotated カ カ︷

Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +

Table Compatibility Equivalence

Normalization for fuzzy matching

UTF-8UPPERUTF-8Input aăⅷDžбᾥ

Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8

Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss

Composition DecompositionCanonical NFC NFD

Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 13: Journey of Bsdconv

Bsdconv Codecs amp Fallback

Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F

from

toFigure Fallback codec

Transliteration UTF-8 CP936 CP936-TRANS 3F

from

toFigure Multiple fallback codecs

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Big5 5C issue (許功蓋)

BIG5BIG5-5CBIG5 Input Output

Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

BIG5-5CBIG5BIG5 Input Output

Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

TraditionalSimplified Chinese

NOT one-to-one mappingTraditional 乾幹干

vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯

Project Chvar (12)httpsgithubcombuganinichvar

签簽 籖籤

Canonical group

Canonical group

Compatibility group

Figure Two level grouping in Chvar

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Project Chvar (22)httpsgithubcombuganinichvar

NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Bsdconv Phases

Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8

from

inter

toFigure Conversion with inter-mapping phase

Bsdconv Phases

Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8

from

inter

inter

toFigure Conversion with multiple inter-mapping phases

Unicode Casing

IS complicatedLowercase Uppercase

a Ai ITable English

Lowercase Uppercaseı Ii İTable Turkic

Lowercase Uppercasea Aagrave ATable French

Lowercase Uppercaseσ Σς Σ

Table Greek

Default Case Folding

Unicode Normalization Forms (12)UAX15

IndexingIdentification securityUsername Domain name

Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω

Table Canonical Equivalence

Unicode Normalization Forms (22)UAX15

Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1

Width size rotated カ カ︷

Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +

Table Compatibility Equivalence

Normalization for fuzzy matching

UTF-8UPPERUTF-8Input aăⅷDžбᾥ

Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8

Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss

Composition DecompositionCanonical NFC NFD

Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 14: Journey of Bsdconv

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Big5 5C issue (許功蓋)

BIG5BIG5-5CBIG5 Input Output

Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

BIG5-5CBIG5BIG5 Input Output

Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

TraditionalSimplified Chinese

NOT one-to-one mappingTraditional 乾幹干

vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯

Project Chvar (12)httpsgithubcombuganinichvar

签簽 籖籤

Canonical group

Canonical group

Compatibility group

Figure Two level grouping in Chvar

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Project Chvar (22)httpsgithubcombuganinichvar

NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Bsdconv Phases

Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8

from

inter

toFigure Conversion with inter-mapping phase

Bsdconv Phases

Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8

from

inter

inter

toFigure Conversion with multiple inter-mapping phases

Unicode Casing

IS complicatedLowercase Uppercase

a Ai ITable English

Lowercase Uppercaseı Ii İTable Turkic

Lowercase Uppercasea Aagrave ATable French

Lowercase Uppercaseσ Σς Σ

Table Greek

Default Case Folding

Unicode Normalization Forms (12)UAX15

IndexingIdentification securityUsername Domain name

Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω

Table Canonical Equivalence

Unicode Normalization Forms (22)UAX15

Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1

Width size rotated カ カ︷

Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +

Table Compatibility Equivalence

Normalization for fuzzy matching

UTF-8UPPERUTF-8Input aăⅷDžбᾥ

Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8

Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss

Composition DecompositionCanonical NFC NFD

Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 15: Journey of Bsdconv

Big5 5C issue (許功蓋)

BIG5BIG5-5CBIG5 Input Output

Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

BIG5-5CBIG5BIG5 Input Output

Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo

TraditionalSimplified Chinese

NOT one-to-one mappingTraditional 乾幹干

vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯

Project Chvar (12)httpsgithubcombuganinichvar

签簽 籖籤

Canonical group

Canonical group

Compatibility group

Figure Two level grouping in Chvar

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Project Chvar (22)httpsgithubcombuganinichvar

NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Bsdconv Phases

Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8

from

inter

toFigure Conversion with inter-mapping phase

Bsdconv Phases

Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8

from

inter

inter

toFigure Conversion with multiple inter-mapping phases

Unicode Casing

IS complicatedLowercase Uppercase

a Ai ITable English

Lowercase Uppercaseı Ii İTable Turkic

Lowercase Uppercasea Aagrave ATable French

Lowercase Uppercaseσ Σς Σ

Table Greek

Default Case Folding

Unicode Normalization Forms (12)UAX15

IndexingIdentification securityUsername Domain name

Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω

Table Canonical Equivalence

Unicode Normalization Forms (22)UAX15

Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1

Width size rotated カ カ︷

Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +

Table Compatibility Equivalence

Normalization for fuzzy matching

UTF-8UPPERUTF-8Input aăⅷDžбᾥ

Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8

Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss

Composition DecompositionCanonical NFC NFD

Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 16: Journey of Bsdconv

TraditionalSimplified Chinese

NOT one-to-one mappingTraditional 乾幹干

vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯

Project Chvar (12)httpsgithubcombuganinichvar

签簽 籖籤

Canonical group

Canonical group

Compatibility group

Figure Two level grouping in Chvar

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Project Chvar (22)httpsgithubcombuganinichvar

NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Bsdconv Phases

Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8

from

inter

toFigure Conversion with inter-mapping phase

Bsdconv Phases

Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8

from

inter

inter

toFigure Conversion with multiple inter-mapping phases

Unicode Casing

IS complicatedLowercase Uppercase

a Ai ITable English

Lowercase Uppercaseı Ii İTable Turkic

Lowercase Uppercasea Aagrave ATable French

Lowercase Uppercaseσ Σς Σ

Table Greek

Default Case Folding

Unicode Normalization Forms (12)UAX15

IndexingIdentification securityUsername Domain name

Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω

Table Canonical Equivalence

Unicode Normalization Forms (22)UAX15

Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1

Width size rotated カ カ︷

Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +

Table Compatibility Equivalence

Normalization for fuzzy matching

UTF-8UPPERUTF-8Input aăⅷDžбᾥ

Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8

Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss

Composition DecompositionCanonical NFC NFD

Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 17: Journey of Bsdconv

Project Chvar (12)httpsgithubcombuganinichvar

签簽 籖籤

Canonical group

Canonical group

Compatibility group

Figure Two level grouping in Chvar

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Project Chvar (22)httpsgithubcombuganinichvar

NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Bsdconv Phases

Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8

from

inter

toFigure Conversion with inter-mapping phase

Bsdconv Phases

Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8

from

inter

inter

toFigure Conversion with multiple inter-mapping phases

Unicode Casing

IS complicatedLowercase Uppercase

a Ai ITable English

Lowercase Uppercaseı Ii İTable Turkic

Lowercase Uppercasea Aagrave ATable French

Lowercase Uppercaseσ Σς Σ

Table Greek

Default Case Folding

Unicode Normalization Forms (12)UAX15

IndexingIdentification securityUsername Domain name

Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω

Table Canonical Equivalence

Unicode Normalization Forms (22)UAX15

Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1

Width size rotated カ カ︷

Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +

Table Compatibility Equivalence

Normalization for fuzzy matching

UTF-8UPPERUTF-8Input aăⅷDžбᾥ

Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8

Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss

Composition DecompositionCanonical NFC NFD

Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 18: Journey of Bsdconv

Project Chvar (22)httpsgithubcombuganinichvar

NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence

签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖

CP950 簽 - 籤 -GB2312 - 签 times times

Table Canonical Group

签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签

CP950 簽 - 簽 簽GB2312 - 签 签 签

Table Compatibility Group

Bsdconv Phases

Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8

from

inter

toFigure Conversion with inter-mapping phase

Bsdconv Phases

Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8

from

inter

inter

toFigure Conversion with multiple inter-mapping phases

Unicode Casing

IS complicatedLowercase Uppercase

a Ai ITable English

Lowercase Uppercaseı Ii İTable Turkic

Lowercase Uppercasea Aagrave ATable French

Lowercase Uppercaseσ Σς Σ

Table Greek

Default Case Folding

Unicode Normalization Forms (12)UAX15

IndexingIdentification securityUsername Domain name

Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω

Table Canonical Equivalence

Unicode Normalization Forms (22)UAX15

Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1

Width size rotated カ カ︷

Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +

Table Compatibility Equivalence

Normalization for fuzzy matching

UTF-8UPPERUTF-8Input aăⅷDžбᾥ

Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8

Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss

Composition DecompositionCanonical NFC NFD

Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 19: Journey of Bsdconv

Bsdconv Phases

Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8

from

inter

toFigure Conversion with inter-mapping phase

Bsdconv Phases

Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8

from

inter

inter

toFigure Conversion with multiple inter-mapping phases

Unicode Casing

IS complicatedLowercase Uppercase

a Ai ITable English

Lowercase Uppercaseı Ii İTable Turkic

Lowercase Uppercasea Aagrave ATable French

Lowercase Uppercaseσ Σς Σ

Table Greek

Default Case Folding

Unicode Normalization Forms (12)UAX15

IndexingIdentification securityUsername Domain name

Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω

Table Canonical Equivalence

Unicode Normalization Forms (22)UAX15

Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1

Width size rotated カ カ︷

Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +

Table Compatibility Equivalence

Normalization for fuzzy matching

UTF-8UPPERUTF-8Input aăⅷDžбᾥ

Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8

Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss

Composition DecompositionCanonical NFC NFD

Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 20: Journey of Bsdconv

Bsdconv Phases

Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8

from

inter

inter

toFigure Conversion with multiple inter-mapping phases

Unicode Casing

IS complicatedLowercase Uppercase

a Ai ITable English

Lowercase Uppercaseı Ii İTable Turkic

Lowercase Uppercasea Aagrave ATable French

Lowercase Uppercaseσ Σς Σ

Table Greek

Default Case Folding

Unicode Normalization Forms (12)UAX15

IndexingIdentification securityUsername Domain name

Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω

Table Canonical Equivalence

Unicode Normalization Forms (22)UAX15

Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1

Width size rotated カ カ︷

Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +

Table Compatibility Equivalence

Normalization for fuzzy matching

UTF-8UPPERUTF-8Input aăⅷDžбᾥ

Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8

Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss

Composition DecompositionCanonical NFC NFD

Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 21: Journey of Bsdconv

Unicode Casing

IS complicatedLowercase Uppercase

a Ai ITable English

Lowercase Uppercaseı Ii İTable Turkic

Lowercase Uppercasea Aagrave ATable French

Lowercase Uppercaseσ Σς Σ

Table Greek

Default Case Folding

Unicode Normalization Forms (12)UAX15

IndexingIdentification securityUsername Domain name

Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω

Table Canonical Equivalence

Unicode Normalization Forms (22)UAX15

Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1

Width size rotated カ カ︷

Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +

Table Compatibility Equivalence

Normalization for fuzzy matching

UTF-8UPPERUTF-8Input aăⅷDžбᾥ

Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8

Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss

Composition DecompositionCanonical NFC NFD

Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 22: Journey of Bsdconv

Unicode Normalization Forms (12)UAX15

IndexingIdentification securityUsername Domain name

Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω

Table Canonical Equivalence

Unicode Normalization Forms (22)UAX15

Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1

Width size rotated カ カ︷

Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +

Table Compatibility Equivalence

Normalization for fuzzy matching

UTF-8UPPERUTF-8Input aăⅷDžбᾥ

Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8

Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss

Composition DecompositionCanonical NFC NFD

Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 23: Journey of Bsdconv

Unicode Normalization Forms (22)UAX15

Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1

Width size rotated カ カ︷

Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +

Table Compatibility Equivalence

Normalization for fuzzy matching

UTF-8UPPERUTF-8Input aăⅷDžбᾥ

Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8

Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss

Composition DecompositionCanonical NFC NFD

Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 24: Journey of Bsdconv

Normalization for fuzzy matching

UTF-8UPPERUTF-8Input aăⅷDžбᾥ

Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8

Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss

Composition DecompositionCanonical NFC NFD

Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 25: Journey of Bsdconv

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 26: Journey of Bsdconv

Bsdconv Codec argument

Other than question mark UTF-8 ANY0121 ASCII ANY21

from

toFigure Codec argument

Or more than one character UTF-8 ANY013F0121 ASCII ANY21

from

toFigure Data list separated by dot

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 27: Journey of Bsdconv

Bsdconv Alias

from3FANY013FampERROR

to3FANY3FampERROR

fromUTF-8ASCII_UTF-8

interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER

interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION

interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD

filter01UNICODE

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 28: Journey of Bsdconv

Charset amp Encoding

Unicode (32bits addr space)

Unicode up to U+10FFFF

Unicode BMP (up to U+FFFF)

(Basic Multilingual Plane)

GB18030

CNS11643

CP950

Latin1Figure Character Sets

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 29: Journey of Bsdconv

Bsdconv Types

(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 30: Journey of Bsdconv

Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw

Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding

FontPronunciation ㄇㄥ ˊ meacuteng

Radical 艸Component 艹日月

StrokeTraSim mapping 萌蕄

Table Examples for some information provided by 全字庫 for「萌」

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 31: Journey of Bsdconv

Chinese components compositionhttpsgithubcombuganinichicomp

UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我

Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8

Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ

UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8

Input 功夫不好不要艹我Output pu nao yao [uh]2

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 32: Journey of Bsdconv

Bsdconv Flags

FREE - memory managementMARK - identifier

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 33: Journey of Bsdconv

Bsdconv Cascade

Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8

from

to

from

toFigure Cascaded conversions

Input Outputyenxyen_ 台北

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 34: Journey of Bsdconv

Look-through (14)

u03B1CEB2Input (UTF-8 literal)

ESCAPE Decoder

01

03

B1

03

CE

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 35: Journey of Bsdconv

Look-through (24)

01

03

B1

03CE

03B2

Internal data

PASSMARKampFOR=1BYTEEncoder

01

03

B1

MARK

CE

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 36: Journey of Bsdconv

Look-through (34)

0103

B1

MARK

CE B2

Internal data

PASSUNMARKUTF-8 Decoder

01

03

B1

01

03

B2

Internal data

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 37: Journey of Bsdconv

Look-through (44)

0103

B1

01

03

B2

Internal data

UTF-8

Encoder

CE

B1

rdquoαrdquoCE

B2

rdquoβrdquo

Internal data

αβ

Output (UTF-8 literal)

Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2

Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 38: Journey of Bsdconv

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 39: Journey of Bsdconv

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 40: Journey of Bsdconv

String width measurement

echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 41: Journey of Bsdconv

Chinese charset encoding detectionhttpsgithubcombuganinichiconv

ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001

$COUNT

帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)

UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40

UTF-16LE 20 5 2 00

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 42: Journey of Bsdconv

Khmer legacy font converterhttpsgithubcombuganinikhmerconv

IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8

SolutionTwo pass detection

Detect encodingDetect font family (currently not working)(High converage in SBCS)

Algorithm ported from Khmer Converter3

Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]

3httpwwwkhmerosinfoenkhmer-converter

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 43: Journey of Bsdconv

Encoding Big5

Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5

Scenario Dominating encodingMicrosoft CP950

Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)

Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 44: Journey of Bsdconv

Unicode East Asian Width (12)UAX11

Narrow

Halfwidth

Wide

Fullwidth

Ambiguous

Neutral

Figure Venn Diagram Showing the Set Relations for Six Categories

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 45: Journey of Bsdconv

Unicode East Asian Width (22)UAX11

Narrow Ambiguous WideЯ

N ऊNa A A FH カ カ W

咦 WTable Examples for Each Character Class and Their Resolved Widths

Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth

Table Width attributes

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 46: Journey of Bsdconv

Terminal transcodinghttpsgithubcombuganinibug5

IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help

Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 47: Journey of Bsdconv

Bug5 explained (16)

⋆xC5x1B[1mxE5Input (Big5 literal)

ANSI-CONTROLBYTE Decoder

03

A1

03

B9

03

C5

1B

5B

31

6D

03

E5

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 48: Journey of Bsdconv

Bug5 explained (26)

03

A1 03

B9 03

C5 1B

5B

31

6D

03E5

Internal data

BIG5-DEFRAG Inter-conversion

03

A1

03

B9

03

C5

03

E5

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 49: Journey of Bsdconv

Bug5 explained (36)

03

A1 03

B9 03

C5 03

E5 1B

5B

31

6D

Internal data

BYTEPASSMARKampFOR=1BEncoder

A1

B9

C5

E5

1B

5B

31

6D

MARK

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 50: Journey of Bsdconv

Bug5 explained (46)

A1 B9 C5 E5 1B

5B

31

6D

MARK

Internal data

PASSUNMARKBIG5 Decoder

0126

05

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 51: Journey of Bsdconv

Bug5 explained (56)

0126

05

019A

5A

1B5B

31

6D

Internal data

AMBIGUOUS-PAD Inter-conversion

01

26

05

01

A0

01

9A

5A

1B

5B

31

6D

Internal data

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 52: Journey of Bsdconv

Bug5 explained (66)

01

26

05

01A0

019A

5A

1B5B

31

6D

Internal data

UTF-8PASSFOR=1BEncoder

⋆ 驚 x1B[1m

Output (UTF-8 literal)

Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D

(NBSP) U+00A0 C2A0 -

Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 53: Journey of Bsdconv

Bsdconv Bindings

PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv

PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv

Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 54: Journey of Bsdconv

Bsdconv GUIhttpsgithubcombuganinigbsdconv

Alternative to ConvertZTextFile nameFile contentMeta tag

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other
Page 55: Journey of Bsdconv

Thanks

ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR

KUTF-8NFCASCIIES

CAPE|

httpsgithubcombuganinibsdconv

  • Syntax
  • Alias
  • Types
  • Flags
  • Counter amp Filter amp Scorer
  • Bindings
  • Other