BSDCONV
Buganini Q
Since 2009
Charset amp Encoding
Character SetCollection of charactersEncodingBinary representation
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
GB18030
CNS11643
CP950
Latin1
UTF-32 UCS4
UTF-81 UTF-16
UCS2
GB18030
CNS11643
CP950 (DBCS)
ISO-8859-1 EBCDIC-0372
1Could cover more but restricted by RFC 36292Aka IBM-37 some control characters are different from ISO-8859-1
Encoding UTF-32 UCS4
Fixed Length4 bytesFilesize = 4 for ASCII text fileIncompatible with C-style string conventionEndianness concern
Encoding UCS2
Fixed Length2 bytesFilesize = 2 for ASCII text fileIncompatible with C-style string conventionEndianness concernBMP-only
Encoding UTF-16
Variable Length2 bytes 4 bytes (Surrogate pairs)SurrogatesUsing U+D800U+DFFFIncompatible with C-style string conventionEndianness concern
110110 110111
Table UTF-16 Structure
Encoding UTF-8
Variable Length1~6 bytesCompatible with C-style string conventionSelf-synchronizingEndian-neutralSorting order = Code point order
0 (ASCII)110 101110 10 1011110 10 10 10111110 10 10 10 101111110 10 10 10 10 10
Table UTF-8 Structure
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Encoding CCCII
VariantsVariant glyph at different planeMostly used for library indexing
強 21 3D 48彊 2D 3D 48强 33 3D 48
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Bsdconv Decoding and Encoding
Alternative to iconv ISO-8859-1 UTF-8
from
toFigure Basic two phases conversion
Bsdconv Codecs amp Fallback
Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F
from
toFigure Fallback codec
Transliteration UTF-8 CP936 CP936-TRANS 3F
from
toFigure Multiple fallback codecs
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Big5 5C issue (許功蓋)
BIG5BIG5-5CBIG5 Input Output
Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
BIG5-5CBIG5BIG5 Input Output
Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
TraditionalSimplified Chinese
NOT one-to-one mappingTraditional 乾幹干
vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯
Project Chvar (12)httpsgithubcombuganinichvar
签簽 籖籤
Canonical group
Canonical group
Compatibility group
Figure Two level grouping in Chvar
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Project Chvar (22)httpsgithubcombuganinichvar
NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Bsdconv Phases
Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8
from
inter
toFigure Conversion with inter-mapping phase
Bsdconv Phases
Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8
from
inter
inter
toFigure Conversion with multiple inter-mapping phases
Unicode Casing
IS complicatedLowercase Uppercase
a Ai ITable English
Lowercase Uppercaseı Ii İTable Turkic
Lowercase Uppercasea Aagrave ATable French
Lowercase Uppercaseσ Σς Σ
Table Greek
Default Case Folding
Unicode Normalization Forms (12)UAX15
IndexingIdentification securityUsername Domain name
Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω
Table Canonical Equivalence
Unicode Normalization Forms (22)UAX15
Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1
Width size rotated カ カ︷
Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +
Table Compatibility Equivalence
Normalization for fuzzy matching
UTF-8UPPERUTF-8Input aăⅷDžбᾥ
Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8
Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss
Composition DecompositionCanonical NFC NFD
Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Charset amp Encoding
Character SetCollection of charactersEncodingBinary representation
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
GB18030
CNS11643
CP950
Latin1
UTF-32 UCS4
UTF-81 UTF-16
UCS2
GB18030
CNS11643
CP950 (DBCS)
ISO-8859-1 EBCDIC-0372
1Could cover more but restricted by RFC 36292Aka IBM-37 some control characters are different from ISO-8859-1
Encoding UTF-32 UCS4
Fixed Length4 bytesFilesize = 4 for ASCII text fileIncompatible with C-style string conventionEndianness concern
Encoding UCS2
Fixed Length2 bytesFilesize = 2 for ASCII text fileIncompatible with C-style string conventionEndianness concernBMP-only
Encoding UTF-16
Variable Length2 bytes 4 bytes (Surrogate pairs)SurrogatesUsing U+D800U+DFFFIncompatible with C-style string conventionEndianness concern
110110 110111
Table UTF-16 Structure
Encoding UTF-8
Variable Length1~6 bytesCompatible with C-style string conventionSelf-synchronizingEndian-neutralSorting order = Code point order
0 (ASCII)110 101110 10 1011110 10 10 10111110 10 10 10 101111110 10 10 10 10 10
Table UTF-8 Structure
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Encoding CCCII
VariantsVariant glyph at different planeMostly used for library indexing
強 21 3D 48彊 2D 3D 48强 33 3D 48
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Bsdconv Decoding and Encoding
Alternative to iconv ISO-8859-1 UTF-8
from
toFigure Basic two phases conversion
Bsdconv Codecs amp Fallback
Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F
from
toFigure Fallback codec
Transliteration UTF-8 CP936 CP936-TRANS 3F
from
toFigure Multiple fallback codecs
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Big5 5C issue (許功蓋)
BIG5BIG5-5CBIG5 Input Output
Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
BIG5-5CBIG5BIG5 Input Output
Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
TraditionalSimplified Chinese
NOT one-to-one mappingTraditional 乾幹干
vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯
Project Chvar (12)httpsgithubcombuganinichvar
签簽 籖籤
Canonical group
Canonical group
Compatibility group
Figure Two level grouping in Chvar
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Project Chvar (22)httpsgithubcombuganinichvar
NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Bsdconv Phases
Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8
from
inter
toFigure Conversion with inter-mapping phase
Bsdconv Phases
Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8
from
inter
inter
toFigure Conversion with multiple inter-mapping phases
Unicode Casing
IS complicatedLowercase Uppercase
a Ai ITable English
Lowercase Uppercaseı Ii İTable Turkic
Lowercase Uppercasea Aagrave ATable French
Lowercase Uppercaseσ Σς Σ
Table Greek
Default Case Folding
Unicode Normalization Forms (12)UAX15
IndexingIdentification securityUsername Domain name
Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω
Table Canonical Equivalence
Unicode Normalization Forms (22)UAX15
Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1
Width size rotated カ カ︷
Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +
Table Compatibility Equivalence
Normalization for fuzzy matching
UTF-8UPPERUTF-8Input aăⅷDžбᾥ
Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8
Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss
Composition DecompositionCanonical NFC NFD
Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
GB18030
CNS11643
CP950
Latin1
UTF-32 UCS4
UTF-81 UTF-16
UCS2
GB18030
CNS11643
CP950 (DBCS)
ISO-8859-1 EBCDIC-0372
1Could cover more but restricted by RFC 36292Aka IBM-37 some control characters are different from ISO-8859-1
Encoding UTF-32 UCS4
Fixed Length4 bytesFilesize = 4 for ASCII text fileIncompatible with C-style string conventionEndianness concern
Encoding UCS2
Fixed Length2 bytesFilesize = 2 for ASCII text fileIncompatible with C-style string conventionEndianness concernBMP-only
Encoding UTF-16
Variable Length2 bytes 4 bytes (Surrogate pairs)SurrogatesUsing U+D800U+DFFFIncompatible with C-style string conventionEndianness concern
110110 110111
Table UTF-16 Structure
Encoding UTF-8
Variable Length1~6 bytesCompatible with C-style string conventionSelf-synchronizingEndian-neutralSorting order = Code point order
0 (ASCII)110 101110 10 1011110 10 10 10111110 10 10 10 101111110 10 10 10 10 10
Table UTF-8 Structure
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Encoding CCCII
VariantsVariant glyph at different planeMostly used for library indexing
強 21 3D 48彊 2D 3D 48强 33 3D 48
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Bsdconv Decoding and Encoding
Alternative to iconv ISO-8859-1 UTF-8
from
toFigure Basic two phases conversion
Bsdconv Codecs amp Fallback
Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F
from
toFigure Fallback codec
Transliteration UTF-8 CP936 CP936-TRANS 3F
from
toFigure Multiple fallback codecs
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Big5 5C issue (許功蓋)
BIG5BIG5-5CBIG5 Input Output
Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
BIG5-5CBIG5BIG5 Input Output
Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
TraditionalSimplified Chinese
NOT one-to-one mappingTraditional 乾幹干
vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯
Project Chvar (12)httpsgithubcombuganinichvar
签簽 籖籤
Canonical group
Canonical group
Compatibility group
Figure Two level grouping in Chvar
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Project Chvar (22)httpsgithubcombuganinichvar
NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Bsdconv Phases
Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8
from
inter
toFigure Conversion with inter-mapping phase
Bsdconv Phases
Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8
from
inter
inter
toFigure Conversion with multiple inter-mapping phases
Unicode Casing
IS complicatedLowercase Uppercase
a Ai ITable English
Lowercase Uppercaseı Ii İTable Turkic
Lowercase Uppercasea Aagrave ATable French
Lowercase Uppercaseσ Σς Σ
Table Greek
Default Case Folding
Unicode Normalization Forms (12)UAX15
IndexingIdentification securityUsername Domain name
Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω
Table Canonical Equivalence
Unicode Normalization Forms (22)UAX15
Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1
Width size rotated カ カ︷
Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +
Table Compatibility Equivalence
Normalization for fuzzy matching
UTF-8UPPERUTF-8Input aăⅷDžбᾥ
Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8
Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss
Composition DecompositionCanonical NFC NFD
Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
GB18030
CNS11643
CP950
Latin1
UTF-32 UCS4
UTF-81 UTF-16
UCS2
GB18030
CNS11643
CP950 (DBCS)
ISO-8859-1 EBCDIC-0372
1Could cover more but restricted by RFC 36292Aka IBM-37 some control characters are different from ISO-8859-1
Encoding UTF-32 UCS4
Fixed Length4 bytesFilesize = 4 for ASCII text fileIncompatible with C-style string conventionEndianness concern
Encoding UCS2
Fixed Length2 bytesFilesize = 2 for ASCII text fileIncompatible with C-style string conventionEndianness concernBMP-only
Encoding UTF-16
Variable Length2 bytes 4 bytes (Surrogate pairs)SurrogatesUsing U+D800U+DFFFIncompatible with C-style string conventionEndianness concern
110110 110111
Table UTF-16 Structure
Encoding UTF-8
Variable Length1~6 bytesCompatible with C-style string conventionSelf-synchronizingEndian-neutralSorting order = Code point order
0 (ASCII)110 101110 10 1011110 10 10 10111110 10 10 10 101111110 10 10 10 10 10
Table UTF-8 Structure
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Encoding CCCII
VariantsVariant glyph at different planeMostly used for library indexing
強 21 3D 48彊 2D 3D 48强 33 3D 48
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Bsdconv Decoding and Encoding
Alternative to iconv ISO-8859-1 UTF-8
from
toFigure Basic two phases conversion
Bsdconv Codecs amp Fallback
Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F
from
toFigure Fallback codec
Transliteration UTF-8 CP936 CP936-TRANS 3F
from
toFigure Multiple fallback codecs
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Big5 5C issue (許功蓋)
BIG5BIG5-5CBIG5 Input Output
Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
BIG5-5CBIG5BIG5 Input Output
Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
TraditionalSimplified Chinese
NOT one-to-one mappingTraditional 乾幹干
vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯
Project Chvar (12)httpsgithubcombuganinichvar
签簽 籖籤
Canonical group
Canonical group
Compatibility group
Figure Two level grouping in Chvar
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Project Chvar (22)httpsgithubcombuganinichvar
NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Bsdconv Phases
Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8
from
inter
toFigure Conversion with inter-mapping phase
Bsdconv Phases
Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8
from
inter
inter
toFigure Conversion with multiple inter-mapping phases
Unicode Casing
IS complicatedLowercase Uppercase
a Ai ITable English
Lowercase Uppercaseı Ii İTable Turkic
Lowercase Uppercasea Aagrave ATable French
Lowercase Uppercaseσ Σς Σ
Table Greek
Default Case Folding
Unicode Normalization Forms (12)UAX15
IndexingIdentification securityUsername Domain name
Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω
Table Canonical Equivalence
Unicode Normalization Forms (22)UAX15
Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1
Width size rotated カ カ︷
Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +
Table Compatibility Equivalence
Normalization for fuzzy matching
UTF-8UPPERUTF-8Input aăⅷDžбᾥ
Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8
Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss
Composition DecompositionCanonical NFC NFD
Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Encoding UTF-32 UCS4
Fixed Length4 bytesFilesize = 4 for ASCII text fileIncompatible with C-style string conventionEndianness concern
Encoding UCS2
Fixed Length2 bytesFilesize = 2 for ASCII text fileIncompatible with C-style string conventionEndianness concernBMP-only
Encoding UTF-16
Variable Length2 bytes 4 bytes (Surrogate pairs)SurrogatesUsing U+D800U+DFFFIncompatible with C-style string conventionEndianness concern
110110 110111
Table UTF-16 Structure
Encoding UTF-8
Variable Length1~6 bytesCompatible with C-style string conventionSelf-synchronizingEndian-neutralSorting order = Code point order
0 (ASCII)110 101110 10 1011110 10 10 10111110 10 10 10 101111110 10 10 10 10 10
Table UTF-8 Structure
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Encoding CCCII
VariantsVariant glyph at different planeMostly used for library indexing
強 21 3D 48彊 2D 3D 48强 33 3D 48
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Bsdconv Decoding and Encoding
Alternative to iconv ISO-8859-1 UTF-8
from
toFigure Basic two phases conversion
Bsdconv Codecs amp Fallback
Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F
from
toFigure Fallback codec
Transliteration UTF-8 CP936 CP936-TRANS 3F
from
toFigure Multiple fallback codecs
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Big5 5C issue (許功蓋)
BIG5BIG5-5CBIG5 Input Output
Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
BIG5-5CBIG5BIG5 Input Output
Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
TraditionalSimplified Chinese
NOT one-to-one mappingTraditional 乾幹干
vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯
Project Chvar (12)httpsgithubcombuganinichvar
签簽 籖籤
Canonical group
Canonical group
Compatibility group
Figure Two level grouping in Chvar
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Project Chvar (22)httpsgithubcombuganinichvar
NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Bsdconv Phases
Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8
from
inter
toFigure Conversion with inter-mapping phase
Bsdconv Phases
Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8
from
inter
inter
toFigure Conversion with multiple inter-mapping phases
Unicode Casing
IS complicatedLowercase Uppercase
a Ai ITable English
Lowercase Uppercaseı Ii İTable Turkic
Lowercase Uppercasea Aagrave ATable French
Lowercase Uppercaseσ Σς Σ
Table Greek
Default Case Folding
Unicode Normalization Forms (12)UAX15
IndexingIdentification securityUsername Domain name
Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω
Table Canonical Equivalence
Unicode Normalization Forms (22)UAX15
Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1
Width size rotated カ カ︷
Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +
Table Compatibility Equivalence
Normalization for fuzzy matching
UTF-8UPPERUTF-8Input aăⅷDžбᾥ
Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8
Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss
Composition DecompositionCanonical NFC NFD
Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Encoding UCS2
Fixed Length2 bytesFilesize = 2 for ASCII text fileIncompatible with C-style string conventionEndianness concernBMP-only
Encoding UTF-16
Variable Length2 bytes 4 bytes (Surrogate pairs)SurrogatesUsing U+D800U+DFFFIncompatible with C-style string conventionEndianness concern
110110 110111
Table UTF-16 Structure
Encoding UTF-8
Variable Length1~6 bytesCompatible with C-style string conventionSelf-synchronizingEndian-neutralSorting order = Code point order
0 (ASCII)110 101110 10 1011110 10 10 10111110 10 10 10 101111110 10 10 10 10 10
Table UTF-8 Structure
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Encoding CCCII
VariantsVariant glyph at different planeMostly used for library indexing
強 21 3D 48彊 2D 3D 48强 33 3D 48
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Bsdconv Decoding and Encoding
Alternative to iconv ISO-8859-1 UTF-8
from
toFigure Basic two phases conversion
Bsdconv Codecs amp Fallback
Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F
from
toFigure Fallback codec
Transliteration UTF-8 CP936 CP936-TRANS 3F
from
toFigure Multiple fallback codecs
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Big5 5C issue (許功蓋)
BIG5BIG5-5CBIG5 Input Output
Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
BIG5-5CBIG5BIG5 Input Output
Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
TraditionalSimplified Chinese
NOT one-to-one mappingTraditional 乾幹干
vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯
Project Chvar (12)httpsgithubcombuganinichvar
签簽 籖籤
Canonical group
Canonical group
Compatibility group
Figure Two level grouping in Chvar
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Project Chvar (22)httpsgithubcombuganinichvar
NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Bsdconv Phases
Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8
from
inter
toFigure Conversion with inter-mapping phase
Bsdconv Phases
Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8
from
inter
inter
toFigure Conversion with multiple inter-mapping phases
Unicode Casing
IS complicatedLowercase Uppercase
a Ai ITable English
Lowercase Uppercaseı Ii İTable Turkic
Lowercase Uppercasea Aagrave ATable French
Lowercase Uppercaseσ Σς Σ
Table Greek
Default Case Folding
Unicode Normalization Forms (12)UAX15
IndexingIdentification securityUsername Domain name
Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω
Table Canonical Equivalence
Unicode Normalization Forms (22)UAX15
Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1
Width size rotated カ カ︷
Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +
Table Compatibility Equivalence
Normalization for fuzzy matching
UTF-8UPPERUTF-8Input aăⅷDžбᾥ
Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8
Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss
Composition DecompositionCanonical NFC NFD
Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Encoding UTF-16
Variable Length2 bytes 4 bytes (Surrogate pairs)SurrogatesUsing U+D800U+DFFFIncompatible with C-style string conventionEndianness concern
110110 110111
Table UTF-16 Structure
Encoding UTF-8
Variable Length1~6 bytesCompatible with C-style string conventionSelf-synchronizingEndian-neutralSorting order = Code point order
0 (ASCII)110 101110 10 1011110 10 10 10111110 10 10 10 101111110 10 10 10 10 10
Table UTF-8 Structure
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Encoding CCCII
VariantsVariant glyph at different planeMostly used for library indexing
強 21 3D 48彊 2D 3D 48强 33 3D 48
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Bsdconv Decoding and Encoding
Alternative to iconv ISO-8859-1 UTF-8
from
toFigure Basic two phases conversion
Bsdconv Codecs amp Fallback
Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F
from
toFigure Fallback codec
Transliteration UTF-8 CP936 CP936-TRANS 3F
from
toFigure Multiple fallback codecs
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Big5 5C issue (許功蓋)
BIG5BIG5-5CBIG5 Input Output
Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
BIG5-5CBIG5BIG5 Input Output
Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
TraditionalSimplified Chinese
NOT one-to-one mappingTraditional 乾幹干
vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯
Project Chvar (12)httpsgithubcombuganinichvar
签簽 籖籤
Canonical group
Canonical group
Compatibility group
Figure Two level grouping in Chvar
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Project Chvar (22)httpsgithubcombuganinichvar
NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Bsdconv Phases
Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8
from
inter
toFigure Conversion with inter-mapping phase
Bsdconv Phases
Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8
from
inter
inter
toFigure Conversion with multiple inter-mapping phases
Unicode Casing
IS complicatedLowercase Uppercase
a Ai ITable English
Lowercase Uppercaseı Ii İTable Turkic
Lowercase Uppercasea Aagrave ATable French
Lowercase Uppercaseσ Σς Σ
Table Greek
Default Case Folding
Unicode Normalization Forms (12)UAX15
IndexingIdentification securityUsername Domain name
Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω
Table Canonical Equivalence
Unicode Normalization Forms (22)UAX15
Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1
Width size rotated カ カ︷
Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +
Table Compatibility Equivalence
Normalization for fuzzy matching
UTF-8UPPERUTF-8Input aăⅷDžбᾥ
Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8
Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss
Composition DecompositionCanonical NFC NFD
Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Encoding UTF-8
Variable Length1~6 bytesCompatible with C-style string conventionSelf-synchronizingEndian-neutralSorting order = Code point order
0 (ASCII)110 101110 10 1011110 10 10 10111110 10 10 10 101111110 10 10 10 10 10
Table UTF-8 Structure
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Encoding CCCII
VariantsVariant glyph at different planeMostly used for library indexing
強 21 3D 48彊 2D 3D 48强 33 3D 48
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Bsdconv Decoding and Encoding
Alternative to iconv ISO-8859-1 UTF-8
from
toFigure Basic two phases conversion
Bsdconv Codecs amp Fallback
Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F
from
toFigure Fallback codec
Transliteration UTF-8 CP936 CP936-TRANS 3F
from
toFigure Multiple fallback codecs
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Big5 5C issue (許功蓋)
BIG5BIG5-5CBIG5 Input Output
Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
BIG5-5CBIG5BIG5 Input Output
Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
TraditionalSimplified Chinese
NOT one-to-one mappingTraditional 乾幹干
vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯
Project Chvar (12)httpsgithubcombuganinichvar
签簽 籖籤
Canonical group
Canonical group
Compatibility group
Figure Two level grouping in Chvar
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Project Chvar (22)httpsgithubcombuganinichvar
NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Bsdconv Phases
Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8
from
inter
toFigure Conversion with inter-mapping phase
Bsdconv Phases
Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8
from
inter
inter
toFigure Conversion with multiple inter-mapping phases
Unicode Casing
IS complicatedLowercase Uppercase
a Ai ITable English
Lowercase Uppercaseı Ii İTable Turkic
Lowercase Uppercasea Aagrave ATable French
Lowercase Uppercaseσ Σς Σ
Table Greek
Default Case Folding
Unicode Normalization Forms (12)UAX15
IndexingIdentification securityUsername Domain name
Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω
Table Canonical Equivalence
Unicode Normalization Forms (22)UAX15
Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1
Width size rotated カ カ︷
Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +
Table Compatibility Equivalence
Normalization for fuzzy matching
UTF-8UPPERUTF-8Input aăⅷDžбᾥ
Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8
Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss
Composition DecompositionCanonical NFC NFD
Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Encoding CCCII
VariantsVariant glyph at different planeMostly used for library indexing
強 21 3D 48彊 2D 3D 48强 33 3D 48
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Bsdconv Decoding and Encoding
Alternative to iconv ISO-8859-1 UTF-8
from
toFigure Basic two phases conversion
Bsdconv Codecs amp Fallback
Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F
from
toFigure Fallback codec
Transliteration UTF-8 CP936 CP936-TRANS 3F
from
toFigure Multiple fallback codecs
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Big5 5C issue (許功蓋)
BIG5BIG5-5CBIG5 Input Output
Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
BIG5-5CBIG5BIG5 Input Output
Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
TraditionalSimplified Chinese
NOT one-to-one mappingTraditional 乾幹干
vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯
Project Chvar (12)httpsgithubcombuganinichvar
签簽 籖籤
Canonical group
Canonical group
Compatibility group
Figure Two level grouping in Chvar
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Project Chvar (22)httpsgithubcombuganinichvar
NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Bsdconv Phases
Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8
from
inter
toFigure Conversion with inter-mapping phase
Bsdconv Phases
Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8
from
inter
inter
toFigure Conversion with multiple inter-mapping phases
Unicode Casing
IS complicatedLowercase Uppercase
a Ai ITable English
Lowercase Uppercaseı Ii İTable Turkic
Lowercase Uppercasea Aagrave ATable French
Lowercase Uppercaseσ Σς Σ
Table Greek
Default Case Folding
Unicode Normalization Forms (12)UAX15
IndexingIdentification securityUsername Domain name
Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω
Table Canonical Equivalence
Unicode Normalization Forms (22)UAX15
Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1
Width size rotated カ カ︷
Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +
Table Compatibility Equivalence
Normalization for fuzzy matching
UTF-8UPPERUTF-8Input aăⅷDžбᾥ
Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8
Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss
Composition DecompositionCanonical NFC NFD
Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Encoding CCCII
VariantsVariant glyph at different planeMostly used for library indexing
強 21 3D 48彊 2D 3D 48强 33 3D 48
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Bsdconv Decoding and Encoding
Alternative to iconv ISO-8859-1 UTF-8
from
toFigure Basic two phases conversion
Bsdconv Codecs amp Fallback
Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F
from
toFigure Fallback codec
Transliteration UTF-8 CP936 CP936-TRANS 3F
from
toFigure Multiple fallback codecs
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Big5 5C issue (許功蓋)
BIG5BIG5-5CBIG5 Input Output
Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
BIG5-5CBIG5BIG5 Input Output
Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
TraditionalSimplified Chinese
NOT one-to-one mappingTraditional 乾幹干
vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯
Project Chvar (12)httpsgithubcombuganinichvar
签簽 籖籤
Canonical group
Canonical group
Compatibility group
Figure Two level grouping in Chvar
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Project Chvar (22)httpsgithubcombuganinichvar
NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Bsdconv Phases
Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8
from
inter
toFigure Conversion with inter-mapping phase
Bsdconv Phases
Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8
from
inter
inter
toFigure Conversion with multiple inter-mapping phases
Unicode Casing
IS complicatedLowercase Uppercase
a Ai ITable English
Lowercase Uppercaseı Ii İTable Turkic
Lowercase Uppercasea Aagrave ATable French
Lowercase Uppercaseσ Σς Σ
Table Greek
Default Case Folding
Unicode Normalization Forms (12)UAX15
IndexingIdentification securityUsername Domain name
Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω
Table Canonical Equivalence
Unicode Normalization Forms (22)UAX15
Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1
Width size rotated カ カ︷
Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +
Table Compatibility Equivalence
Normalization for fuzzy matching
UTF-8UPPERUTF-8Input aăⅷDžбᾥ
Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8
Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss
Composition DecompositionCanonical NFC NFD
Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Bsdconv Decoding and Encoding
Alternative to iconv ISO-8859-1 UTF-8
from
toFigure Basic two phases conversion
Bsdconv Codecs amp Fallback
Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F
from
toFigure Fallback codec
Transliteration UTF-8 CP936 CP936-TRANS 3F
from
toFigure Multiple fallback codecs
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Big5 5C issue (許功蓋)
BIG5BIG5-5CBIG5 Input Output
Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
BIG5-5CBIG5BIG5 Input Output
Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
TraditionalSimplified Chinese
NOT one-to-one mappingTraditional 乾幹干
vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯
Project Chvar (12)httpsgithubcombuganinichvar
签簽 籖籤
Canonical group
Canonical group
Compatibility group
Figure Two level grouping in Chvar
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Project Chvar (22)httpsgithubcombuganinichvar
NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Bsdconv Phases
Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8
from
inter
toFigure Conversion with inter-mapping phase
Bsdconv Phases
Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8
from
inter
inter
toFigure Conversion with multiple inter-mapping phases
Unicode Casing
IS complicatedLowercase Uppercase
a Ai ITable English
Lowercase Uppercaseı Ii İTable Turkic
Lowercase Uppercasea Aagrave ATable French
Lowercase Uppercaseσ Σς Σ
Table Greek
Default Case Folding
Unicode Normalization Forms (12)UAX15
IndexingIdentification securityUsername Domain name
Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω
Table Canonical Equivalence
Unicode Normalization Forms (22)UAX15
Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1
Width size rotated カ カ︷
Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +
Table Compatibility Equivalence
Normalization for fuzzy matching
UTF-8UPPERUTF-8Input aăⅷDžбᾥ
Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8
Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss
Composition DecompositionCanonical NFC NFD
Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Bsdconv Decoding and Encoding
Alternative to iconv ISO-8859-1 UTF-8
from
toFigure Basic two phases conversion
Bsdconv Codecs amp Fallback
Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F
from
toFigure Fallback codec
Transliteration UTF-8 CP936 CP936-TRANS 3F
from
toFigure Multiple fallback codecs
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Big5 5C issue (許功蓋)
BIG5BIG5-5CBIG5 Input Output
Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
BIG5-5CBIG5BIG5 Input Output
Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
TraditionalSimplified Chinese
NOT one-to-one mappingTraditional 乾幹干
vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯
Project Chvar (12)httpsgithubcombuganinichvar
签簽 籖籤
Canonical group
Canonical group
Compatibility group
Figure Two level grouping in Chvar
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Project Chvar (22)httpsgithubcombuganinichvar
NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Bsdconv Phases
Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8
from
inter
toFigure Conversion with inter-mapping phase
Bsdconv Phases
Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8
from
inter
inter
toFigure Conversion with multiple inter-mapping phases
Unicode Casing
IS complicatedLowercase Uppercase
a Ai ITable English
Lowercase Uppercaseı Ii İTable Turkic
Lowercase Uppercasea Aagrave ATable French
Lowercase Uppercaseσ Σς Σ
Table Greek
Default Case Folding
Unicode Normalization Forms (12)UAX15
IndexingIdentification securityUsername Domain name
Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω
Table Canonical Equivalence
Unicode Normalization Forms (22)UAX15
Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1
Width size rotated カ カ︷
Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +
Table Compatibility Equivalence
Normalization for fuzzy matching
UTF-8UPPERUTF-8Input aăⅷDžбᾥ
Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8
Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss
Composition DecompositionCanonical NFC NFD
Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Bsdconv Codecs amp Fallback
Optionally produce question mark (U+003F) as replacement UTF-8 3F ASCII 3F
from
toFigure Fallback codec
Transliteration UTF-8 CP936 CP936-TRANS 3F
from
toFigure Multiple fallback codecs
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Big5 5C issue (許功蓋)
BIG5BIG5-5CBIG5 Input Output
Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
BIG5-5CBIG5BIG5 Input Output
Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
TraditionalSimplified Chinese
NOT one-to-one mappingTraditional 乾幹干
vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯
Project Chvar (12)httpsgithubcombuganinichvar
签簽 籖籤
Canonical group
Canonical group
Compatibility group
Figure Two level grouping in Chvar
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Project Chvar (22)httpsgithubcombuganinichvar
NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Bsdconv Phases
Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8
from
inter
toFigure Conversion with inter-mapping phase
Bsdconv Phases
Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8
from
inter
inter
toFigure Conversion with multiple inter-mapping phases
Unicode Casing
IS complicatedLowercase Uppercase
a Ai ITable English
Lowercase Uppercaseı Ii İTable Turkic
Lowercase Uppercasea Aagrave ATable French
Lowercase Uppercaseσ Σς Σ
Table Greek
Default Case Folding
Unicode Normalization Forms (12)UAX15
IndexingIdentification securityUsername Domain name
Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω
Table Canonical Equivalence
Unicode Normalization Forms (22)UAX15
Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1
Width size rotated カ カ︷
Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +
Table Compatibility Equivalence
Normalization for fuzzy matching
UTF-8UPPERUTF-8Input aăⅷDžбᾥ
Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8
Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss
Composition DecompositionCanonical NFC NFD
Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Big5 5C issue (許功蓋)
BIG5BIG5-5CBIG5 Input Output
Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
BIG5-5CBIG5BIG5 Input Output
Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
TraditionalSimplified Chinese
NOT one-to-one mappingTraditional 乾幹干
vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯
Project Chvar (12)httpsgithubcombuganinichvar
签簽 籖籤
Canonical group
Canonical group
Compatibility group
Figure Two level grouping in Chvar
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Project Chvar (22)httpsgithubcombuganinichvar
NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Bsdconv Phases
Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8
from
inter
toFigure Conversion with inter-mapping phase
Bsdconv Phases
Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8
from
inter
inter
toFigure Conversion with multiple inter-mapping phases
Unicode Casing
IS complicatedLowercase Uppercase
a Ai ITable English
Lowercase Uppercaseı Ii İTable Turkic
Lowercase Uppercasea Aagrave ATable French
Lowercase Uppercaseσ Σς Σ
Table Greek
Default Case Folding
Unicode Normalization Forms (12)UAX15
IndexingIdentification securityUsername Domain name
Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω
Table Canonical Equivalence
Unicode Normalization Forms (22)UAX15
Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1
Width size rotated カ カ︷
Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +
Table Compatibility Equivalence
Normalization for fuzzy matching
UTF-8UPPERUTF-8Input aăⅷDžбᾥ
Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8
Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss
Composition DecompositionCanonical NFC NFD
Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Big5 5C issue (許功蓋)
BIG5BIG5-5CBIG5 Input Output
Big5 Literal rdquo 成功rdquo rdquo 成功 rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
BIG5-5CBIG5BIG5 Input Output
Big5 Literal rdquo 成功 rdquo rdquo 成功rdquoASCIIHex rdquoxA6xA8xA5rdquo rdquoxA6xA8xA5rdquo
TraditionalSimplified Chinese
NOT one-to-one mappingTraditional 乾幹干
vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯
Project Chvar (12)httpsgithubcombuganinichvar
签簽 籖籤
Canonical group
Canonical group
Compatibility group
Figure Two level grouping in Chvar
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Project Chvar (22)httpsgithubcombuganinichvar
NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Bsdconv Phases
Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8
from
inter
toFigure Conversion with inter-mapping phase
Bsdconv Phases
Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8
from
inter
inter
toFigure Conversion with multiple inter-mapping phases
Unicode Casing
IS complicatedLowercase Uppercase
a Ai ITable English
Lowercase Uppercaseı Ii İTable Turkic
Lowercase Uppercasea Aagrave ATable French
Lowercase Uppercaseσ Σς Σ
Table Greek
Default Case Folding
Unicode Normalization Forms (12)UAX15
IndexingIdentification securityUsername Domain name
Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω
Table Canonical Equivalence
Unicode Normalization Forms (22)UAX15
Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1
Width size rotated カ カ︷
Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +
Table Compatibility Equivalence
Normalization for fuzzy matching
UTF-8UPPERUTF-8Input aăⅷDžбᾥ
Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8
Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss
Composition DecompositionCanonical NFC NFD
Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
TraditionalSimplified Chinese
NOT one-to-one mappingTraditional 乾幹干
vsSimplified 干干干Context dependent之後夜之后入夜之後Variants峰峯
Project Chvar (12)httpsgithubcombuganinichvar
签簽 籖籤
Canonical group
Canonical group
Compatibility group
Figure Two level grouping in Chvar
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Project Chvar (22)httpsgithubcombuganinichvar
NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Bsdconv Phases
Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8
from
inter
toFigure Conversion with inter-mapping phase
Bsdconv Phases
Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8
from
inter
inter
toFigure Conversion with multiple inter-mapping phases
Unicode Casing
IS complicatedLowercase Uppercase
a Ai ITable English
Lowercase Uppercaseı Ii İTable Turkic
Lowercase Uppercasea Aagrave ATable French
Lowercase Uppercaseσ Σς Σ
Table Greek
Default Case Folding
Unicode Normalization Forms (12)UAX15
IndexingIdentification securityUsername Domain name
Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω
Table Canonical Equivalence
Unicode Normalization Forms (22)UAX15
Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1
Width size rotated カ カ︷
Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +
Table Compatibility Equivalence
Normalization for fuzzy matching
UTF-8UPPERUTF-8Input aăⅷDžбᾥ
Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8
Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss
Composition DecompositionCanonical NFC NFD
Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Project Chvar (12)httpsgithubcombuganinichvar
签簽 籖籤
Canonical group
Canonical group
Compatibility group
Figure Two level grouping in Chvar
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Project Chvar (22)httpsgithubcombuganinichvar
NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Bsdconv Phases
Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8
from
inter
toFigure Conversion with inter-mapping phase
Bsdconv Phases
Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8
from
inter
inter
toFigure Conversion with multiple inter-mapping phases
Unicode Casing
IS complicatedLowercase Uppercase
a Ai ITable English
Lowercase Uppercaseı Ii İTable Turkic
Lowercase Uppercasea Aagrave ATable French
Lowercase Uppercaseσ Σς Σ
Table Greek
Default Case Folding
Unicode Normalization Forms (12)UAX15
IndexingIdentification securityUsername Domain name
Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω
Table Canonical Equivalence
Unicode Normalization Forms (22)UAX15
Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1
Width size rotated カ カ︷
Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +
Table Compatibility Equivalence
Normalization for fuzzy matching
UTF-8UPPERUTF-8Input aăⅷDžбᾥ
Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8
Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss
Composition DecompositionCanonical NFC NFD
Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Project Chvar (22)httpsgithubcombuganinichvar
NormalizationCanonical EquivalenceTransliterationConvertedor Canonical Equivalenceor Compatibility EquivalenceFuzzy character matchingCompatibility Equivalence
签 簽 籖 籤TW 簽 - 籤 -CN - 签 - 籖
CP950 簽 - 籤 -GB2312 - 签 times times
Table Canonical Group
签 簽 籖 籤TW 簽 - 簽 簽CN - 签 签 签
CP950 簽 - 簽 簽GB2312 - 签 签 签
Table Compatibility Group
Bsdconv Phases
Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8
from
inter
toFigure Conversion with inter-mapping phase
Bsdconv Phases
Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8
from
inter
inter
toFigure Conversion with multiple inter-mapping phases
Unicode Casing
IS complicatedLowercase Uppercase
a Ai ITable English
Lowercase Uppercaseı Ii İTable Turkic
Lowercase Uppercasea Aagrave ATable French
Lowercase Uppercaseσ Σς Σ
Table Greek
Default Case Folding
Unicode Normalization Forms (12)UAX15
IndexingIdentification securityUsername Domain name
Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω
Table Canonical Equivalence
Unicode Normalization Forms (22)UAX15
Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1
Width size rotated カ カ︷
Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +
Table Compatibility Equivalence
Normalization for fuzzy matching
UTF-8UPPERUTF-8Input aăⅷDžбᾥ
Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8
Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss
Composition DecompositionCanonical NFC NFD
Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Bsdconv Phases
Traditional Chinese hArr Simplified Chinese UTF-8 ZHTW UTF-8
from
inter
toFigure Conversion with inter-mapping phase
Bsdconv Phases
Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8
from
inter
inter
toFigure Conversion with multiple inter-mapping phases
Unicode Casing
IS complicatedLowercase Uppercase
a Ai ITable English
Lowercase Uppercaseı Ii İTable Turkic
Lowercase Uppercasea Aagrave ATable French
Lowercase Uppercaseσ Σς Σ
Table Greek
Default Case Folding
Unicode Normalization Forms (12)UAX15
IndexingIdentification securityUsername Domain name
Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω
Table Canonical Equivalence
Unicode Normalization Forms (22)UAX15
Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1
Width size rotated カ カ︷
Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +
Table Compatibility Equivalence
Normalization for fuzzy matching
UTF-8UPPERUTF-8Input aăⅷDžбᾥ
Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8
Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss
Composition DecompositionCanonical NFC NFD
Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Bsdconv Phases
Furthermore phrases mapping UTF-8 ZHTW ZHTW-WORDS UTF-8
from
inter
inter
toFigure Conversion with multiple inter-mapping phases
Unicode Casing
IS complicatedLowercase Uppercase
a Ai ITable English
Lowercase Uppercaseı Ii İTable Turkic
Lowercase Uppercasea Aagrave ATable French
Lowercase Uppercaseσ Σς Σ
Table Greek
Default Case Folding
Unicode Normalization Forms (12)UAX15
IndexingIdentification securityUsername Domain name
Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω
Table Canonical Equivalence
Unicode Normalization Forms (22)UAX15
Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1
Width size rotated カ カ︷
Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +
Table Compatibility Equivalence
Normalization for fuzzy matching
UTF-8UPPERUTF-8Input aăⅷDžбᾥ
Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8
Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss
Composition DecompositionCanonical NFC NFD
Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Unicode Casing
IS complicatedLowercase Uppercase
a Ai ITable English
Lowercase Uppercaseı Ii İTable Turkic
Lowercase Uppercasea Aagrave ATable French
Lowercase Uppercaseσ Σς Σ
Table Greek
Default Case Folding
Unicode Normalization Forms (12)UAX15
IndexingIdentification securityUsername Domain name
Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω
Table Canonical Equivalence
Unicode Normalization Forms (22)UAX15
Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1
Width size rotated カ カ︷
Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +
Table Compatibility Equivalence
Normalization for fuzzy matching
UTF-8UPPERUTF-8Input aăⅷDžбᾥ
Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8
Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss
Composition DecompositionCanonical NFC NFD
Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Unicode Normalization Forms (12)UAX15
IndexingIdentification securityUsername Domain name
Combining sequence Ccedil C + Ordering of combining marks q++ q++Hangul 가 ᄀ + ᅡSingleton Ω Ω
Table Canonical Equivalence
Unicode Normalization Forms (22)UAX15
Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1
Width size rotated カ カ︷
Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +
Table Compatibility Equivalence
Normalization for fuzzy matching
UTF-8UPPERUTF-8Input aăⅷDžбᾥ
Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8
Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss
Composition DecompositionCanonical NFC NFD
Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Unicode Normalization Forms (22)UAX15
Font variants ℌ HBreaking differences NBSP SPCursive forms ن نCircled ① 1
Width size rotated カ カ︷
Superscriptssubscripts ⁹ 9Squared characters 株 + 式 + 会 + 社Fractions frac34 3 + + 4Others dž d + z +
Table Compatibility Equivalence
Normalization for fuzzy matching
UTF-8UPPERUTF-8Input aăⅷDžбᾥ
Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8
Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss
Composition DecompositionCanonical NFC NFD
Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Normalization for fuzzy matching
UTF-8UPPERUTF-8Input aăⅷDžбᾥ
Output AĂⅧDŽБᾭUTF-8ZH-FUZZY-TWKANA-PHONETICNFKD-CASEFOLDUTF-8
Input frac14ℌℍăDž⁹灣湾ド195082鬒 aeligszligOutput 1frasl4hhadza9灣灣 do鬒鬒正 aeligss
Composition DecompositionCanonical NFC NFD
Compatibility NFKC NFKDTable The four Unicode normalization forms and the transformations
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Bsdconv Codec argument
Other than question mark UTF-8 ANY0121 ASCII ANY21
from
toFigure Codec argument
Or more than one character UTF-8 ANY013F0121 ASCII ANY21
from
toFigure Data list separated by dot
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Bsdconv Alias
from3FANY013FampERROR
to3FANY3FampERROR
fromUTF-8ASCII_UTF-8
interNFKD_NFKD_NF-HANGUL-DECOMPOSITION_NF-ORDER
interNFKCNFKD_NFC_NF-HANGUL-COMPOSITION
interNFKD-CASEFOLDNFDCASEFOLDNFKDCASEFOLDNFKD
filter01UNICODE
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Charset amp Encoding
Unicode (32bits addr space)
Unicode up to U+10FFFF
Unicode BMP (up to U+FFFF)
(Basic Multilingual Plane)
GB18030
CNS11643
CP950
Latin1Figure Character Sets
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Bsdconv Types
(01) Unicode(02) CNS11643(03) Byte(04) Chinese components(1B) ANSI control sequences(00) Bsdconv special characters
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Encoding CNS11643 (全字庫) issuehttpwwwcns11643govtw
Only used by Taiwan governmentNOT a subset of UnicodeNot just an charsetencoding
FontPronunciation ㄇㄥ ˊ meacuteng
Radical 艸Component 艹日月
StrokeTraSim mapping 萌蕄
Table Examples for some information provided by 全字庫 for「萌」
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Chinese components compositionhttpsgithubcombuganinichicomp
UTF-8ZH-DECOMPZH-COMPUTF-8Input 功夫不好不要艹我
Output 巭孬嫑莪UTF-8ZH-DECOMPZH-COMPCHEWINGUTF-8
Input 功夫不好不要艹我Output ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8ZH-DECOMPZH-COMPCHEWINGHAN-PINYINUTF-8
Input 功夫不好不要艹我Output pu nao yao [uh]2
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Bsdconv Flags
FREE - memory managementMARK - identifier
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Bsdconv Cascade
Re-encode UTF-8 ISO-8859-1 | BIG5 UTF-8
from
to
from
toFigure Cascaded conversions
Input Outputyenxyen_ 台北
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Look-through (14)
u03B1CEB2Input (UTF-8 literal)
ESCAPE Decoder
01
03
B1
03
CE
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Look-through (24)
01
03
B1
03CE
03B2
Internal data
PASSMARKampFOR=1BYTEEncoder
01
03
B1
MARK
CE
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Look-through (34)
0103
B1
MARK
CE B2
Internal data
PASSUNMARKUTF-8 Decoder
01
03
B1
01
03
B2
Internal data
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Look-through (44)
0103
B1
01
03
B2
Internal data
UTF-8
Encoder
CE
B1
rdquoαrdquoCE
B2
rdquoβrdquo
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hexα U+03B1 CEB1β U+03B2 CEB2
Figure ESCAPEPASSMARKampFOR=1BYTE|PASSUNMARKUTF-8UTF-8
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
String width measurement
echo 42(ˊ_gtˋ) 紅茶 | bsdconv UTF-8WIDTHNULLFULL 2HALF 7AMBI 2
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Chinese charset encoding detectionhttpsgithubcombuganinichiconv
ENCODINGSCOREWITH=CJKCOUNTZH-BONUSZHTWZH-BONUS-PHRASENULLScore(s) = $SCOREminus$IERRlowast$COUNTlowast001
$COUNT
帥呆了 rArr UTF-8SCOREWITH=CJKhelliphellipENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 475BIG5 8 3 2 -40GBK 4 1 4 -360CCCII 36 9 0 40
UTF-16LE 20 5 2 00
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Khmer legacy font converterhttpsgithubcombuganinikhmerconv
IssuesEncoding without registerd name bound on fontsStored in CP1252 or UTF-8
SolutionTwo pass detection
Detect encodingDetect font family (currently not working)(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer ConverterMappingReorderingVisual order vs Unicode modelUnicode Model baseCharacter [+ [RobatShifter] + [Coeng]+ [Shifter] + [Vowel] + [Sign]]
3httpwwwkhmerosinfoenkhmer-converter
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Encoding Big5
Many incompatible variations (abusing PUA) none ofstandard tools can rule them allhttpmoztworgdocsbig5
Scenario Dominating encodingMicrosoft CP950
Taiwan BBS UAO (Unicode-at-Once)govtw Big5-2003govhk HKSCS (199920012004)
Special characters conflictThe second byte could be 0x5C () 0x7C (|) 0x7E (~) whichmay have special meaning in certain context
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Unicode East Asian Width (12)UAX11
Narrow
Halfwidth
Wide
Fullwidth
Ambiguous
Neutral
Figure Venn Diagram Showing the Set Relations for Six Categories
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Unicode East Asian Width (22)UAX11
Narrow Ambiguous WideЯ
N ऊNa A A FH カ カ W
咦 WTable Examples for Each Character Class and Their Resolved Widths
Na NarrowN Neural usually treated as NarrowW WideF FullwidthH Halfwidth
Table Width attributes
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Terminal transcodinghttpsgithubcombuganinibug5
IssuesUAO Non-standard big5 extensionDouble color hackANSI control sequence in the middle of DBCSAmbiguous width charactersluitscreen cannot help
Solution (tldr)Big5 to UnicodeANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1BUnicode to Big5UTF-800BYTEZHTWAMBIGUOUS-UNPADBIG5CP950-TRANSUAO00ANY3F
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Bug5 explained (16)
⋆xC5x1B[1mxE5Input (Big5 literal)
ANSI-CONTROLBYTE Decoder
03
A1
03
B9
03
C5
1B
5B
31
6D
03
E5
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Bug5 explained (26)
03
A1 03
B9 03
C5 1B
5B
31
6D
03E5
Internal data
BIG5-DEFRAG Inter-conversion
03
A1
03
B9
03
C5
03
E5
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Bug5 explained (36)
03
A1 03
B9 03
C5 03
E5 1B
5B
31
6D
Internal data
BYTEPASSMARKampFOR=1BEncoder
A1
B9
C5
E5
1B
5B
31
6D
MARK
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Bug5 explained (46)
A1 B9 C5 E5 1B
5B
31
6D
MARK
Internal data
PASSUNMARKBIG5 Decoder
0126
05
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Bug5 explained (56)
0126
05
019A
5A
1B5B
31
6D
Internal data
AMBIGUOUS-PAD Inter-conversion
01
26
05
01
A0
01
9A
5A
1B
5B
31
6D
Internal data
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Bug5 explained (66)
01
26
05
01A0
019A
5A
1B5B
31
6D
Internal data
UTF-8PASSFOR=1BEncoder
⋆ 驚 x1B[1m
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex Big5 Hex⋆ U+2605 E29885 A1B9驚 U+9A5A E9A99A C5E5[ U+005B 5B 5B1 U+0031 31 31m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure ANSI-CONTROLBYTEBIG5-DEFRAGBYTEPASSMARKampFOR=1B|PASSUNMARKBIG5AMBIGUOUS-PADUTF-8PASSFOR=1B
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Bsdconv Bindings
PythonRubyGoPerlPHPhttpspypipythonorgpypibsdconvhttpsrubygemsorggemsruby-bsdconvhttpsgithubcombuganinigo-bsdconvhttpsgithubcombuganiniperl-bsdconvhttpsgithubcombuganiniphp-bsdconv
PostgreSQLMySQLhttpsgithubcombuganinipostgres-bsdconvhttpsgithubcombuganinimysql-udf-bsdconv
Irssihttpsgithubcombuganiniirssi-scriptsblobmasterirssi-bsdconvpl
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Bsdconv GUIhttpsgithubcombuganinigbsdconv
Alternative to ConvertZTextFile nameFile contentMeta tag
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv
Thanks
ESCAPEUTF-8PASSFOR=UNICODEampMARKBYTE|PASSUNMAR
KUTF-8NFCASCIIES
CAPE|
httpsgithubcombuganinibsdconv