Code sets, NLS and character conversion vs. DB2Roland Schock, ARS Computer und Consulting GmbH
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
2
OverviewWhat are character sets, encoding schemes and code pages?
Where can I define the code page used?What is code page conversion and where does it happen?What problems can arise and how can I avoid them?Performance considerations
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
3
Character SetsBasically a character set is just a collection of entities or graphical symbols with a meaning.
Examples for character sets are the latin alphabet, digits, naval flag signs or other symbols:
A, B, C, ... a g p x
A b c d
ᇹ ぁ ゆ ㌹ ㌺
亹 怔 떟 떥
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
4
Character EncodingA character encoding or code page is a mapping of symbols of a character set to bit patterns which are also referred as code points. A → 17, B → 23, C → 42, …
Typical examples of encodings are ASCII, EBCDIC or Unicode.
Part of the encoding scheme is also the definition of a serialisation scheme to convert the code point into a sequence of bytes.
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
5
ASCIISample of an encoding scheme:
First version 1963, Standardized 1968Ordered mapping to 7-bit numbers
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
6
Single Byte Char Sets (SBCS)Extensions from 7-bit ASCII to 8-bit code pagesISO-8859-x: ASCII + special characters for some languagesPlatform specific charsets: Windows ANSI or MacRoman
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
7
Double Byte Char Sets (DBCS)Expansion of the SBCS concept from one byte to two bytes per character
Mainly used for asiatic languages with more than 256 characters to encode
Latin text is expanded to twice the size of SBCS
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
8
EUC (Extended Unix Code)Multi Byte Char Set (MBCS): 2 or 4 bytes/charOnly used for Japanese, Korean, Traditional and Simplified Chinese on Unix platform
Uses single shift characters to switch to a another code group to build a multi byte character
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
9
UnicodeIntended to simplify and unify the different definitions of code pages and hence conversion.
The first definition contained 65536 characters(16-bit, 1991, UCS-2).
Version 2.0 extended the charset with 16 planes for up to 1.114.112 characters(32-bit, 1996, UCS-4).
Today in Unicode Version 4.0 we have approx. 100.000 characters assigned to code points.
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
10
Unicode char sets and encodingsUCS-2: two bytes per characterUCS-4: four bytes per characterUTF-16: Encoding of UCS-4 into one or two words: the first 64k code points use two bytes per character, all others four byte
UTF-8: dynamic or variable length encoding of characters with one to four bytes
Possible problems with UCS-2, UCS-4, UTF-16:Byte order differences (big-endian vs. little-endian) between different processor architectures.
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
11
UTF-8Encoding in variable length sequence of bytesSimple recognition of multibyte charsCompact storage of text in latin charsOnly the shortest encoding allowed
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
12
OverviewWhat are character sets, encoding schemes and code pages?
Where can I define the code page used?What is code page conversion and where does it happen?What problems can arise and how can I avoid them?Performance considerations
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
13
Usage of a code pageCode pages can be specified at different levels:At the operating system where the application runsAt the operating system where the server runsAt the operating system where the application is prepared/bound
At the database level
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
14
Default code page
As default DB2 server and clients use the local settings of the operating system or user:Windows: The server process is using the default region settings of the operating system.
Linux/Unix: The codepage is derived from the locale setting for the instance user (i.e. the user running the database processes).
Client (LUW): The current locale settings of the user determine the code page used during CONNECT.
Programming language: Java is always using Unicode when connecting to a database via JDBC.
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
15
Specifying a code page: OS levelWindows: Control Panel → Regional and Language settings, chcp command
Linux/Unix: locale command
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
16
At prepare/bind timeSpecial case during development of database software with static, embedded SQL.
Embedded SQL needs a prepare phase before compilation of the source code.
Later the prepared package needs to be bound to the database with the bind command.
Both commands need a database connection and at the connect time; the current setting of the locale is used.
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
17
Defining a database w/ code pageExplicitly set the code page at creation time:CREATE DB test USING CODESET codeset TERRITORY territory COLLATE collatingseq
Otherwise current locale is used to determine database codeset.
The choosen code page cannot be changed later.In DB2 for iSeries and for z/OS you can also define single columns of a table in a different code set (not detailed here).
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
18
OverviewWhat are character sets, encoding schemes and code pages?
Where can I define the code page used?What is code page conversion and where does it happen?What problems can arise and how can I avoid them?Performance considerations
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
19
Code page conversionIf application and server use a different code page, code page conversion happens.
Code page conversion is always done at the receivers side:at the servers side for data sent from client to serverat the clients side for data sent from server to client
Exception: Importing IXF files generated on a different system with another code page
If conversion tables are missing: SQLCODE -332
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
20
Client to server conversionClient
uses code page X
Serveruses code page Y
§ Receive data§ Convert to code page Y§ Process data§ Return result in code page Y
§ Send data using code page X
§ Receive data in Y§ Convert to code
page X
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
21
Using DB2 ConnectClient
uses code page X
Gatewayuses code page Y
§ Receive data§ Convert to code
page Y§ Send data in Y
§ Receive data in Z§ Convert to Y§ Return result in Y
§ Send data using code page X
§ Receive data in Y§ Convert to code
page X
Serveruses code page Z
§ Receive data§ Convert to code
page Z§ Return result in
code page Z
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
22
OverviewWhat are character sets, encoding schemes and code pages?
Where can I define the code page used?What is code page conversion and where does it happen?What problems can arise and how can I avoid them?Performance considerations
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
23
Other considerationsMapping of characters (injective):If a character in the source code page is not contained in the target code page, it is replaced by a substitution character.
Round trip conversion (bijective):If no substitution needs to take place between source and target code pages, a round trip conversion does not loose information.
Encoding/Decoding can change the number of bytes needed to store the data.
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
24
More considerationsUsing different conversion tables and €-Symbol:Microsoft ANSI code page and the official code page 850 have a different code point for the Euro symbol. If needed code coversion tables can be replaced (ref. Administration Guide, Planning).
Unicode support:DB2 supports the UCS-2 character set with UTF-8 and UCS-2 encoding for Unicode databases
For PureXML (V9.x) a UTF-8 database is needed.
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
25
More considerationsTo change a code page of a database, you have to use db2move (Export/Import). Backup/Restore cannot be used. So choosing the right database code page during database creation is crucial.
Binary data (BLOB, FOR BIT DATA) is internally stored with code page 0, so no character conversion is applied.
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
26
TroubleshootingIdentify used code pages: db2 get db cfg for sampleRetrieves database code page
Displaying SQLCA area during CONNECT with CLPWhen connecting to a database via CLP the option "–a" displays the SQLCA data area, which shows the code page of the database and the connecting client.
If connecting to iSeries or zSeries machines from DB2 LUW, check if conversion tables are available.
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
27
OverviewWhat are character sets, encoding schemes and code pages?
Where can I define the code page used?What is code page conversion and where does it happen?What problems can arise and how can I avoid them?Performance considerations
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
28
Performance considerationsTry to avoid unneccessary conversions.Create databases already with the code page needed for your applications.
For international databases prefer UTF-8, especially when used with Java programs.
Conversion takes time.
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
29
LinksIBM developerworks white paper:
http://www.ibm.com/developerworks/db2/library/techarticle/dm-0506chong/index.html
DB2 Infocenterhttp://publib.boulder.ibm.com/infocenter/db2luw/v9/index.jsp
Unicodehttp://www.unicode.org
UTF-8 article at Wikipediahttp://en.wikipedia.org/wiki/UTF-8
© ARS Computer und Consulting GmbH 2014
Code sets, NLS and character conversion vs. DB2
30
Roland Schock
ARS Computer und Consulting GmbH