30
Code sets, NLS and character conversion vs. DB2 Roland Schock, ARS Computer und Consulting GmbH

DB2 and Codepages

Embed Size (px)

DESCRIPTION

Learn details about the database DB2 and codepages

Citation preview

Page 1: DB2 and Codepages

Code sets, NLS and character conversion vs. DB2Roland Schock, ARS Computer und Consulting GmbH

Page 2: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

2

OverviewWhat are character sets, encoding schemes and code pages?

Where can I define the code page used?What is code page conversion and where does it happen?What problems can arise and how can I avoid them?Performance considerations

Page 3: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

3

Character SetsBasically a character set is just a collection of entities or graphical symbols with a meaning.

Examples for character sets are the latin alphabet, digits, naval flag signs or other symbols:

A, B, C, ... a g p x

A b c d

ᇹ ぁ ゆ ㌹ ㌺

亹 怔 떟 떥

Page 4: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

4

Character EncodingA character encoding or code page is a mapping of symbols of a character set to bit patterns which are also referred as code points. A → 17, B → 23, C → 42, …

Typical examples of encodings are ASCII, EBCDIC or Unicode.

Part of the encoding scheme is also the definition of a serialisation scheme to convert the code point into a sequence of bytes.

Page 5: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

5

ASCIISample of an encoding scheme:

First version 1963, Standardized 1968Ordered mapping to 7-bit numbers

Page 6: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

6

Single Byte Char Sets (SBCS)Extensions from 7-bit ASCII to 8-bit code pagesISO-8859-x: ASCII + special characters for some languagesPlatform specific charsets: Windows ANSI or MacRoman

Page 7: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

7

Double Byte Char Sets (DBCS)Expansion of the SBCS concept from one byte to two bytes per character

Mainly used for asiatic languages with more than 256 characters to encode

Latin text is expanded to twice the size of SBCS

Page 8: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

8

EUC (Extended Unix Code)Multi Byte Char Set (MBCS): 2 or 4 bytes/charOnly used for Japanese, Korean, Traditional and Simplified Chinese on Unix platform

Uses single shift characters to switch to a another code group to build a multi byte character

Page 9: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

9

UnicodeIntended to simplify and unify the different definitions of code pages and hence conversion.

The first definition contained 65536 characters(16-bit, 1991, UCS-2).

Version 2.0 extended the charset with 16 planes for up to 1.114.112 characters(32-bit, 1996, UCS-4).

Today in Unicode Version 4.0 we have approx. 100.000 characters assigned to code points.

Page 10: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

10

Unicode char sets and encodingsUCS-2: two bytes per characterUCS-4: four bytes per characterUTF-16: Encoding of UCS-4 into one or two words: the first 64k code points use two bytes per character, all others four byte

UTF-8: dynamic or variable length encoding of characters with one to four bytes

Possible problems with UCS-2, UCS-4, UTF-16:Byte order differences (big-endian vs. little-endian) between different processor architectures.

Page 11: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

11

UTF-8Encoding in variable length sequence of bytesSimple recognition of multibyte charsCompact storage of text in latin charsOnly the shortest encoding allowed

Page 12: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

12

OverviewWhat are character sets, encoding schemes and code pages?

Where can I define the code page used?What is code page conversion and where does it happen?What problems can arise and how can I avoid them?Performance considerations

Page 13: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

13

Usage of a code pageCode pages can be specified at different levels:At the operating system where the application runsAt the operating system where the server runsAt the operating system where the application is prepared/bound

At the database level

Page 14: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

14

Default code page

As default DB2 server and clients use the local settings of the operating system or user:Windows: The server process is using the default region settings of the operating system.

Linux/Unix: The codepage is derived from the locale setting for the instance user (i.e. the user running the database processes).

Client (LUW): The current locale settings of the user determine the code page used during CONNECT.

Programming language: Java is always using Unicode when connecting to a database via JDBC.

Page 15: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

15

Specifying a code page: OS levelWindows: Control Panel → Regional and Language settings, chcp command

Linux/Unix: locale command

Page 16: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

16

At prepare/bind timeSpecial case during development of database software with static, embedded SQL.

Embedded SQL needs a prepare phase before compilation of the source code.

Later the prepared package needs to be bound to the database with the bind command.

Both commands need a database connection and at the connect time; the current setting of the locale is used.

Page 17: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

17

Defining a database w/ code pageExplicitly set the code page at creation time:CREATE DB test USING CODESET codeset TERRITORY territory COLLATE collatingseq

Otherwise current locale is used to determine database codeset.

The choosen code page cannot be changed later.In DB2 for iSeries and for z/OS you can also define single columns of a table in a different code set (not detailed here).

Page 18: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

18

OverviewWhat are character sets, encoding schemes and code pages?

Where can I define the code page used?What is code page conversion and where does it happen?What problems can arise and how can I avoid them?Performance considerations

Page 19: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

19

Code page conversionIf application and server use a different code page, code page conversion happens.

Code page conversion is always done at the receivers side:at the servers side for data sent from client to serverat the clients side for data sent from server to client

Exception: Importing IXF files generated on a different system with another code page

If conversion tables are missing: SQLCODE -332

Page 20: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

20

Client to server conversionClient

uses code page X

Serveruses code page Y

§ Receive data§ Convert to code page Y§ Process data§ Return result in code page Y

§ Send data using code page X

§ Receive data in Y§ Convert to code

page X

Page 21: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

21

Using DB2 ConnectClient

uses code page X

Gatewayuses code page Y

§ Receive data§ Convert to code

page Y§ Send data in Y

§ Receive data in Z§ Convert to Y§ Return result in Y

§ Send data using code page X

§ Receive data in Y§ Convert to code

page X

Serveruses code page Z

§ Receive data§ Convert to code

page Z§ Return result in

code page Z

Page 22: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

22

OverviewWhat are character sets, encoding schemes and code pages?

Where can I define the code page used?What is code page conversion and where does it happen?What problems can arise and how can I avoid them?Performance considerations

Page 23: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

23

Other considerationsMapping of characters (injective):If a character in the source code page is not contained in the target code page, it is replaced by a substitution character.

Round trip conversion (bijective):If no substitution needs to take place between source and target code pages, a round trip conversion does not loose information.

Encoding/Decoding can change the number of bytes needed to store the data.

Page 24: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

24

More considerationsUsing different conversion tables and €-Symbol:Microsoft ANSI code page and the official code page 850 have a different code point for the Euro symbol. If needed code coversion tables can be replaced (ref. Administration Guide, Planning).

Unicode support:DB2 supports the UCS-2 character set with UTF-8 and UCS-2 encoding for Unicode databases

For PureXML (V9.x) a UTF-8 database is needed.

Page 25: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

25

More considerationsTo change a code page of a database, you have to use db2move (Export/Import). Backup/Restore cannot be used. So choosing the right database code page during database creation is crucial.

Binary data (BLOB, FOR BIT DATA) is internally stored with code page 0, so no character conversion is applied.

Page 26: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

26

TroubleshootingIdentify used code pages: db2 get db cfg for sampleRetrieves database code page

Displaying SQLCA area during CONNECT with CLPWhen connecting to a database via CLP the option "–a" displays the SQLCA data area, which shows the code page of the database and the connecting client.

If connecting to iSeries or zSeries machines from DB2 LUW, check if conversion tables are available.

Page 27: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

27

OverviewWhat are character sets, encoding schemes and code pages?

Where can I define the code page used?What is code page conversion and where does it happen?What problems can arise and how can I avoid them?Performance considerations

Page 28: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

28

Performance considerationsTry to avoid unneccessary conversions.Create databases already with the code page needed for your applications.

For international databases prefer UTF-8, especially when used with Java programs.

Conversion takes time.

Page 29: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

29

LinksIBM developerworks white paper:

http://www.ibm.com/developerworks/db2/library/techarticle/dm-0506chong/index.html

DB2 Infocenterhttp://publib.boulder.ibm.com/infocenter/db2luw/v9/index.jsp

Unicodehttp://www.unicode.org

UTF-8 article at Wikipediahttp://en.wikipedia.org/wiki/UTF-8

Page 30: DB2 and Codepages

© ARS Computer und Consulting GmbH 2014

Code sets, NLS and character conversion vs. DB2

30

Roland Schock

ARS Computer und Consulting GmbH

[email protected]