Introducing DB2 pureScale - neodbug DB2 pureScale Bob Harbus WW DB2 Evangelist – IBM Toronto Lab. Information Management 2 © 2010 IBM Corporation DB2 pureScale Learning from the

© 2010 IBM CorporationMay 19, 2010

IntroducingDB2 pureScale

Bob HarbusWW DB2 Evangelist – IBM Toronto Lab

Information Management

© 2010 IBM Corporation2

DB2 pureScale

Learning from the undisputed Gold Standard... System z

Unlimited Capacity– Buy only what you need, add capacity as

your needs grow

Application Transparency– Avoid the risk and cost of

application changes

Continuous Availability– Deliver uninterrupted access to your data

with consistent performance



DB2 pureScale Architecture

Leverages PowerHApureScale (CF), RSCT and

GPFS technology from STG

Automatic workload balancing

Shared Data

InfiniBand network

Cluster of DB2 members running on Power servers

Integrated DB2 Cluster Services (Tivoli SA MP, RSCT, GPFS)



Proof of DB2 pureScale Architecture Scalability

How far will it scale?

Take a web commerce type workload– Read mostly but not read only

Don’t make the application cluster aware– No routing of transactions to members– Demonstrate transparent application scaling

Scale out to the 128 member limit and measure scalability



The Result

64 Members 95% Scalability

16 Members Over 95% Scalability

2, 4 and 8 Members Over 95% Scalability

32 Members Over 95% Scalability






Scalability Example: 12 Member Cluster

Update workload– 1 update transaction for every 4 read transactions– Typical read/write ratio of many OLTP workloads

No cluster awareness in the application– No routing of transactions to members– Demonstrate transparent application scaling

Redundant system– 14 8-core p550s including duplexed PowerHA pureScale™

Scalability remains above 90%



Scalability for OLTP Applications

Number of Members in the Cluster

Rel

ativ

e Sc

alab

ility



Application Transparency

Take advantage of extra capacity instantly– No need to modify your application code– No need to tune your database infrastructure

Developers don’t even need to know more nodes are being added

Administrators can add capacity without re-tuning or re-testing



DB2 pureScale is Easy to Deploy

Single installation for all components

Monitoring integrated into Optim tools

Single installation for fixpaksand updates

Simple command to add and remove members



Cluster Interconnect

DB2 pureScale : Technology Overview

Single Database View

Clients

Shared Database

Log Log Log Log

Shared Storage Access

CS CS CSCS

CS CS

CS

Member Member Member Member

Primary2nd-ary

DB2 engine runs on several host computers– Co-operate with each other to provide coherent access to the

database from any member

Data sharing architecture– Shared access to database– Members write to their own logs on shared disk– Logs accessible from another host (used during recovery)

PowerHA pureScale technology– Efficient global locking and buffer management– Synchronous duplexing to secondary ensures availability

Low latency, high speed interconnect– Special optimizations provide significant advantages on RDMA-

capable interconnects (eg. Infiniband)

Clients connect anywhere,…… see single database

– Clients connect into any member– Automatic load balancing and client reroute may change

underlying physical member to which client is connected

Integrated cluster services– Failure detection, recovery automation, cluster file system– In partnership with STG and Tivoli



Scale with Ease

Log LogLogLog

Without changing applications– Efficient coherency

protocols designed to scale without application change

– Applications automatically and transparently workload balanced across members

Without administrative complexity– No data redistribution

required

To 128 members in initial release– Limited by testing

resources


DB2 DB2 DB2 DB2

Log

DB2



Online Recovery

Log LogLogLog

DB2 DB2 DB2 DB2

A key DB2 pureScale design point is to maximize availability during failure recovery processing

When a database member fails, only data in-flight on the failed member remains locked– In-flight = data modifications that are

part of active transactions on the member at the time it failed

Target member recovery time– <20 seconds

% o

f Dat

a A

vaila

ble

Time (~seconds)

Only data in-flight updates locked during recovery

Database member failure

100

50



Stealth Maintenance

Log LogLogLog

Goal: allow DBAs to apply maintenance without negotiating an outage window

Procedure:1. Drain (aka Quiesce)2. Remove & Maintain3. Re-integrate4. Repeat until done


DB2 DB2 DB2 DB2



What is a Member ?A DB2 engine address space

– i.e. a db2sysc process and its threads

Members Share Data– All members access the same shared

database– Aka “Data Sharing”

Each member has it’s own …– Bufferpools– Memory regions– Log files

Members are logical.Can have …

– 1 per machine or LPAR (recommended)– >1 per machine or LPAR (not recommended

for production)

Member != Database Partition– Member = db2sysc process– Database Partition = a partition of the database

log buffer, dbheap, &other heaps

Member 0


bufferpool(s)

Member 1

bufferpool(s)

Shared database(Single database partition)

Log Log

db2 agents & other threads


bufferpool(s)



bufferpool(s)

Primary

Secondary

db2sysc process db2sysc process



What is a PowerHA pureScale Server?

Software technology that assists in global buffer coherency management and global locking

– Derived from System z Parallel Sysplex& Coupling Facility technology

– Software based

Services provided include– Group Bufferpool (GBP)– Global Lock Management (GLM)– Shared Communication Area (SCA)

Members duplex GBP, GLM, SCA state to both a primary and secondary

– Done synchronously– Duplexing is optional (but recommended)– Set up automatically, by default


Member 0


Member 1

db2sysc process

bufferpool(s)

db2sysc process

GBP GLM SCA

Primary

Secondary


Log Log

bufferpool(s)



bufferpool(s) bufferpool(s)



bufferpool(s)



Cluster interconnectRequirements1. Low latency, high speed

interconnect between members, and the primary and secondary PowerHApure scale servers

2. RDMA capable fabric• To make direct updates in

memory without the need to interrupt the CPU

Solution– InfiniBand (IB) and uDAPL

for performance• InfiniBand supports RDMA

and is a low latency, high speed interconnect

• uDAPL to reduce kernel time in AIX

Member 0 Member 1


db2sysc process

bufferpool(s)

db2sysc process


Log Log

bufferpool(s)






bufferpool(s)

Primary

Secondary

Interconnect



Cluster file systemRequirements1.Shared data requires shared

disks and a cluster file system2.Fencing of any failed members

from the file systemSolution– General Parallel File System –

GPFS•Shipped with, and installed and configured as part of DB2

•We will also support a pre-existing user managed GPFS file system

–Allows GPFS to be managed at the same level across the enterprise–DB2 will not manage this pre-existing file system, nor will it apply service updates to GPFS.

– SCSI 3 Persistent Reserve recommended for rapid fencing


db2sysc process

bufferpool(s)

db2sysc process


Log Log

bufferpool(s)






bufferpool(s)

Primary

Secondary

Cluster File SystemGPFS

Member 0 Member 1



DB2 Cluster ServicesOrchestrate

– Unplanned event notifications to ensure seamless recovery and availability.• Member, PowerHA pureScale,

AIX, hardware, etc. unplanned events.

– Planned events• ‘Stealth’ maintenance

Integrates with:– RSCT, Tivoli SA MP and GPFS

• Packaged, shipped and installed with DB2 pureScale

Member 0 Member 1


db2sysc process

bufferpool(s)

db2sysc process


Log Log

bufferpool(s)






bufferpool(s)

Primary

Secondary

CS CS

CS



Availability Goals

HA automation built-in - “out of the box”:– Built-in failover, recovery and fail-back

Unplanned events (eg. member and PowerHApureScale server failures) – Online recovery complete within ~20 seconds (OLTP)– All data that is not being updated is fully available during recovery

Planned events (eg. member maintenance) : “Stealth” Maintenance– No errors– No data availability loss



Log

CS

CSDB2

Member SW Failure : “Member Restart on Home Host”


Shared Data

Clientskill -9 erroneously issued to a member

DB2 Cluster Services automatically detects member’s death

– Informs other members & powerHA pureScale servers– Initiates automated member restart on same (“home”)

host– Member restart is like a database crash recovery in a

single system database, but is much faster• Redo limited to inflight transactions (due to FAC)• Benefits from page cache (GBP)

In the mean-time, client connections are transparently re-routed to healthy members

– Based on least load (by default), or,– Pre-designated failover member

Other members remain fully available throughout – “Online Failover”

– Primary retains update locks held by member at the time of failure

– Other members can continue to read and update data not locked for write access by failed member

Member restart completes– Retained locks released and all data fully available– Transaction work routed to the restarted member

CSDB2

CSDB2

CSDB2

CSUpdated Pages

Global Locks

LogLogLog

Log Records Pages

PrimarySecondary

Updated Pages Global Locks

kill -9 Automatic;

Ultra Fast;

Online



Log

CS

CSDB2

Member HW Failure : “Member Restart on Guest Host (aka Restart Light)”

Shared Data

ClientsPower cord tripped over accidentally

DB2 Cluster Services looses heartbeat and declares member down

– Informs other members & PowerHA pureScale servers– Fences member from logs and data– Initiates automated member restart on another (“guest”)

host• Using reduced, and pre-allocated memory model

– Member restart is like a database crash recovery in a single system database, but is much faster

• Redo limited to inflight transactions (due to FAC)• Benefits from page cache in PowerHA pureScale

In the mean-time, client connections are automatically re-routed to healthy members

– Based on least load (by default), or,– Pre-designated failover member

Other members remain fully available throughout – “Online Failover”

– Primary retains update locks held by member at the time of failure

– Other members can continue to read and update data not locked for write access by failed member

Member restart completes– Retained locks released and all data fully available

CSDB2

CSDB2

CSUpdated Pages

Global Locks

LogLogLog

PrimarySecondary


Fence

CSDB2

DB2

Pages

Log Recs


Automatic;

Ultra Fast;

Online



Log

CS

CSDB2

Member Failback

Shared Data

ClientsPower restored and system re-booted

DB2 Cluster Services automatically detects system availability– Informs other members and

PowerHA pureScale servers– Removes fence– Brings up member on home host

Client connections automatically re-routed back to member

CSDB2

CS

CSUpdated Pages

Global Locks

LogLogLog

PrimarySecondary


CSDB2

DB2


DB2



Automatic Workload Balancing & Client RoutingRun-time load information used to automatically balance load across the cluster (as in System z sysplex)

– Load information of all members kept on each member– Piggy-backed to clients regularly– Used to route next connection (or optionally next transaction) to least loaded member– Routing occurs automatically (transparent to application)

Failover– Load of failed member evenly distributed to surviving members automatically

Fallback– Once the failed member is back online, fallback does the reverse

ClientsClients



Log

CS

CSDB2

Primary PowerHA pureScale Failure

Shared Data

Clients

Power cord tripped over accidentally

DB2 Cluster Services looses heartbeat and declares primary down

– Informs members and secondary– PowerHA pureScale service momentarily

blocked– All other database activity proceeds

normally• Eg. accessing pages in bufferpool,

existing locks, sorting, aggregation, etc

Members send missing data to secondary

– Eg. read locks

Secondary becomes primary– PowerHA pureScale service continues

where it left off– No errors are returned to DB2 members

CSDB2

CSDB2

CSUpdated Pages

Global Locks

LogLogLog

PrimarySecondary


CSDB2


Primary

Automatic;

Ultra Fast;

Online



Log

CS

CSDB2

PowerHA pureScale Re-integration

Shared Data

ClientsPower restored and system re-booted

DB2 Cluster Services automatically detects system availability

– The PowerHA pureScale server is started as secondary

– Informs members and primary

New system assumes secondary role in ‘catchup’ state

– Members resume duplexing– Members asynchronously send lock and

other state information to secondary– Members asynchronously castout pages

from primary to disk

Catchup complete– Secondary in peer state (contains same

lock and page state as primary)

CSDB2

CSDB2

CSUpdated Pages

Global Locks

LogLogLog

Secondary


CSDB2


Primary

(Catchup state)(Peer state)



Summary (Single Failures)

Failure Mode

DB2 DB2 DB2 DB2

CF CF

Member

PrimaryPowerHApureScale

SecondaryPowerHApureScale

OtherMembersRemainOnline ? Automatic &

Transparent ? Comments

Only data that was in-flight on failed memberremains locked temporarily.

Connections to failed member transparently move to another member

Momentary “blip” in PowerHApureScale service.

Transparent to members(In-flight PowerHA pureScalerequests just take a few moreseconds before completingnormally.)

Momentary “blip” in PowerHApureScale service.

Transparent to members(In-flight PowerHA pureScalerequests just take a few moreseconds before completingnormally.)

DB2 DB2 DB2 DB2

CF CF

DB2 DB2 DB2 DB2

CF CF



Simultaneous Failures

Failure Mode

OtherMembersRemainOnline ? Automatic &

Transparent ? Comments

Only data that was in-flight on failed membersremains locked temporarily.

Recoveries done in parallel.

Same as member failure.

Momentary, transparent, “blip”in PowerHA pureScale service.

Same as member failure.

Momentary, transparent, “blip”in PowerHA pureScale service.




DB2 DB2 DB2 DB2

CF CF

DB2 DB2 DB2 DB2

CF CF

DB2 DB2 DB2 DB2

CF CF



Failure Mode

Group Crash Recovery (GCR)

Occurs when both primary and secondary PowerHA pureScaleservers simultaneously fail

– Or when primary fails with secondary not in peer state

DB2 Cluster Services automatically elects member to drive GCR

GCR is very similar to crash recovery on a single system database

– Redo and Undo operations for updates not yet written to disk

– Operates against merged logs in pureScale

Log

CS

CSDB2

Shared Data

Clients

CS CSDB2

CSUpdated Pages

Global Locks

LogLogLog

Primary Secondary


CSDB2


DB2

Pages

LogRecords

LogRecords

LogRecords

LogRecords

DB2 DB2 DB2 DB2

CF CF



“Stealth” Maintenance : Example

Log LogLogLog

1. Ensure automatic load balancing (default) or automatic routing is enabled

2. db2stop member 3 quiesce <timeout>

3. db2stop instance on host <hostname>--also--

db2cluster –cm –enter –maintenance

4. Perform desired maintenance (eg. install AIX PTF)

5. db2cluster –cm –exit maintenance-- also --

db2start instance on host <hostname>

6. db2start member 3


DB2 DB2 DB2 DB2



Scalability

Transaction scalability – Key operations are designed for transaction scalability

Additional Capacity– Rapid addition and exploitation of additional capacity

– Add member– Add disk space

• Good scaling without application change• No restrictions based on data ownership• Flexible workload balancing and routing



Achieving Efficient Scaling : Key Design PointsDeep RDMA exploitation over low latency fabric– Enables round-trip

response time as low as 10-15 microseconds(lock request)

Silent Invalidation– Informs members of page

updates requires no CPU cycles on those members

– No interrupt or other message processing required

– Increasingly important as cluster grows

Buffer Mgr

Lock Mgr Lock Mgr Lock Mgr Lock Mgr

Can I have this lock ?

Yup, here you are.

New page image

SCAGLMGBP



Adding capacity

1. Complete pre-requisite workAIX installed, on the network, access to shared disks

2. Add the member db2iupdt –add –m <MemHostName:MemIBHostName> InstName

Note: extending and shrinking the instance is an offline task in the initial release

SD image

You can also:Drop memberAdd / drop powerHA pureScale server

3. DB2 does all tasks to add the member to the cluster

Copies the image and response file to host6Runs installAdds M4 to the resources for the instance.Sets up access to the cluster file system for M4

Initial installationComplete pre-requisite work: AIX installed, hosts on the network, access to shared disks enabled.Copies the DB2 pureScale image to the Install Initiating Host.Installs the code on the specified hosts using a response file.Creates the instance, members, and primary and secondary PowerHA pureScale servers as directed. Adds members, primary and secondary PowerHA pureScale servers, hosts, HCA cards, etc. to the domain resources.Creates the cluster file system and sets up each member’s access to it.

Add a member

host3host0

Install

Member 0

CSCS

scp image and rspfile

host4

Install

CSCS

Primary

host5

Install

CSCS

Secondary

Install

host1

Member 1

CSCS

Install

host2

Member 2

CSCS

Install

host3

Member 3

CSCS

InstallInitiatingHost

Copy Image Locally

DB2pureScale

Image

Member 4

Install

CSCS

Member 4



Instance and Host Status

0 host0 0 - MEMBER1 host1 0 - MEMBER2 host2 0 - MEMBER3 host3 0 - MEMBER4 host4 0 - CF5 host5 0 - CF

db2nodes.cfg Host status

Instance statusDB2 DB2 DB2 DB2


CF CF

Shared Data

host1host0 host3host2

host5

Clients

host4

> db2start08/24/2008 00:52:59 0 0 SQL1063N DB2START processing was successful. 08/24/2008 00:53:00 1 0 SQL1063N DB2START processing was successful. 08/24/2008 00:53:01 2 0 SQL1063N DB2START processing was successful.08/24/2008 00:53:01 3 0 SQL1063N DB2START processing was successful. SQL1063N DB2START processing was successful.

> db2instance -list

ID TYPE STATE HOME_HOST CURRENT_HOST ALERT

0 MEMBER STARTED host0 host0 NO1 MEMBER STARTED host1 host1 NO2 MEMBER STARTED host2 host2 NO3 MEMBER STARTED host3 host3 NO4 CF PRIMARY host4 host4 NO5 CF PEER host5 host5 NO

HOST_NAME STATE INSTANCE_STOPPED ALERT

host0 ACTIVE NO NOhost1 ACTIVE NO NOhost2 ACTIVE NO NOhost3 ACTIVE NO NOhost4 ACTIVE NO NOhost5 ACTIVE NO NO



DB2 pureScale & Optim Tooling Solutions

Consistent and integrated approach to offering tools for DB2 pureScale– Optim tools offered for enterprise and

distributed versions of DB2 fully aware of DB2 pureScale environments

Database Administration– Ability to perform common administration

tasks across members and PowerHAPureScale servers

– Integrated navigation through shared data instances

System Monitoring– Seamless view of status and statistics across

all instances– Includes information on locking, connection,

storage, memory, etc. Application Development– Full support for developing Java, C, and .NET

applications against a pureScale environment

Select Quiesceoptions that

define how and when the action

should occurLaunch Desired Administration Task Assistant Select which

member to quiesce before taking it offline

View, modify, or execute the commands to

complete a task



DB2 pureScale

Unlimited Capacity–Start small–Grow easily, with your business

Application Transparency–Avoid the risk and cost of tuning your applications to the database topology

Continuous Availability–Maintain service across planned and unplanned events

Documents

Introducing DB2 pureScale - neodbug DB2 pureScale Bob Harbus WW DB2 Evangelist – IBM Toronto Lab. Information Management 2 © 2010 IBM Corporation DB2 pureScale Learning from the