Upload
trandat
View
243
Download
4
Embed Size (px)
Citation preview
© 2010 IBM CorporationMay 19, 2010
IntroducingDB2 pureScale
Bob HarbusWW DB2 Evangelist – IBM Toronto Lab
Information Management
© 2010 IBM Corporation2
DB2 pureScale
Learning from the undisputed Gold Standard... System z
Unlimited Capacity– Buy only what you need, add capacity as
your needs grow
Application Transparency– Avoid the risk and cost of
application changes
Continuous Availability– Deliver uninterrupted access to your data
with consistent performance
Information Management
© 2010 IBM Corporation3
DB2 pureScale Architecture
Leverages PowerHApureScale (CF), RSCT and
GPFS technology from STG
Automatic workload balancing
Shared Data
InfiniBand network
Cluster of DB2 members running on Power servers
Integrated DB2 Cluster Services (Tivoli SA MP, RSCT, GPFS)
Information Management
© 2010 IBM Corporation4
Proof of DB2 pureScale Architecture Scalability
How far will it scale?
Take a web commerce type workload– Read mostly but not read only
Don’t make the application cluster aware– No routing of transactions to members– Demonstrate transparent application scaling
Scale out to the 128 member limit and measure scalability
Information Management
© 2010 IBM Corporation5
The Result
64 Members 95% Scalability
16 Members Over 95% Scalability
2, 4 and 8 Members Over 95% Scalability
32 Members Over 95% Scalability
88 Members 90% Scalability
112 Members 89% Scalability
128 Members 84% Scalability
Information Management
© 2010 IBM Corporation6
Scalability Example: 12 Member Cluster
Update workload– 1 update transaction for every 4 read transactions– Typical read/write ratio of many OLTP workloads
No cluster awareness in the application– No routing of transactions to members– Demonstrate transparent application scaling
Redundant system– 14 8-core p550s including duplexed PowerHA pureScale™
Scalability remains above 90%
Information Management
© 2010 IBM Corporation7
Scalability for OLTP Applications
Number of Members in the Cluster
Rel
ativ
e Sc
alab
ility
Information Management
© 2010 IBM Corporation8
Application Transparency
Take advantage of extra capacity instantly– No need to modify your application code– No need to tune your database infrastructure
Developers don’t even need to know more nodes are being added
Administrators can add capacity without re-tuning or re-testing
Information Management
© 2010 IBM Corporation9
DB2 pureScale is Easy to Deploy
Single installation for all components
Monitoring integrated into Optim tools
Single installation for fixpaksand updates
Simple command to add and remove members
Information Management
© 2010 IBM Corporation10
Cluster Interconnect
DB2 pureScale : Technology Overview
Single Database View
Clients
Shared Database
Log Log Log Log
Shared Storage Access
CS CS CSCS
CS CS
CS
Member Member Member Member
Primary2nd-ary
DB2 engine runs on several host computers– Co-operate with each other to provide coherent access to the
database from any member
Data sharing architecture– Shared access to database– Members write to their own logs on shared disk– Logs accessible from another host (used during recovery)
PowerHA pureScale technology– Efficient global locking and buffer management– Synchronous duplexing to secondary ensures availability
Low latency, high speed interconnect– Special optimizations provide significant advantages on RDMA-
capable interconnects (eg. Infiniband)
Clients connect anywhere,…… see single database
– Clients connect into any member– Automatic load balancing and client reroute may change
underlying physical member to which client is connected
Integrated cluster services– Failure detection, recovery automation, cluster file system– In partnership with STG and Tivoli
Information Management
© 2010 IBM Corporation11
Scale with Ease
Log LogLogLog
Without changing applications– Efficient coherency
protocols designed to scale without application change
– Applications automatically and transparently workload balanced across members
Without administrative complexity– No data redistribution
required
To 128 members in initial release– Limited by testing
resources
Single Database View
DB2 DB2 DB2 DB2
Log
DB2
Information Management
© 2010 IBM Corporation12
Online Recovery
Log LogLogLog
DB2 DB2 DB2 DB2
A key DB2 pureScale design point is to maximize availability during failure recovery processing
When a database member fails, only data in-flight on the failed member remains locked– In-flight = data modifications that are
part of active transactions on the member at the time it failed
Target member recovery time– <20 seconds
% o
f Dat
a A
vaila
ble
Time (~seconds)
Only data in-flight updates locked during recovery
Database member failure
100
50
Information Management
© 2010 IBM Corporation13
Stealth Maintenance
Log LogLogLog
Goal: allow DBAs to apply maintenance without negotiating an outage window
Procedure:1. Drain (aka Quiesce)2. Remove & Maintain3. Re-integrate4. Repeat until done
Single Database View
DB2 DB2 DB2 DB2
Information Management
© 2010 IBM Corporation14
What is a Member ?A DB2 engine address space
– i.e. a db2sysc process and its threads
Members Share Data– All members access the same shared
database– Aka “Data Sharing”
Each member has it’s own …– Bufferpools– Memory regions– Log files
Members are logical.Can have …
– 1 per machine or LPAR (recommended)– >1 per machine or LPAR (not recommended
for production)
Member != Database Partition– Member = db2sysc process– Database Partition = a partition of the database
log buffer, dbheap, &other heaps
Member 0
log buffer, dbheap, &other heaps
bufferpool(s)
Member 1
bufferpool(s)
Shared database(Single database partition)
Log Log
db2 agents & other threads
log buffer, dbheap, &other heaps
bufferpool(s)
db2 agents & other threads
log buffer, dbheap, &other heaps
bufferpool(s)
Primary
Secondary
db2sysc process db2sysc process
Information Management
© 2010 IBM Corporation15
What is a PowerHA pureScale Server?
Software technology that assists in global buffer coherency management and global locking
– Derived from System z Parallel Sysplex& Coupling Facility technology
– Software based
Services provided include– Group Bufferpool (GBP)– Global Lock Management (GLM)– Shared Communication Area (SCA)
Members duplex GBP, GLM, SCA state to both a primary and secondary
– Done synchronously– Duplexing is optional (but recommended)– Set up automatically, by default
log buffer, dbheap, &other heaps
Member 0
log buffer, dbheap, &other heaps
Member 1
db2sysc process
bufferpool(s)
db2sysc process
GBP GLM SCA
Primary
Secondary
Shared database(Single database partition)
Log Log
bufferpool(s)
db2 agents & other threads
log buffer, dbheap, &other heaps
bufferpool(s) bufferpool(s)
db2 agents & other threads
log buffer, dbheap, &other heaps
bufferpool(s)
Information Management
© 2010 IBM Corporation16
Cluster interconnectRequirements1. Low latency, high speed
interconnect between members, and the primary and secondary PowerHApure scale servers
2. RDMA capable fabric• To make direct updates in
memory without the need to interrupt the CPU
Solution– InfiniBand (IB) and uDAPL
for performance• InfiniBand supports RDMA
and is a low latency, high speed interconnect
• uDAPL to reduce kernel time in AIX
Member 0 Member 1
log buffer, dbheap, &other heaps
db2sysc process
bufferpool(s)
db2sysc process
Shared database(Single database partition)
Log Log
bufferpool(s)
db2 agents & other threads
log buffer, dbheap, &other heaps
bufferpool(s) bufferpool(s)
db2 agents & other threads
log buffer, dbheap, &other heaps
bufferpool(s)
Primary
Secondary
Interconnect
Information Management
© 2010 IBM Corporation17
Cluster file systemRequirements1.Shared data requires shared
disks and a cluster file system2.Fencing of any failed members
from the file systemSolution– General Parallel File System –
GPFS•Shipped with, and installed and configured as part of DB2
•We will also support a pre-existing user managed GPFS file system
–Allows GPFS to be managed at the same level across the enterprise–DB2 will not manage this pre-existing file system, nor will it apply service updates to GPFS.
– SCSI 3 Persistent Reserve recommended for rapid fencing
log buffer, dbheap, &other heaps
db2sysc process
bufferpool(s)
db2sysc process
Shared database(Single database partition)
Log Log
bufferpool(s)
db2 agents & other threads
log buffer, dbheap, &other heaps
bufferpool(s) bufferpool(s)
db2 agents & other threads
log buffer, dbheap, &other heaps
bufferpool(s)
Primary
Secondary
Cluster File SystemGPFS
Member 0 Member 1
Information Management
© 2010 IBM Corporation18
DB2 Cluster ServicesOrchestrate
– Unplanned event notifications to ensure seamless recovery and availability.• Member, PowerHA pureScale,
AIX, hardware, etc. unplanned events.
– Planned events• ‘Stealth’ maintenance
Integrates with:– RSCT, Tivoli SA MP and GPFS
• Packaged, shipped and installed with DB2 pureScale
Member 0 Member 1
log buffer, dbheap, &other heaps
db2sysc process
bufferpool(s)
db2sysc process
Shared database(Single database partition)
Log Log
bufferpool(s)
db2 agents & other threads
log buffer, dbheap, &other heaps
bufferpool(s) bufferpool(s)
db2 agents & other threads
log buffer, dbheap, &other heaps
bufferpool(s)
Primary
Secondary
CS CS
CS
Information Management
© 2010 IBM Corporation19
Availability Goals
HA automation built-in - “out of the box”:– Built-in failover, recovery and fail-back
Unplanned events (eg. member and PowerHApureScale server failures) – Online recovery complete within ~20 seconds (OLTP)– All data that is not being updated is fully available during recovery
Planned events (eg. member maintenance) : “Stealth” Maintenance– No errors– No data availability loss
Information Management
© 2010 IBM Corporation20
Log
CS
CSDB2
Member SW Failure : “Member Restart on Home Host”
Single Database View
Shared Data
Clientskill -9 erroneously issued to a member
DB2 Cluster Services automatically detects member’s death
– Informs other members & powerHA pureScale servers– Initiates automated member restart on same (“home”)
host– Member restart is like a database crash recovery in a
single system database, but is much faster• Redo limited to inflight transactions (due to FAC)• Benefits from page cache (GBP)
In the mean-time, client connections are transparently re-routed to healthy members
– Based on least load (by default), or,– Pre-designated failover member
Other members remain fully available throughout – “Online Failover”
– Primary retains update locks held by member at the time of failure
– Other members can continue to read and update data not locked for write access by failed member
Member restart completes– Retained locks released and all data fully available– Transaction work routed to the restarted member
CSDB2
CSDB2
CSDB2
CSUpdated Pages
Global Locks
LogLogLog
Log Records Pages
PrimarySecondary
Updated Pages Global Locks
kill -9 Automatic;
Ultra Fast;
Online
Information Management
© 2010 IBM Corporation21
Log
CS
CSDB2
Member HW Failure : “Member Restart on Guest Host (aka Restart Light)”
Shared Data
ClientsPower cord tripped over accidentally
DB2 Cluster Services looses heartbeat and declares member down
– Informs other members & PowerHA pureScale servers– Fences member from logs and data– Initiates automated member restart on another (“guest”)
host• Using reduced, and pre-allocated memory model
– Member restart is like a database crash recovery in a single system database, but is much faster
• Redo limited to inflight transactions (due to FAC)• Benefits from page cache in PowerHA pureScale
In the mean-time, client connections are automatically re-routed to healthy members
– Based on least load (by default), or,– Pre-designated failover member
Other members remain fully available throughout – “Online Failover”
– Primary retains update locks held by member at the time of failure
– Other members can continue to read and update data not locked for write access by failed member
Member restart completes– Retained locks released and all data fully available
CSDB2
CSDB2
CSUpdated Pages
Global Locks
LogLogLog
PrimarySecondary
Updated Pages Global Locks
Fence
CSDB2
DB2
Pages
Log Recs
Single Database View
Automatic;
Ultra Fast;
Online
Information Management
© 2010 IBM Corporation22
Log
CS
CSDB2
Member Failback
Shared Data
ClientsPower restored and system re-booted
DB2 Cluster Services automatically detects system availability– Informs other members and
PowerHA pureScale servers– Removes fence– Brings up member on home host
Client connections automatically re-routed back to member
CSDB2
CS
CSUpdated Pages
Global Locks
LogLogLog
PrimarySecondary
Updated Pages Global Locks
CSDB2
DB2
Single Database View
DB2
Information Management
© 2010 IBM Corporation23
Automatic Workload Balancing & Client RoutingRun-time load information used to automatically balance load across the cluster (as in System z sysplex)
– Load information of all members kept on each member– Piggy-backed to clients regularly– Used to route next connection (or optionally next transaction) to least loaded member– Routing occurs automatically (transparent to application)
Failover– Load of failed member evenly distributed to surviving members automatically
Fallback– Once the failed member is back online, fallback does the reverse
ClientsClients
Information Management
© 2010 IBM Corporation24
Log
CS
CSDB2
Primary PowerHA pureScale Failure
Shared Data
Clients
Power cord tripped over accidentally
DB2 Cluster Services looses heartbeat and declares primary down
– Informs members and secondary– PowerHA pureScale service momentarily
blocked– All other database activity proceeds
normally• Eg. accessing pages in bufferpool,
existing locks, sorting, aggregation, etc
Members send missing data to secondary
– Eg. read locks
Secondary becomes primary– PowerHA pureScale service continues
where it left off– No errors are returned to DB2 members
CSDB2
CSDB2
CSUpdated Pages
Global Locks
LogLogLog
PrimarySecondary
Updated Pages Global Locks
CSDB2
Single Database View
Primary
Automatic;
Ultra Fast;
Online
Information Management
© 2010 IBM Corporation25
Log
CS
CSDB2
PowerHA pureScale Re-integration
Shared Data
ClientsPower restored and system re-booted
DB2 Cluster Services automatically detects system availability
– The PowerHA pureScale server is started as secondary
– Informs members and primary
New system assumes secondary role in ‘catchup’ state
– Members resume duplexing– Members asynchronously send lock and
other state information to secondary– Members asynchronously castout pages
from primary to disk
Catchup complete– Secondary in peer state (contains same
lock and page state as primary)
CSDB2
CSDB2
CSUpdated Pages
Global Locks
LogLogLog
Secondary
Updated Pages Global Locks
CSDB2
Single Database View
Primary
(Catchup state)(Peer state)
Information Management
© 2010 IBM Corporation26
Summary (Single Failures)
Failure Mode
DB2 DB2 DB2 DB2
CF CF
Member
PrimaryPowerHApureScale
SecondaryPowerHApureScale
OtherMembersRemainOnline ? Automatic &
Transparent ? Comments
Only data that was in-flight on failed memberremains locked temporarily.
Connections to failed member transparently move to another member
Momentary “blip” in PowerHApureScale service.
Transparent to members(In-flight PowerHA pureScalerequests just take a few moreseconds before completingnormally.)
Momentary “blip” in PowerHApureScale service.
Transparent to members(In-flight PowerHA pureScalerequests just take a few moreseconds before completingnormally.)
DB2 DB2 DB2 DB2
CF CF
DB2 DB2 DB2 DB2
CF CF
Information Management
© 2010 IBM Corporation27
Simultaneous Failures
Failure Mode
OtherMembersRemainOnline ? Automatic &
Transparent ? Comments
Only data that was in-flight on failed membersremains locked temporarily.
Recoveries done in parallel.
Same as member failure.
Momentary, transparent, “blip”in PowerHA pureScale service.
Same as member failure.
Momentary, transparent, “blip”in PowerHA pureScale service.
Connections to failed member transparently move to another member
Connections to failed member transparently move to another member
Connections to failed member transparently move to another member
DB2 DB2 DB2 DB2
CF CF
DB2 DB2 DB2 DB2
CF CF
DB2 DB2 DB2 DB2
CF CF
Information Management
© 2010 IBM Corporation28
Failure Mode
Group Crash Recovery (GCR)
Occurs when both primary and secondary PowerHA pureScaleservers simultaneously fail
– Or when primary fails with secondary not in peer state
DB2 Cluster Services automatically elects member to drive GCR
GCR is very similar to crash recovery on a single system database
– Redo and Undo operations for updates not yet written to disk
– Operates against merged logs in pureScale
Log
CS
CSDB2
Shared Data
Clients
CS CSDB2
CSUpdated Pages
Global Locks
LogLogLog
Primary Secondary
Updated Pages Global Locks
CSDB2
Single Database View
DB2
Pages
LogRecords
LogRecords
LogRecords
LogRecords
DB2 DB2 DB2 DB2
CF CF
Information Management
© 2010 IBM Corporation29
“Stealth” Maintenance : Example
Log LogLogLog
1. Ensure automatic load balancing (default) or automatic routing is enabled
2. db2stop member 3 quiesce <timeout>
3. db2stop instance on host <hostname>--also--
db2cluster –cm –enter –maintenance
4. Perform desired maintenance (eg. install AIX PTF)
5. db2cluster –cm –exit maintenance-- also --
db2start instance on host <hostname>
6. db2start member 3
Single Database View
DB2 DB2 DB2 DB2
Information Management
© 2010 IBM Corporation30
Scalability
Transaction scalability – Key operations are designed for transaction scalability
Additional Capacity– Rapid addition and exploitation of additional capacity
– Add member– Add disk space
• Good scaling without application change• No restrictions based on data ownership• Flexible workload balancing and routing
Information Management
© 2010 IBM Corporation31
Achieving Efficient Scaling : Key Design PointsDeep RDMA exploitation over low latency fabric– Enables round-trip
response time as low as 10-15 microseconds(lock request)
Silent Invalidation– Informs members of page
updates requires no CPU cycles on those members
– No interrupt or other message processing required
– Increasingly important as cluster grows
Buffer Mgr
Lock Mgr Lock Mgr Lock Mgr Lock Mgr
Can I have this lock ?
Yup, here you are.
New page image
SCAGLMGBP
Information Management
© 2010 IBM Corporation32
Adding capacity
1. Complete pre-requisite workAIX installed, on the network, access to shared disks
2. Add the member db2iupdt –add –m <MemHostName:MemIBHostName> InstName
Note: extending and shrinking the instance is an offline task in the initial release
SD image
You can also:Drop memberAdd / drop powerHA pureScale server
3. DB2 does all tasks to add the member to the cluster
Copies the image and response file to host6Runs installAdds M4 to the resources for the instance.Sets up access to the cluster file system for M4
Initial installationComplete pre-requisite work: AIX installed, hosts on the network, access to shared disks enabled.Copies the DB2 pureScale image to the Install Initiating Host.Installs the code on the specified hosts using a response file.Creates the instance, members, and primary and secondary PowerHA pureScale servers as directed. Adds members, primary and secondary PowerHA pureScale servers, hosts, HCA cards, etc. to the domain resources.Creates the cluster file system and sets up each member’s access to it.
Add a member
host3host0
Install
Member 0
CSCS
scp image and rspfile
host4
Install
CSCS
Primary
host5
Install
CSCS
Secondary
Install
host1
Member 1
CSCS
Install
host2
Member 2
CSCS
Install
host3
Member 3
CSCS
InstallInitiatingHost
Copy Image Locally
DB2pureScale
Image
Member 4
Install
CSCS
Member 4
Information Management
© 2010 IBM Corporation33
Instance and Host Status
0 host0 0 - MEMBER1 host1 0 - MEMBER2 host2 0 - MEMBER3 host3 0 - MEMBER4 host4 0 - CF5 host5 0 - CF
db2nodes.cfg Host status
Instance statusDB2 DB2 DB2 DB2
Single Database View
CF CF
Shared Data
host1host0 host3host2
host5
Clients
host4
> db2start08/24/2008 00:52:59 0 0 SQL1063N DB2START processing was successful. 08/24/2008 00:53:00 1 0 SQL1063N DB2START processing was successful. 08/24/2008 00:53:01 2 0 SQL1063N DB2START processing was successful.08/24/2008 00:53:01 3 0 SQL1063N DB2START processing was successful. SQL1063N DB2START processing was successful.
> db2instance -list
ID TYPE STATE HOME_HOST CURRENT_HOST ALERT
0 MEMBER STARTED host0 host0 NO1 MEMBER STARTED host1 host1 NO2 MEMBER STARTED host2 host2 NO3 MEMBER STARTED host3 host3 NO4 CF PRIMARY host4 host4 NO5 CF PEER host5 host5 NO
HOST_NAME STATE INSTANCE_STOPPED ALERT
host0 ACTIVE NO NOhost1 ACTIVE NO NOhost2 ACTIVE NO NOhost3 ACTIVE NO NOhost4 ACTIVE NO NOhost5 ACTIVE NO NO
Information Management
© 2010 IBM Corporation34
DB2 pureScale & Optim Tooling Solutions
Consistent and integrated approach to offering tools for DB2 pureScale– Optim tools offered for enterprise and
distributed versions of DB2 fully aware of DB2 pureScale environments
Database Administration– Ability to perform common administration
tasks across members and PowerHAPureScale servers
– Integrated navigation through shared data instances
System Monitoring– Seamless view of status and statistics across
all instances– Includes information on locking, connection,
storage, memory, etc. Application Development– Full support for developing Java, C, and .NET
applications against a pureScale environment
Select Quiesceoptions that
define how and when the action
should occurLaunch Desired Administration Task Assistant Select which
member to quiesce before taking it offline
View, modify, or execute the commands to
complete a task
Information Management
© 2010 IBM Corporation35
DB2 pureScale
Unlimited Capacity–Start small–Grow easily, with your business
Application Transparency–Avoid the risk and cost of tuning your applications to the database topology
Continuous Availability–Maintain service across planned and unplanned events