Oracle 10g RAC Scalability – Lessons Learned

Oracle 10g RAC Scalability – Oracle 10g RAC Scalability – Lessons LearnedLessons Learned

Bert Scalzo, Ph.D.Bert Scalzo, Ph.D.

[email protected]@Quest.com

About the AuthorAbout the Author

• Oracle Dev & DBA for 20 years, versions 4 through 10g• Worked for Oracle Education & Consulting• Holds several Oracle Masters (DBA & CASE)• BS, MS, PhD in Computer Science and also an MBA• LOMA insurance industry designations: FLMI and ACS• Books

– The TOAD Handbook (March 2003)– Oracle DBA Guide to Data Warehousing and Star Schemas (June 2003)– TOAD Pocket Reference 2nd Edition (June 2005)

• Articles– Oracle Magazine– Oracle Technology Network (OTN)– Oracle Informant– PC Week (now E-Magazine)– Linux Journal– www.Linux.com

About Quest SoftwareAbout Quest Software

Used in this paper

This paper is based upon collaborative RAC research efforts between Quest Software and Dell Computers.

Quest:•Bert Scalzo•Murali Vallath – author of RAC articles and books

Dell:•Anthony Fernandez•Zafar Mahmood

Also an extra special thanks to Dell for allocating a million dollars worth of equipment to make such testing possible

Project FormationProject Formation

Quest:•To partner with a leading hardware vendor•To field test and showcase our RAC enabled software

•Spotlight on RAC•Benchmark Factory•TOAD for Oracle with DBA module

Dell:•To write a Dell Power Edge Magazine article about the OLTP scalability of Oracle 10g RAC running on typical Dell servers and EMC storage arrays•To create a standard methodology for all benchmarking of database servers to be used for future articles and for lab testing & demonstration purposes

Project PurposeProject Purpose

OLTP BenchmarkingOLTP Benchmarking

TPC benchmark (www.tpc.org)

TPC Benchmark™ C (TPC-C) is an OLTP workload. It is a mixture of read-only and update intensive transactions that simulate the activities found in complex OLTP application environments. It does so by exercising a breadth of system components associated with such environments, which are characterized by:

• The simultaneous execution of multiple transaction types that span a breadth of complexity• On-line and deferred transaction execution modes• Multiple on-line terminal sessions• Moderate system and application execution time• Significant disk input/output• Transaction integrity (ACID properties)• Non-uniform distribution of data access through primary and secondary keys• Databases consisting of many tables with a wide variety of sizes, attributes, and relationships• Contention on data access and update

Excerpt from “TPC BENCHMARK™ C: Standard Specification, Revision 3.5”

The TPC-C like benchmark measures on-line transaction processing (OLTP) workloads. It combines read-only and update intensive transactions simulating the activities found in complex OLTP enterprise environments.

Create the Load - Benchmark FactoryCreate the Load - Benchmark Factory

Monitor the Load - Spotlight on RACMonitor the Load - Spotlight on RAC

Hardware & SoftwareHardware & SoftwareServers, Storage and Software

Oracle 10g RAC Cluster Servers10 x 2-CPU Dell PowerEdge 1850 3.8 GHz P4 processors with HT4 GB RAM (later expanded to 8GB RAM)1 x 1 Gb NICs (Intel) for LAN 2 x1 Gb LOM teamed for RAC interconnect1 x two port HBAs (Qlogic 2342)DRAC

RHEL AS 4 QU1 (32-bit)EMC PowerPath 4.4EMC Navisphere agent Oracle 10g R1 10.1.0.4Oracle ASM 10.1.0.4Oracle Cluster Ready Services 10.1.0.4Linux bonding driver for interconnectDell OpenManage

Benchmark Factory Servers 2 x 4-CPU Dell PowerEdge 66508 GB RAM

Windows 2003 serverQuest Benchmark Factory ApplicationQuest Benchmark Factory AgentsQuest Spotlight on RACQuest TOAD for Oracle

Storage1 x Dell | EMC CX700 1 x DAE unit: total 30 x 73GB 15K RPM disksRaid Group 1: 16 disks having 4 x 50GB RAID 1/0 LUN’s for Data and backupRaid Group 2: 10 disks having 2 x 20GB RAID 1/0 LUN’s for Redo Logs Raid Group 3: 4 disks having 1 x 5 GB RAID 1/0 LUN for voting disk, OCR, and spfiles 2 x Brocade SilkWorm 3800 Fibre Channel Switch (16 port)Configured with 8 paths to each logical volume

Flare Code Release 16

Network1 x Gigabit 5224 Ethernet Switches (24 port) for private interconnect1 x Gigabit 5224 Ethernet switch for Public LAN

Linux binding driver used to team dual onboard NIC’s for private interconnect

TOAD for Oracle

Setup Planned vs. ActualSetup Planned vs. Actual

Planned:•Redhat 4 Update 1 64-bit•Oracle 10.2.0.1 64-bit

Actual:•Redhat 4 Update 1 32-bit•Oracle 10.0.1.4 32-bit

Issues:•Driver problems with 64-bit (no real surprise)•Some software incompatibilities with 10g R2•Known ASM issues require 10.0.1.4, not earlier

Testing Methodology – Steps 1 A-Testing Methodology – Steps 1 A-CC1. For a single node and instance

a. Establish a fundamental baseline i. Install the operating system and Oracle database (keeping all normal installation defaults) ii. Create and populate the test database schema iii. Shutdown and startup the database iv. Run a simple benchmark (e.g. TPC-C for 200 users) to establish a baseline for default

operating system and database settings b. Optimize the basic operating system

i. Manually optimize typical operating system settings ii. Shutdown and startup the database iii. Run a simple benchmark (e.g. TPC-C for 200 users) to establish a new baseline for basic

operating system improvements iv. Repeat prior three steps until a performance balance results

c. Optimize the basic non-RAC database i. Manually optimize typical database “spfile” parameters ii. Shutdown and startup the database iii. Run a simple benchmark (e.g. TPC-C for 200 users) to establish a new baseline for basic

Oracle database improvements iv. Repeat prior three steps until a performance balance results

d) Ascertain the reasonable per-node load

a) Manually optimize scalability database “spfile” parameters

b) Shutdown and startup the database

c) Run an increasing user load benchmark (e.g. TPC-C for 100 to 800 users increment by 100) to find the “sweet spot” of how many concurrent users a node can reasonably support

d) Monitor the benchmark run via the vmstat command, looking for the point where excessive paging and swapping begins – and where the CPU idle time consistently approaches zero

e) Record the “sweet spot” number of concurrent users – this represents an upper limit

f) Reduce the “sweet spot” number of concurrent users by some reasonable percentage to account for RAC architecture and inter/ intra-node overheads (e.g. reduce by say 10%)

e) Establish the baseline RAC benchmark

g) Shutdown and startup the database

h) Create an increasing user load benchmark based upon the node count and the “sweet spot” (e.g. TPC-C for 100 to node count * sweet spot users increment by 100)

i) Run the baseline RAC benchmark

Testing Methodology – Steps 1 D-ETesting Methodology – Steps 1 D-E

Linux Kernel parameters /etc/sysctl.conf: kernel.shmmax = 2147483648 kernel.sem = 250 32000 100 128 fs.file-max = 65536 fs.aio-max-nr = 1048576 net.ipv4.ip_local_port_range = 1024 65000 net.core.rmem_default = 262144 net.core.rmem_max = 262144 net.core.wmem_default = 262144 net.core.wmem_max = 262144

Step 1B - Optimize Linux KernelStep 1B - Optimize Linux Kernel

Step 1C - Optimize Oracle BinariesStep 1C - Optimize Oracle Binaries

Oracle compiled & linked for Asynchronous IO:

1.cd to $ORACLE_HOME/rdbms/lib a. make -f ins_rdbms.mk async_on b.make -f ins_rdbms.mk ioracle

2.Set necessary “spfile” parameter settings

a. disk_asynch_io = true (default value is true) b.filesystemio_options = setall (for both async and direct io)

Note that in Oracle 10g Release 2 asynchronous IO is now compiled & linked in by default.

spfile adjustments shown below:

cluster_database=true

.cluster_database_instances=10

db_block_size=8192

processes=16000

sga_max_size=1500m

sga_target=1500m

pga_aggregate_target=700m

db_writer_processes=2

open_cursors=300

optimizer_index_caching=80

optimizer_index_cost_adj=40

The key idea was to eek out as much SGA memory usage as possible within the 32-bit operating system limit (about 1.7 GB). Since our servers had only 4 GB of RAM each, we figured that allocating half to Oracle was sufficient – with the remaining memory to be shared by the operating system and the thousands of dedicated Oracle server processes that the TPC-C like benchmark would be creating as its user load.

Step 1C - Optimize Oracle SPFILEStep 1C - Optimize Oracle SPFILE

Step 1D – Find Per Node Sweet SpotStep 1D – Find Per Node Sweet Spot

Finding the ideal per node sweet spot is arguably the most critical aspect of the entire benchmark testing process – and especially for RAC environments with more than just a few nodes.

We initially ran a 100-800 user TPC-C on the single node

Without monitoring the database server using the vmstat command

Simply looked the BMF transactions per second graph, which was positive to beyond 700 users

Assumed this meant the “sweet spot” was 700 users per node (and did not factor in any overhead)

What was happening in reality:

The operating system was being overstressed and exhibited thrashing characteristics at about 600 users

Running benchmarks for 700 users per node did not scale either reliably or predictably beyond four servers

Our belief is that by taking each box to a near thrashing threshold by our overzealous per node user load selection, the nodes did not have sufficient resources available to communicate in a timely enough fashion for inter/intra-node messaging – and thus Oracle began to think that nodes were either dead or non-respondent

Furthermore when relying upon Oracle’s client and server side load balancing feature, which allocates connections based upon node responding, the user load per node became skewed and then exceeded our per node “sweet spot” value. For example when we tested 7000 users for 10 nodes, since some nodes appeared dead to Oracle – the load balancer simply directed all the sessions across whatever node were responding. So we ended up with nodes trying to handle far more than 700 users – and thus the thrashing was even worse.

Sweet Spot Lessons LearnedSweet Spot Lessons Learned•Cannot solely rely on BMF transactions per second graph•Can still be increasing throughput while beginning to trash•Need to monitor database server with vmstat and other tools•Must stop just shy of bandwidth challenges (RAM, CPU, IO)•Must factor in multi-node overhead, and reduce accordingly•Prior to 10g R2, better to rely on app (BMF) load balancing•If you’re not careful on this step, you’ll run into roadblocks which either invalidate your results or simply cannot scale!!!

2. For 2nd through Nth nodes and instances

a. Duplicate the environment

i. Install the operating system

ii. Duplicate all of the base node’s operating system settings

b. Add the node to the cluster

i. Perform node registration tasks

ii. Propagate the Oracle software to the new node

iii. Update the database “spfile” parameters for the new node

iv. Alter the database to add node specific items (e.g. redo logs)

c. Run the baseline RAC benchmark

i. Update the baseline benchmark criteria to include user load scenarios from the prior run’s maximum up to the new maximum based upon node count * “sweet spot” of concurrent users using the baseline benchmark’s constant for increment by

ii. Shutdown and startup the database – adding the new instance

iii. Run the baseline RAC benchmark

iv. Plot the transactions per second graph showing this run versus all the prior baseline benchmark runs – the results should show a predictable and reliable scalability factor

Testing Methodology – Steps 2 A-Testing Methodology – Steps 2 A-CC

With the correct per node user load now correctly identified and guaranteed load balancing, it was now a very simple (although time consuming) exercise to run the TPC-C like benchmarks listed below:

1 Node: 100 to 500 users, increment by 100

2 Node 100 to 1000 users, increment by 100





Benchmark Factory’s default TPC-C like test iteration requires about 4 minutes for a given user load. So for the single node with five user load scenarios, the overall OLTP benchmark test run requires 20 minutes.

During the entire testing process the load was monitored to identify any hiccups using Spotlight on RAC.

Step 2C – Run OLTP Test per NodeStep 2C – Run OLTP Test per Node

As illustrated below when we reached our four node tests we did identify that CPU’s on node racdb1 and racdb3 reached 84% and 76% respectively. Analyzing the root cause of the problem it was related to temporary overload of users on these servers, and the ASM response time.

Some Speed Bumps Along the WaySome Speed Bumps Along the Way

We increased the following parameters on the ASM instance ran our four node tests again and all was well beyond this:

Parameter Default Value

New Value

SHARED_POOL 32M 67M LARGE_POOL 12M 67M

This was the only parameter change we had to make to the ASM instance and beyond this everything work just smooth.

Some ASM Fine Tuning NecessarySome ASM Fine Tuning Necessary

As shown below, the cluster level latency charts from Spotlight on RAC during our eight node run. This indicated that the interconnect latency was well within expectations and in par with any industry network latency numbers.

Smooth Sailing After ThatSmooth Sailing After That

As shown below, ASM was performing excellently well at this user load. 10 instances with over 5000 users indicated an excellent service time from ASM, actually the I/O’s per second was pretty high and noticeably good - topping over 2500 I/O’s per second!

Full Steam Ahead!Full Steam Ahead!

Final ResultsFinal Results

Other than some basic monitoring to make sure that all is well and the tests are working, there’s really not very much to do while these tests run – so bring a good book to read. The final results are shown below.

The results are quite interesting. As the previous graph clearly shows, Oracle’s RAC and ASM are very predictable and reliable in terms of its scalability. Each successive node seems to continue the near linear line almost without issue. Now there are 3 or 4 noticeable troughs in the graph for the 8 and 10 node test runs that seem out of place. Note that we had one database instance that was throwing numerous ORA-00600 [4194] errors related to its UNDO tablespace. And that one node took significantly longer to startup and shutdown than all the other nodes combined. A search of Oracle’s metalink web site located references to a known problem that would require a database restore or rebuild. Since we were tight on time, we decided to ignore those couple of valleys in the graph, because it’s pretty obvious from the overall results we obtained that smoothing over those few inconsistent points would yield a near perfect graph – showing that RAC is truly reliable and predictable in terms of scalability.

Interpreting the ResultsInterpreting the Results

Projected RAC ScalabilityProjected RAC ScalabilityUsing the 6 node graph results to project forward, the figure below shows a reasonable expectation in terms of realizable scalability – where 17 nodes should equal nearly 500 TPS and support about 10,000 concurrent users.

Next Steps …Next Steps …

•Since first iteration of test we were limited by memory, we upgraded each database server from 4 to 8 GB RAM

•Now able to scale up to 50% more users per node•Now doing zero percent paging and/or swapping•But – now CPU bound

•Next step, replace each CPU with a dual-core Pentium

•Increase from 4 CPU’s (2-real/2-virtual) to 8 CPU’s•Should be able to double users again ???•Will we now reach IO bandwidth limits ???

•Will be writing about those results in future Dell articles…

Conclusions …Conclusions … A few minor hiccups at the initial round where we tried to determine the optimal user load on a node for

the given hardware and processor configuration

The scalability of the RAC cluster was outstanding. Addition of every node to the cluster showed steady - close to linear scalability. Close to linear scalability because of the small overhead that the cluster interconnect would consume during block transfer between instances.

The interconnect also performed very well, in this particular case NIC paring/ bonding feature of Linux was implemented to provide load balancing across the redundant interconnects which also helped provide availability should any one interconnect fail.

The DELL| EMC storage subsystem that consisted of six ASM diskgroups for the various data files types performed with high throughput also indicating high scalability numbers. EMC PowerPath provided IO load balancing and redundancy utilizing dual Fibre Channel host bus adapters on each server.

It’s the unique architecture of RAC that makes this possible, because irrespective of the number of instances in the cluster, the maximum number of hops that will be performed to before the requestor gets the block requested will not exceed three under any circumstances. This unique architecture of RAC removes any limitations in clustering technology (available from other database vendors) giving maximum scalability. This was demonstrated through the tests above.

Oracle® 10g Real Application Clusters (RAC) running on standards-based Dell™ PowerEdge™ servers and Dell/ EMC storage can provide a flexible, reliable platform for a database grid.

In particular, Oracle 10g RAC databases on Dell hardware can easily be scaled out to provide the redundancy or additional capacity that the grid environment requires.

Questions …Questions …

Thanks for coming

Documents

Oracle 10g RAC Scalability – Lessons Learned