Large RAC Benchmark – Lessons Learned Hanan Hit NOCOUG Feb 14, 2006

Large RAC Benchmark – Lessons LearnedLarge RAC Benchmark – Lessons LearnedHanan HitHanan Hit

NOCOUGNOCOUGFeb 14, 2006Feb 14, 2006

2

Today’s agendaToday’s agenda

• SUT Definitions • Benchmark Goals• Used RAC Mode• OS (HP-UX) best practices • HMP - Hyper Messaging Protocol • UDP Usage • OS Scheduling• Deadlocks• Startup & Shutdown• Execution Plans• ORA Parameters • Multiple Block Sizes

3

The Tested Application The Tested Application

• Type 1 - Powerbuilder front-end on windows, tuxedo middleware and Oracle backend.

• Type 2 – Java on Sun servers using Weblogic as the middle tier.

• Type 3 - Batch processing written in Pro*C/Pro*Cobol.

• All the Oracle database servers run on HP servers, running Oracle 9.2.0.5 on HP-UX 11iV2.

4

SUTSUT

Node RAC Instance

CPU’s Memory (GB)

SGA (GB)

HP001 RAC1 128 512 71

HP002 RAC2 64 256 71

HP003 RAC3 32 128 50

5

SUT – (Cont.)SUT – (Cont.)

• Database Size 30TB• RAW Devices• Non Shared Oracle Home• Two Full XP 12K storage array from HP

82 TB each (1152 disks with 73GB 15K RPM) 128GB Cache 128GB Cache 120,000 IOPS120,000 IOPS (Max Random Performance Disk)

6

MachinesMachines

App Server

#1Sun Fire 6800

24 CPU – 80GB

App Server

#2Sun Fire 6800

App Server

#16Sun Fire 6800

Web Servert #1

Sun 4804 CPU

Web ServeriPlanet #1

Tuxedo server

#1

RP8420/

32 CPU

Tuxedo server

#2

Tuxedo server

#3

Tuxedo server

#4

Web Server #2

Web Server#38

7

Machines – (Cont.)Machines – (Cont.)

8

Machines – (Cont.)Machines – (Cont.)

9

Benchmark Goals Benchmark Goals

• Manual Load Balancing – Online only – 6K Type 1 , 3.5K Type 2 (Test 1)

All the connections to the RAC instances should be balanced. The throughput should be comparable with an earlier NON-RAC test

environment.

Online + Batch - 6K Type 1, 3.5K Type 2, 500 Type 3 (Test 2) All the connections to the RAC instances should be balanced. The throughput should be comparable with an earlier NON-RAC tests. Type 3 timings should be comparable with the NON-RAC tests.

Online only- 12K Type 1 , 7K Type 2 (Test 3) All the connections to the RAC instances should be balanced. The throughput should increase 95% comparable with an earlier

Test1.

10

Benchmark Goals – (Cont.)Benchmark Goals – (Cont.)

• RAC Load Balancing – Online only – 6K Type 1 , 3.5K Type 2 (Test 1)

All the connections to the RAC instances should be balanced. The throughput should be comparable with an earlier NON-RAC test

environment.

Configuration/Setup:o Amend the /etc/tnsnames.ora with the RAC LB strings.

• RAC Fail Over Testing – Online only – 6K Type 1 , 3.5K Type 2 (Test 1)

All the database connections that were connected to the RAC3 instance should have been successfully failed over to the other RAC1 instance.

11

RAC - ModeRAC - Mode

Oracle Instance

Node A (Primary) Node B (Primary)

Oracle Instance

Listener Listener

IP (A) IP (B)

Application ormiddleware

Failover

12

OS Port Specific Best PracticesOS Port Specific Best Practices

• IPC Protocols HP’s HMP HP True64’s RDG (Reliable DataGram) Sun’s RSM (Remote Shared Memory) VIA on Intel

• In general the recommendation is to use low latency user mode IPC rather then UDP or TCP.

13

What is What is HyperFabricHyperFabric

• HyperFabric is a high-speed cluster interconnect fabric that supports both the industry standard TCP/UDP over IP and HP’s proprietary Hyper Messaging Protocol (HMP). HyperFabric extends the scalability and reliability of TCP/UDP by

providing transparent load balancing of connection traffic across multiple network interface cards (NICs) and transparent failover of traffic from one card to another without invocation of MC/ServiceGaurd.

The HyperFabric NIC incorporates a network processor that implements HP’s Hyper Messaging Protocol and provides lower latency and lower host CPU utilization for standard TCP/UDP benchmarks over HyperFabric when compared to gigabit Ethernet.

Hewlett-Packard released HyperFabric in 1998 with a link rate of 2.56 Gbps over copper. In 2001, Hewlett-Packard released HyperFabric 2 with a link rate of 4.0 Gbps over fiber with support for compatibility with the copper HyperFabric interface. Both HyperFabric products support clusters up to 64-nodes.

14

HyperFabric SwitchesHyperFabric Switches

• Hewlett-Packard provides the fastest cluster interconnect via its proprietary HyperFabric switches. The latest product being HyperFabric 2, which is a new set of

hardware components with fiber connectors to enable low-latency, high bandwidth system interconnect.

With fiber interfaces, HyperFabric 2 provides faster speed up to 4Gbps in full duplex over longer distance up to 200 meters.

HyperFabric 2 also provides excellent scalability by supporting up to 16 hosts via point-to-point connectivity and up to 64 hosts via fabric switches. It is backward compatible with previous versions of HyperFabric and available on IA-64, PA-RISC servers.

15

Hyper Messaging Protocol (HMP)Hyper Messaging Protocol (HMP)

• HMP also known as the “Lowfat” is HP’s reliable user-mode IPC protocol. Hyper Messaging Protocol significantly expands on the feature set

provided by TCP/UDP by providing a true Reliable Datagram model for both remote direct memory access (RDMA) and traditional message semantics.

Coupled with OS bypass capability and the hardware support for protocol offload provided by HyperFabric, HMP provides high bandwidth, low latency and extremely low CPU utilization with an interface and feature set optimized for business critical parallel applications such as Oracle 9i RAC.

16

How to Relink Oracle with HMP?How to Relink Oracle with HMP?

• OS Configuration setprivgrp dba MLOCK Persistent over reboots => /etc/privgroup dba MLOCK Kernel MAXFILES must be at least 1024

• HMP Configuration in /opt/clic/lib/skgxp/skclic.conf• HMP relink

make –f ins_rdbms.mk rac_on ipc_hms ioracle

• UDP relink (Back to default) make –f ins_rdbms.mk rac_on ipc_udp ioracle

17

HMP Diagnostics HMP Diagnostics

• Check the alert.log and verify that you are using HMP protocol.

• netstat –in Check that the CLIC #N interfaces are up on all nodes and that the MTU is at least 32K

• Cluster_interconnects init.ora (spfile) parameter will not be showing the IP address and shouldn’t be used. > oradebug setmypid > oradebug ipc

18

HMP Findings HMP Findings

• Each Oracle shadow process will require 300-400K of additional memory.

• The system couldn’t scale the No. of concurrent connections beyond ~ 1K and no additional connections could be established to either of the instances once this threshold had been reached.

• The maximum (single card), memory was 4GB (during the benchmark timeframe)

• We couldn’t start more than a single instance while reaching the above threshold while the other ones were waiting. Direct connection could be established to the single node

that was active.

19

HMP Findings (Cont.) HMP Findings (Cont.)

• netstat may show that most of the traffic is going to single interface, but this is really not a concern as the traffic is load balanced in the fabric and the command is not able to catch it.

• Rollback to UDP over HyperFabric as this is usually more then “good enough”.

20

UDP Findings UDP Findings

• Use UDP over HyperFabric as this will have the ability to use multiple cards.

• Due to the Fail Over scenarios we had to be able to cope with at least twice node loads on a single node, hence minimum 20K concurrent connections. The initial setup failed with (ORA-27504) that were pointing

to OS Errors (227). The cause for the above was found to be the fact that we

were hitting an OS limitations of maximum UDP ports which is defaulted to 16380 (49152 through 65535).

The solution is to lower the starting port from 49152 to 20000 /usr/bin/ndd –set /dev/udp udp_smallest_anon_port 20000

21

Scheduling & process prioritizationScheduling & process prioritization

• Most UNIX OS implemented time-sharing (TS) and real time (RT) scheduling.

• The default for user process is TS.• With the regular scheduler (TS) the longer a process runs, its

priority get weaker, and the chances for its preemption increases.

• The cost of context switch is likely to be very high especially for LMD/LMS processes in RAC.

• SCHED_NOAGE scheduler allows the processes to hold the CPU until they give it up.

• Starting all processes in the same highest priority of this scheduler remove the possibility that they will ‘steal’ the CPU from each other

22

SCHED_NOAGE SCHED_NOAGE

• Unlike the normal TS policy a process scheduled with the SCHED_NOAGE will not increase or decrease its priority nor it will be preempted.

• How to implement: As root (after reboot)

setprivgrp -g dba RTSCHED RTPRIO Enter in /etc/privgroup dba RTSCHED RTPRIO In init.ora (spfile) enter

o hpux_sched_noage=178

• Check (ps) that the Oracle backgrond processes use priority 178.

• In RAC make sure that this will be true for all nodes.

23

Deadlock etc.Deadlock etc.

• Deadlock situations might get extrapolated to a serious issue in RAC environment.

• Please make sure to check these issues prior to any RAC implementations.

• Hints that might help *.event = "7445 trace name ERRORSTACK level 3"

• We encountered a situation that a instance had crashed due to deadlocks.

• RAC1 had crashed while the databases on the other nodes remained up.

24

Deadlock etc. (Cont.)Deadlock etc. (Cont.)

• *** In RAC environment the deadlock detection works different than NON-RAC. In NON-RAC it updates the alert.log with ORA-60 and generate a user trace file, But in RAC environment it puts the deadlock graph only in LMD Tracefile (background_dump_dest no user trace file).

• Since we are dealing with global resources in RAC, the deadlock detection mechanism is different than from a non-RAC environment. In RAC, LMD process periodically checks for deadlocks on the database. How often it checks for deadlocks is controlled by the parameter _lm_dd_interval. The default is 60 seconds on most versions / platforms and can be checked with the following query: select x.ksppinm , y.ksppstvl from x$ksppi x , x$ksppcv y where x.indx = y.indx and

x.ksppinm like '%lm_dd%‘order by x.ksppinm;

Setting _lm_dd_interval too low can cause LMD to consume too much CPU. Setting it too high can cause applications to hang up because deadlocks aren't being detected quickly enough. The default is sufficient for most cases.

25

Deadlock etc. (Cont.)Deadlock etc. (Cont.)

• *** Deadlocks may occur on the same node and on different nodes in the cluster.

• Diagnostibility of ORA-60 errors on the single instance is easier then the case that is spans multiple instances (getting the SQL stmt).

• On multiple instances starting from 9.2.0.6 / 10.1.0.3 and above Oracle dumps additional diagnostics for deadlock in RAC environment.

• On a single instance you can use the following (After checking the <node> in the LMD trace file) ON All instances (spfile). event="60 trace name errorstack level 3;name systemstate level 10"

26

Startup & ShutdownStartup & Shutdown

• Be cautious while performing startup and shutdown operations on such a large SGA’s.

• Shutdown may take very long time (half an hour…).• While performing the below procedure we got few Oracle Errors

srvctl start database -d <NODE NAME> Bug 2540942 (LMS may spin in kjctr_rksxp() during simultaneous

shutdown). During simultaneous shutdown of several instances with large

buffer caches there is the chance of running out of tickets due to the amount of messages that need to be sent for cleanup purposes. Most visible side effect is LMS spinning in kjctr_rksxp(). the ora-600:[504] is the side effect of this.

27

Different Execution PlansDifferent Execution Plans

• Node 1 Example

Enter value for hashvalue: 2601636153

SQL_HASH SQL_TEXT

--------------- ---------------------------------------------------------------------------

2601636153 select members.MEMBER_ID into :b0:b1 from (select t.MEMBER_ID from

UFMI_INV u ,RM_MEMBER_ID t where (((((u.URBAN_ID(+)=:b2 and

u.FLEET_ID(+)=:b3) and u.MEMBER_ID(+)=t.MEMBER_ID) and u.MEMBER_ID is null

) and t.MEMBER_ID>:b4) and t.MEMBER_ID<=:b5) order by t.MEMBER_ID ) members

where ROWNUM<=:b6


Plan Table

------------------------------------------------------------------------------------------------------------------------

| Operation | Name | Starts | E-Rows | A-Rows | Buffers | Reads | Writes | E-Time |

------------------------------------------------------------------------------------------------------------------------

| INDEX RANGE SCAN |UFMI_INV_PK | | 4 | | | | | |

| SORT JOIN | | | 4 | | | | | |

| INDEX RANGE SCAN |RM_MEMBER_ID_PK | | 142K| | | | | |

| MERGE JOIN OUTER | | | | | | | | |

| FILTER | | | | | | | | |

| FILTER | | | | | | | | |

| VIEW | | | 142K| | | | | |

| COUNT STOPKEY | | | | | | | | |

| SELECT STATEMENT | | | | | | | | |

------------------------------------------------------------------------------------------------------------------------

12 rows selected.

28

Different Execution Plans (Cont.)Different Execution Plans (Cont.)

• Node 3 ExampleEnter value for hashvalue: 2601636153

SQL_HASH SQL_TEXT

--------------- ---------------------------------------------------------------------------

2601636153 select members.MEMBER_ID into :b0:b1 from (select t.MEMBER_ID from

UFMI_INV u ,RM_MEMBER_ID t where (((((u.URBAN_ID(+)=:b2 and

u.FLEET_ID(+)=:b3) and u.MEMBER_ID(+)=t.MEMBER_ID) and u.MEMBER_ID is null

) and t.MEMBER_ID>:b4) and t.MEMBER_ID<=:b5) order by t.MEMBER_ID ) members

where ROWNUM<=:b6


Plan Table

-------------------------------------------------------------------------------------------------------------------------

| Operation | Name | Starts | E-Rows | A-Rows | Buffers | Reads | Writes | E-Time |

------------------------------------------------------------------------------------------------------------------------

| INDEX RANGE SCAN |UFMI_INV_PK | | 1 | | | | | |

| INDEX RANGE SCAN |RM_MEMBER_ID_PK | | 1 | | | | | |

| NESTED LOOPS OUTER | | | | | | | | |

| FILTER | | | | | | | | |

| FILTER | | | | | | | | |

| VIEW | | | 1 | | | | | |

| COUNT STOPKEY | | | | | | | | |

| SELECT STATEMENT | | | | | | | | |

--------------------------------------------------------------------------------------------------------------------

----

11 rows selected.

29

What might be the Reason?What might be the Reason?

The first set of binds on node1 return a larger result set.

• Bind Peeking Version 9i / 10G uses “Bind Variable Peeking” the CBO makes the assumption, that the bind variable values used

for the first execution of a SQL statement are representative of the bind variable values to be used in future executions of the same SQL statement.

Bind variable peeking occurs every time the cursor is parsed or loaded.

If the data is skewed, the execution plan may change when, for example, the shared pool is flushed or the underlying object are analyzed

• The optimizer is making different estimates because of the different parameters set on each node.

• Histograms built on these columns

30

How to “Solve” the issue?How to “Solve” the issue?

• Turn off bind peeking - (_optim_peek_user_binds=false) Pros – Will eliminate any bind peeking Cons - This will affect *all* statements, some of which could already

be benefiting from bind peeking

• Hints – (If the data is skewed or if the result set of the binds vary a lot) Pros -Always get the same plan, regardless of peeked binds Cons - This would unfortunately require an app change.

• Use Store Outlines (we chose this option) Pros – No Need to change the application and applied immediately. Cons – Extra DBA maintenance and any changes to the query will

disable the use of the outline.

31

_ Hidden Parameters & More_ Hidden Parameters & More

• _row_cr => Global cache cr request Event – This event is generated

when an instance has requested a consistent read data block (or UNDO segment headers), and the block to be transferred hasn’t arrived at the requesting instance. Please make sure to check V$CR_BLOCK_SERVER.

The usage of _row_cr=TRUE (default = FALSE) might reduce the number CR rollbacks and avoid a costly cleanup/rollback in RAC. Instead of performing block cleanup/rollback the usage of this parameter will try to generate a CR for a given ROW.

Once we used this parameter we were hitting another bug # 3577772.8 which we found during a Direct Path export.

32

_ Hidden Parameters & More (Cont.)_ Hidden Parameters & More (Cont.)

• FAST_START_MTTR_TARGET => This parameters guarantees that instance or crash recovery will be

completed within this time frame and when set to non zero value will use incremental checkpoints. The default value is 300 Sec (5 minutes).

The customer were using 0 sec and once setting this parameter to the Oracle default we reduced the inter-node communications as well as reduced the overall I/O on the system. Please make sure to amend it to suite your site requirements.

• _BUMP_ HIGHWATER_ MARK_COUNT => HW Enqueue – (High water mark) - High Water Mark for the table

may be constantly increased since the table is growing and running out of blocks for new inserts on the free list. The default value for bumping up the HWM is 0, which may be too low on a busy system or for the average insert size.

We increased it to 5 and nearly eliminated this Enqueue at all.

33

Multiple Block Sizes?Multiple Block Sizes?

• Single instance was using KEEP buffer pool nearly the size as the DEFAULT buffer pool.

• We tried once to segregate the index from the data on a smaller buffer pool (2k for indexes). ORA-00379: no free buffers available in buffer pool DEFAULT for block size 2K

Due to the un-symmetric machines we encountered the above error once the proper buffer pool was not defined on all instances.

• The recommendations - Combine the different buffer pool caches into one, no need to use KEEP, RECYCLE. Oracle can manage them well.

34

SummarySummary

• Address application deadlocks. This especially important once moving to RAC.

• Combine the different buffer caches into one, no need to use KEEP, RECYCLE. Oracle can manage them well.

• Move to Locally Managed Tablespaces, with ASSM (Automatic segment space management).

• Don’t consider using HMP while going to production in very large systems. The usage of the fabric over UDP should be enough.

• Try to use Hints once you encounter different execution plans (Binds).

• If possible partition to the Applications to few physical nodes (only in real high concurrency & volumes).

• All bugs had been fixed in newer versions of Oracle. • Takes extra precautions once performing startup & shutdown in

such large environments.

35

ReferencesReferences

HP Cluster Install Oracle 9i – Best Practices – performance Diagnostics &

Tuning. http://www.oracle.com/technology/documentation/index.html

(And search for HMP).

36

Thanks

Documents

Large RAC Benchmark – Lessons Learned Hanan Hit NOCOUG Feb 14, 2006