13
Issues after Workload Moves And why we implemented IRD to fix

Issues after Workload Moves And why we implemented IRD to fix

Embed Size (px)

Citation preview

Page 1: Issues after Workload Moves And why we implemented IRD to fix

Issues after Workload Moves

And why we implemented IRD to fix

Page 2: Issues after Workload Moves And why we implemented IRD to fix

Workload Separation For Licence Savings

Original setup

•Mixture of licences across LPARs

•Licence charges based on LPAR size, not product usage within the LPAR

SYSA

IMS

DB2

CICS

BATCH

DB2 SP

SYSB

IMS

DB2

CICS

BATCH

DB2 SP

SYSK

DB2

CICS

BATCH

DB2 SP

MQ MQ MQ

Z9 2094 – 713

Page 3: Issues after Workload Moves And why we implemented IRD to fix

Workload Separation For Licence Savings

New Setup•IMS, CICS and general LPAR•IMS subsystems merged•DB2 datasharing•MSU licence cost on SYSB is substantially cheaper.•Move as much non-IMS, non-CICS work to SYSB as possible.

SYSA

IMS

DB2

DLIBATCH

SYSB

DB2

BATCH

DB2 SP

SYSK

DB2

CICS

MQ MQ MQ

BMPBATCH

IMS LPAR

BATCH LPAR

CICS LPAR

Page 4: Issues after Workload Moves And why we implemented IRD to fix

Working day 1

•Lots of batch, overnight and during day

•Always our peak day of month

•Peak occurs during overnight batch

•Batch failures and overruns common

•Online systems are also busiest on this day

•System can often run capped

Page 5: Issues after Workload Moves And why we implemented IRD to fix

Workload Moves

•No major issues with online

•No major issues with normal batch, but

•Major issue with WD1 batch

•Many DB2 batch jobs elapsed time increased 2-3 times when moved to SYSB

•Changing priority made no difference in most cases

•If reran on SYSA, would run much faster

•Jobs not delayed for CPU

•Nothing in RMFWDM or DB2 thread stats to explain

•Dasd response times similar SYSA/B

•LPARs on same CEC, shared channels

Page 6: Issues after Workload Moves And why we implemented IRD to fix

Thoughts and Guesses

•Delay in servicing I/O interrupts

• Higher TPI% on SYSB, but not much

• CPENABLE setting (10,30) – in line with IBM recommendations

• Logical processor share% very low on SYSB 16% vs 80% SYSA

•Same number of CPUs online SYSB as SYSA

•Weightings not changed to reflect workload moves

• Significant when capped

• Weightings need to be more dynamic overnight vs online

•LPAR dispatch time 12.5 ms

Page 7: Issues after Workload Moves And why we implemented IRD to fix

Thoughts and Guesses

•Found ‘Short CP’ paper by Kathy Walsh

• Describes how high priority work may be delayed

• In relation to online CICS

• I/O not mentioned

• But posting I/O complete IS high priority work

•Recommends reducing online CPUs to match workload

•Recommends making sure weights appropriate

•Recommends using IRD to monitor and automate changes

Page 8: Issues after Workload Moves And why we implemented IRD to fix

What is a short CP?

•SYSB weight set to expect 120 MSU (approx 2 engines)

•When capped, this is enforced

•SYSB Z/OS thinks it has 11 engines, but each dispatched only about 11/2 of the time

•High priority work is not dispatched

•All dispatched CPs might be disabled for interrupts due to CPENABLE

•Short CP ratio – MVSBUSY/LPARBUSY

Page 9: Issues after Workload Moves And why we implemented IRD to fix

Metrics From 01/02/2012

Short CP Ratio DASD RESPONSE DASD RATE Actual MSU

Interval GCL rem SYSA SYSB SYSA SYSB SYSA SYSB SYSA SYSBRatio

03.00.00 321 1.00 1.24 4.7 4.2 1819 3860 112 320 25.93%

03.30.00 265 1.00 1.22 2.4 5.1 1884 2353 140 265 34.57%

04.00.00 234 1.00 1.02 4.5 8.8 2457 5670 100 203 33.00%

04.30.00 194 1.00 1.07 4.1 9.6 2478 3988 122 257 32.19%

05.00.00 166 1.00 1.04 5.2 11.7 3520 2670 92 175 34.46%

05.30.00 108 1.00 1.58 4.1 9.6 8008 4207 209 230 47.61%

06.00.00 49 1.00 1.09 5.2 8 6012 4777 277 342 44.75%

06.30.00 67 1.00 1.23 4.4 5.1 4866 6019 285 268 51.54%

07.00.00 76 1.00 1.20 3.4 3.6 6265 6113 281 220 56.09%

07.30.00 77 1.00 1.18 3.8 3.8 6627 4973 232 193 54.59%

08.00.00 29 1.09 3.75 3.2 3.7 6661 3069 536 153 77.79%

08.30.00 -5 1.08 3.21 3.3 2.8 3763 2155 474 204 69.91%

09.00.00 -37 1.34 4.79 3.6 3.5 5073 2448 403 137 74.63%

09.30.01 -25 1.56 4.81 3.8 3.8 4911 1854 416 124 77.04%

10.00.00 -11 1.51 5.87 3.9 3.9 5523 2045 434 100 81.27%

10.30.00 0 1.18 3.87 4.1 3.7 5198 2373 446 143 75.72%

11.00.00 -2 1.09 2.95 4.3 3.8 4651 2910 331 154 68.25%

11.30.00 -21 1.19 3.79 3.6 2.9 5984 2881 400 126 76.05%

12.00.00 -6 1.17 3.08 4.2 3.2 5682 4237 388 130 74.90%

Page 10: Issues after Workload Moves And why we implemented IRD to fix

2/1/2012 Notes on Metrics

•DASD responses appear similar between systems

•I/O rate appears depressed on SYSB when capped, more than SYSA

•90% of I/Os serviced by 4 out of 11 CPs

•But each CP only has 16% logical processor share

•Weights are about 80/20 between SYSA/SYSB

•When capped MSUs moved toward these ratios

•SYSB needs more than SYSA in the batch window

•SYSA needs more than SYSB in the online window

Page 11: Issues after Workload Moves And why we implemented IRD to fix

Recreation of problem

•Lower the cap on development system

•Run several low priority looping jobs

•Run I/O intensive Job at high priority – no CPU delays

•Problem recreated only when

• System was capped

• Weight was too low for system MSU

• Additional CPs were varied online

Page 12: Issues after Workload Moves And why we implemented IRD to fix

Metrics From 01/03/2012

Short CP Ratio DASD RESPONSE DASD RATE Actual MSU

Interval GCL rem SYSA SYSB SYSA SYSB SYSA SYSB SYSA SYSBRatio

03.00.00 163 2.10 1.01 6.9 10.9 5287 6129 316 411 43.47%

03.30.00 79 1.25 1.01 5.5 6.8 6672 5803 396 304 56.57%

04.00.00 24 1.04 1.00 3.4 5.3 5730 7288 278 286 49.29%

04.30.00 5 1.08 1.00 3.4 4 5503 7855 239 243 49.59%

05.00.00 28 1.03 1.00 3.2 4.2 6522 5686 287 248 53.64%

05.30.00 11 1.04 1.00 3.3 4.5 6787 4251 385 215 64.17%

06.00.00 -15 1.01 1.00 3.5 4.2 5294 5596 338 128 72.53%

06.30.00 -39 1.00 1.28 3.5 3.9 4929 5269 330 147 69.18%

07.00.00 -7 1.00 1.00 3.3 4.1 5147 4331 331 80 80.54%

07.30.00 42 1.00 1.00 3.6 2.1 2632 5916 175 78 69.17%

08.00.00 87 1.06 1.00 4.2 4.5 3267 3282 185 81 69.55%

08.30.00 97 1.02 1.00 3.2 4.5 7594 3177 338 248 57.68%

09.00.00 101 1.02 1.01 3.5 5 6570 3637 260 139 65.16%

09.30.01 119 1.04 1.00 4.3 3.8 5715 3475 350 112 75.76%

10.00.00 117 1.03 1.00 3.9 3.7 4352 3526 311 102 75.30%

10.30.00 118 1.04 1.00 4 4.6 4893 2527 330 82 80.10%

11.00.00 112 1.04 1.00 4.1 4.4 4753 3051 294 93 75.97%

11.30.00 88 1.08 1.00 3.7 4.4 4820 2078 371 110 77.13%

12.00.00 54 1.05 1.00 3.6 3.7 5028 2756 382 100 79.25%

Page 13: Issues after Workload Moves And why we implemented IRD to fix

Comparison

•SYSB logical processor share 60-100%, compared to 16% previously

• Helps processor cache

•SYSA logical processor share 60% compared to 80%

•Job FNMDMP05

• 15h17mins elapsed, 55mins cpu, 642K excp 01/02

• 8h35mins elapsed, 48mins cpu, 648K excp 01/03

•DB2 DBM1 address space (does database I/O)

• 499,271,506 I/Os take 02:50:58 CPU hours 01/02

• 549,277,707 I/OS take 02:28:45 CPU hours 01/03