29
www.enkitec.com 1 Troubleshoo4ng the Most Complex Performance Issue I’ve ever seen Tanel Poder hAp://blog.tanelpoder.com hAp://tech.e2sn.com

Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

Embed Size (px)

Citation preview

Page 1: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   1      

Troubleshoo4ng  the  Most  Complex  Performance  Issue  I’ve  ever  seen  

Tanel  Poder  hAp://blog.tanelpoder.com  

hAp://tech.e2sn.com      

Page 2: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   2      

Intro:  About  me  

•  Tanel  Põder  Oracle  Database  Performance  geek  Exadata  Performance  geek  Hadoop  Performance  geek  

•  Enkitec  •  Consultant  •  Researcher  •  Technology  Evangelist  

•  Just  moved  to  Dallas  •  ANer  Tallinn  -­‐>  Stockholm  -­‐>  London  -­‐>  

Cancun  -­‐>  Singapore  

Expert  Oracle  Exadata  book  

(with  Kerry  Osborne  and  Randy  Johnson  of  Enkitec)  

Page 3: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   3      

Two  issues  -­‐  actually  

•  For  warm-­‐up:  •  cursor  pin:  S  wait  events  and  sporadic  CPU  spikes  •  Read  more  from  my  blog  entry:  

•  hAp://blog.tanelpoder.com/2010/04/21/cursor-­‐pin-­‐s-­‐waits-­‐sporadic-­‐cpu-­‐spikes-­‐and-­‐systema4c-­‐troubleshoo4ng/  

•  Or  just  google  for  “cursor  pin  s”    

 

Page 4: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   4      

Environment  

•  High-­‐concurrency,  high-­‐visibility  OLTP  database  •  Oracle  11.1.0.7  single-­‐instance,  dedicated  server  processes  •  HP-­‐UX  on  Itanium  •  32  CPUs,  128  GB  RAM  

•  Thousands  of  end  users  •  Mul4ple  WebLogic  applica4on  servers  talking  to  database  via  

connec4on  pools  

Page 5: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   5      

The  problem  

•  Sporadic  extreme  slowness  of  Oracle  DB  and  the  server  •  Slowness  lasts  for  1  ..  20  minutes  at  a  4me…  •  Queries  don’t  answer  or  extremely  slow  •  Can’t  even  log  on  to  OS  during  that  4me  

•  New  SSH  connec4ons  succeded  once  the  spike  was  over  •  It  takes  minutes  to  run  simple  OS  commands  during  the  problem  4me  

•  This  is  a  global  server-­‐wide  problem  –  everyone  complains!  So,  the  scope  of  this  problem  

is  global,  server-­‐wide.  Therefore  we  can  use  global,  

server-­‐wide  metrics  to  diagnose  the  problem.  

Page 6: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   6      

Let’s  pick  and  diagnose  one  occurrence  of  this  problem  

•  The  database  response  4mes  extremely  bad  again  around  18:10  and  this  lasted  for  about  5  minutes…  

•  If  it’s  the  users  who  report  the  problem  (as  opposed  to  applica4on  side  measurements),  then  there  may  be  some  discrepancies  in  the  user  reported  4mes  vs  actual  problem  4me    

Page 7: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   7      

Ini4al  AWR  Report  

Snap Id Snap Time Sessions Curs/Sess --------- ------------------- -------- ---------

Begin Snap: 61921 30-Oct-10 18:00:10 2,383 28.9

End Snap: 61922 30-Oct-10 18:20:20 2,863 24.7

Elapsed: 20.17 (mins) DB Time: 559.31 (mins)

Avg % DB

Event Waits Time(s) (ms) time Wait Class

--------------------------- --------- ----------- ------ ------ ----------

db file sequential read 2,135,668 21,468 10 64.0 User I/O

DB CPU 5,860 17.5 log file sync 92,720 1,498 16 4.5 Commit

read by other session 91,676 1,307 14 3.9 User I/O

SQL*Net message from dblink 525 1,132 2155 3.4 Network

Host CPU (CPUs: 32 Cores: 32 Sockets: 32) ~~~~~~~~ Load Average

Begin End %User %System %WIO %Idle

--------- --------- --------- --------- --------- ---------

0.37 0.31 16.3 17.6 12.1 66.1

Using  a  20  minute  report  for  diagnosing  a  5  minute  problem?!  

Number  of  sessions  has  grown  by  ~500!  

This  66%  idle  is  an  average  over  20  

minutes!  

Page 8: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   8      

ASH  data  (shown  in  OEM)  

•  Average  ac4ve  sessions  showed  something  different  •  Note  that  this  data  is  from  another  period  of  4me  when  a  similar  spike  happened  

•  In  worst  4mes  there  were  up  to  220  ac4ve  sessions  trying  to  be  on  CPU!  •  Thanks  to  beAer  granularity  we  see  the  spikes  instead  of  some  20-­‐minute  or  hourly  

averages…  •  The  problem  with  ASH  samples  is  that  it  looks  into  session  state  from  inside  Oracle  •  Perhaps  the  starva5on  is  due  to  some  other  applica5on  /  instance  in  the  server?  

Page 9: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   9      

How  many  logons  were  done?  

Snap Id Snap Time Sessions Curs/Sess --------- ------------------- -------- --------- Begin Snap: 61921 30-Oct-10 18:00:10 2,383 28.9 End Snap: 61922 30-Oct-10 18:20:20 2,863 24.7 Elapsed: 20.17 (mins) DB Time: 559.31 (mins) Statistic Total per Second per Trans -------------------------------- ------------------ -------------- ------------- index fetch by key 24,174,148 19,971.0 43.6 index scans kdiixs1 24,565,055 20,293.9 44.3 leaf node 90-10 splits 5,865 4.9 0.0 leaf node splits 14,529 12.0 0.0 lob reads 34,480 28.5 0.1 lob writes 1,623,273 1,341.0 2.9 lob writes unaligned 1,623,266 1,341.0 2.9 logons cumulative 2,550 2.1 0.0 messages received 133,740 110.5 0.2 messages sent 133,740 110.5 0.2 min active SCN optimization appl 538,358 444.8 1.0 no buffer to keep pinned count 6,331 5.2 0.0 no work - consistent read gets 146,703,542 121,196.1 264.7 opened cursors cumulative 4,168,700 3,443.9 7.5

Number  of  sessions  has  grown  by  ~500!  

2.1  logons  per  second,  but  we  don’t  know  how  these  logons  are    distributed  over  the  20  minute  

period!  

Page 10: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   10      

OS  level  metrics  don’t  lie  (well,  they  do,  but  less  ;-­‐)  

| | | Phys | Phys |Memory|Pg Out | VM Pg | Date | Time |CPU % | IO Rt | KB Rt | % | Rate | Scan Rt | 10/30/2010|17:45:00| 25.06| 7149.5| 145817.6| 74.07| 0.0| 0.0|

10/30/2010|17:50:00| 24.77| 5334.8| 60928.0| 73.98| 0.0| 0.0|

10/30/2010|17:55:00| 24.60| 7176.4| 186368.0| 73.98| 0.0| 0.0|

10/30/2010|18:00:00| 25.95| 7556.2| 192307.2| 74.11| 0.0| 0.0|

10/30/2010|18:05:00| 22.88| 5379.5| 67584.0| 74.15| 0.0| 0.0|

10/30/2010|18:10:00| 66.89| 4544.6| 58060.8| 76.97| 0.0| 0.0| 10/30/2010|18:15:00| 24.51| 7544.9| 159334.4| 76.40| 0.0| 0.0|

10/30/2010|18:20:00| 25.47| 5144.0| 59187.2| 75.04| 0.0| 0.0|

10/30/2010|18:25:00| 28.38| 10139.5| 151552.0| 74.37| 0.0| 0.0|

1)  What  does  the  Time  18:10:00  mean,  beginning  of  the  monitoring  interval  or  end?  

2)  66.89%  busy  during  5  minutes  may  actually  mean  100%  busy  during  ~3  minutes  out  of  5,  but  we  don’t  know  that  for  sure  without  measuring  in  more  detail  (beAer  granularity)…  

Page 11: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   11      

Measuring  CPU  u4liza4on  in  more  detail  The  spike  lasted  from  18:11  to  18:14  (3  min)  

Around  90%  in  Kernel  mode!!!  

Page 12: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   12      

Checkpoint  –  measured  evidence  so  far  

•  Fact:  We  have  a  100%  CPU  u4liza4on  spike,  las4ng  3  minutes  

•  Fact:  90%  of  it  is  spent  in  KERNEL  mode  

•  Fact:  We  have  over  2500  logons  done  during  20  minute  period  •  2.1  logons  /  second  on  average  (which  doesn’t  sound  bad)  

•  Kernel  mode  CPU  usage  is  usually  caused  by  system  calls  •  …or  some  internal  kernel  thread  ac4vity  100%  CPU  usage  doesn’t  always  automa4cally  mean  you  have  a  serious  CPU  starva4on  problem.  The  CPU  runqueue  length  would  indicate  you  how  much  

starva4on  (wai4ng  for  CPU  service)  there  is.  However  seeing  90%  of  CPU  used  in  KERNEL  mode  is  definitely  not  normal  for  an  Oracle  database  server.  

Page 13: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   13      

Diagnosing  90%  kernel-­‐mode  CPU  usage  spikes…  

1.  Systema4c  •  Break  down  this  90%  of  Kernel  mode  CPU  usage  •  Profiling!  •  Oh,  this  is  a  produc4on  system  and  the  problem  is  acute  &  ongoing  •  On  Solaris,  I’d  have  used  Dtrace  stack()  probe  to  record  OS  kernel  

stack  traces  most  common  on  CPU  (google  for  dstackprof)  •  Or  lockstat  as  it  reports  spins  on  spinlocks  (which  consume  kernel  CPU)  

•  But  this  was  HP-­‐UX  and  I  didn’t  know  the  tools  needed  •  But  I  knew  what  numbers  I  wanted  to  see!  •  We  sent  a  request  to  HP-­‐UX  support:    

“How  do  we  measure  &  break  down  where  is  kernel  mode  CPU  used?”  

2.  Check  for  usual  suspects  •  Fast,  cheap  checks  to  rule  out  or  find  known  troublemakers  

Page 14: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   14      

Kernel  mode  CPU  usage  spikes  –  the  usual  suspects  

•  Before  star4ng  the  systema4c  troubleshoo4ng  &  drilldown,  do  quick  checks  for  usual  suspects  •  Remember,  the  client  has  a  business  problem,  4me  is  of  essence…  

1.  Logon  (or  logoff)  storms  •  Spawning,  ini4alizing  new  processes,  opening  files  and  aAaching  to  

SGA  means  system  calls,  kernel  CPU  usage  

2.  Oracle  code  gevng  into  some  crazy  loop  (due  to  a  bug)  •  Semop(),  yield(),  read  /proc/…,  getrusage(),  etc  loop  

3.  OS  kernel  spinlock  conten4on  •  Variety  of  reasons…  •  ONen  due  to  bugs  in  OS  or  some  kernel  module  

Not  so  usual  suspects  really…  

Page 15: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   15      

Measuring  logon  storms  

•  Use  the  AUD$  records  or  “logons  cumula@ve”  number  from    V$SYSSTAT  or  AWR,  right?  •  Wrong!  

•  logons  cumula5ve  number  is  incremented  by  the  session  itself  –  aNer  it  has  logged  on,  the  same  applies  to  audit  records!  1.  ANer  the  listener  connec4on  has  been  established…  2.  The  process  has  been  started…  3.  It  has  aAached  to  SGA  SHM  segments…  4.  Audit  file  has  been  wriAen  (if  needed)  …  5.  Process,  session  SGA  structures  have  been  created  

•  Memory  from  OS  and  shared  pool  allocated  (shared  pool  latches!)  6.  Session  has  been  authen4cated  7.  Then  the  logons  cumula4ve  is  incremented!    

The  logon  storm  may  have  started  way  

before  these  logons  finally  succeeded!    

Page 16: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   16      

Measuring  logon  storms  

•  Logon  storms  should  be  measured  at  the  listener  level  •  Process  listener.log  using  a  script:  

$ tail listener.log

30-OCT-2010 23:22:03 * (CONNECT_DATA=…) * establish * E2SNDB * 0

30-OCT-2010 23:22:08 * (CONNECT_DATA=…) * establish * E2SNDB * 0

30-OCT-2010 23:22:08 * (CONNECT_DATA=…) * establish * E2SNDB * 0

30-OCT-2010 23:22:09 * service_update * E2SNDB * 0

$ fgrep "30-OCT-2010" listener.log | fgrep "establish" | \

awk '{ print $1 " " $2 }' | awk -F: '{ print $1 }' | \

sort | uniq –c

5564 30-OCT-2010 20

5571 30-OCT-2010 21

5703 30-OCT-2010 22

4828 30-OCT-2010 23

5703  logons  during  22:00  –  22:59  

Page 17: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   17      

Measuring  logon  storms  

•  Logon  storms  should  be  measured  at  the  listener  level  •  Process  listener.log  using  a  script:  

$ fgrep "30-OCT-2010 22:" listener.log | fgrep "establish" | \ awk '{ print $1 " " $2 }' | awk -F: '{ print $1 ":" $2 }' | \ sort | uniq –c … 88 30-OCT-2010 22:00 120 30-OCT-2010 22:01 94 30-OCT-2010 22:02 94 30-OCT-2010 22:03 95 30-OCT-2010 22:04 120 30-OCT-2010 22:05 79 30-OCT-2010 22:06 101 30-OCT-2010 22:07 85 30-OCT-2010 22:08 100 30-OCT-2010 22:09 85 30-OCT-2010 22:10 89 30-OCT-2010 22:11 83 30-OCT-2010 22:12 93 30-OCT-2010 22:13

How  many  connec4ons  listener  established  every  minute  (this  data  from  non-­‐

problem  4me)  

Page 18: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   18      

Con4nuing  the  OS  kernel  mode  CPU  usage  diagnosis  

# tusc -cp 8021

( Attached to process 8021 ("oracleXYZ (LOCAL=NO)") [64-bit] )

( Detaching from process 8021 ("oracleXYZ (LOCAL=NO)") )

Syscall Seconds Calls Errors

open 1.05 3 3 ----- ----- ----- -----

Total 1.05 3 3

Let’s  trace  system  calls  by  one  CPU  

heavy  Oracle  process  

Interes4ngly  3  open()  syscalls  take  over  a  second  in  total.  Could  this  be  caused  by  the  general  CPU  starva4on  in  the  

server?  

The  next  step  should  have  been  to  check  which  file  did  we  try  to  open  (but  the  spike  ended  before  we  could  do  

that)  

Seeing  errors  isn’t  a  problem  itself,  as  this  is  how  Oracle  and  libs  some4mes  check  for  existence  of  a  file…  

Page 19: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   19      

Con4nuing  the  OS  kernel  mode  CPU  usage  diagnosis  

•  HP-­‐UX  support  got  back  to  us  and  recommended  the  use  of  runki  tool  for  measuring  kernel  CPU  u4liza4on  in  detail  •  It  had  to  be  installed  as  root  •  It  measured  a  lot  of  things  happening  in  kernel,  wri4ng  a  big  output  

file  •  on  Solaris  there’s  a  tool  called  TNF  trace,  AIX  has  trace  command  for  such  full  kernel  tracing  

•  Now  we  had  to  wait  for  the  problem  to  happen  again  •  Didn’t  have  to  wait  for  too  long…  •  We  sent  the  raw  trace  dump  to  HP  Support,  so  they  could  run  

something  like  “tkprof”  on  that  tracefile  •  It  basically  just  summed  up  the  kernel  spinlock  wait,  spin  events  by  lock,  object  locked  etc  

Page 20: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   20      

kiprof  –  profiled  runki  output  

Total Hardclock traces: 30239 ================================ State Count Percent USER 7130 23.58 SYS 22965 75.94 IDLE 24 0.08 SSYS 120 0.40

 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Kernel Functions executed during profile Count Pct State Function ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 15615 51.64% SYS timed_preArbitration 7130 23.58% USER OTHER 1275 4.22% SYS spinlock 1008 3.33% SYS wait_for_lock_spinner 488 1.61% SYS vx_dirbread 464 1.53% SYS vx_bc_getblk 417 1.38% SYS preArbitration 338 1.12% SYS vx_dirscan 291 0.96% SYS vx_bc_brelse

That’s  basically  spinning  for  locks  

(adap4ve  decisions  to  spin  or  wait)  

Self-­‐explanatory.  We  are  spinning  for  a  lock  

VxFS  directory  block  read?  

VxFS  directory  contents  scan  !!!  

Page 21: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   21      

Drilling  down  to  kernel  spinlock  spinning  

The  main  spinlock  experiencing  spinning  was  related  to  VxFS:  

spn%kern cumpct seconds spn%cpu lock name

7.24 7.24 71.96 3.77 FS:vxfs:bc_freelist_lock spin 0.11 7.36 1.14 0.06 FS:vxfs:inode spin for sleep lock

0.04 7.40 0.44 0.02 FS:vxfs:i_spinspin lock

0.04 7.44 0.42 0.02 Sleep Queue lock

0.01 7.45 0.12 0.01 v_count_lock

Oracle  processes  were  the  main  ones  spinning:  

spn%spn cumpct spnsec usrsec kernsec spn%kern process name

99.17 99.17 73.82 880.32 543.70 13.58 oracle 0.60 99.77 0.45 0.00 7.37 6.11 vxfsd

0.08 99.85 0.06 1.42 11.52 0.52 tnslsnr

0.05 99.91 0.04 0.00 402.60 0.01 [IDLE]

0.03 99.93 0.02 0.00 0.50 4.00 xyz

0.03 99.96 0.02 0.00 0.43 4.65 sadc

Page 22: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   22      

Which  system  calls?  

$ /opt/ktools/bin/kiinfo -kitrace pid=28679,syscalls -ts 1103_1307 | grep open | sort -r -n -k 13,13 | more

2.550292 cpu=6 seqcnt=43808107 pid=28679 ktid=526366 utid=1 open [5] err=17 ret1=12 ret2=1 syscallbeg= 3.295138 ktcsys= 0.000000 A0=0x9fffffff A1=

0xffffceb0 A2=0 A3=0xd02 A4=0 A5=432 fname='/opt/apps/oracle/admin/xyzdb/adump/xyzdb_ora_28679_21.aud'

8.789893 cpu=6 seqcnt=43825746 pid=28679 ktid=526366 utid=1 open [5] err=17 ret1=12 ret2=1 syscallbeg= 3.172851 ktcsys= 0.000000 A0=0x9fffffff A1=

0xffffceb0 A2=0 A3=0xd02 A4=0 A5=432 fname='/opt/apps/oracle/admin/xyzdb/adump/xyzdb_ora_28679_23.aud'

5.617009 cpu=6 seqcnt=43816702 pid=28679 ktid=526366 utid=1 open [5] err=17 ret1=12 ret2=1 syscallbeg= 3.066686 ktcsys= 0.000000 A0=0x9fffffff A1=

0xffffceb0 A2=0 A3=0xd02 A4=0 A5=432 fname='/opt/apps/oracle/admin/xyzdb/adump/xyzdb_ora_28679_22.aud'

18.143066 cpu=6 seqcnt=43848837 pid=28679 ktid=526366 utid=1 open [5] err=17 ret1=12 ret2=1 syscallbeg= 2.576237 ktcsys= 0.000000 A0=0x9fffffff A1=

0xffffceb0 A2=0 A3=0xd02 A4=0 A5=432 fname='/opt/apps/oracle/admin/xyzdb/adump/xyzdb_ora_28679_26.aud'

Page 23: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   23      

Audit  file  des4ona4on  

•  New  audit  file  name  format  in  11g…  •  A  new  file  is  created  for  each  audit  file  where  the  SPID  collides  with  a  

previous  file…  •  Every  @me  when  crea@ng  a  new  audit  file,  Oracle  has  to  check  whether  

such  file  already  exists  with  suffix  _1,  then  _2,  _3,  etc..  

$ cd /u01/app/oracle/admin/E2SNDB/adump $ ls -l | head total 4788 -rw-r----- 1 oracle dba 735 Feb 28 16:06 e2sndb_ora_10028_1.aud -rw-r----- 1 oracle dba 710 Oct 16 17:58 e2sndb_ora_10082_1.aud -rw-r----- 1 oracle dba 735 Oct 16 17:58 e2sndb_ora_10082_2.aud -rw-r----- 1 oracle dba 735 Feb 27 17:53 e2sndb_ora_10095_1.aud -rw-r----- 1 oracle dba 736 Oct 16 17:58 e2sndb_ora_10120_1.aud -rw-r----- 1 oracle dba 740 Oct 16 17:58 e2sndb_ora_10125_1.aud -rw-r----- 1 oracle dba 735 Feb 28 16:07 e2sndb_ora_10158_1.aud -rw-r----- 1 oracle dba 735 Feb 24 17:44 e2sndb_ora_10206_1.aud -rw-r----- 1 oracle dba 735 Dec 22 21:28 e2sndb_ora_10482_1.aud

Page 24: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   24      

Shouldn’t  the  audit  files  be  created  only  for  SYSDBA  and  SYSOPER  access?  

•  In  theory,  yes.  •  In  prac4ce  in  our  case,  no.  

•  Bug  9744092:  EXCESSIVE  AMOUNT  OF  AUD  FILES  BEING  GENERATED  IN  11.1  •  Oracle  generated  a  new  .aud  file  for  every  new  database  connec4on!  •  Not  just  SYSOPER/SYSDBA  like  normally  •  This  is  all  despite  having  audit_trail  =  DB    

•  Normally  the  .aud  files  in  audit_dump_dest  are  not  related  to  regular  audit  trail,  but  for  SYSOPER/SYSDBA  logon,  startup/shutdown  audi4ng  

•  …and  if  AUDIT_SYS_OPERATIONS  =  TRUE  then  all  commands  issued  as  SYS  

Page 25: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   25      

Bug  9744092:  EXCESSIVE  AMOUNT  OF  AUD  FILES  BEING  GENERATED  IN  11.1  

PROBLEM: -------- After upgrade to 11.1 the system creates 10 - 16 *aud files per minute in audit_file_dest. Out of 9528 *.aud files that the customer uploaded, 9124 files recorded ACTION:[3] "102". DIAGNOSTIC ANALYSIS: -------------------- The change of behavior (move audit action 102 from aud$ to audit_file_dest when audit_trail=DB.) is due to a fix for an unpublished bug 5476184 in 11.1. It is not an intended feature for 11G. WORKAROUND: ----------- Manual delete of audit files

Page 26: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   26      

Diagnosis  

1.  Thanks  to  bug  9744092  and  a  behavior  change  in  Oracle  11.1  a  new  audit  file  was  created  for  each  new  connec4on  to  DB  •  If  a  file  already  existed,  Oracle  checked  if  a  similar  file  name  with  

larger  suffix  value  (_2,  _3,  _4  etc)  existed  2.  The  audit_file_dest  eventually  had  over  1.5  M  files  in  it!  

•  For  each  logon,  mul4ple  file  existence  checks  (open()  syscalls)  had  to  be  done  

3.  Checking  whether  a  file  existed  (open  syscall  -­‐>  directory  entry  scan)  became  very  slow  –  and  it’s  done  in  kernel  mode  •  A  spinlock  was  held  during  the  directory  entry  scan  •  Other  new  Oracle  processes  also  wanted  to  do  the  directory  scan,  

resul4ng  in  spinlock  conten4on  and  further  Kernel  mode  CPU  usage  4.  When  the  DB  got  slow  –  app  servers  fired  up  hundreds  of  

new  connec4ons  to  “make  things  faster”  •  This  all  fed  back  to  the  problem  –  even  more  conten4on  &  spinning  

Page 27: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   27      

Limi4ng  logon  storms  

Use  Oracle  Listener  connec4on  rate  limiter  (11gR1+)  listener.ora: LISTENER= (ADDRESS_LIST= (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521)(RATE_LIMIT=5)) (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1522)(RATE_LIMIT=10)) (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1523)) )

 Oracle  Documenta5on:  Oracle  Net  Listener  Parameters  (listener.ora)  hQp://download.oracle.com/docs/cd/B28359_01/network.111/b28317/listener.htm  

 Also,  it  is  possible  to  limit  logoff  storm  rate    _logout_storm_rate  parameter  (instance-­‐wide)  

Page 28: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   28      

Troubleshoo4ng  sporadic  system  performance  issues  

 Right  Data  !!!  

 

•  Right  scope  –  if  your  problem  lasts  for  seconds,  this  should  be  the  granularity  of  your  data  too  

•  OS  level  data,  in  addi4on  to  the  database  metrics  •  Ideally  OS  level  metrics  sampled  mul4ple  4mes  per  minute  

Page 29: Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1

www.enkitec.com   29      

Conclusions  

•  Logon  storms  are  evil!  •  They  will  amplify  any  performance  hiccups  as  they  cause  extra  load  

just  when  the  resources  are  scarcest  

•  Connec4on  pools  firing  up  hundreds  of  new  connec4ons  are  evil!  •  Know  your  limits  (both  max  connec4ons  and  max  connect  rate  /  sec)  

•  Here’s  a  thought:  •  If  you  have  planned  the  servers’  capacity  to  support  N-­‐thousand  connec4ons  anyway  (by  allowing  connec4on  pools  grow  that  high),  why  not  create  this  amount  of  connec4ons  right  away?  

•  This  would  avoid  logon  storms  during  worst  4mes  as  all  connec4ons  have  already  been  created!