52
The Cloud Story or Less is More… by Slava Vladyshevsky slava[at]verizon.com Dedicated to Lee, Sarah, David, Andy and Jeff as well as many others, who went above and beyond to make this possible. “Cache is evil. Full stop.” Jeff

The Cloud Story or Less is More

Embed Size (px)

Citation preview

Page 1: The Cloud Story or Less is More

               

The  Cloud  Story  or  Less  is  More…        

by  Slava  Vladyshevsky  slava[at]verizon.com  

               

Dedicated  to  Lee,  Sarah,  David,  Andy  and  Jeff  as  well  as  many  others,  

who  went  above  and  beyond  to  make  this  possible.                                  

“Cache  is  evil.  Full  stop.”  Jeff  

Page 2: The Cloud Story or Less is More

   

Page 3: The Cloud Story or Less is More

Table  of  Content    PART  I  –  BUILDING  TESTBED  ......................................................................................................................................  6  PART  II  –  FIRST  TEST  ....................................................................................................................................................  10  PART  III  –  STORAGE  STACK  PERFORMANCE  .....................................................................................................  12  PART  IV  –  DATABASE  OPTIMIZATION  ..................................................................................................................  15  PART  V  –  PEELING  THE  ONION  ................................................................................................................................  24  PART  VI  –  PFSENSE  ........................................................................................................................................................  25  PART  VII  –  JMETER  ........................................................................................................................................................  27  PART  VIII  –  ALMOST  THERE  ......................................................................................................................................  28  PART  IX  –  CASSANDRA  .................................................................................................................................................  29  PART  X  –  HAPROXY  ........................................................................................................................................................  34  PART  XI  –  TOMCAT  ........................................................................................................................................................  40  PART  XII  –  JAVA  ...............................................................................................................................................................  42  PART  XIII  –  OS  OPTIMIZATION  .................................................................................................................................  44  PART  XIV  –  NETWORK  STACK  ..................................................................................................................................  44    Figure  Register    AWS  Application  Deployment  ......................................................................................................................  6  Initial  VCC  Application  Deployment  ..........................................................................................................  9  First  Test  Results  -­‐  Comparison  Chart  ..................................................................................................  10  First  Test  -­‐  High  CPU  Load  on  DB  Server  .............................................................................................  11  First  Test  -­‐  High  CPU  %iowait  on  DB  Server  ......................................................................................  11  First  Test  -­‐  Disk  I/O  Skew  on  DB  Server  ..............................................................................................  11  Optimized  Storage  Subsystem  Throughput  ........................................................................................  14  AWS  i2.8xlarge  CPU  load  -­‐  Sysbench  Test  Completed  in  64.42  sec  ..........................................  16  VCC  4C-­‐28G  CPU  load  -­‐  Sysbench  Test  Complete  in  283.51  sec  .................................................  16  InnoDB  Engine  Internals  .............................................................................................................................  17  Optimized  MySQL  DB  -­‐  QPS  Graph  ..........................................................................................................  20  Optimized  MySQL  DB  -­‐  TPS  and  RT  Graph  ..........................................................................................  20  Optimized  MySQL  DB  -­‐RAID  Stripe  I/O  Metrics  ...............................................................................  21  Optimized  MySQL  DB  -­‐  CPU  Metrics  ......................................................................................................  21  Optimized  MySQL  DB  -­‐  Network  Metrics  .............................................................................................  22  Jennifer  APM  Console  ...................................................................................................................................  25  Initial  Application  Deployment  -­‐  Network  Diagram  .......................................................................  25  Jennifer  XView  -­‐  Transaction  Response  Time  Scatter  Graph  ......................................................  26  Jennifer  APM  -­‐  Transaction  Introspection  ...........................................................................................  26  Iterative  Optimization  Progress  Chart  ..................................................................................................  28  Jennifer  XView  -­‐  Transaction  Response  Time  Surges  .....................................................................  29  VCC  Cassandra  Cluster  CPU  Usage  During  the  Test  .........................................................................  30  AWS  Cassandra  Cluster  CPU  Usage  During  the  Test  .......................................................................  31  High-­‐Level  Cassandra  Architecture  ........................................................................................................  32  Jennifer  APM  -­‐  Concurrent  Connections  and  Per-­‐server  Arrival  Rate  ....................................  35  Jennifer  APM  -­‐  Connection  Statistics  After  Optimization  ..............................................................  35  Jennifer  APM  -­‐  DB  Connection  Pool  Usage  ..........................................................................................  41  JVM  Garbage  Collection  Analysis  .............................................................................................................  42  

Page 4: The Cloud Story or Less is More

JVM  Garbage  Collection  Analysis  –  Optimized  Run  .........................................................................  43  XEN  PV  Driver  and  Network  Device  Architecture  ...........................................................................  45  Recommended  Network  Optimizations  ...............................................................................................  47  Last  Performance  Test  Results  .................................................................................................................  52    Table  Register  Major  Infrastructure  Limits  ..........................................................................................................................  7  AWS  Infrastructure  Mapping  and  Sizing  .................................................................................................  7  VCC  Infrastructure  Mapping  and  Sizing  ..................................................................................................  8  Optimized  MySQL  DB  -­‐  Recommended  Settings  ...............................................................................  22  Optimized  Cassandra  -­‐  Recommended  Settings  ...............................................................................  33  Network  Parameter  Comparison  ............................................................................................................  49      

Page 5: The Cloud Story or Less is More

PREFACE  One  of  the  market  leading  enterprises,  hereinafter  called  Customer,  has  multiple  business  units  working  in  various  areas,  ranging  from  consumer  electronics  to  mobile  communications  and  cloud  services.  One  of  their  strategic  initiatives  is  to  expand  software  capabilities  to  get  on  top  of  the  competition.    The  Customer  started  to  use  AWS  platform  for  development  purposes  and  as  the  main  hosting  platform  for  their  cloud  services.  Over  past  years  the  usage  of  AWS  grew  significantly  with  over  30  production  applications  currently  hosted  on  AWS  infrastructure.    While  Customer  reliance  on  AWS  increased,  the  number  of  pain  points  grew  as  well.  They  experienced  multiple  outages  and  had  to  spend  unnecessary  high  costs  to  grow  application  performance  and  to  accommodate  unbalanced  CPU/Memory  hardware  profiles.  Although  achieved  application  performance  was  satisfactory  in  general,  several  major  challenges  and  trends  emerged  over  time:  

-­‐ Scalability  and  growth  issues  -­‐ Very  high  overall  infrastructure  and  support  costs  -­‐ Single  service  provider  lock-­‐in.  

   Verizon  proposed  to  trial  the  Verizon  Cloud  Compute  (VCC)  beta  product  as  an  alternative  hosting  platform  with  the  goal  to  demonstrate  that  on  par  application  performance  can  be  achieved  at  a  much  lower  cost,  effectively  addressing  one  of  the  biggest  challenges.  An  alternative  hosting  platform  would  give  the  Customer  a  freedom  of  choice,  thus  addressing  another  issue.  Last,  but  not  least,  the  unique  VCC  platform  architecture  and  infrastructure  stack  built  for  low  latency  and  high  performance  workloads  would  definitely  help  to  address  another  pain  point  –  application  performance  and  scalability.    Senior  executives  from  both  companies  supported  this  initiative  and  one  of  the  Customer’s  applications  was  selected  for  the  proof  of  concept  project.  The  objective  was  to  compare  side-­‐by-­‐side  AWS  and  VCC  deployments  from  both  capability  and  performance  perspectives,  execute  performance  tests  and  deliver  report  to  senior  management.    The  proof  of  concept  project  has  been  successfully  executed  in  a  close  collaboration  between  various  Verizon  teams  as  well  as  Customer’s  SMEs.  It  was  demonstrated  that  the  application  hosted  on  the  VCC  platform,  given  appropriate  tuning,  is  capable  of  delivering  better  performance  than  when  hosted  on  a  more  powerful  AWS  based  footprint.      

Page 6: The Cloud Story or Less is More

PART  I  –  BUILDING  TESTBED    The  agreed  high-­‐level  plan  was  clear  and  straightforward:  

• (Verizon)  Mirror  AWS  hosting  infrastructure  using  VCC  platform  • (Verizon)  Setup  Infrastructure,  OS  and  Applications  per  specification  sheet  • (Customer)  Adjust  necessary  configurations  and  settings  on  VCC  platform  • (Customer)  Upload  test  data  –  10  million  users,  100  million  contacts  • (Customer)  Execute  smoke,  performance  and  aging  test  in  AWS  environment  • (Customer)  Execute  smoke,  performance  and  aging  test  in  VCC  environment  • (Customer)  Compare  AWS  and  VCC  results  and  captured  metrics  • (Customer)  Deliver  report  to  senior  management  

 The  high-­‐level  diagram  below  is  depicting  the  application  infrastructure  hosted  on  AWS  platform.    

 Figure  1:  AWS  Application  Deployment  

Although  both  AWS  and  VCC  platforms  are  using  XEN  hypervisor  in  their  core,  the  initial  step  –  mirroring  AWS  hosting  environment  by  provisioning  equally  sized  VMs  in  

Page 7: The Cloud Story or Less is More

VCC  raised  first  challenge.  Verizon  Cloud  Compute  platform  in  its  early  beta  stage  has  imposed  number  of  limitations.  To  be  fair,  those  limitations  were  not  by  design,  nor  hardware  limits,  rather  software  or  configuration  settings  pertinent  to  corresponding  product  release.      The  table  below  summarizes  most  important  infrastructure  limits  for  both  cloud  platforms  as  of  February  2014:    Resource  Limit   VCC   AWS  VPUs  per  VM   8   32  RAM  per  VM   28  GB   244  GB  Volumes  per  VM   5   20+  IOPS  per  Volume  (SSD)   3000   4000  Max  Volume  Size   1  TB   1  TB  Guaranteed  IOPS  per  VM   15K   40K  Throughput  per  vNIC   500  Mbps   10  Gbps  

Table  1:  Major  Infrastructure  Limits  

Besides  obvious  points,  like  the  number  of  CPUs  or  huge  difference  in  network  throughput,  it’s  also  worth  mentioning  that  the  CPU/RAM  –  processor  count  to  memory  size  ratio  is  quite  different  as  well  -­‐  1:4.5  for  VCC  and  1:7.625  for  AWS  correspondingly.  This  ratio  is  crucial  for  certain  types  of  applications,  specifically  for  databases.    Despite  aforementioned  differences,  it  was  jointly  decided  with  the  Customer  to  move  forward  with  smaller  VCC  VMs  and  consider  sizing  ratio  while  comparing  performance  and  test  results.  This  already  set  the  expectation  that  VCC  results  might  be  lower  comparing  to  AWS,  assuming  linear  application  scalability  and  4-­‐8  times  hardware  footprint  differences.    The  table  below  summarizes  infrastructure  sizing  and  mapping  for  corresponding  service  layers  hosted  on  both  cloud  platforms.  Resources  sized  differently  on  the  corresponding  platforms  are  highlighted.    VM  Role   AWS  VM  Profile  

Count   VPUs   RAM,  GB   IOPS   Net,  Mbps  

Tomcat   2   4   34.2   -­‐   1000  MySQL   1   32   244   10K   10000  Cassandra   8   8   68.4   5K   1000  HA  Proxy   4   2   7.5   -­‐   1000  DB  Cache   2   4   34.2   -­‐   1000  

Table  2:  AWS  Infrastructure  Mapping  and  Sizing  

VM  Role   VCC  VM  Profile  

Page 8: The Cloud Story or Less is More

Count   VPUs   RAM,  GB   IOPS   Net,  Mbps  

Tomcat   2   4   28   -­‐   500  MySQL   1   4   28   9K   500  Cassandra   12   4   28   5K   500  HA  Proxy   4   2   4   -­‐   500  DB  Cache   2   4   28   -­‐   500    

Table  3:  VCC  Infrastructure  Mapping  and  Sizing  

The  initial  setup  of  the  disk  volumes  required  special  creativity  in  order  to  get  as  close  as  possible  to  the  required  number  of  IOPS.  In  addition  to  the  per-­‐disk  storage  limits  mentioned  above,  initially  there  was  another  VCC  limitation  in  place  that  was  luckily  addressed  later  –  all  disks  connected  to  a  particular  VM  had  to  be  provisioned  with  the  exact  same  IOPS  rate.    The  most  common  setup  used  was  based  on  LVM2  with  a  linear  extension  for  the  boot  disk  volume  group  and  either  two  or  three  additional  disks  aggregated  into  an  LVM  stripe  set.  This  setup  allowed  setting  up  disk  volumes  with  up  to  3TB  size  and  9000  IOPS,  getting  close  enough  to  the  required  10K  IOPS  for  database  VMs.    Besides  technical  limitations  the  sheer  volume  of  provisioning  and  configuration  work  presented  challenge  in  itself.  The  hosting  platform  requirements  were  captured  in  a  spreadsheet  listing  system  parameters  for  every  VM.  Following  this  spreadsheet  manually  and  building  out  environment  sequentially  would  have  required  significant  time  and  tremendous  manual  effort.  Additionally,  this  may  have  resulted  in  a  number  of  human  errors  and  omissions.  Automating  and  scripting  major  parts  of  the  installation  and  setup  process  addressed  this.    The  automation  suite  implemented  based  on  the  vzDeploymentFramework  shell  library  (Verizon  internal  development),  made  it  possible  in  a  matter  of  minutes  to:  

-­‐ Parse  specification  spreadsheet  for  inputs  and  updates  -­‐ Generate  updated  OS  and  Application  configurations  -­‐ Create  LVM  volumes  or  software  RAID  arrays  -­‐ Roll-­‐out  updated  settings  to  multiple  systems  based  on  their  functional  role  -­‐ Change  of  Linux  iptables  based  firewall  configurations  across  the  board  -­‐ Validate  required  connectivity  between  hosts  -­‐ Install  required  software  packages  

 Having  all  configurations  in  version  controlled  repository  allowed  auditing  and  comparing  configurations  between  master  and  on-­‐host  deployed  versions,  providing  rudimentary  configuration  management  capabilities.    Below  is  a  high-­‐level  architecture  for  the  originally  implemented  test  environment.    

Page 9: The Cloud Story or Less is More

 Figure  2:  Initial  VCC  Application  Deployment  

The  test  load  was  initiated  by  a  JMeter  Master  (Test  Controller  and  Management  GUI)  and  generated  by  several  JMeter  Slaves  (Load  Generators  or  Test  Agents).  The  generated  virtual  user  (VU)  requests  were  load-­‐balanced  between  two  Tomcat  application  servers  each  running  single  application  instance.    Since  F5  LTM  instances  were  not  available  during  the  build  time,  the  proposed  design  utilized  pfSense  appliances  as  routers,  load-­‐balancers  or  firewalls  for  corresponding  VLANs.    The  tomcat  servers  communicated  via  another  pair  of  HAProxy  load-­‐balancers  with  two  persistent  storage  back-­‐ends  –  MySQL  (SQL  DB)  and  Cassandra  (NOSQL  DB),  employing  Couchbase  (DB  Cache)  as  a  caching  layer.      

Page 10: The Cloud Story or Less is More

Most  systems  were  additionally  instrumented  with  NMON  collectors  for  gathering  key  performance  metrics.  A  Jennifer  APM  application  has  been  deployed  to  perform  real-­‐time  transaction  monitoring  and  code  introspection.    Following  the  initial  plan,  the  hosting  environment  was  timely  handed  over  to  the  Customer  for  adjusting  configurations  and  uploading  test  data.    

PART  II  –  FIRST  TEST    The  first  test  was  conducted  on  both  AWS  and  VCC  platforms  and  Customer  did  share  the  test  results.    During  the  test  the  load  was  ramped  up  using  100  VU  increments  for  each  subsequent  10  minutes  long  test  run.  During  each  run  the  corresponding  number  of  virtual  users  performed  various  API  calls  emulating  human  behavior  using  patterns  observed  and  measured  on  the  production  application.    The  chart  below  depicts  the  number  of  application  transactions  successfully  processed  by  each  platform  during  the  10  minutes  test  runs.    

 Figure  3:  First  Test  Results  -­‐  Comparison  Chart  

It  was  obvious  that  the  AWS  infrastructure  is  more  powerful,  processing  more  than  two  times  higher  throughput,  which  did  not  come  as  big  surprise.  However,  Customer  expressed  several  concerns  about  overall  VCC  platform  stability,  low  MySQL  DB  server  performance  and  uneven  load  distribution  between  striped  data  volumes,  dubbed  as  I/O  skews.  

321  

462  

539  

627   637   645   651   654  

203  256   269   257   275  

249   268   247  

0  

100  

200  

300  

400  

500  

600  

700  

200   300   400   500   600   700   800   900  

TPS  per  VU  count

AWS  TPS   Verizon  TPS  

Page 11: The Cloud Story or Less is More

 Indeed,  application  “Transactions  per  Second”  (TPS)  measurements  did  not  correlate  well  with  the  generated  application  load  and  even  with  a  growing  number  of  users  something  prevented  the  application  from  taking-­‐off.  After  short  increases  overall  throughput  consistently  dropped  again,  clearly  pointing  to  a  bottleneck  limiting  the  transaction  stream.    According  to  Jennifer  APM  monitors  the  increase  in  application  transaction  times  was  caused  by  slow  DB  responses,  taking  5  second  and  more,  per  single  DB  operation.  At  the  same  time  DB  server  was  showing  very  high  CPU  %iowait,  fluctuating  about  85-­‐90%.      

 Figure  4:  First  Test  -­‐  High  CPU  Load  on  DB  Server  

 Figure  5:  First  Test  -­‐  High  CPU  %iowait  on  DB  Server  

Furthermore,  out  of  three  stripes,  parts  of  the  data  volume,  one  volume  constantly  reported  significantly  higher  device  wait  times  and  utilization  percentage,  effectively  causing  disk  I/O  skews.    

 Figure  6:  First  Test  -­‐  Disk  I/O  Skew  on  DB  Server  

Page 12: The Cloud Story or Less is More

Obviously,  these  test  results  were  not  acceptable.  To  investigate  and  identify  bottlenecks  and  performance  limiting  factors  good  knowledge  of  the  application  architecture  and  its  internals  was  required  as  well  as  deep  VCC  product  and  storage  stack  knowledge,  since  the  latter  two  issues  seemed  to  be  platform  and  infrastructure  related.  To  address  this  a  dedicated  cross-­‐team  taskforce  was  established.    

PART  III  –  STORAGE  STACK  PERFORMANCE    The  VCC  Storage  Stack  was  validated  once  more  and  it’s  been  reconfirmed,  that  there  are  no  limiting  factors  or  shortcomings  on  layers  below  block  device.  The  resulting  conclusion  was  that  the  limitations  had  to  be  on  the  hypervisor,  OS,  or  application  layer.    On  the  other  hand  Customer  confirmed  that  AWS  deployment  is  using  exactly  the  same  configuration  and  application  versions  as  VCC.  The  only  possible  logical  conclusion  was  that  the  setup  and  configuration  optimal  for  AWS  does  not  perform  the  same  way  on  VCC.  Or  in  other  words,  the  VCC  platform  required  its  own  optimal  configuration.    Further  efforts  have  been  aligned  with  the  following  objectives:  

-­‐ Improve  storage  throughput  and  address  I/O  skews  -­‐ Identify  the  root  cause  for  low  DB  server  performance  -­‐ Improve  DB  server  performance  and  scalability  -­‐ Work  with  Customer  on  improving  overall  VCC  deployment  performance  -­‐ Re-­‐run  performance  tests  and  demonstrate  improved  throughput  and  

predictable  performance  levels    Originally  the  storage  volumes  were  setup  using  Customer  specifications  and  OS  defaults  for  other  parameters.    After  performing  research  and  a  number  of  component  performance  tests,  several  interesting  discoveries  were  made,  in  particular:  

-­‐ Different  Linux  distributions  (Ubuntu  and  CentOS)  are  using  a  different  approach  to  disk  partitioning.  Ubuntu  did  align  partitions  for  4k  block  sizes,  while  CentOS  did  not  

-­‐ The  default  block  device  scheduler  CFQ  is  not  a  good  choice  in  environments  using  virtualized  storage  

-­‐ MDADM  and  LVM  volume  managers  are  using  quite  different  algorithms  for  I/O  batching  or  compaction  

-­‐ XFS  and  EXT4  file-­‐systems  yield  very  different  results  depending  on  the  number  of  concurrent  threads  performing  I/O  

-­‐ Due  to  all  Linux  optimizations  and  multiple  caching  levels  it’s  hard  enough  to  measure  net  storage  throughput  from  within  VM,  let  alone  through  the  entire  application  stack  

 After  number  of  trials  and  studying  platform  behavior,  the  following  was  suggested  for  achieving  optimal  I/O  performance  on  VCC  storage  stack:  

-­‐ Use  raw  block  devices  instead  of  partitions  for  RAID  stripes  to  circumvent  any  partition  block  alignment  issues  

Page 13: The Cloud Story or Less is More

-­‐ Use  MDADM  software  RAID  instead  of  LVM  (the  latter  is  more  flexible  and  may  be  used  in  combination  with  MDADM,  however  it  does  perform  certain  amount  of  “optimization”  assuming  spindle  based  storage  that  may  interfere  with  performance  in  VCC)  

-­‐ Use  proper  stripe  settings  and  block  sizes  for  software  RAID  (don’t  let  system  guess  –  specify!)  

-­‐ Use  EXT4  file-­‐system  instead  of  XFS.  EXT4  does  provide  journaling  for  meta-­‐data  and  data  instead  of  meta-­‐data  only  with  neglectable  performance  overhead  for  the  load  observed.  

-­‐ Use  optimal  (and  safe)  settings  for  EXT4  file-­‐system  creation  and  mounts  -­‐ Ensure  NOOP  block  device  scheduler  is  used  (which  lets  the  underlying  storage  

stack  from  the  hypervisor  down  optimize  block  I/O  more  effectively)  -­‐ Separate  various  I/O  profiles,  e.g.  sequential  I/O  (redo/bin-­‐log  files)  and  random  

I/O  (data  files)  for  DB  server  by  writing  corresponding  data  to  separate  logical  disks.  

-­‐ Use  DIRECT_IO  wherever  possible  and  avoid  OS/file-­‐system  caching  (cache  may  give  in  certain  situations  false  impression  of  high  performance  which  is  then  abruptly  interrupted  by  flushing  massive  caches  during  which  the  entire  VM  gets  blocked)  

-­‐ Avoid  I/O  bursts  due  to  cache  flushing  and  keep  device  queue  length  close  to  8.  This  corresponds  to  a  hardware  limitation  on  the  chassis  NPU.  In  VCC  storage  is  very  low  latency  and  quick,  but  if  the  storage  queue  locks  up  the  entire  VM  gets  blocked.  Writing  early  and  often  at  a  consistent  rate  performs  dramatically  better  under  load  than  caching  in  RAM  as  long  as  possible  and  then  flooding  the  I/O  queue  when  the  cache  has  been  exhausted.  

-­‐ Make  sure  network  device  driver  is  not  competing  with  block  device  drivers  and  application  for  CPU  time  by  relocating  associated  interrupts  to  different  vCPU  cores  inside  the  VM.  

-­‐ Use  4K  blocks  for  I/O  operations  wherever  possible  for  more  optimal  storage  stack  operation.  

 After  implementing  these  suggestions  on  a  DB  server,  storage  subsystem  yielded  predictable  and  consistent  performance.  For  example,  data  volumes  setup  with  10K  IOPS,  have  been  reporting  ~39MB/s  throughput,  which  is  expected  maximum  assuming  4K  I/O  block  size:  

 (4K  *  10000  IOPS)  /1024  =  39.06M,  the  maximum  possible  throughput    (4K  *  15000  IOPS)  /1024  =  58.59M,  the  maximum  possible  throughput  

With  15K  IOPS  setup  using  3  stripes  (5K  IOPS  each)  the  ~55-­‐56MB/s  throughput  was  achieved  as  shown  on  the  screenshot  below:  

Page 14: The Cloud Story or Less is More

 Figure  7:  Optimized  Storage  Subsystem  Throughput  

Although  some  minor  I/O  figures  deviation  (+/-­‐  5%)  was  still  observed,  this  is  typically  considered  acceptable  and  within  normal  range.    While  performing  additional  tests  on  optimized  systems,  it  was  observed  that  all  block  device  interrupts  are  being  served  by  CPU0,  which  was  becoming  a  hot  spot  even  with  netdev  interrupts  moved  off,  to  a  different  CPUs.  The  following  method  may  be  used  to  spread  block  device  interrupts  evenly  for  devices  implementing  RAID  stripes:  

#  distribute  block  device  interrupts  between  CPU4-­‐CPU7  cat  /proc/interrupts  cat  /proc/irq/183[3-­‐6]/smp_affinity*  echo  80  >  /proc/irq/1836/smp_affinity  echo  40  >  /proc/irq/1835/smp_affinity  echo  20  >  /proc/irq/1834/smp_affinity  echo  10  >  /proc/irq/1833/smp_affinity  echo  8  >  /proc/irq/1838/smp_affinity    

Page 15: The Cloud Story or Less is More

Please  note  that  IRQ  numbers  and  assignment  may  differ  on  your  system.  You  have  to  consult  /proc/interrupts  table  for  specific  assignments  pertinent  to  your  system.    For  additional  details  and  theory,  please  refer  to  the  following  online  materials:  http://www.percona.com/blog/2011/06/09/aligning-­‐io-­‐on-­‐a-­‐hard-­‐disk-­‐raid-­‐the-­‐theory/  https://www.kernel.org/doc/ols/2009/ols2009-­‐pages-­‐235-­‐238.pdf  http://people.redhat.com/msnitzer/docs/io-­‐limits.txt    

PART  IV  –  DATABASE  OPTIMIZATION    Since  Customer  didn’t  share  application  and  testing  know-­‐how  yet,  the  only  way  to  reproduce  abnormal  DB  behavior  during  the  test  was  to  replay  DB  transaction  log  against  recovered  from  backup  DB  snapshot.  This  was  slow,  cumbersome  and  not  really  fully  repeatable  process.  Percona  tools  were  really  instrumental  for  this  task  allowing  multithreaded  transaction  replay  inserting  delays  between  transactions  as  recorded.  A  plain  SQL  script  import  would  have  been  processed  by  single  thread  only  and  all  requests  would  be  processed  as  one  stream.    Although  the  transaction  replay  has  created  some  DB  server  load,  the  load  type  and  its  I/O  patterns  were  quite  different  compared  to  I/O  patterns  observed  during  the  test.  Transaction  logs  included  only  DML  statements  (insert,  update,  delete),  but  no  data  read  (select)  requests.  Knowing  that  those  “select”  requests  represented  75%  of  all  requests,  it  quickly  became  apparent  that  such  testing  approach  is  flawed  and  will  not  be  able  to  recreate  real-­‐life  conditions.    We  came  to  a  point  where  more  advanced  tools  and  techniques  were  required  for  iterating  over  various  DB  parameters  in  a  repeatable  fashion  while  measuring  their  impact  on  DB  performance  and  underlying  subsystems.  Moreover,  it  was  not  clear  whether  unexpected  DB  behavior  and  performance  issues  were  caused  by  the  virtualization  infrastructure,  the  DB  engine  settings,  or  the  way  DB  was  used,  i.e.  combination  of  application  logic  and  data  stored  in  DB  tables.    To  separate  those  concerns  it  was  proposed  to  perform  load-­‐tests  using  synthetic  OLTP  transactions  generated  by  sysbench,  a  well-­‐known  load-­‐testing  toolkit.  Such  tests  have  been  executed  on  both  VCC  and  AWS  platforms.  The  results  were  speaking  for  themselves.    

Page 16: The Cloud Story or Less is More

 Figure  8:  AWS  i2.8xlarge  CPU  load  -­‐  Sysbench  Test  Completed  in  64.42  sec    

 Figure  9:  VCC  4C-­‐28G  CPU  load  -­‐  Sysbench  Test  Complete  in  283.51  sec  

At  this  point  it  was  clear  that  DB  server’s  performance  issues  have  nothing  to  do  with  application  logic  and  are  not  specific  to  SQL  workload  and  rather  related  to  configuration  and  infrastructure.  The  OLTP  test  provided  the  capability  to  stress  test  the  DB  engine  and  optimize  it  independently,  without  having  to  rely  on  Customer’s  application  know-­‐how  and  the  solution  wide  test  harness.    

Page 17: The Cloud Story or Less is More

Thorough  research  and  study  of  InnoDB  engine  began…  Studying  source  code  as  well  as  consulting  with  the  following  online  resources  was  a  key  to  a  clear  understanding  of  DB  engine  internals  and  its  behavior:    

-­‐ http://www.mysqlperformanceblog.com  -­‐ http://www.percona.com    -­‐ http://dimitrik.free.fr/blog/  -­‐ https://blog.mariadb.org    

   The  drawing  below  published  by  Percona  engineers  is  showing  key  factors  and  settings  impacting  DB  engine  throughput  and  performance.    

 Figure  10:  InnoDB  Engine  Internals  

Obviously,  there  is  no  quick  win  and  no  single  dial  to  turn  in  order  to  achieve  the  optimal  result.    It’s  easy  to  explain  main  factors  impacting  InnoDB  engine  performance,  though  optimizing  those  factors  practically  is  a  quite  challenging  task.    

Page 18: The Cloud Story or Less is More

 InnoDB  Performance  –  Theory  and  Practice    The  two  most  important  parameters  for  InnoDB  performance  are    innodb_buffer_pool_size  and  innodb_log_file_size.  InnoDB  works  with  data  in  memory,  and  all  changes  to  data  are  performed  in  memory.  In  order  to  survive  a  crash  or  system  failure,  InnoDB  is  logging  changes  into  InnoDB  transaction  logs.  The  size  of  the  InnoDB  transaction  log  defines  up  to  how  many  changed  blocks  are  tolerated  in  memory  for  any  given  point  in  time.  The  obvious  question  is:  “why  can’t  we  simply  use  a  gigantic  InnoDB  transaction  log?”  The  answer  is  that  the  size  of  the  transaction  log  affects  recovery  time  after  a  crash.  The  rule  of  thumb  (until  recent)  was  -­‐  the  bigger  the  log,  the  longer  the  recovery  time.  Okay,  so  we  have  another  innodb_log_file_size  variable.  Let’s  imagine  it  as  some  distance  on  imaginary  axis:  

 Our  current  state  is  checkpoint  age,  which  is  the  age  of  the  oldest  modified  non-­‐flushed  page.  Checkpoint  age  is  located  somewhere  between  0  and  innodb_log_file_size.  Point  0  means  there  are  no  modified  pages.  Checkpoint  age  can’t  grow  past  innodb_log_file_size,  as  that  would  mean  we  would  not  be  able  to  recover  after  a  crash.  

 In  fact,  InnoDB  has  two  safety  nets  or  protection  points:  “async”  and  “sync”.  When  checkpoint  age  reaches  “async”  point,  InnoDB  tries  to  flush  as  many  pages  as  possible,  while  still  allowing  other  queries,  however,  throughput  drops  down  to  the  floor.  The  “sync”  stage  is  even  worse.  When  we  reach  “sync”  point,  InnoDB  blocks  other  queries  while  trying  to  flush  pages  and  return  checkpoint  age  to  a  point  before  “async”.  This  is  done  to  prevent  checkpoint  age  from  exceeding  innodb_log_file_size.  These  are  both  abnormal  operational  stages  for  InnoDB  and  should  be  avoided  at  all  cost.  In  current  versions  of  InnoDB,  the  “sync”  point  is  at  about  7/8  of  innodb_log_file_size,  and  the  “async”  point  is  at  about  6/8  =  3/4  of  innodb_log_file_size.  

 So,  there  is  one  critically  important  balancing  act:  on  the  one  hand  we  want  “checkpoint  age”  as  large  as  possible,  as  it  defines  performance  and  throughput.  But,  on  the  other  hand,  we  should  never  reach  the  “async”  point.  

Page 19: The Cloud Story or Less is More

 The  idea  is  to  define  another  point  T  (target),  which  is  located  before  “async”,  in  order  to  have  a  gap  for  flexibility,  and  try  at  all  cost  to  keep  checkpoint  age  from  going  past  T.  We  assume  that  if  we  can  keep  “checkpoint_age”  in  the  range  0  –  T,  we  will  achieve  stable  throughput  even  for  more-­‐less  unpredictable  workload.  

 Now,  which  factors  affecting  checkpoint  age?  When  we  execute  DML  queries  that  change  data  (insert/update/delete),  we  perform  writes  to  the  log,  we  change  pages,  and  checkpoint  age  is  growing.  When  we  perform  flushing  of  changed  pages,  checkpoint  age  is  going  down  again.  So,  that  means  –  the  main  way  we  have  to  keep  checkpoint  age  about  point  “T”  is  to  change  the  number  of  pages  being  flushed  per  second  or  make  this  number  variable  and  suited  for  specific  workload.  That  way,  we  can  keep  checkpoint  age  down.  If  this  doesn’t  help  and  checkpoint  age  keeps  growing  beyond  “T”  towards  “async”–  we  have  a  second  control  mechanism:  we  can  add  a  delay  into  insert/update/delete  operations.  This  way  we  prevent  checkpoint  age  from  growing  and  reaching  “async”.    To  summarize,  the  idea  for  the  optimization  algorithm  is:  under  load  we  must  keep  checkpoint  age  around  point  “T”  by  increasing  or  decreasing  the  number  of  pages  flushed  per  second.  If  checkpoint  age  continues  to  grow,  we  need  to  throttle  throughput  to  prevent  further  growth.  The  throttling  depends  on  the  position  of  checkpoint  age  –  as  our  checkpoint  age  gets  closer  to  “async”,  we  need  higher  levels  of  throttling.    From  Theory  to  Practice  –  Test  Framework  There  is  a  saying  -­‐  In  theory,  there  is  no  difference  between  theory  and  practice,  but  in  practice  there  is…    In  practice,  there  are  a  lot  more  variables  to  bear  in  mind.  There  are  also  such  factors  as  I/O  limits,  thread  contention  and  locking  coming  into  play  and  improving  performance  is  becoming  more  like  solving  equation  with  a  number  of  variables,  which  are  depending  on  each  other…    Obviously,  for  being  able  to  iterate  over  various  parameter  and  setting  combinations  there  is  a  need  to  execute  DB  tests  in  a  repeatable  and  well-­‐defined  (read  automated)  manner,  while  capturing  test  results  for  correlation  and  further  analysis.  Quick  research  showed  that  although  there  are  many  load-­‐testing  frameworks  available,  with  some  being  specifically  tailored  for  testing  MySQL  DB  performance,  unfortunately,  none  of  them  would  cover  all  requirements  and  provide  needed  tools  and  automation.    Eventually,  we  developed  our  own  fully  automated  and  flexible  load-­‐testing  framework.  This  framework  was  mainly  used  to  stress  test  and  analyze  MySQL  and  InnoDB  

Page 20: The Cloud Story or Less is More

behavior,  nonetheless,  it  is  open  enough  to  plug  in  any  other  tools  or  to  be  used  for  testing  different  applications.  The  developed  toolkit  includes  following  components:  

-­‐ Test  Runner  -­‐ Remote  Test  Agent  (load  generator)  -­‐ Data  Collector  (sampler)  -­‐ Data  Processor  -­‐ Graphing  facility  

 Using  this  framework  it  was  possible  to  identify  the  optimal  MySQL  and  InnoDB  engine  configuration.  The  goal  was  to  deliver  best  possible  InnoDB  engine  performance  in  terms  of  transactions  and  queries  served  per  second  (TPS  and  QPS)  while  eliminating  I/O  spikes  and  achieving  consistent  and  predictable  system  load,  in  other  words  fulfilling  the  critically  important  balancing  act  mentioned  above:  keeping  “checkpoint  age”  as  large  as  possible  at  the  same  time  trying  not  to  reach  the  “async”  (or  even  worse  “sync”)  point.    The  graphs  below  show  that  an  optimally  configured  DB  server  can  easily  deliver  1000+  OLTP  transactions,  translating  to  20+K  queries  per  second,  generated  by  500  concurrent  DB  connections  during  a  6  hour  long  test.    

Queries per second (QPS) – green

   

Figure  11:  Optimized  MySQL  DB  -­‐  QPS  Graph  

After  a  warm-­‐up  phase  the  system  consistently  delivered  about  22K  queries  per  second.    

Transactions per second (TPS) – green Response Time (RT) - blue

 Figure  12:  Optimized  MySQL  DB  -­‐  TPS  and  RT  Graph  

Page 21: The Cloud Story or Less is More

 After  ramping  up  load  up  to  500  concurrent  users,  the  system  consistently  delivered  1200  TPS  in  average.  The  response  time  1600ms  average  is  measured  end  to  end  and  includes  both  network  and  communication  overhead  (~1000ms)  and  SQL  processing  time  (~600ms).        

%util - red await - green avgqu-sz - blue

 Figure  13:  Optimized  MySQL  DB  -­‐RAID  Stripe  I/O  Metrics  

It’s  easy  to  see  that  after  the  warm-­‐up  and  stabilization  phases  the  disk  stripe  performed  consistently  utilizing  an  average  disk  queue  size  ~8,  which  was  suggested  by  the  storage  team  as  the  optimum  value  for  VCC  storage  stack.  The  “await”  iostat  metric  is  constantly  below  20ms  ,  which  is  the  average  time  for  I/O  requests  to  be  issued  to  the  device  and  to  be  served.  Device  utilization  is  <25%  in  average,  showing  that  there  is  still  plenty  of  spare  capacity  to  serve  I/O  requests.      

%idle – red %user - green %system - blue %iowait - yellow

 Figure  14:  Optimized  MySQL  DB  -­‐  CPU  Metrics  

 The  CPU  metrics  are  showing  that  in  average  55%  of  CPUs  were  idle,  35%  were  spent  in  user  space,  i.e.  executing  applications,  5%  were  spent  by  kernel  (or  system)  tasks  including  interrupt  processing  and  just  5%  were  spent  waiting  for  device  I/O.          

Page 22: The Cloud Story or Less is More

 bytes sent - green bytes received - blue

 Figure  15:  Optimized  MySQL  DB  -­‐  Network  Metrics  

The  network  traffic  measurement  suggests  that  network  capacity  is  fully  consumed,  or  using  other  words  –  network  is  saturated  with  ~48  MB/s  sent  and  ~2  MB/s  received.  These  50  MB/s  of  accumulative  traffic  getting  very  close  to  a  practical  maximum  throughput  that  can  be  achieved  on  the  500  Mbps  network  interface.    In  plain  English  this  means  that  network  is  the  limiting  factor  here  and  having  other  resources  available,  DB  server  could  deliver  much  higher  TPS  and  QPS  figures,  if  additional  network  capacity  can  be  provisioned.  The  ultimate  system  capacity  limit  was  not  established  due  to  time  constraints  and  the  fact  that  Customer  application  did  not  utilize  more  than  300  concurrent  DB  connections.    Optimal  DB  Configuration    Below  is  a  summary  of  major  changes  between  the  MySQL  database  configurations  on  the  AWS  and  VCC  platforms.  As  with  the  file-­‐system  configuration  the  objective  was  to  achieve  consistent  and  predictable  performance  by  avoiding  resource  usage  surges  and  stalls.      The  proposed  optimizations  may  have  a  positive  effect  in  general,  however,  they  are  specific  to  a  certain  workload  and  use-­‐case.  Therefore,  these  optimizations  cannot  be  considered  as  universally  applicable  in  VCC  environments  and  must  be  tailored  for  a  specific  workload.  Settings  marked  with  asterisk  (*)  are  defaults  for  the  DB  version  used.    

<  …  removed  …  >  

Table  4:  Optimized  MySQL  DB  -­‐  Recommended  Settings  

 Besides  the  parameter  changes  listed  above,  binary  logs  (also  known  as  transaction  logs)  have  been  moved  to  a  separate  volume  where  Ext4  file-­‐system  has  been  setup  with  the  following  parameters:    

<  …  removed  …  >  

Page 23: The Cloud Story or Less is More

Further  areas  for  DB  improvement:  -­‐ Consider  using  the  latest  stable  Percona  XtraDB  version,  which  is  based  on  

MariaDB  codebase  and  provides  many  improvements,  including  patches  from  Google  and  Facebook:  

o Redesign  of  locking  subsystem,  no  reliance  on  kernel  mutexes  o Latest  versions  have  removed  number  of  known  contention  points  

resulting  in  less  spins  and  lock  waits  and  eventually  in  a  better  overall  performance  

o Dump  and  pre-­‐load  buffer  pool  features  –  allowing  much  quicker  startup  and  warming-­‐up  phases  

o Online  DDL  –  changing  schema  does  not  require  downtime  o Better  query  analyzer  and  overall  query  performance  o Better  page  compression  support  and  performance  o Better  monitoring  and  integration  with  performance  schema  o More  intelligent  flushing  algorithm  taking  in  consideration  both  page  

change  rates,  I/O  rates,  system  load  and  capabilities  and  thus  providing  better  performance  adjusted  to  workload  out  of  the  box  

o Suited  better  for  fast  SSD-­‐based  storage  (no  added  cost  for  random  I/O)  and  adaptive  algorithms  not  attempting  to  accommodate  for  spinning  disks  shortcomings  

o Scales  better  on  SMP  (multi-­‐core)  systems  and  better  utilizes  higher  number  of  CPU  threads  

o Provides  fast-­‐checksums  (hardware  assisted  CRC32)  allowing  to  lessen  CPU  overhead  while  retaining  data  consistency  and  security  

o New  configuration  options  allowing  to  tailor  InnoDB  engine  even  better  to  a  specific  workload  

-­‐ Consider  using  more  efficient  memory  allocator,  e.g.  jemalloc  or  tc_malloc.    o The  memory  allocator  provided  as  a  part  of  GLIBC  is  known  to  fall  short  

under  high  concurrency.    o GLIBC  malloc  wasn’t  designed  for  multithreaded  workloads  and  has  

number  of  internal  contention  points.    o Using  modern  memory  allocators  suited  for  high-­‐concurrency  can  

significantly  improve  throughput  by  reducing  internal  locking  and  contention.  

-­‐ Perform  DB  optimization.  While  optimizing  infrastructure  may  result  in  significant  improvement,  even  better  results  may  be  achieved  by  tailoring  the  DB  structure  itself:  

o Consider  cluster  indexes  to  avoid  locking  and  contention  o Consider  page  compression.  Besides  slight  CPU  penalty,  this  may  

significantly  improve  throughput  while  reducing  on-­‐disk  storage  several  times,  resulting  in  turn  in  quicker  replication  and  backups  

o Monitor  performance  schema  to  find  out  more  about  in-­‐flight  DB  engine  performance  and  adjust  required  parameters  

o Monitor  performance  and  information  schemas  to  find  more  details  about  index  effectiveness  and  build  better,  more  effective  indexes  

Page 24: The Cloud Story or Less is More

-­‐ Perform  SQL  optimization.  No  infrastructure  optimization  can  accommodate  for  badly  written  SQL  requests.  Caching  and  other  optimization  techniques  often  mask  bad  code.  SQL  queries  joining  multi-­‐million  record  tables  may  work  just  fine  in  development  and  completely  break  down  on  a  production  DB.  Continuously  analyze  the  most  expensive  SQL  queries  to  avoid  full  table  scans  and  on-­‐disk  temporary  tables.  

 

PART  V  –  PEELING  THE  ONION    It  is  a  common  saying  that  performance  improvement  is  like  peeling  an  onion.  After  addressing  one  issue,  the  next  one,  previously  masked,  is  uncovered  and  so  on…  Likewise,  in  our  case,  after  addressing  the  storage  and  DB  layers  and  improving  overall  application  throughput  it  is  became  apparent  something  else  was  holding  the  application  back  from  delivering  the  best  possible  performance.  By  this  time,  DB  layer  was  studied  very  well,  however,  the  overall  application  stack  and  associated  connection  flows  were  not  yet  completely  understood.    The  Customer  demonstrated  willingness  to  cooperate  and  assisted  by  providing  instructions  for  reproducing  JMeter  load  tests  as  well  as  on-­‐site  resources  for  an  architecture  workshop.    From  this  point  on,  the  optimization  project  speed  up  tremendously.  Not  only  was  it  possible  to  iterate  reliably  and  perform  load-­‐test  against  the  complete  application  stack,  the  understanding  of  the  application  architecture  and  access  to  Application  Performance  Management  (APM)  tool  Jennifer  made  a  huge  difference  in  terms  of  visibility  into  internal  application  operation  and  major  performance  metrics.        

 

Page 25: The Cloud Story or Less is More

Figure  16:  Jennifer  APM  Console  

Besides  providing  visual  feedback  and  displaying  a  number  of  metrics,  Jennifer  revealed  the  next  bottleneck  –  the  network.      

PART  VI  –  PFSENSE    The  original  network  design,  replicating  network  structure  in  AWS,  was  proposed  and  agreed  with  the  Customer.  Separate  networks  were  created  to  replicate  the  functionality  of  AWS  VPC  and  pfSense  appliances  were  used  to  provide  network  segmentation,  routing  and  load  balancing.    

<  …  removed  …  >  

 Figure  17:  Initial  Application  Deployment  -­‐  Network  Diagram  

The  pfSense  is  an  open  source  firewall/router  software  distribution  based  on  FreeBSD.  It  is  installed  on  a  VM  and  turns  this  VM  to  a  dedicated  firewall/router  for  a  network.  It  also  provides  additional  important  functions  such  as  load  balancing,  VPN,  DHCP.  It  is  easy  to  manage  using  web  based  UI  even  for  users  with  little  knowledge  about  underlying  FreeBSD  system.    The  FreeBSD  network  stack  is  known  for  it’s  exceptional  stability  and  performance.  The  pfSense  appliances  have  been  used  many  times  before  and  after,  thus  nobody  expected  issues  coming  from  that  side…    Watching  the  Jennifer  XView  chart  closely  in  real-­‐time  is  fun  by  itself,  like  watching  fire.  It  also  is  a  powerful  analysis  tool  that  helps  to  understand  application  components  behavior.  

Page 26: The Cloud Story or Less is More

 Figure  18:  Jennifer  XView  -­‐  Transaction  Response  Time  Scatter  Graph  

On  the  graph  above,  distance  between  layers  is  exactly  10000ms,  which  is  pointing  to  the  fact  that  one  of  application  services  is  timing-­‐out  with  10  second  interval  and  repeating  connection  attempts  several  times.      

 Figure  19:  Jennifer  APM  -­‐  Transaction  Introspection  

Network  socket  operations  were  taking  a  significant  amount  of  time  resulting  in  multiple  repeated  attempts  in  10-­‐second  intervals.    Following  old  sysadmin  adage  –  “…always  blame  the  network…  ”  application  flows  have  been  analyzed  again  and  pfSense  was  suspected  to  loose  or  delay  packets.  Interestingly  enough,  the  web  UI  has  reported  low  to  moderate  VM  load  and  didn’t  show  any  reasons  for  concern.    

Page 27: The Cloud Story or Less is More

Nonetheless,  the  console  access  revealed  the  truth  –  the  load  created  by  number  of  short  thread  spins  was  not  properly  reported  in  the  web  UI  and  hidden  by  averaging  calculations.  A  closer  look  using  advanced  CPU  and  system  metrics  confirmed  that  the  appliance  was  experiencing  unexpectedly  high  CPU-­‐load,  adding  to  latency  and  dropping  network  packets.    Adding  more  CPUs  to  the  pfSense  appliances  resulted  in  doubling  network  traffic  passed  by  them.  However,  even  with  the  maximum  CPU  count  the  network  was  not  yet  saturated,  suggesting  that  pfSense  appliances  may  be  still  limiting  application  performance.    Since  pfSense  appliances  were  not  an  essential  requirement  and  they  were  just  used  to  provide  routing  and  load-­‐balancing  capability,  it  was  decided  to  remove  them  from  application  network  flow  and  access  subnets  by  adding  additional  network  cards  to  VMs,  with  either  NIC  connected  to  corresponding  subnet.    To  summarize  -­‐  it  would  be  wrong  to  conclude  that  pfSense  does  not  fit  the  purpose  and  is  not  a  viable  option  for  building  virtual  network  deployments.  Most  definitely,  additional  research  and  tuning  would  help  to  overcome  the  observed  issues.  Due  to  time  constraints  this  area  was  not  fully  researched  and  is  still  pending  thorough  investigation.    

PART  VII  –  JMETER  With  pfSense  removed  and  HAProxy  used  for  load  balancing,  overall  application  throughput  was  definitely  improved.  Increasing  the  number  of  CPUs  on  the  DB  servers  and  the  Cassandra  nodes  seemed  to  help  as  well.  The  collaborative  effort  with  the  Customer  yielded  great  results  and  we  were  definitely  on  the  right  track.    With  the  floodgates  wide  open  we  have  been  able  to  push  more  than  1000+  concurrent  users  during  our  tests.  About  the  same  time  we  started  seeing  another  anomaly  –  one  out  of  three  JMeter  load  agents  (generators)  was  behaving  quite  strange.  After  reaching  end  of  the  test  at  3600  seconds  time  frame,  java  threads  belonging  to  the  two  JMeter  servers  were  shutting  down  quickly  and  the  third  instance  shutdown  took  a  while,  effectively  increasing  test  window  duration  and  as  result  negatively  impacting  average  test  metrics.    All  three  JMeter  servers  were  reconfigured  to  use  the  same  settings.  For  some  reason  they  were  using  slightly  different  configurations  and  were  logging  data  to  different  paths.  It  didn’t  resolve  the  underlying  issue,  though.  Due  to  time  constraints  it  was  decided  to  build  a  replacement  VM  rather  than  to  troubleshoot  issues  with  one  of  the  existing  VMs.    Eventually,  a  fourth  JMeter  server  was  deployed.  Besides  fixing  the  issue  with  java  threads  startup  and  shutdown,  it  allowed  us  to  generate  higher  loads  and  provided  additional  flexibility  in  defining  load-­‐patterns.  

Page 28: The Cloud Story or Less is More

 Lesson  learned:  for  low  to  moderate  loads  JMeter  is  working  just  fine.  For  high  loads,  JMeter  may  become  a  breaking  point  itself.  In  this  case,  it  is  recommended  to  use  scale-­‐out  approach  rather  than  scale-­‐up,  keeping  the  number  of  java-­‐threads  per  server  below  a  certain  threshold.    

PART  VIII  –  ALMOST  THERE    Although  AWS  performance  measurements  were  still  better,  we  had  already  significantly  improved  performance  compared  to  the  figures  captured  during  the  first  round  of  performance  tests.      Removing  pfSense  an  average  of  587  TPS  with  800  VU  was  achieved.  In  this  test  load  was  spread  statically  rather  than  balanced  by  specifying  different  target  application  server  IP  addresses  manually  in  the  JMeter  configuration  files.  With  a  HAProxy  load-­‐balancer  put  in  place  TPS  figures  initially  went  down  to  544  VU  and  after  some  optimizations  (disabled  connection  tracking,  netfilter),  it  has  increased  up  to  607  TPS  with  800  VU  –  the  maximum  we’ve  seen  to  date.  This  represents  a  22%  increase  from  the  best  previous  result  (498  TPS/800  VU  with  pfSense  yet)  and  100%  increase  from  initial  performance  test.  Overall  the  results  were  looking  more  than  promising.    

 Figure  20:  Iterative  Optimization  Progress  Chart  

Page 29: The Cloud Story or Less is More

 Despite  good  progress  the  following  points  still  required  further  investigation:  

-­‐ Disk  I/O  skew  issues  still  remained  -­‐ Cassandra  servers  disk  I/O  was  uneven  and  quite  high  

 Our  enthusiasm  rose  more  and  more  as  we  discovered  that  VCC  platform  could  serve  more  users  than  AWS.  The  AWS  test  results  showed  that  past  600VU  performance  started  to  decline  and  we  were  able  to  push  as  high  as  1600VU  with  application  being  able  to  support  the  load  and  showing  higher  throughput  numbers  (~760-­‐780TPS),  until  …    The  next  day  something  happened,  which  became  another  turning  point  in  this  project.  The  application  became  unstable  and  the  application  throughput  that  we  saw  just  a  couple  hours  earlier  decreased  significantly.  More  importantly  it  started  to  fluctuate,  with  the  application  freezing  at  random  times.  The  TPS-­‐scatter  landscape  in  Jennifer  was  showing  a  new  anomaly…    

 Figure  21:  Jennifer  XView  -­‐  Transaction  Response  Time  Surges  

Since  other  known  bottlenecks  have  ben  removed  and  MySQL  DB  was  not  a  weak  link  in  the  chain  any  more,  basically  being  bored  during  the  performance  test,  the  Cassandra  cluster  became  a  next  suspect.    

PART  IX  –  CASSANDRA    

Page 30: The Cloud Story or Less is More

The  tomcat  logs  were  pointing  to  Cassandra  as  well.  There  were  numerous  warning  messages  about  excluding  one  or  another  node  from  the  connection  pool  due  to  connectivity  timeouts.    After  having  a  closer  look  at  the  Cassandra  nodes  several  points  drew  our  attention:  

-­‐ There  was  no  consistency  in  the  Cassandra  ring  load  -­‐ Amounts  of  data  stored  on  Cassandra  nodes  varied  significantly  -­‐ Memory  usage  and  I/O  profiles  were  different  across  the  board.    

 As  a  common  trend  after  a  short  normal  run  period,  the  average  system  load  on  several  random  Cassandra  nodes  started  growing  exponentially  eventually  making  those  nodes  unresponsive.  During  this  time  the  I/O  subsystem  was  over-­‐utilized  as  well,  yielding  very  high  CPU  %wait  and  queue  length  on  block  devices.      Everything  was  pointing  to  the  fact  that  certain  Cassandra  nodes  initiated  compaction  (internal  data  structure  optimization)  right  during  the  load  test,  spiraling  down  in  a  deadly  loop.    Another  quick  conversation  with  Customer’s  architect  confirmed  the  same  –  it  was  most  likely  the  SSTable  compaction  causing  the  issue.    

 Figure  22:  VCC  Cassandra  Cluster  CPU  Usage  During  the  Test  

As  seen  on  the  graph  above,  during  the  various  test  runs,  one  or  another  Cassandra  node  maxed  out  CPU  utilization.  The  same  configuration  in  AWS  has  been  working  just  fine  with  not  perfect  but  still  quite  even  load  and  no  continuous  load  spikes.    

Page 31: The Cloud Story or Less is More

 Figure  23:  AWS  Cassandra  Cluster  CPU  Usage  During  the  Test  

Comparing  both  VCC  and  AWS  Cassandra  deployments  led  to  quite  contradicting  conclusions:  

-­‐ VCC  has  more  nodes  –  12  vs.  8  in  AWS,  but  it  should  improve  performance,  right?  -­‐ AWS  is  using  spinning  disks  for  Cassandra  VMs  and  VCC  storage  stack  is  SSD-­‐

based,  which  should  improve  performance  too…    Like  with  MySQL,  it  was  clear  -­‐  the  optimal,  or  even  “good  enough”  settings  taken  from  AWS  are  not  good  or  at  times  even  bad  for  using  on  the  VCC  platform.    For  historical  reasons  Customer’s  application  is  utilizing  both  SQL  and  NOSQL  databases.  When  mapping  AWS  infrastructure  to  VCC,  it  was  decided  to  build  a  Cassandra  ring  using  12  nodes  in  VCC  instead  of  8  nodes  in  AWS,  since  latter  were  lot  more  powerful  in  terms  of  individual  node  specifications.  As  further  tests  revealed  the  better  approach  would  have  been  just  opposite  -­‐  to  use  bigger  number  of  smaller  VMs  for  the  Cassandra  cluster.  It  is  also  worth  mentioning  that  Cassandra  has  been  originally  designed  to  run  on  number  of  low-­‐end  systems,  based  on  slow  spinning  disks.    During  the  past  couple  years,  SSD  started  to  appear  more  and  more  often  in  the  Data  Centers.  While  not  being  a  commodity  yet,  SSDs  became  a  heavily  used  component  in  modern  infrastructures  and  the  Cassandra  codebase  was  adjusted  to  make  internal  decisions  and  algorithms  more  suitable  for  use  in  conjunction  with  SSD,  and  not  only  spinning  disks.  Therefore  deploying  the  latest  stable  Cassandra  version  could  have  provided  additional  benefits  right  away.  Unfortunately,  the  specification  required  specific  version,  and  therefore  all  optimizations  have  been  performed  against  the  older  version.    Let’s  have  a  quick  look  at  Cassandra’s  architecture  and  some  key  definitions.    

Page 32: The Cloud Story or Less is More

 Figure  24:  High-­‐Level  Cassandra  Architecture  

Cassandra  is  a  distributed  key-­‐value  store  initially  developed  at  Facebook.  It  was  designed  to  handle  large  amounts  of  data  spread  across  many  commodity  servers.  Cassandra  provides  high  availability  through  a  symmetric  architecture  that  contains  no  single  point  of  failure  and  replicates  data  across  nodes.    Cassandra’s  architecture  is  a  combination  of  Google’s  Big-­‐  Table  and  Amazon’s  Dynamo.  Like  in  Dynamo’s  architecture,  all  Cassandra  nodes  form  a  ring  that  partitions  the  key  space  using  consistent  hashing  (see  figure  above).  This  is  known  as  distributed  hash  table  (DHT).  The  data  model  and  single  node  architecture  are  mainly  based  on  BigTable  in  its  terminology.  Cassandra  can  be  classified  as  an  extensible  row  store  since  it  can  store  a  variable  number  of  attributes  per  row.  Each  row  is  accessible  through  a  globally  unique  key.  Although  columns  can  differ  per  row,  columns  are  grouped  into  more  static  column  families.  These  are  treated  like  tables  in  a  relational  database.  Each  column  family  is  stored  in  separate  files.  In  order  to  allow  the  level  of  flexibility  of  a  different  schema  per  row,  Cassandra  stores  metadata  with  each  value.  The  metadata  contains  the  column  name  as  well  as  a  timestamp  for  versioning.    Like  BigTable,  Cassandra  has  an  in-­‐memory  storage  structure  that  is  called  Memtable,  one  instance  per  column  family.  The  Memtable  acts  as  a  write  cache  that  allows  for  fast  sequential  writes  to  disk.  Data  on  disk  is  stored  in  immutable  Sorted  String  Tables  (SSTable).  SSTables  consist  of  three  structures,  a  key  index,  a  bloom  filter  and  a  data  file.  The  key  index  points  to  the  rows  in  the  SSTable,  the  bloom  filter  enables  checking  for  the  existence  of  keys  in  the  table.  Due  to  the  limited  size  of  the  bloom  filter  it  is  also  cached  in  memory.  The  data  file  is  ordered  for  faster  scanning  and  merging.    For  consistency  and  fault  tolerance,  all  updates  are  first  written  to  a  sequential  log  (Commit  Log)  after  which  they  can  be  confirmed.  In  addition  to  the  Memtable,  Cassandra  provides  optional  row  caches  and  key  cache.  The  row  cache  stores  a  consolidated,  up-­‐to-­‐date  version  of  a  row,  while  the  key  cache  acts  as  an  index  to  the  SSTables.  If  these  are  used,  write  operations  have  to  keep  them  updated.  It  is  worth  

Page 33: The Cloud Story or Less is More

mentioning  that  only  previously  accessed  rows  are  cached  in  Cassandra  in  both  caches.  As  a  result,  new  rows  will  only  be  written  to  the  Memtable  but  not  the  cache.    In  order  to  deliver  the  least  possible  latency  and  best  performance  on  low-­‐end  hardware,  data  writes  in  Cassandra  are  using  a  multi-­‐step  process,  first  writing  requests  to  the  commit-­‐log,  then  to  a  MemTable  structure  and  eventually,  when  flushed,  they  are  appended  to  and  becoming  immutable  SSTables  in  the  form  of  a  disk  file.  Over  time  as  the  number  of  SSTables  is  growing,  they  are  becoming  fragmented,  which  is  impacting  read  operations  performance.      To  make  it  simple,  flushing  and  compaction  operations  are  vitally  important  for  Cassandra.  However,  if  setup  incorrectly  or  executed  at  the  “wrong”  time,  they  can  decrease  performance  significantly,  at  times  making  the  entire  Cassandra  node  unresponsive.  Exactly  this  was  happening  and  during  the  test  when  several  nodes  stopped  responding  and  showed  very  high  system  load  and  performing  huge  amounts  of  I/O.  Obviously,  Cassandra’s  configuration  was  tuned  for  spinning  disks  on  AWS,  resulting  in  unexpected  behavior  on  the  SSD-­‐based  VCC  storage  stack.    As  a  first  measure  to  gain  better  visibility  into  Cassandra’s  operation,  the  DataStax  OpsCenter  application  was  deployed.  It  allowed  iterating  over  various  parameters  and  executing  a  number  of  tests  against  the  Cassandra  cluster  while  measuring  their  impact  and  helping  to  observe  overall  cluster  behavior.    Applying  all  the  lessons  learned  earlier  and  working  with  VCC  storage  team  the  following  configuration  changes  were  applied:    

<  …  removed  …  >  

 

Table  5:  Optimized  Cassandra  -­‐  Recommended  Settings  

Similar  to  the  MySQL  optimization,  the  basic  idea  is  to  use  more  frequent  I/O  to  saturate  block  device  queues  less  and  as  a  result  more  optimally  utilizing  storage  stack  resources.    Besides  the  recommended  option  changes,  the  commit-­‐log  was  moved  to  a  separate  volume.  Those  changes  led  to  predictable  and  consistent  Cassandra  performance,  evenly  and  constantly  forcing  in-­‐memory  data  to  disk  and  avoiding  I/O  spikes  and  minimizing  stalls  due  to  compaction.  Below  is  a  summary  of  the  volumes  created  for  the  Cassandra  nodes:  

 xvda    600  IOPS  –  boot  and  root  xvdb    600  IOPS  –  lvm2  root  extension  xvdc  4600  IOPS  –  data  mdadm  stripe  disk  1  –  no  partitioning  xvde  4600  IOPS  –  data  mdadm  stripe  disk  2  –  no  partitioning  

Page 34: The Cloud Story or Less is More

xvdf  4600  IOPS  –  data  mdadm  stripe  disk  3  –  no  partitioning  xvdg  5000  IOPS  –  commit  log  disk  –  no  partitioning    

There  are  two  more  parameters  worth  mentioning,  which  are  controlling  the  streaming  and  compaction  throughput  limits  within  the  Cassandra  cluster.  Both  values  were  set  to  50MB/s,  which  is  sufficient  for  normal  cluster  operation  and  in  line  with  storage  sub-­‐system  throughput  configured  on  the  Cassandra  nodes.  However,  sometimes  those  thresholds  may  need  to  be  changed.  In  case  of  cluster  rebalancing,  maintenance,  and  similar  operations  the  following  handy  shortcuts  may  be  used  to  control  thresholds  cluster  wide.  

 #  for  n  in  01  02  03  04  05  06  07  08  09  10  11  12  ;  do  ./nodetool  -­‐h  node$n  -­‐p  9199  setcompactionthroughput  150  ;  done  #  for  n  in  01  02  03  04  05  06  07  08  09  10  11  12  ;  do  ./nodetool  -­‐h  node$n  -­‐p  9199  setstreamthroughput  150  ;  done    

Obviously,  after  maintenance  has  completed,  those  thresholds  should  be  set  back  to  appropriate  values  for  normal  production  use.    

PART  X  –  HAPROXY    With  the  DB  layer  fixed,  application  performance  became  stable  across  tests,  although  two  points  were  still  raising  some  concerns:  

-­‐ After  an  initial  spike  at  the  beginning  of  a  load  test,  the  number  of  concurrent  connections  abruptly  dropped  almost  two  times  

-­‐ The  amount  of  Virtual  User  requests  reaching  either  application  server  was  quite  different  sometimes  reaching    a  1:2  ratio  

 

 

Page 35: The Cloud Story or Less is More

Figure  25:  Jennifer  APM  -­‐  Concurrent  Connections  and  Per-­‐server  Arrival  Rate  

It  was  time  to  take  a  closer  look  at  the  software  load-­‐balancers  based  on  HAProxy.  This  application  is  known  to  be  able  to  serve  100K+  concurrent  connections,  so  just  one  thousand  concurrent  connections  should  not  get  even  close  to  the  limit.    Additional  research  showed  that  the  round-­‐robin  load-­‐balancing  scheme  is  not  performing  as  expected  and  was  causing  a  concentration  of  requests  on  one  or  another  system  in  an  unpredictable  manner.  The  most  even  request  distribution  was  achieved  by  using  least-­‐connect  algorithm.    After  implementing  this  change,  the  load  eventually  evenly  spread  across  all  systems.    

 Figure  26:  Jennifer  APM  -­‐  Connection  Statistics  After  Optimization  

Furthermore,  a  number  of  SYN  flood  kernel  warnings  in  the  log  files  as  well  as  nf_conntrac  complaints  (Linux  connection  tracking  facility  used  by  iptables)  about  its  overrun  buffers  and  dropped  connections  pointed  to  next  optimization  steps.      Initially,  it  was  decided  to  increase  the  size  of  the  connection  tracking  tables  and  internal  structures  and  disable  the  SYN  flood  protection  mechanisms.      

<  …  removed  …  >  

 This  did  show  some  improvement,  however,  eventually  it  was  decided  to  turn  iptables  off  completely  to  remove  any  possible  obstacles  and  latency  introduced  by  this  facility.      

Page 36: The Cloud Story or Less is More

During  the  subsequent  tests  when  generated  load  was  increased  further,  HAProxy  hit  another  issue  often  referred  to  as  “TCP  socket  exhaustion”.      A  quick  reminder  –  there  were  two  layers  of  HAProxies  deployed.  The  first  layer  was  load-­‐balancing  the  incoming  http  requests  originating  from  the  application  clients  between  the  java  application  server  (tomcat)  instances  and  the  second  layer  passing  requests  from  the  java  application  server  to  the  primary  and  stand-­‐by  MySQL  DB  servers.    HAProxy  works  as  a  reverse-­‐proxy  and  so  uses  its  own  IP  address  to  establish  connections  to  the  server.  Most  operating  systems  implementing  a  TCP  stack  typically  have  around  64K  (or  less)  TCP  source  ports  available  for  connections  to  a  remote  IP:port.  Once  a  combination  of  “source  IP:port  =>  destination  IP:port”  is  in  use,  it  cannot  be  re-­‐used.    As  a  consequence  there  cannot  be  more  than  64K  open  connections  from  a  HAProxy  box  to  a  single  remote  IP:port  couple.    On  the  front  layer  the  http  request  rate  was  a  few  hundreds  per  second,  so  we  never  ever  reach  the  limit  of  64K  simultaneous  open  connections  to  the  remote  service.  On  the  backend  layer  there  should  not  have  been  more  than  a  couple  of  hundred  persistent  connections  during  peak  time  since  connection  pooling  was  used  on  the  application  server.  So  this  was  not  the  problem  either.    It  turns  out  that  there  was  an  issue  with  the  MySQL  client  implementation.  When  a  client  sends  its  “QUIT”  sequence,  it  performs  a  few  internal  operations  before  immediately  shutting  down  the  TCP  connection,  without  waiting  for  the  server  to  do  it.  A  basic  tcpdump  revealed  this  behavior.  Note  that  this  issue  cannot  be  reproduced  on  a  loopback  interface  or  on  the  same  system,  because  the  server  answers  fast  enough.  But  over  a  LAN  connection  with  2  different  machines  the  latency  raises  past  the  threshold  where  the  issue  becomes  apparent.  Basically,  here  is  the  sequence  performed  by  a  MySQL  client:  

MySQL  Client  ==>  "QUIT"  sequence  ==>  MySQL  Server  MySQL  Client  ==>              FIN              ==>  MySQL  Server  MySQL  Client  <==          FIN  ACK          <==  MySQL  Server  MySQL  Client  ==>              ACK              ==>  MySQL  Server  

This  results  in  the  client  connection  to  remain  unavailable  for  twice  the  MSL  (Maximum  Segment  Life)  time,  which  defaults  to  2  minutes.  Note  that  this  type  of  close  has  no  negative  impact  when  the  MySQL  connection  is  established  using  a  UNIX  socket.    

Page 37: The Cloud Story or Less is More

Explication  of  the  issue  by  Charlie  Schluting1:  “There  is  no  way  for  the  person  who  sent  the  first  FIN  to  get  an  ACK  back  for  that  last  ACK.  You  might  want  to  reread  that  now.  The  person  that  initially  closed  the  connection  enters  the  TIME_WAIT  state;  in  case  the  other  person  didn’t  really  get  the  ACK  and  thinks  the  connection  is  still  open.  Typically,  this  lasts  one  to  two  minutes.”    Since  the  source  port  is  unavailable  for  the  system  for  2  minutes,  this  means  that  over  534  connection  requests  per  seconds  will  contribute  to  TCP  source  port  exhaustion:    

64000  (available  ports)  /  120  (number  of  seconds  in  2  minutes)  =  533.333  

This  TCP  port  exhaustion  appears  on  a  direct  MySQL  client  server  connection  as  well,  but  also  through  the  HAProxy  because  it  forwards  the  client  traffic  to  the  server.  And  since  there  were  several  clients  talking  to  the  same  HAProxy,  it  happened  much  faster  on  the  HAProxy.    However,  this  does  not  explain  the  front-­‐side  HAProxy  issues,  where  HTTP  connections  were  used,  not  MySQL  protocol.    The  problem:  keep-­‐alive.  A  very  useful  feature  in  the  past,  when  servers  were  huge  yet  slow  and  client  concurrency  was  generally  much  lower.  Formerly,  you’d  think  twice  before  forking  another  process  to  serve  the  next  incoming  connection  on  a  web  server  –  the  new  process  creation  overhead  was  way  too  expensive.    With  server  hardware  becoming  more  and  more  powerful  and  Linux  kernel  and  software  stack  getting  more  and  more  optimized,  most  server  implementations  today  are  using  threads  and  can  pick  new  connections  in  a  very  fast  and  effective  manner.  In  a  modern  world,  specifically  with  the  advent  of  REST  and  web-­‐services,  short-­‐lived  stateless  connections  are  much  more  favorable.  As  the  number  of  clients  and  concurrency  grows  it  is  becoming  less  and  less  optimal  to  keep  sockets  busy  anticipating  another  request  from  the  same  client.    Another  lesson  learned:  be  very  conservative  with  the  keep-­‐alive  feature  and  consider  turning  it  off  or  reducing  the  keep-­‐alive  timeout  significantly  for  certain  use-­‐cases.  This  was  addressed  in  the  Tomcat  connector  configuration  as  well.  See  corresponding  chapter.    While  HAProxy  is  providing  a  number  of  options  and  mechanisms  for  dealing  with  connection  time-­‐outs  and  keep-­‐alive  HTTP  connections,  HAProxy  still  operates  above  the  transport  layer  and  may  not  always  be  able  to  help  with  half-­‐closed  TCP  connections.  

                                                                                                               1  taken  from  http://www.enterprisenetworkingplanet.com/print/netsp/article.php/3595616/Networking-­‐101-­‐TCP-­‐In-­‐More-­‐Depth.htm    

Page 38: The Cloud Story or Less is More

 So  how  can  TCP  source  port  exhaustion  be  avoided?    First,  a  “clean”  sequence  should  be  used  (find  the  difference  to  one  above  ;)  

MySQL  Client  ==>  "QUIT"  sequence  ==>  MySQL  Server  MySQL  Client  <==              FIN              <==  MySQL  Server  MySQL  Client  ==>          FIN  ACK          ==>  MySQL  Server  MySQL  Client  <==              ACK              <==  MySQL  Server  

Actually,  this  sequence  occurs  when  both  MySQL  client  and  server  are  hosted  on  the  same  box  and  are  using  the  loopback  interface,  which  is  why  it  was  mentioned  earlier  that  added  latency  between  the  client  and  the  server  is  crucial  to  reproduce  the  issue..    Until  MySQL  developers  rewrite  the  code  to  follow  the  sequence  above,  there  won’t  be  any  improvement  here!    Second,  increasing  source  port  range    By  default,  on  a  Linux  box,  there  are  around  28K  source  ports  available  for  a  single  IP:port    tuple:  

$  sysctl  net.ipv4.ip_local_port_range  net.ipv4.ip_local_port_range  =  32768        61000  

This  limit  can  be  increased  to  close  to  64K  source  ports:    

<  …  removed  …  >  

 And  don’t  forget  to  update  the  /etc/sysctl.conf  file.  It’s  a  good  idea  to  add  this  configuration  to  most  busy  network  servers.    Third,  allow  usage  of  source  port  in  TIME_WAIT    A  few  configurations  can  be  used  to  tell  the  kernel  to  reuse  or  recycle  the  connection  in  TIME_WAIT  state  faster:    

<  …  removed  …  >  

 The  tw_reuse  can  be  used  safely,  be  but  careful  with  tw_recycle.  It  may  have  side  effects.      Fourth,  using  multiple  IPs  to  get  connected  to  a  single  server  

Page 39: The Cloud Story or Less is More

 In  the  HAProxy  configuration,  the  source  IP  address  to  be  used  to  establish  a  connection  to  a  server  can  be  specified  on  the  server  line.  Additional  server  lines  using  different  source  IP  addresses  can  be  added.  In  the  example  below,  the  source  IPs  10.0.0.100  and  10.0.0.101  are  local  IP  addresses  of  the  HAProxy  box:  

[...]      server  mysql_A     10.0.0.1:3306  check  source  10.0.0.100      server  mysql_B     10.0.0.1:3306  check  source  10.0.0.101  [...]  

This  would  allow  us  to  open  up  to  2  x  64K  =  128K  outgoing  TCP  connections.  The  kernel  is  responsible  to  select  a  new  TCP  port  when  the  HAProxy  requests  it.  Despite  improving  things  a  bit,  we  may  still  reach  some  source  port  exhaustion  down  the  road,  which  would  happen  at  around  80K  connections  in  TIME_WAIT.    Fifth  and  last,  let  HAProxy  manage  TCP  source  ports    You  can  let  HAProxy  decide  which  source  port  to  use  when  opening  a  new  TCP  connection  on  behalf  of  the  kernel.  To  address  this,  HAProxy  has  built-­‐in  functions,  which  are  more  efficient  than  those  implemented  in  a  regular  kernel.  Let’s  update  the  configuration  above  accordingly:  

[...]      server  mysql_A   10.0.0.1:3306  check  source  10.0.0.100:1025-­‐65000      server  mysql_B   10.0.0.1:3306  check  source  10.0.0.101:1025-­‐65000  [...]  

A  test  showed  170K+  connections  in  TIME_WAIT  with  4  source  IPs  while  avoiding  source  port  exhaustion.    As  explained  in  the  OS  Optimization  chapter,  since  HAProxy  is  a  single-­‐threaded  non-­‐blocking  application,  it  may  be  a  good  idea  to  pin  haproxy  process  to  a  specific  CPU  as  well,  allowing  the  other  CPUs  to  handle  netdev  and  blkdev  interrupts.  

#  service  haproxy  status  haproxy  (pid    29621)  is  running...  #  taskset  -­‐c  -­‐p  3  29621  pid  29621's  current  affinity  list:  0-­‐3  pid  29621's  new  affinity  list:  3  

Although  this  additional  step  will  not  result  in  a  significant  performance  boost,  it  will  make  system  operation  much  smoother  and  will  better  utilize  available  system  resources.    

Page 40: The Cloud Story or Less is More

PART  XI  –  TOMCAT    The  last,  or  actually  topmost  application  in  the  stack  was  Java  application  server  –Tomcat.  Within  Tomcat  itself,  there  is  not  much  to  tune,  but  those  couple  of  points  can  make  a  huge  difference.    Tomcat  Connectors  Configuration    The  following  shows  the  resulting  connector  configuration  with  modified  or  added  options  being  highlighted:    

<  …  removed  …  >  

 In  plain  English:  

-­‐ We  don’t  want  to  resolve  hostnames  for  incoming  requests  -­‐ We  do  want  to  get  rid  of  keep-­‐alive  by  all  means,  since  the  application  is  

essentially  a  web  service  and  serving  short  stateless  requests.  Allowing  keep-­‐alive,  would  quickly  hog  all  connector  threads  and  they  won’t  be  available  to  accept  new  requests  and  the  threads  will  not  be  returned  back  to  the  pool  until  timed-­‐out  or  the  client  closes  the  connection,  with  the  latter  typically  not  occurring  since  the  clients  are  trying  to  keep  the  connection  alive  in  the  first  place.  

 Generally  speaking  in  the  options  above  the  maximum  threads  limit  is  also  set  a  bit  too  high.  It  may  be  a  better  idea  to  have  a  cluster  configuration  with  several  instances  and  a  lower  thread  number  per  instance  (e.g.  2  x  500)  in  order  to  lessen  thread  management  overhead.  In  practice,  any  configuration  change  like  this  requires  thorough  testing.  Due  to  time  constraints  a  cluster  configuration  was  not  tested.    Tomcat,  Java  and  Application  Logging    The  java  application  logging  was  a  real  resource  hog  in  terms  of  I/O,  disk  space  and  CPU  utilization.  Application  messages  were  logged  to  several  pipelines,  including  the  console  (stdout  and  stderr)  and  after  each  10min  test  run  generated  about  10-­‐20  GB  worth  of  log  files.  Yes,  Gigabytes,  not  Megabytes.  At  this  point  it  shall  be  noted  that  with  all  admiration  for  Jennifer,  this  toolkit  is  also  very  generous  when  it  comes  to  logging.  Jennifer  APM  itself  produced  a  few  GBs  of  log  files  after  every  test  run…    Obviously,  without  a  proper  log  rotation  facility  and  configuration  in  place,  those  logs  were  exhausting  the  underlying  disk  volumes  rather  quickly,  resulting  in  unpredictable  application  server  behavior  and  possible  system  failure.    The  logging  configuration  was  adjusted:  

-­‐ To  avoid  duplicate  logging  -­‐ To  use  the  valve  logging  facility,  which  is  performing  better  

Page 41: The Cloud Story or Less is More

-­‐ To  decrease  amount  of  log  messages  by  raising  the  log  severity  threshold  to  WARN  or  ERROR  depending  on  use-­‐case.  

 This  resulted  in  a  significant  drop  in  I/O  and  request  volume  by  almost  an  order  of  magnitude.  Correspondingly,  log  file  sizes  were  reduced  by  an  order  of  magnitude  without  sacrificing  any  important  information  for  operations.    DB  Connection  pooling    Finally,  since  DB  requests  have  been  processed  within  500ms  on  average,  there  was  no  need  anymore  to  keep  an  over-­‐blown  connection  pool  to  accommodate  for  DB  slowness  and  delays.  Therefore,  connection  pool  configuration  was  adjusted  as  well:    

<  …  removed  …  >  

 This  resulted  in  more  effective  connection  pool  utilization.  The  example  below  is  showing  that  pool  usage  is  averaging  about  45-­‐48  connections  for  900  virtual  users.    

 Figure  27:  Jennifer  APM  -­‐  DB  Connection  Pool  Usage  

Obviously,  connection  pool  timeouts  and  eviction  rules  have  to  be  setup  in  concert  with  DB  server  connection  time-­‐outs  to  decrease  number  of  half-­‐closed  connections  on  either  end.    Further  areas  for  improving  Tomcat  performance  and  throughput:  

-­‐ Fewer  threads  mean  smaller  memory  footprint  and  less  context  switching  resulting  in  better  CPU  cache  utilization.  Therefore,  it  is  generally  recommended  to  start  with  a  lower  number  of  threads.  If  Tomcat’s  thread-­‐pool  is  being  exhausted  too  quickly,  it  is  worth  further  investigation  to  establish  the  root  cause.  Is  it  rather  a  problem  of  individual  requests  taking  too  long?  Are  threads  returning  to  the  pool?  If,  for  example,  database  connections  are  not  released,  threads  pile  up  waiting  to  obtain  a  database  connection  thereby  making  it  impossible  to  process  additional  requests.  This  might  indicate  a  problem  in  the  application  code.  

-­‐ Generally  it  is  not  recommended  to  configure  a  single  connector  with  more  than  500-­‐750  threads.  If  there  is  such  a  need,  it’s  worth  looking  at  setting  up  a  cluster  configuration  with  several  instances  

Page 42: The Cloud Story or Less is More

-­‐ It’s  recommended  to  validate  the  Tomcat  deployment  and  remove  unused  classes  and  libraries  to  reduce  the  overall  memory  footprint  

-­‐ It  may  be  worthwhile  to  test  different  database  connection  pool  (DBCP)  implementations  

   

PART  XII  –  JAVA    When  dealing  with  any  kind  of  Java  application,  the  immediate  concern  coming  to  mind  is  Garbage  Collection.  While  analyzing  GC  behavior  it  was  established  that  full  garbage  collections  are  happening  more  often  than  desired.    

 Figure  28:  JVM  Garbage  Collection  Analysis  

 From  the  chart  above  one  can  see  that  during  the  test  run  the  application  is  executing  quite  a  few  full  garbage  collections  leading  to  sub-­‐optimal  application  throughput  at  around  91%    

Page 43: The Cloud Story or Less is More

After  performing  minor  garbage  collection  algorithm  tuning,  the  throughput  increased  and  at  the  same  time  resulted  in  almost  completely  avoiding  full  garbage  collections  altogether.    

<  …  removed  …  >  

 

 Figure  29:  JVM  Garbage  Collection  Analysis  –  Optimized  Run  

 Having  that  said,  thorough  study  and  JVM  tuning  was  not  possible  due  to  the  time  constraints  and  there  is  still  some  room  for  improving  java  application  throughput  while  reducing  latency  caused  by  garbage  collection.    It’s  recommended  to  perform  further  studies  around  application  memory  allocation  patterns,  and  do  a  proper  sizing  for  JVM  memory  segments.    Besides  the  usual  suspects,  there  are  couple  more  minor  points  to  watch  out  for.  On  systems,  where  both  IPv4  and  IPv6  stacks  supported,  the  JVM  may  become  confused  and  try  to  use  IPv6  whereas  expectation  is  for  IPv4  to  be  used.  You  can  either  disable  this  behavior  using  JVM  startup  options  or  disable  IPv6  addressing  on  the  OS  level:    

<  …  removed  …  >  

Page 44: The Cloud Story or Less is More

PART  XIII  –  OS  OPTIMIZATION    OS  optimization  may  be  a  subject  for  a  whole  book  talking  about  generic  tuning  and  improvements,  yet  each  and  every  use-­‐case  and  application  will  require  a  specific  approach.    Due  to  the  lack  of  time  and  various  constraints  it  was  decided  to  take  a  holistic  approach  and  address  low  hanging  fruits  that  would  have  the  most  impact  across  the  board.    By  following  the  application  flows  it  is  obvious  that  packets  are  first  entering  the  system  via  a  network  interface  so  the  network  device  itself  and  the  TCP-­‐stack  were  the  first  points  where  generic  optimizations  could  be  applied.    As  next  step,  the  application  was  processing  received  network  data  in  memory  while  reading  and  writing  to  persistent  storage.  While  the  storage  subsystem  optimizations  have  been  already  mentioned  previously,  now  we  will  be  looking  for  possible  contention  points  between  those  two  subsystems.  Memory  allocation  and  memory  pressure  is  another  area  where  some  generic  optimizations  are  possible.    Eventually,  since  the  application  was  running  in  multitasked  environment,  we  wanted  to  make  sure  that  the  application  is  receiving  a  higher  priority  and  all  required  resources,  while  other  non-­‐critical  processes  were  assigned  a  lower  priority  and  not  competing  for  resources  with  the  prime  application.    When  taking  any  optimization  steps  it  is  important  to  understand  that  a  modern  OS  is  a  coherent  and  very  well  balanced  construct.  Applying  certain  changes  to  various  OS  parameters  may  cause  an  imbalance  for  others  and  may  lead  to  performance  deterioration  in  the  long  run  and  to  negative  overall  results.  Therefore,  it  is  mandatory  to  test  all  optimization  steps  thoroughly  prior  to  applying  them  to  production  systems.    Having  the  above  disclaimer  in  place,  let’s  get  to  the  bits  and  bytes.    

PART  XIV  –  NETWORK  STACK    Setting  up  an  efficient  network  can  be  a  daunting  task.  In  contrast  to  the  physical  server  world  on  a  virtualized  infrastructure  we  have  to  consider  both  physical  NICs  available  to  the  hypervisor  and  the  virtual  network  devices  as  seen  by  the  VMs.  There  are  many  possible  scenarios  where  network  throughput  can  be  relevant:  

• Hypervisor  Xen  dom0  throughput:  The  traffic  sent/received  directly  by  `dom0`.  • Single-­‐VM  throughput:  The  traffic  sent/received  by  a  single  VM.  • Multi-­‐VM  throughput:  The  traffic  sent/received  by  multiple  VMs,  concurrently.  

Here,  we  are  interested  in  aggregate  network  throughput.  

Page 45: The Cloud Story or Less is More

• Single-­‐VCPU  VM  throughput:  The  traffic  sent/received  by  VMs  using  a  single  VCPU  only.  

• Single-­‐VCPU  single-­‐TCP-­‐thread  VM  throughput:  The  traffic  sent/received  by  a  Single  TCP  thread  in  single-­‐VCPU  VMs.  

• Multi-­‐VCPU  VM  throughput:  The  traffic  is  sent/received  by  a  multi-­‐VPU  VMs.  • Network  throughput  for  storage:  The  traffic  sent/received  for  virtualized  

storage  access,  which  uses  different  underlying  physical  NICs    The  figure  below  applies  to  PV  XEN  guests,  and  to  HVM  guests  with  PV  drivers.    

 Figure  30:  XEN  PV  Driver  and  Network  Device  Architecture  

Page 46: The Cloud Story or Less is More

When  a  process  in  a  VM,  e.g.  a  VM  with  domID  =  X,  wants  to  send  a  network  packet,  the  following  occurs:  

1. A  process  in  the  VM  generates  a  network  packet  P,  and  sends  it  to  a  VM's  virtual  network  interface  (VIF),  e.g.  ethY_n  for  some  network  Y  and  some  connection  n.  

2. The  driver  for  that  VIF,  netfront  driver,  then  shares  the  memory  page  (which  contains  the  packet  P)  with  the  backend  domain  by  establishing  a  new  grant  entry.  A  grant  reference  is  part  of  the  request  pushed  onto  transmit  shared  ring  (Tx  Ring).  

3. The  netfront  driver  then  notifies,  via  an  event  channel  (not  depicted  in  the  diagram),  one  of  the  netback  threads  in  dom0  (the  one  responsible  for  ethY_n)  where  in  the  shared  pages  the  packet  P  is  stored.  

4. The  netback  (in  dom0)  fetches  P,  processes  it,  and  forwards  it  to  vifX.Y_n;  5. The  packet  is  then  handed  to  the  back-­‐end  network  stack,  where  it  is  treated  

according  to  its  configuration  just  like  any  other  packet  arriving  on  a  network  device.  

 When  a  VM  is  to  receive  a  packet,  the  process  is  almost  the  reverse  of  the  above.  The  key  difference  is  that  on  receive  there  is  a  copy  being  made:  it  happens  in  dom0,  and  is  a  copy  from  back-­‐end  owned  memory  into  a  Tx  Buf,  which  the  guest  has  granted  to  the  back-­‐end  domain.  The  grant  references  to  these  buffers  are  in  the  request  on  the  Rx  Ring  (not  Tx  Ring).    Sounds  easy,  right?  One  of  the  promises  of  Virtualization  was  to  remove  complexity  from  technology.  From  a  performance  tuning  point  of  view,  complexity  just  increased  and  shifted  under  the  hypervisor  hood…    For  complete  technical  explanation  it  is  recommended  to  refer  to  XEN  Wiki,  where  the  information  above  was  taken  from.    As  a  short  synopsis,  in  order  to  achieve  the  best  throughput,  we  have  to  consider  the  following  recommendations:  

• Proper  PV  drivers  must  be  in  place.  It  may  help  to  make  use  of  underlying  hardware.  

• Enabling  NIC  offloading  may  help  to  save  some  CPU  cycles  • It  is  recommended  to  use  multi-­‐threaded  applications  for  sending/receiving  

network  traffic.  This  will  give  the  OS  better  chances  to  distribute  the  workload  among  multiple  CPUs  

• For  some  use-­‐cases  it  may  be  beneficial  to  consider  using  several  1VCPU  load-­‐balanced  VMs  rather  than  one  huge  VM  with  multiple  VCPUs.  

• If  the  network  driver  heavily  utilizes  one  of  the  available  VCPUs,  consider  associating  the  application  with  one  or  more  less  loaded  VCPU,  thus  reducing  VCPU  contention  and  better  utilizing  resources.  

• Consider  using  modern  Linux  kernels  where  the  underlying  architecture  has  been  improved  so  that  the  VM's  non-­‐first  VCPUs  can  process  interrupt  requests  

• Check  in  /proc/interrupts  whether  your  device  exposes  multiple  interrupt  queues.  If  the  device  supports  this  feature,  make  sure  that  it  is  enabled.  

Page 47: The Cloud Story or Less is More

• If  the  device  supports  multiple  interrupt  queues,  distribute  the  processing  of  them  either  automatically  (by  using  the  irqbalance  daemon),  or  manually  (by  setting  /proc/irq/<irq-­‐no>/smp_affinity)  to  all  or  a  selected  subset  of  VCPUs.  

• Enable  Jumbo  Frames  for  the  whole  connection.  This  should  decrease  the  number  of  interrupts,  and  therefore  decrease  the  load  on  the  associated  VCPUs  (for  a  specific  amount  of  network  traffic).  

• If  a  host  has  spare  CPU  capacity,  give  more  VCPUs  to  dom0,  increase  the  number  of  netback  threads,  and  restart  the  VMs  (to  force  re-­‐allocation  of  VIFs  to  netback  threads).  

• Experiment  with  the  TCP  parameters,  e.g.  window  size  and  message  size  to  identify  the  ideal  combination  for  your  workload  and  scenario.  

 The  VCC  platform  is  a  shared  environment,  therefore  changes  to  the  hypervisor  and  dom0  have  to  go  trough  the  change  and  release  management  process  and  if  approved  by  the  QA  team,  they  may  be  included  in  the  next  release.  This  is  to  say,  that  ad-­‐hoc  changes  to  the  network  stack  outside  of  the  VMs  were  not  considered  during  the  project,  with  the  exception  of  Jumbo  frames.  This  feature  was  already  available  and  supported  and  was  just  lacking  automation  that  would  orchestrate  setting  MTU  size  for  both  guest-­‐  and  hypervisor-­‐side  network  devices.  Therefore,  only  changes  limited  to  the  guest  VM  network  settings  have  been  performed  as  outlined  below.    

<  …  removed  …  >  

Figure  31:  Recommended  Network  Optimizations  

 Besides  mentioned  above  measures,  some  additional  research  has  been  performed  for  improving  the  TCP  stack  settings.  These  are  default  settings:  

 net.core.rmem_max  =  229376  net.core.wmem_max  =  229376  net.ipv4.tcp_rmem  =  4096                87380      4194304  net.ipv4.tcp_wmem  =  4096                16384      4194304  net.core.netdev_max_backlog  =  1000  net.ipv4.tcp_congestion_control  =  cubic  txqueuelen:1000  generic-­‐segmentation-­‐offload:  on  net.ipv4.tcp_timestamps  =  1  net.ipv4.tcp_sack  =  1  net.ipv4.tcp_fin_timeout  =  60    

A  number  of  tests  have  been  conducted  using  iperf  to  measure  throughput  between  two  systems  in  the  same  subnet  and  average  throughput  was  as  shown  below:  

Page 48: The Cloud Story or Less is More

 [  ID]  Interval              Transfer          Bandwidth  [  24]    0.0-­‐  1.0  sec    57.2  MBytes      480  Mbits/sec  [  24]    1.0-­‐  2.0  sec    57.0  MBytes      478  Mbits/sec  [  24]    2.0-­‐  3.0  sec    57.0  MBytes      478  Mbits/sec  [  24]    3.0-­‐  4.0  sec    57.0  MBytes      478  Mbits/sec  [  24]    4.0-­‐  5.0  sec    57.0  MBytes      478  Mbits/sec  [  24]    5.0-­‐  6.0  sec    57.0  MBytes      478  Mbits/sec  [  24]    6.0-­‐  7.0  sec    57.0  MBytes      478  Mbits/sec  [  24]    7.0-­‐  8.0  sec    57.0  MBytes      478  Mbits/sec  [  24]    8.0-­‐  9.0  sec    57.0  MBytes      478  Mbits/sec  [  24]    9.0-­‐10.0  sec    57.0  MBytes      478  Mbits/sec  [  24]  10.0-­‐11.0  sec    57.0  MBytes      478  Mbits/sec  [  24]  11.0-­‐12.0  sec    57.0  MBytes      478  Mbits/sec  [  24]  12.0-­‐13.0  sec    57.0  MBytes      478  Mbits/sec  [  24]  13.0-­‐14.0  sec    57.0  MBytes      478  Mbits/sec  [  24]  14.0-­‐15.0  sec    57.0  MBytes      478  Mbits/sec  [  24]  15.0-­‐16.0  sec    57.0  MBytes      478  Mbits/sec  [  24]  16.0-­‐17.0  sec    57.0  MBytes      478  Mbits/sec  [  24]  17.0-­‐18.0  sec    57.0  MBytes      478  Mbits/sec  [  24]  18.0-­‐19.0  sec    57.0  MBytes      478  Mbits/sec  [  24]  19.0-­‐20.0  sec    57.0  MBytes      478  Mbits/sec  [  24]    0.0-­‐20.1  sec    1144  MBytes      478  Mbits/sec    

After  adjusting  TCP  stack  using  the  script  below:    

<  …  removed  …  >  

 An  additional  ~0.5  MB/s  throughput  was  achieved:  

 [    35]    0.0-­‐  1.0  sec    57.5  MBytes      482  Mbits/sec  [    35]    1.0-­‐  2.0  sec    57.5  MBytes      483  Mbits/sec  [    35]    2.0-­‐  3.0  sec    57.5  MBytes      482  Mbits/sec  [    35]    3.0-­‐  4.0  sec    57.2  MBytes      480  Mbits/sec  [    35]    4.0-­‐  5.0  sec    57.5  MBytes      482  Mbits/sec  [    35]    5.0-­‐  6.0  sec    57.5  MBytes      482  Mbits/sec  [    35]    6.0-­‐  7.0  sec    57.6  MBytes      483  Mbits/sec  [    35]    7.0-­‐  8.0  sec    57.5  MBytes      482  Mbits/sec  [    35]    8.0-­‐  9.0  sec    57.5  MBytes      482  Mbits/sec  [    35]    9.0-­‐10.0  sec    57.5  MBytes      482  Mbits/sec  [    35]  10.0-­‐11.0  sec    57.5  MBytes      482  Mbits/sec  [    35]  11.0-­‐12.0  sec    57.5  MBytes      482  Mbits/sec  [    35]  12.0-­‐13.0  sec    57.5  MBytes      482  Mbits/sec  [    35]  13.0-­‐14.0  sec    57.5  MBytes      482  Mbits/sec  [    35]  14.0-­‐15.0  sec    57.5  MBytes      482  Mbits/sec  [    35]  15.0-­‐16.0  sec    57.5  MBytes      482  Mbits/sec  

Page 49: The Cloud Story or Less is More

[    35]  16.0-­‐17.0  sec    57.5  MBytes      482  Mbits/sec  [    35]  17.0-­‐18.0  sec    57.5  MBytes      482  Mbits/sec  [    35]  18.0-­‐19.0  sec    57.5  MBytes      482  Mbits/sec  [    35]  19.0-­‐20.0  sec    57.5  MBytes      482  Mbits/sec  [    35]    0.0-­‐20.2  sec    1148  MBytes      482  Mbits/sec    

The  Bandwidth-­‐delay  product    (BDP)  settings  above  were  sized  for  a  10GE  connection  that  we  possibly  will  have  in  VCC  one  day  but  should  further  be  adjusted  for  the  current  real-­‐life  scenario.  For  the  currently  available  500Mbit  bandwidth,  the  BDP  values  will  have  to  be  re-­‐calculated  and  set  much  more  conservatively.    AWS  Optimizations  Since  the  beginning  of  the  project  there  was  some  suspicion  around  “AWS  can’t  be  using  off  the  shelf  default  settings.  They  must  be  tuning  their  instances  to  perform  the  best  on  their  infrastructure  and  those  changes  may  vary  from  specialized  kernel  settings  and  drivers  to  certain  application  settings  allowing  to  realize  the  full  potential”.  Only  at  the  end  of  the  project  we  found  out  that  this  suspicion  was  well  grounded  and  in  fact  the  table  below  shows  some  network  stack  settings  and  their  values  at  the  beginning  of  the  journey:    

<  …  removed  …  >  

Table  6:  Network  Parameter  Comparison  

On  the  positive  side,  by  this  time  we  had  already  come  up  with  those  optimizations  and  applied  them  as  well.  However,  those  options  are  just  scratching  the  surface  and  a  number  of  settings  and  their  combinations  that  might  have  been  or  may  be  still  tuned  are  pretty  much  countless.  You  may  refer  to  this  slide-­‐deck  for  more  details:  http://www.slideshare.net/cpwatson/cpn302-­‐yourlinuxamioptimizationandperformance        

Page 50: The Cloud Story or Less is More

CONCLUSION  –  LESSONS  LEARNED  Since  the  very  early  stages  of  the  project  there  was  a  question:  are  we  comparing  apples  to  apples  or  apples  and  oranges.  Some  differences  in  the  AWS  and  VCC  platforms  may  be  as  obvious  as  number  of  VMs,  CPUs,  Gigabytes  of  RAM  and  IOPS.  These  are  apparent  and  easy  to  count,  doing  some  napkin  math.  However,  looking  beyond  those  numbers,  it  is  quickly  becoming  clear  that  while  the  temptation  is  high  to  conclude  that  8  CPUs  are  going  to  perform  better  than  4,  reality  shows  that  there  is  no  quick  and  simple  answer.  There  are  a  number  of  unknowns  associated  with  every  resource  type,  e.g.:  

-­‐ VCPU  vs.  VCPU:  o Are  we  talking  about  cores,  native  threads  or  hyper-­‐threads?  o What  is  the  CPU  and  bus  frequency  on  the  physical  host?  o What  are  the  physical  CPU  generation  and  family  used?  

-­‐ RAM  vs.  RAM  o Are  we  talking  about  on-­‐board  RAM,  directly  addressable  by  CPU?  o Are  we  talking  about  NUMA  architecture?  o What  is  the  bus  speed?  o What  bus  architecture  is  used?  o How  many  memory  access  channels?  

-­‐ IOPS  vs.  IOPS  o Are  those  IOPS  backed  by  SSDs  or  rotating  disks?  o What  storage  controller  is  used?  o What  storage  stack  is  used?  o Are  those  IOPS  guaranteed  or  just  provisioned?  o Etc…  

 As  saying  goes,  the  devil  is  in  the  details  and  comparing  resources  provided  by  various  platforms  can  help  initially  with  some  rough  sizing  decisions,  but  cannot  be  considered  a  reliable  metric  used  for  measuring  cloud  infrastructure  performance.    Obviously,  there  is  a  need  for  some  artificial  metric,  let’s  call  it  hmm…  *cloud  stones*,  that  may  be  used  to  assess  various  cloud  platforms.    Then  a  simplistic  way  of  measuring  would  be  to  say  –  AWS  platform  is  capable  delivering  X  cloud  stones,  GAE  is  able  to  do  Y  and  VCC  is  pushing  limits  with  Z.  But,  does  it  make  any  sense?  No,  it  does  not.  How  would  you  know  how  many  *cloud  stones*  does  your  application  need?    Another  popular  approach  is  to  tie  those  *cloud  stones*  to  cost  and  create  a  chart  per  cloud  provider,  showing  which  one  delivers  best  bang  for  the  buck.  While  it  makes  decisions  somewhat  simpler,  since  the  budget  is  presumably  known  and  it  is  relatively  easy  to  see,  where  your  investment  will  result  in  the  best  possible  delivered  performance.  Yet,  possibly  delivered  performance  does  not  equal  realized  performance  and  this  is  where  the  crux  is  and  this  is  why  less  can  really  become  more.    

Page 51: The Cloud Story or Less is More

The  right  question  to  ask  would  be  –  how  would  you  realize  the  maximum  performance  for  your  application?  Obviously,  there  are  countless  guides  available,  provided  by  vendors,  both  financed  and  independent  consulting  entities  and  myriads  of  bloggers  on  the  net.  Will  they  help?  Possibly.  It  is  a  win-­‐or-­‐lose  game.  One  optimization  can  boost  performance  –  another  one  can  diminish  it  again.  And  the  chance  that  there  is  a  combination  that  is  perfectly  matching  your  application  and  infrastructure  is  far  less  than  winning  a  lottery.  So,  the  question  is  still  open  -­‐  what  it  the  right  recipe  for  realizing  the  best  possible  performance?    First  of  all,  there  is  no  magic  bullet  or  setting  that  can  be  universally  applied  with  consistent  and  predictable  result.  Second,  even  following  best  practices  and  vendor  prescriptions  won’t  necessary  provide  the  best  possible  outcome.  Using  architectural  blueprints  can  help  avoiding  known  mistakes,  but  will  not  address  unknown  ones.    The  only  way  is  to  work  iteratively  by  clearly  setting  your  goals  and  employing  repeatable  automated  testing.  The  optimization  process  is  going  to  be  in  line  with  following  steps:  

-­‐ Execute  the  test  and  capture  performance  metrics  -­‐ Correlate  and  interpret  metrics  -­‐ Identify  the  bottleneck  and  understand  the  root  cause  -­‐ Address  the  bottleneck  by  implementing  well  documented  changes  -­‐ Repeat  the  process  until  you  have  achieved  your  objectives  

 It  may  seem  that  in  certain  cases  this  loop  may  be  endless,  since  improving  one  metric  may  be  in  the  conflict  with  another  one  and  common  wisdom  is  saying  you  can  not  have  the  pie  and  eat  it  at  the  same  time.  However,  in  reality,  with  each  cycle  you  will  be  learning  a  lot  about  your  application  -­‐  how  it  interacts  with  the  infrastructure  and  the  other  way  around,  how  the  infrastructure  tolerates  certain  application  shortcomings.  Very  soon  you’ll  become  more  and  more  effective  in  identifying  the  next  optimization  step.  To  put  it  simple:  optimization  is  an  iterative  learning  process,  not  a  tweak  or  milestone.    Coming  back  to  the  proof  of  concept  project,  this  is  exactly  what  has  been  done.  Building  a  repeatable  test  framework.  Decomposing  the  whole  application  into  subsystems,  which  have  been  tuned  in  isolation  and  than  integrated  again  and  optimized  in  combination  with  the  associated  re-­‐testing.  And  here  is  the  result  in  a  picture  that  says  more  than  a  thousand  words.    

Page 52: The Cloud Story or Less is More

 Figure  32:  Last  Performance  Test  Results  

Interesting  questions  related  to  this  might  be:  -­‐ Would  the  same  approach  help  to  improve  application  performance  on  the  AWS  

platform,  while  reducing  the  infrastructure  footprint  and  improving  performance?  Yes,  definitely.  

-­‐ Can  the  findings  outlined  above  be  reused  for  the  AWS  platform?  Some  –  may  be,  others  –  unlikely  and  in  most  cases  not.  

 And  the  ultimate  question:  Which  platform  is  the  best  for  your  application?  This  can  be  answered  quite  simple:  The  one  that  helps  you  realize  all  the  performance  your  workload  requires  and  which  you  are  paying  for.