capacity planning for LAMP what happens after you’re scalable MySQL Conf and Expo April 2007

Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

Embed Size (px)



Citation preview

Page 1: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

capacity planning for LAMP

what happens after you’re scalable

MySQL Conf and ExpoApril 2007

Page 2: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

John Allspaw

• Engineering Manager (Operations) at flickr (Yahoo!)


This is me. I work at Flickr.

Page 3: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007


• You’re scalable! (or not)

• Now you can simply add hardware as you need capacity.

• (right ?)

Page 4: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

• But:

• How many servers ?

Page 5: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

BUT, um, wait....• How many databases ?

• How many webservers ?

• How much shared storage ?

• How many network switches ?

• What about caching ?

• How many CPUs in all of these ?

• How much RAM ?

• How many drives in each ?

• WHEN should we order all of these ?

All of these are very legitimate questions. But what is capacity ?

Page 6: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

some stats• - ~35M photos in squid cache (total)

• - ~2M photos in squid’s RAM

• - ~470M photos, 4 or 5 sizes of each

• - 38k req/sec to memcached (12M objects)

• - 2 PB raw storage (consumed about ~1.5TB on Sunday)

Page 7: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007


Increased growth (usage) means needing capacity.SCALABILITY (horizontal or vertical) = ability to easily add capacity to accommodate growth.We’ll talk about MySQL, squid, and memcached.

Page 8: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007


Stop performance tuning. Stop tweaking. Accept what performance you do have now, and make predictions based on that. Capacity isn’t performance.

Page 9: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

capacity is for businessBring capacity planning into the product discussion EARLY.Get buy-in from the $$$ people (and engineering management) that it’s something to watch.We’ll talk about MySQL, and Squid, and memcached.

Page 10: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

too soon too late

Buying enoughfor now


t e






Don’t buy too much equipment just because you’re scared/happy that your site will explode.

Page 11: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

3 main parts

• - Planning (what ?/why ?/when ?)

• - Deployment (install/config/manage)

• - Measurement (graph the world)

Planning includes realizing what you have right NOW, and predicting what you’ll need later.Deployment includes making sure you can deploy new capacity easily.Measurement is ongoing, all the time, save as much data as you can.

Page 12: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

boring queueing theory

• Forced Flow Law:

• Xi = Vi x X0

Little’s Law:

N = X x R

Service Demand Law:

Di = Vi x Si = Ui / X0

•We can use these...but they’re boring. And take a long time. Too long.Don’t read books with these equations in them to learn about capacity planningfor web operations.

Page 13: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

my theory

• capacity planning math should be based on real things, not abstract ones.

I don’t have time for a dissertation on how many MySQL machines we’ll need in anabstract “future”.

Page 14: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

predicting the futureCan’t predict the future until you know what the past and present are like.Must find out what you have right now for capacity.

TWO TYPES OF CAPACITY: consumable, and concurrent/peak-driven

Page 15: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007


Disk space, RAM caches, bandwidth (sorta consumable). (Like candy)

Page 16: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

concurrent usageconcurrent usage:MySQL, memcached, squid, apache....things that don’t deplete over time.The trick here is to plan for peaks.(Like engines)

Page 17: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

considerations:social applications

• - Have the ‘network effect’

• - Exponential growth


more users means more contentmore content means more connectionsmore connections means more usage

etc., etc., etc.

Page 18: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

considerations:social applications

• Event-related growth

• (press, news event, social trends, etc.)

• Examples:

• London bombing, holidays, tsunamis, etc.

•We get 20-40% more uploads on first work day of the year than any previous peak the previous year.40-50% more uploads on Sundays than the rest of the week, on average.

Page 19: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

What do you have NOW ?

• When will your current capacity be depleted or outgrown ?

Predicting the future is hard, since it’s impossible.Try to graph usage per resource (cluster) and plot how that changes over time.

Page 20: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

finding ceilings

• MySQL (disk IO ?)

• SQUID (disk IO ? or CPU ?)

• memcached (CPU ? or network ?)

Probably the most important slide.What is the maximum something that every server can do ? How close are you to that maximum, and how is it trending ?

Page 21: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

forget benchmarks

• boring

• to use in capacity planning...not usually worth the time

• not representative of real load

Benchmarks are fine for getting a general idea of capabilities, but not for planning.Artificial tests give artificial results, and the time is better used with testing for real.

Page 22: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

• test in production

Don’t be afraid, it’s ok. :)Best approximation to how it will perform in real life, because it’s real life.This means build into the architecture mechanisms (config flags, load balancing, etc.) with which you can deploy new hardware easily into (and out of) production.

Page 23: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

what do you expect ?

• define what is acceptable

• examples:

• squid hits should take less than X milliseconds

• SQL queries less than Y milliseconds, and also keep up with replication

These are your internal SLAs, to help you guide capacity.

Page 24: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007


Page 25: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

accept the observer effect

• measurement is a necessity.

• it’s not optional.

speed freaks and tweakers and overclockers out there: suck it up.measurement and pretty graphs are good.

Page 26: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007


- Uses multicast and/or unicast to squirt xml data into an rrdtool frontend.- Super super easy to make custom graphs- originally written to handle stats data from HPC clusters

Page 27: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007








xml over UDP on


db1 db2 db3

xml over UDP on


XML over TCP

You have redundant machines in clusters, why not use the redundancy for clusterstats as well ?

Page 28: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007








xml over UDP on


db1 db2 db3

xml over UDP on


XML over TCP


One box goes away, then another can be used as a spokesperson for that cluster.

Page 29: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007
Page 30: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

super simple graphing

• #!/bin/sh

• /usr/bin/iostat -x 4 2 sda | grep -v ^$ | tail -4 > /tmp/disk-io.tmp

• UTIL=`grep sda /tmp/disk-io.tmp | awk '{print $14}'`

• /usr/bin/gmetric -t uint16 -n disk-util -v$UTIL -u '%'

Basically, if you can produce a number on the command line, then you can spit it into rrdtool with ganglia.

Page 31: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007


This is one of our memcached boxes. 4GB, 2 instances of 1.5GB each.2.5% user CPU and 10% system CPU at peak.

This was built within ganglia with the excellent add-on: http://wtf.ath.cx/screenshots.html

Page 32: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

what if you have graphs but no raw data ?

• GraphClick

• http://www.arizona-software.ch/applications/graphclick/en/

$8 US. Worth it.This helpful for MRTG graphs given by ISPs. You have the images, but no raw data.

Page 33: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

GraphClick allows you to digitize any image of a graph with units, and spit out tabular data for use in Excel, etc.

Page 34: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

application usage

• Usage stats are just as important

• as server stats!

• Examples:

• # of user registrations

• # of photos uploaded every hour

Build in custom metrics to measure real-world usage to server-based stats.Example:How many users can 1 database machine handle, given W/X/Y/Z selects, inserts, updates, deletes ?More on this later.

Page 35: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

not a straight line

Growth is exponential or nonlinear in some way.

Page 36: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

another not straight line

Page 37: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

but straight relationships!

This means that you can intuitively tie photos growth to user growth.Not rocket science, and expected, but knowing how steep that line is...is important.

Page 38: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

measurement examples

Page 39: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007


A week view of one of our MySQL machines. A Dell PE2950 with 6 disk 15K RPM SCSI drives,RAID10, with 16GB of RAM. 2x Quad Core CPUs, Intel Clovertown E5320 @ 1.86GHz.

Never went above 2% CPU, these are IO-bound machines. Which means it’s a good bet that we are disk IO-bound...so, let’s watch that...

Page 40: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

disk I/O

Watch disk IOwait. Any amount of disk utilization is “ok” if IOwait doesn’t increase dramatically.

By “ok” we mean: - no slave lag - query response time is still acceptable

Page 41: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

What we know now

• we can do at least 1500 qps (peak) without:

- slave lag

- unacceptable avg response time

- waiting on disk IO

So...how many users were on this database ? 400,000 ?Then plan that every X of that h/w platform can support (within the architecture it’s in)400X users. Add a pair of them for every 400k registrations.

Page 42: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

MySQL capacity1. find ceilings of existing h/w

2. tie app usage to server stats

3. find ceiling:usage ratio

4. do this again:

- regularly (monthly)

- when new features are released

- when new h/w is deployed

Of course this is architecture-specific! A master/multi-slave layout will perform differently and see different limitations than a partitioned or federated multi-master situation.

Page 43: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

caching maximums

Everyone loves caching. Everyone loves memory. Everyone thinks RAM is the answer to everything.They might be right.

Page 44: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

caching ceilingssquid, memcache

• working-set specific:

• - tiny enough to all fit in memory ?

• - some/more/all on disk ?

• - watch LRU churn

Least Recently Used replacement policy chooses to replace the page which has not been referenced for the longest time.

Page 45: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

churning full caches

• Ceilings at:

• - LRU ref age small enough to affect hit ratio too much

• - Request rate large enough to affect disk IO (to 100%)

Caching isn’t helpful if churn is too high.

Page 46: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

squid requests and hits

2 days graphed. Daytime peaks are pretty clear here.

Page 47: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

squid hit ratio

We MISS on purpose for some larger objects. Usually the hit ratio exists somewherebetween 75-80%.

Page 48: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

LRU reference age

Fits with request rates. More requests for more unique objects, more churn.Low point = 0.15 days = 3.66 hours.Max point = 0.23 = 5.52 hours.

Must make a sanity check for these with response times.

Page 49: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

hit response times

50ms is still within reasonableness. So, ok!

Page 50: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

What we know now

• we can do at least 620 req/sec (peak) without:

- LRU affecting hit ratio

- unacceptable avg response time

- waiting too much on diskIO

Squid request rate goes up over time, so week-by-week graphs are made.This is an example, we know that we can do actually 900-1000 req/sec withoutthe response time for hits getting above 100ms.So for every 900 req/sec we expect, we should be adding another squid machine.

Page 51: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

not full caches

• (working set smaller than max size)

• - request rate large enough to bring network or CPU to 100%

Hard pressed to get memcached to eat up CPU, but squid can.

Page 52: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

deploymentBeing able to deploy capacity easily is a necessity.

Page 53: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

Automated Deploy Tools

•SystemImager/SystemConfigurator•- http://wiki.systemimager.org

• CVSup:

• - http://www.cvsup.org

• Subcon:

• - http://code.google.com/p/subcon/

•These are all great OSS projects which can automate the installation, configuration, deployment, and general management of clusters of machines.

Page 54: Slides from ‘Capacity Planning for LAMP’ talk at MySQL Conf 2007

questions ?
