Upload
john-allspaw
View
27.125
Download
3
Tags:
Embed Size (px)
Citation preview
Capacity Management for Web Operations
John AllspawOperations Engineering
the book I’m writing
???
Rules of Thumb
Planning/Forecasting
Stupid Capacity Tricks
(with some Flickr statistics sprinkled in)
bugs (disguised as capacity problems)
edge cases (disguised as capacity problems)
security incidents
real capacity problems*
* (should be the last thing you need to worry about)
Things that can cause downtime
Capacity != Performance
Forget about performance for right now
Measure what you have right NOW
Don’t count on it getting any better
Thank You HPC Industry!
Automated Stuff
Scalable Metric Collection/Display
a lot of great deployment and management trickscome from them, adopted by web ops
Good Measurement
Tools
record and storemetrics in/outcustom metricseasily comparelightweight-ish
I
Clouds need planning too
Makes deployment and procurement easy and quick
But clouds are still resources with costs and limits, just like your own stuff
Black-boxes: you may need to pay even more attention than before
Metrics
System Statistics
Metrics“Application” Level
(photos processed per minute)
(average processing time per photo)
(apache requests)
(concurrent busy apache procs)
MetricsApp-level meets system-level
here, total CPU = ~1.12 * # busy apache procs (ymmv)
2400
photos per minute being uploaded right NOW (Tuesday afternoon)
Ceilings
the most amount of “work” yourresources will allow before degradationor failure
Forget Benchmarking
Find your ceilings
The End
what you have left
Use real live production data to find ceilings
Production: “it’s like a lab, but bigger!”
Like: database ceilings
replication lag: bad!
Ceilings
waiting on disk too much
sustained disk I/O wait for >40% creates
slave lag**for us, YMMV
35,000photo requests per second on a Tuesday peak
Safety Factors
Safety Factors
Ceiling * Factor of Safety = UR LIMITZ
Safety Factors
webserver!
what you have left
“safe” ceiling
@85% CPU
Safety Factors
85% total CPU = ~76 busy apache procs
Safety FactorsYahoo Front Page
link to Chinese NewYearPhotos
(photo requests/second)
(8% spike)
Forecasting
Forecasting
Fictional Example:webservers
Forecasting
Fictional example: 15 webservers. 1 week.
peak of the week
...bigger sample, 6 weeks....isolate the peaks...
Forecasting
...”Add a Trendline” with some decent correlation...
Forecasting
not too shabby
now
Forecasting
15 servers @76 busy apache proc limit = 1140 total procs
when is this?
this will tell you when it isceiling
what you have left
Forecasting
(week #10, duh)
(1140-726) / 42.751 = 9.68
Writing excel macros is boring
All we want is “days remaining”, so all we need is the curve-fit
Forecasting Automation
Use http://fityk.sf.net to automate the curve-fit
Forecasting
Fictional Example:storage consumption
Forecasting Automation
actual flickr storage consumption from early 2005, in GB(ceiling is fictional)
this will tellyou when this is
Forecasting Automationcmd line scriptoutput
jallspaw:~]$cfityk ./fit-storage.fit
1> # Fityk script. Fityk version: 0.8.22> @0 < '/home/jallspaw/storage-consumption.xy'15 points. No explicit std. dev. Set as sqrt(y)3> guess QuadraticNew function %_1 was created.4> fitInitial values: lambda=0.001 WSSR=464.564#1: WSSR=0.90162 lambda=0.0001 d(WSSR)=-463.663 (99.8059%)#2: WSSR=0.736787 lambda=1e-05 d(WSSR)=-0.164833 (18.2818%)#3: WSSR=0.736763 lambda=1e-06 d(WSSR)=-2.45151e-05 (0.00332729%)#4: WSSR=0.736763 lambda=1e-07 d(WSSR)=-3.84524e-11 (5.21909e-09%)Fit converged.Better fit found (WSSR = 0.736763, was 464.564, -99.8414%).5> info formula in @0# storage-consumption14147.4+146.657*x+0.786854*x^26> quitbye...
Forecasting Automation
(SAME)
fityk gave:
y = 0.786854x2 + 146.657x + 14147.4
( R2 = 99.84)
Excel gave:
y = 0.7675x2 + 146.96x + 14147.3
( R2 = 99.84)
Capacity Health
12,629 nagios checks
1314 hosts
6 datacenters
4 photo “farms”
farm = 2 DCs (east/west)
High and Low Water Marks
alert if higher
alert if lower
Per server, squid requests per second
A good dashboard looks something like...
type #limit/box
ceiling units
limit (total)
current (peak)
% peak
Est daysleft
www 20 80 busy procs
1600 1000 62.50% 36
shard db
20 40 I/O wait
800 220 27.50% 120
squid 18 950 req/sec 17,100 11,400 66.67% 48
(yes, fictional numbers)
Diagonal Scaling
Image processing machines
Replace Dell PE860s with HP DL140G3s
vertically scaling your already horizontal nodes
Diagonal Scalingexample: image processing
4 cores
8 cores
(about the same CPU “usage” per box)
~45 images/min @ peak
~140 images/min @ peak
(same CPU usage, but ~3x more work)“processing” means making 4 sizes from originals
Diagonal Scalingexample: image processing throughput
3008.4 Watts
1036.8 Watts
went from:
23 Dell PE860s
8 HP DL140 G3s
to:
1035 photos/min
1120 photos/min
(75% faster, even)
23Urack
8Urack
Diagonal Scalingexample: image processing
!!!
3.52
terabytes will be consumed today (on a Tuesday)
2nd Order Effects(beware the wandering bottleneck)
www
LB
www
memcacheddb search
running hot,so add more
2nd Order Effects(beware the wandering bottleneck)
www
LB
www www www
memcacheddb search
running great now,so more traffic!
now these run
hot
Stupid Capacity Tricks
Stupid Capacity Tricksquick and dirty management
DSHhttp://freshmeat.net/projects/dsh
[root@netmon101 ~]# cat group.of.servers
www100www118dbcontacts3admin1admin2
Stupid Capacity Tricksquick and dirty management
[root@netmon101 ~]# dsh -N group.of.servers
dsh> dateexecuting 'date'www100: Mon Jun 23 14:14:53 UTC 2008www118: Mon Jun 23 14:14:53 UTC 2008dbcontacts3: Mon Jun 23 07:14:53 PDT 2008admin1: Mon Jun 23 14:14:53 UTC 2008admin2: Mon Jun 23 14:14:53 UTC 2008dsh>
Stupid Capacity TricksTurn Stuff OFF
Disable heavy-ish features of the site(on/off switches)
We have 195 different things to disable in case of emergency.
Stupid Capacity TricksTurn Stuff OFF
uploads (photo)
uploads (video)
uploads by email
various API things
various mobile things
various search things
etc., etc.
Host your outage/status/blog page in more than one datacenter.
Tell your users WTF is going on, they’ll appreciate it.
Stupid Capacity TricksOutages Happen
Stupid Capacity TricksHit the Pause Button
Bake the dynamic into static
Some Y! properties have a big red button to instantly bake (and un-bake) at will
thankshttp://flickr.com/photos/bondidwhat/402089763/http://flickr.com/photos/74876632@N00/2394833962/http://flickr.com/photos/42311564@N00/220394633/http://flickr.com/photos/unloveable/2422483859/http://flickr.com/photos/absolutwade/149702085/http://flickr.com/photos/krawiec/521836276/http://flickr.com/photos/eschipul/1560875648/http://flickr.com/photos/library_of_congress/2179060841/http://flickr.com/photos/jekkyl/511187885/http://flickr.com/photos/ab8wn/368021672/http://flickr.com/photos/jaxxon/165559708/http://flickr.com/photos/sparktography/75499095/
We’re Hiring!flickr.com/jobs
Come see me!
questions?