68
How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale? Surge 2013

How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Embed Size (px)

DESCRIPTION

The Canadian Broadcasting Corporation is Canada's national public broadcaster. Our website, www.cbc.ca, is one of the largest and most visited in the country, delivering 700 million hits per day on an origin infrastructure composed of only six web servers. With the right combination of publishing methods, content delivery networks and fine-tuned caching rules, the CBC’s infrastructure has enough headroom to handle spikes of 40x normal traffic during major news events. How do you scale to almost infinite capacity when you can't predict the world’s events? It's impossible to prepare for that influx of visitors when a celebrity dies, a natural disaster occurs or for other breaking news. Scaling for predictable events is easier, but although we know when the next Federal Election, Olympics Games or FIFA Cup is scheduled, these events present different challenges. Balancing the architecture for both scenarios is important.

Citation preview

Page 1: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

How Do you Scale for both Predictable and

Unpredictable Events on such a Large Scale?

Surge 2013

Page 2: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

We’re going to talk about this:

Whitney Houston Death: February 11, 2012

Page 3: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

… and this:

Page 4: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Without your site going down…

Page 5: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Who Am I?

• Team Lead of CBC.ca System Administration team.

• Been with CBC for over 11 years (since 2002).

• @blakecrosby

[email protected] / [email protected]

Page 6: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?
Page 7: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?
Page 8: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Let’s go back in time……way back

Page 9: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

2010

Page 10: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

2008

Page 11: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

2007

Page 12: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

2006

Page 13: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

2005

Page 14: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

2004

Page 15: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

2003

Page 16: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

“News stories must appear on the site as fast as possible!”

- Every Journalist at CBC

Page 17: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?
Page 18: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?
Page 19: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?
Page 20: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?
Page 21: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

This architecture doesn’t work for news websites.

Page 22: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

This was an important lesson for CBC

Page 23: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Breaking news trafficIt’s unpredictable and short lived.

Page 24: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

From 12k hit/s to 30k hit/s

Royal Baby: July 22, 2013

Page 25: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

From 1Gbps to 2.5Gbps in ~7min

Boston Marathon Bombing: April 15, 2013

Page 26: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

From 1 Gbps to 14 Gbps in ~10 minutes.

Whitney Houston Death: February 11, 2012

Page 27: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Challenges we (or you) face

Page 28: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Too expensive to build out infrastructure for traffic levels that are sustained < 1% of the year.

Page 29: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Content must be flexible to changing traffic conditions

Page 30: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

We have valuable information that users need in a crisis.

Page 31: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

“News stories must appear on the site as fast as possible!”

- Every Journalist at CBC

Page 32: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

How we fixed this problem(back in 2003, remember?)

Page 33: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?
Page 34: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Save everything to

disk.

Page 35: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Advantages

• Observes the principal of least surprise.

• Fast

• Takes advantages of OS and FS caches

• Easy to turn off certain site features.

Page 36: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?
Page 37: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Using SSIs (Server Side Includes)

• Primitive, but fast and secure.

• Can turn off site features or change look and feel by editing one file.

• All pages are updated instantly, without having to wait for pages to be republished.

Page 38: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Use a Content Delivery Network

Page 39: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Use Conditional GETs (If-Modified-Since)

Page 40: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?
Page 41: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Using Expiry and Validation

• Object has a TTL of 30 Seconds.

• Object hast a last modified time of Jan 1, 2013 00:00:00

• Once TTL has expired, cache/CDN will check if object is updated.

• Origin will return "304 Not Modified" and cache will reset TTL and serve object from cache store.

• The 30 second TTL protects the origin from a deluge of "If modified since" requests.

Page 42: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?
Page 43: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Use Last Mile Acceleration (GZIP Compression)

Page 44: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Use persistent HTTP connections

Page 45: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Use Appropriate Cache TTLs. Keep them simple!

Page 46: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Keep tunable options at the origin

Page 47: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Move personalization to the client

Page 48: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Outcomes(Where we are now in 2013)

Page 49: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Outcomes

• 2003 to 2010 – No need to grow origin

• 2010 to today – 9 origin web servers• HP DL360 G7

• Average 45-50% CPU utilization

• Capital cost for hardware? $15,000!

Page 50: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Our secret sauce.(or how to serve 800M requests a day from 9 webservers)

Page 51: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Offload (Bandwidth)

Page 52: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Offload (Hits)

Page 53: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Scaling for Unpredictable Events

Page 54: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Checking the last time a file has changed is faster than delivering that file to a user.

Page 55: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Conditional GETs (304s) will save you.

Page 56: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Make sure users don’t have to search for content

Page 57: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Increase your TTLs

Page 58: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Turn off dynamic components

Page 59: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Scaling for predictable events

Page 60: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Predicting traffic levels is impossible

Page 61: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Some (loose) rules.

• Scheduled events don't peak has high as unpredictable ones.

• Scheduled events last longer, so increase in traffic is spread out over hours, days, or weeks.

• Scheduled events are more "niche". Unlike breaking news where everyone wants to know what's going on.

• Might have to worry about 95/5 and bandwidth overages.

Page 62: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

How do you scale for write operations?

Page 63: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

We let someone else deal with that:

Page 64: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

In Summary…

Page 65: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

• Ensure your TTLs are appropriate

• Make sure your applications/content return last modified headers.

• Don't be afraid to change your site to turn off components that aren't critical during high traffic periods.

• Keep tunables at the Origin. This allows you to make changes quickly without waiting for CDN propagation.

• A CDN will not replace or fix bad origin infrastructure!

Page 66: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

• Predicting the scale of a scheduled event is impossible. You will either over estimate or under estimate.

• Use previous traffic levels during unscheduled events as a high water mark.

• Don't be afraid to ask someone else (SaaS provider) to implement a feature that is not your core business/expertise.

Page 67: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Usenix Paper

http://tinyurl.com/lisa-paper

Page 68: How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Thank You

@[email protected]