Guerilla Scaling In The Wild - University of BirminghamApache 2.2, PHP 5.3, Silverstripe CMS Master Web Server Linux, Apache 2.2, PHP 5.3, Silverstripe CMS Load Balancer Web Server

Guerilla Scaling In The WildJake Grimley, Managing Director, Made Media Ltd.

Made Media Ltd.

• Digital Marketing Agency

• Web Design Agency

• Arts, Media, Advertising

• Design & Code

• Disposable Software

Employing...

38%

19%

19%

6%

19%Developers DesignersProject Managers SysOpsManagement

Including...

• 1 Birmingham Uni SoftEng Graduate

• And Stef Lewandowski

• And One Placement

• Working on the EuDML Project as a partner alongside Bham Uni

Jake GrimleyManaging Director

• Studied Physics

• Learnt to Program on a ZX Spectrum

• Reasonable Grasp of OOP, REST, LAMP

• Big Picture Company Overview

Guerilla Scaling In The Wild

Scalingscalability is the ability of a system, network, or process, to handle growing amounts of work in a graceful manner or its ability to be enlarged to accommodate that growth.

Commercial Definition

Adding more users should make Total Cost of Ownership cheaper, per-user, not more expensive.

In The Wild

Guerilla

Real World

• Client may not consider scaling at all. Problem not feature.

• Timescales tight

• Limited opportunities to test and learn

• Hardware scale-out not always possible

• Your reputation on the line

Case Study 1

Embarrassing Bodies

• Channel 4’s most successful multi-platform project

• 1 episode: 4m viewers, 150 thousand web users, 1.2m page views

Things we’d rather they didn’t ask

• Would you like to look at some boobs?

• Do you want to take an autism test? (38,000 users in 1 minute)

Scaling Up / Scaling Out

Up

Quickest‘Big Iron’

Hit Limits Fast

Out

Preferable‘Unlimited’

Requires Planning

App

Embarrassing Bodies

Web Server

Database Server

Reverse Proxy

Firewall

Varnish

Smoothwall

Apache (PHP)

MySQL

Optimise Last

• Chances are you won’t need to

• Servers are cheaper than programmers

• You don’t know how much you need to optimise when you start

• You don’t know where the bottlenecks will be

• But don’t be reckless

Cache FIRST

Edge Cache / CDN

Reverse Proxy

Application

Object Cache (Memcached)

Database (Query Cache)

Caching

• Start from front and work backwards

• Works best with high read-to-write ratio

• Low levels of personalisation

“There are only two hard problems in Computer Science cache invalidation and naming things.”

Phil Karlton

Ultimately, this strategy will run out of steam

HTTP Optimisations• Minimise HTTP requests

• Add Expires / Cache-Control headers

• Don’t start an Application process when you don’t need one

• Offload Images / Javascript etc. to S3

• Cache output into flat files

• Optimise Connection Limits and space child processes

• Be careful with Keep-alive

Connection Pooling

• Ideal: Connection established, Request served, Connection closed

• Connections expensive to open. Then hold-open RAM.

• Too many connections. Server will lock-up

• Too few connections, requests will be denied

• We want to safe-guard against lock-up whilst refusing as few requests as possible

• Default configurations are conservative

Stick to REST

• REST was designed by smarter people than you

• Distributed caching is built in by design

• Understand the difference between GET and POST

• Stateless systems scale better

Distributed Caching

Cache-Control: max-age=86400, must-revalidatemax-age=86400The number of seconds the resource will be considered fresh, similar to the Expires header except the number of seconds from now is used rather than a specific date/time.

s-maxage=86400The same as max-age, except that it only applies to shared caches.

publicMakes the response always cacheable even if it wouldn't normally be cacheable (behind authentication, etc.)

no-cacheForces caches ask the server for validation before releasing a cached copy, so if the server says that the cached version is still fresh it is served from the cache.

must-revalidateForces caches to obey any freshness information you give them about a resource.

proxy-revalidateThe same as must-revalidate except that it only applies to shared (proxy) caches.

GET/POST

Client

Cache

App

DB

Get

Post

State stored in Application

Client

App2

DB

App1 App3 App4 App5

Load Balancer

Sticky Session

Stateless

Client

App2

DB

App1 App3 App4 App5

Round Robin

App6

Stateless?

• Store session state on client (encrypted cookie)

• Push session state down into database (now we have a database scaling problem)

Case Study 2

Embarrassing Bodies Live• Live post-broadcast show with video streaming

• User generated content:

• Images

• Comments

• Polls

• Votes

• 40 transactions/second

Scaling Out

• Smart Client built in Javascript

• Poll for JSON data-update

• JSON distributed across EC2 varnish instances (GET)

• All POST requests filtered through one server

• Doctor’s Waiting Room (Velvet Rope Policy)

Pragmatic Architecture

De-Normalise

Aggregate Query

Aggregate Query

Aggregate Query

Votes

Poll Total

Aggregate Query

1 simple query

Data mining/reporting

• JSON feed included ‘refresh-time’ variable and ‘drop-%’ variable

• Primary Concern: Keep the site (and video feed) up

• ‘Red, Yellow, Green’. ‘Defcon 1, Defcon 2

• Triumph of Pragmatism over theory

One Massive Hack

Votes

Poll Total

Aggregate Query

X Scaling Factor

War Room

War Room

• Live monitoring

• Roles assigned

• Recovery, failure and escalation procedures

Success

• Site stayed up

• Failed to pay attention to moderation tool (5 users)

• Grand strategy paid off

• Subsequent broadcasts, no hacks.

Grown-up Database Scaling

• Clustering

• Sharding

Clustering

Sharding

NoSQL

The Cloud

• Trivial to set up multiple servers with different roles

• We use Amazon EC2

• Managed via Chef or Puppet

Web Servervarnish, Linux,

Apache 2.2, PHP 5.3, Silverstripe CMS

Master Web ServerLinux, Apache 2.2, PHP

5.3, Silverstripe CMS

Load Balancer

Web Servervarnish, Linux,

Apache 2.2, PHP 5.3, Silverstripe CMS

MySQL Slave(read)

MySQL Slave(read)

Akamai NetStorage(Static asset files)

MySQL Master(read/write)

HAProxy

memcache

Internet

Web ServerSurvey

varnish, Linux, Apache 2.2, PHP 5.3

Images/Filespushed via FTP

HTTPS

Case Study 3

Glyndebourne

• Opera Tickets, Known for failure online

• Dedicated International Audience, 1,000 loyal fans queued at midnight

• Credit Card Transactions, c. £0.5m in one night

• Many dependencies, points of failure

Backstage CMS

Varnish Webcache

Tessitura APIMagento e-Commerce

SSL

Unified Shopping Basket

Unified User Accounts

MySQL Database

Purchase Path & Membership Pages

Waiting Room

Marketing Pages Merchandise Pages

Restaurant System

Car Park System

Fulfillment System

Booksolve

Email System (Campaign Monitor)

Google Analytics

Third-Party SAAS

Open-Source Software

Proprietary Software

Custom/Adapted Development

Tessitura

DAM

CDN Edge Servers

Media Player

Geotargetting

Identify Bottlenecks

Scalable Tessitura Ticketing Path

Anonymous

BrowserWaiting Room

Personalised

BackstageTicketing Pages

2000 requests / second 300 requests / second 2000 requests / second

Cache

360

48 requests / secondTarget

SSL

User

Journey

Backstage

Tessitura

Queues

• Velvet Rope Policy (Doctor’s Waiting Room)

• Rabbit MQ, Amazon Simple Queue Service

• Send requests for completion later (e.g. billing credit cards)

Load Testing

“So did it pass the load test? No? Oh.”

Commercial Load Testing

• Expensive

• Difficult to simulate 500,000 real users

• All or nothing

• Won’t tell you much

• Should confirm what you know, not be your first sign something is wrong

Test Plan

• Load testing integrated into development process

• Business criteria. Number of checkouts, not requests/second.

• Help your client understand

Component Test First

• Apache Bench (ab)

• Break down chain into components

• Identify bottlenecks

• Requests/second. 100% complete without errors.

ApacheBenchCopyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/Benchmarking www.cyberciti.biz (be patient).....doneServer Software:Server Hostname: www.somewhere.comServer Port: 80Document Path: /Document Length: 16289 bytesConcurrency Level: 1Time taken for tests: 16.885975 secondsComplete requests: 10Failed requests: 0Write errors: 0Total transferred: 166570 bytesHTML transferred: 162890 bytesRequests per second: 0.59 [#/sec] (mean)Time per request: 1688.597 [ms] (mean)Time per request: 1688.597 [ms] (mean, across all concurrent requests)Transfer rate: 9.59 [Kbytes/sec] receivedConnection Times (ms) min mean[+/-sd] median maxConnect: 353 375 16.1 386 391Processing: 1240 1312 52.1 1339 1369Waiting: 449 472 16.2 476 499Total: 1593 1687 67.7 1730 1756Percentage of the requests served within a certain time (ms) 50% 1730 66% 1733 75% 1741 80% 1753 90% 1756 95% 1756 98% 1756 99% 1756 100% 1756 (longest request)

Component Tests

• Suspect one part of the process?

• Write a script that just does that part

• Load test that part in isolation

• It’s like bug fixing

Scalable Tessitura Ticketing Path

Anonymous

BrowserWaiting Room

Personalised

BackstageTicketing Pages

2000 requests / second 300 requests / second 2000 requests / second

Cache

360

48 requests / secondTarget

SSL

User

Journey

Backstage

Tessitura

Selenium: Better Testing

Bring It Together in the BrowserWidgetisation

Service Service Service

Web Layer

User

Browser

Natural Integrator Approach

Service Service Service

Web Layer

User

Browser

Web Layer

Web Layer

Easily Cacheable (CMS)

Heavily Personalised (E-Commerce)

Might Scale Better

When it all goes wrong

• Fail gracefully

• Have an escalation plan

• A good use of your time

Essay Topics• Twitter famously couldn’t scale their architecture to keep pace with growth. Discuss

the architectural differences between a blogging platform and a messaging platform. Research the early architectural design of Twitter, and describe an architecture that might have scaled better.

• Consider one of your other university projects; something with a data-schema or an Entity Relationship Diagram. Consider challenges you might have scaling that system to many thousands of users. Where would the bottlenecks be? What optimisations or compromises could you make?

• Imagine you were tasked with delivering a high-volume live chat-room over the web (something like Google+ perhaps, or my EB Live Case Study). Research and compare modern approaches to delivering the project. [hint: look-up Web Sockets and node.js]. Compare with a simple polling approach and discuss with reference to REST.

Documents

Guerilla Scaling In The Wild - University of BirminghamApache 2.2, PHP 5.3, Silverstripe CMS Master Web Server Linux, Apache 2.2, PHP 5.3, Silverstripe CMS Load Balancer Web Server