Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Guerilla Scaling In The WildJake Grimley, Managing Director, Made Media Ltd.
Made Media Ltd.
• Digital Marketing Agency
• Web Design Agency
• Arts, Media, Advertising
• Design & Code
• Disposable Software
Employing...
38%
19%
19%
6%
19%Developers DesignersProject Managers SysOpsManagement
Including...
• 1 Birmingham Uni SoftEng Graduate
• And Stef Lewandowski
• And One Placement
• Working on the EuDML Project as a partner alongside Bham Uni
Jake GrimleyManaging Director
• Studied Physics
• Learnt to Program on a ZX Spectrum
• Reasonable Grasp of OOP, REST, LAMP
• Big Picture Company Overview
Guerilla Scaling In The Wild
Scalingscalability is the ability of a system, network, or process, to handle growing amounts of work in a graceful manner or its ability to be enlarged to accommodate that growth.
Commercial Definition
Adding more users should make Total Cost of Ownership cheaper, per-user, not more expensive.
In The Wild
Guerilla
Real World
• Client may not consider scaling at all. Problem not feature.
• Timescales tight
• Limited opportunities to test and learn
• Hardware scale-out not always possible
• Your reputation on the line
Case Study 1
Embarrassing Bodies
• Channel 4’s most successful multi-platform project
• 1 episode: 4m viewers, 150 thousand web users, 1.2m page views
Things we’d rather they didn’t ask
• Would you like to look at some boobs?
• Do you want to take an autism test? (38,000 users in 1 minute)
Scaling Up / Scaling Out
Up
Quickest‘Big Iron’
Hit Limits Fast
Out
Preferable‘Unlimited’
Requires Planning
App
Embarrassing Bodies
Web Server
Database Server
Reverse Proxy
Firewall
Varnish
Smoothwall
Apache (PHP)
MySQL
Optimise Last
• Chances are you won’t need to
• Servers are cheaper than programmers
• You don’t know how much you need to optimise when you start
• You don’t know where the bottlenecks will be
• But don’t be reckless
Cache FIRST
Edge Cache / CDN
Reverse Proxy
Application
Object Cache (Memcached)
Database (Query Cache)
Caching
• Start from front and work backwards
• Works best with high read-to-write ratio
• Low levels of personalisation
“There are only two hard problems in Computer Science cache invalidation and naming things.”
Phil Karlton
Ultimately, this strategy will run out of steam
HTTP Optimisations• Minimise HTTP requests
• Add Expires / Cache-Control headers
• Don’t start an Application process when you don’t need one
• Offload Images / Javascript etc. to S3
• Cache output into flat files
• Optimise Connection Limits and space child processes
• Be careful with Keep-alive
Connection Pooling
• Ideal: Connection established, Request served, Connection closed
• Connections expensive to open. Then hold-open RAM.
• Too many connections. Server will lock-up
• Too few connections, requests will be denied
• We want to safe-guard against lock-up whilst refusing as few requests as possible
• Default configurations are conservative
Stick to REST
• REST was designed by smarter people than you
• Distributed caching is built in by design
• Understand the difference between GET and POST
• Stateless systems scale better
Distributed Caching
Cache-Control: max-age=86400, must-revalidatemax-age=86400The number of seconds the resource will be considered fresh, similar to the Expires header except the number of seconds from now is used rather than a specific date/time.
s-maxage=86400The same as max-age, except that it only applies to shared caches.
publicMakes the response always cacheable even if it wouldn't normally be cacheable (behind authentication, etc.)
no-cacheForces caches ask the server for validation before releasing a cached copy, so if the server says that the cached version is still fresh it is served from the cache.
must-revalidateForces caches to obey any freshness information you give them about a resource.
proxy-revalidateThe same as must-revalidate except that it only applies to shared (proxy) caches.
GET/POST
Client
Cache
App
DB
Get
Post
State stored in Application
Client
App2
DB
App1 App3 App4 App5
Load Balancer
Sticky Session
Stateless
Client
App2
DB
App1 App3 App4 App5
Round Robin
App6
Stateless?
• Store session state on client (encrypted cookie)
• Push session state down into database (now we have a database scaling problem)
Case Study 2
Embarrassing Bodies Live• Live post-broadcast show with video streaming
• User generated content:
• Images
• Comments
• Polls
• Votes
• 40 transactions/second
Scaling Out
• Smart Client built in Javascript
• Poll for JSON data-update
• JSON distributed across EC2 varnish instances (GET)
• All POST requests filtered through one server
• Doctor’s Waiting Room (Velvet Rope Policy)
Pragmatic Architecture
De-Normalise
Aggregate Query
Aggregate Query
Aggregate Query
Votes
Poll Total
Aggregate Query
1 simple query
Data mining/reporting
• JSON feed included ‘refresh-time’ variable and ‘drop-%’ variable
• Primary Concern: Keep the site (and video feed) up
• ‘Red, Yellow, Green’. ‘Defcon 1, Defcon 2
• Triumph of Pragmatism over theory
One Massive Hack
Votes
Poll Total
Aggregate Query
X Scaling Factor
War Room
War Room
• Live monitoring
• Roles assigned
• Recovery, failure and escalation procedures
Success
• Site stayed up
• Failed to pay attention to moderation tool (5 users)
• Grand strategy paid off
• Subsequent broadcasts, no hacks.
Grown-up Database Scaling
• Clustering
• Sharding
Clustering
Sharding
NoSQL
The Cloud
• Trivial to set up multiple servers with different roles
• We use Amazon EC2
• Managed via Chef or Puppet
Web Servervarnish, Linux,
Apache 2.2, PHP 5.3, Silverstripe CMS
Master Web ServerLinux, Apache 2.2, PHP
5.3, Silverstripe CMS
Load Balancer
Web Servervarnish, Linux,
Apache 2.2, PHP 5.3, Silverstripe CMS
MySQL Slave(read)
MySQL Slave(read)
Akamai NetStorage(Static asset files)
MySQL Master(read/write)
HAProxy
memcache
Internet
Web ServerSurvey
varnish, Linux, Apache 2.2, PHP 5.3
Images/Filespushed via FTP
HTTPS
Case Study 3
Glyndebourne
• Opera Tickets, Known for failure online
• Dedicated International Audience, 1,000 loyal fans queued at midnight
• Credit Card Transactions, c. £0.5m in one night
• Many dependencies, points of failure
Backstage CMS
Varnish Webcache
Tessitura APIMagento e-Commerce
SSL
Unified Shopping Basket
Unified User Accounts
MySQL Database
Purchase Path & Membership Pages
Waiting Room
Marketing Pages Merchandise Pages
Restaurant System
Car Park System
Fulfillment System
Booksolve
Email System (Campaign Monitor)
Google Analytics
Third-Party SAAS
Open-Source Software
Proprietary Software
Custom/Adapted Development
Tessitura
DAM
CDN Edge Servers
Media Player
Geotargetting
Identify Bottlenecks
Scalable Tessitura Ticketing Path
Anonymous
BrowserWaiting Room
Personalised
BackstageTicketing Pages
2000 requests / second 300 requests / second 2000 requests / second
Cache
360
48 requests / secondTarget
SSL
User
Journey
Backstage
Tessitura
Queues
• Velvet Rope Policy (Doctor’s Waiting Room)
• Rabbit MQ, Amazon Simple Queue Service
• Send requests for completion later (e.g. billing credit cards)
Load Testing
“So did it pass the load test? No? Oh.”
Commercial Load Testing
• Expensive
• Difficult to simulate 500,000 real users
• All or nothing
• Won’t tell you much
• Should confirm what you know, not be your first sign something is wrong
Test Plan
• Load testing integrated into development process
• Business criteria. Number of checkouts, not requests/second.
• Help your client understand
Component Test First
• Apache Bench (ab)
• Break down chain into components
• Identify bottlenecks
• Requests/second. 100% complete without errors.
ApacheBenchCopyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/Benchmarking www.cyberciti.biz (be patient).....doneServer Software:Server Hostname: www.somewhere.comServer Port: 80Document Path: /Document Length: 16289 bytesConcurrency Level: 1Time taken for tests: 16.885975 secondsComplete requests: 10Failed requests: 0Write errors: 0Total transferred: 166570 bytesHTML transferred: 162890 bytesRequests per second: 0.59 [#/sec] (mean)Time per request: 1688.597 [ms] (mean)Time per request: 1688.597 [ms] (mean, across all concurrent requests)Transfer rate: 9.59 [Kbytes/sec] receivedConnection Times (ms) min mean[+/-sd] median maxConnect: 353 375 16.1 386 391Processing: 1240 1312 52.1 1339 1369Waiting: 449 472 16.2 476 499Total: 1593 1687 67.7 1730 1756Percentage of the requests served within a certain time (ms) 50% 1730 66% 1733 75% 1741 80% 1753 90% 1756 95% 1756 98% 1756 99% 1756 100% 1756 (longest request)
Component Tests
• Suspect one part of the process?
• Write a script that just does that part
• Load test that part in isolation
• It’s like bug fixing
Scalable Tessitura Ticketing Path
Anonymous
BrowserWaiting Room
Personalised
BackstageTicketing Pages
2000 requests / second 300 requests / second 2000 requests / second
Cache
360
48 requests / secondTarget
SSL
User
Journey
Backstage
Tessitura
Selenium: Better Testing
Bring It Together in the BrowserWidgetisation
Service Service Service
Web Layer
User
Browser
Natural Integrator Approach
Service Service Service
Web Layer
User
Browser
Web Layer
Web Layer
Easily Cacheable (CMS)
Heavily Personalised (E-Commerce)
Might Scale Better
When it all goes wrong
• Fail gracefully
• Have an escalation plan
• A good use of your time
Essay Topics• Twitter famously couldn’t scale their architecture to keep pace with growth. Discuss
the architectural differences between a blogging platform and a messaging platform. Research the early architectural design of Twitter, and describe an architecture that might have scaled better.
• Consider one of your other university projects; something with a data-schema or an Entity Relationship Diagram. Consider challenges you might have scaling that system to many thousands of users. Where would the bottlenecks be? What optimisations or compromises could you make?
• Imagine you were tasked with delivering a high-volume live chat-room over the web (something like Google+ perhaps, or my EB Live Case Study). Research and compare modern approaches to delivering the project. [hint: look-up Web Sockets and node.js]. Compare with a simple polling approach and discuss with reference to REST.