83
Achieving Rapid Response Times in Large Online Services Jeff Dean Google Fellow [email protected] Monday, March 26, 2012

Achieving Rapid Response Times in Large Online Services

Embed Size (px)

DESCRIPTION

Great recent talk by Googler Jeff Dean on problems due to hitting occasional latency in large scale distributed systems. Some highlights: Better to sync than randomize regularly scheduled OS tasks and log rotations on machines (surprising result to me, and rarely done to my knowledge), should vary the number of shard replicas based on demand such as creating more replicas of recent log data or popular web page snippets (may seem obvious, yes, but I've seen people not do this, just using three replicas for all data), should stop sending requests to nodes with high latency and bring up replacement nodes instead (also may seem obvious, but not used enough in the systems I've seen), and issue redundant requests for data (issue a second request if the first does not immediately (< 2ms) respond, use whichever comes first, cancel the other, also may seem obvious, but also rarely done, more common is a retry loop with a much longer (> 50ms) cycle between retries).

Citation preview

Page 1: Achieving Rapid Response Times in Large Online Services

Achieving Rapid Response Times in

Large Online Services

Jeff DeanGoogle Fellow

[email protected]

Monday, March 26, 2012

Page 2: Achieving Rapid Response Times in Large Online Services

Faster Is Better

Monday, March 26, 2012

Page 3: Achieving Rapid Response Times in Large Online Services

Faster Is Better

Monday, March 26, 2012

Page 4: Achieving Rapid Response Times in Large Online Services

Large Fanout Services

Frontend Web Server

query

Cache servers

Ad System

News

Super root

Images

Web

BlogsVideo

Books

Local

Monday, March 26, 2012

Page 5: Achieving Rapid Response Times in Large Online Services

• Overall latency ≥ latency of slowest component–small blips on individual machines cause delays–touching more machines increases likelihood of delays

• Server with 1 ms avg. but 1 sec 99%ile latency–touch 1 of these: 1% of requests take ≥1 sec–touch 100 of these: 63% of requests take ≥1 sec

Why Does Fanout Make Things Harder?

Monday, March 26, 2012

Page 6: Achieving Rapid Response Times in Large Online Services

• Careful engineering all components of system

• Possible at small scale–dedicated resources–complete control over whole system–careful understanding of all background activities–less likely to have hardware fail in bizarre ways

• System changes are difficult–software or hardware changes affect delicate balance

One Approach: Squash All Variability

Not tenable at large scale: need to share resources

Monday, March 26, 2012

Page 7: Achieving Rapid Response Times in Large Online Services

• Huge benefit: greatly increased utilization

• ... but hard to predict effects increase variability–network congestion–background activities–bursts of foreground activity–not just your jobs, but everyone else’s jobs, too

• Exacerbated by large fanout systems

Shared Environment

Monday, March 26, 2012

Page 8: Achieving Rapid Response Times in Large Online Services

Shared Environment

Linux

Monday, March 26, 2012

Page 9: Achieving Rapid Response Times in Large Online Services

Shared Environment

Linux

file systemchunkserver

Monday, March 26, 2012

Page 10: Achieving Rapid Response Times in Large Online Services

Shared Environment

Linux

file systemchunkserver

schedulingsystem

Monday, March 26, 2012

Page 11: Achieving Rapid Response Times in Large Online Services

Shared Environment

Linux

file systemchunkserver

schedulingsystem

various othersystem services

Monday, March 26, 2012

Page 12: Achieving Rapid Response Times in Large Online Services

Shared Environment

Linux

file systemchunkserver

schedulingsystem

various othersystem services

Bigtabletablet server

Monday, March 26, 2012

Page 13: Achieving Rapid Response Times in Large Online Services

Shared Environment

Linux

file systemchunkserver

schedulingsystem

various othersystem services

Bigtabletablet server

cpu intensivejob

Monday, March 26, 2012

Page 14: Achieving Rapid Response Times in Large Online Services

Shared Environment

Linux

file systemchunkserver

schedulingsystem

various othersystem services

Bigtabletablet server

randomMapReduce #1

cpu intensivejob

Monday, March 26, 2012

Page 15: Achieving Rapid Response Times in Large Online Services

Shared Environment

Linux

file systemchunkserver

schedulingsystem

various othersystem services

Bigtabletablet server

randomMapReduce #1

cpu intensivejob

randomapp

randomapp #2

Monday, March 26, 2012

Page 16: Achieving Rapid Response Times in Large Online Services

• Differentiated service classes–prioritized request queues in servers–prioritized network traffic

• Reduce head-of-line blocking–break large requests into sequence of small requests

• Manage expensive background activities–e.g. log compaction in distributed storage systems–rate limit activity–defer expensive activity until load is lower

Basic Latency Reduction Techniques

Monday, March 26, 2012

Page 17: Achieving Rapid Response Times in Large Online Services

• Large systems often have background daemons–various monitoring and system maintenance tasks

• Initial intuition: randomize when each machine performs these tasks–actually a very bad idea for high fanout services

• at any given moment, at least one or a few machines are slow

• Better to actually synchronize the disruptions–run every five minutes “on the dot”–one synchronized blip better than unsynchronized

Synchronized Disruption

Monday, March 26, 2012

Page 18: Achieving Rapid Response Times in Large Online Services

• Tolerating faults:– rely on extra resources

• RAIDed disks, ECC memory, dist. system components, etc.

– make a reliable whole out of unreliable parts

• Tolerating variability:– use these same extra resources– make a predictable whole out of unpredictable parts

• Times scales are very different:– variability: 1000s of disruptions/sec, scale of milliseconds– faults: 10s of failures per day, scale of tens of seconds

Tolerating Faults vs. Tolerating Variability

Monday, March 26, 2012

Page 19: Achieving Rapid Response Times in Large Online Services

• Cross request adaptation–examine recent behavior–take action to improve latency of future requests–typically relate to balancing load across set of servers–time scale: 10s of seconds to minutes

• Within request adaptation–cope with slow subsystems in context of higher level

request–time scale: right now, while user is waiting

Latency Tolerating Techniques

Monday, March 26, 2012

Page 20: Achieving Rapid Response Times in Large Online Services

• Partition large datasets/computations– more than 1 partition per machine (often 10-100/machine)– e.g. BigTable, query serving systems, GFS, ...

Fine-Grained Dynamic Partitioning

1 3

17

2 12 78 4... ... ... ...

9

Machine 1 Machine 2 Machine 3 Machine N

Master

Monday, March 26, 2012

Page 21: Achieving Rapid Response Times in Large Online Services

• Can shed load in few percent increments–prioritize shifting load when imbalance is more severe

Load Balancing

1 3

17

2 12 78 4... ... ... ...

9

Master

Monday, March 26, 2012

Page 22: Achieving Rapid Response Times in Large Online Services

• Can shed load in few percent increments–prioritize shifting load when imbalance is more severe

Load Balancing

1 3

17

2 12 78 4... ... ... ...

9

Overloaded!

Master

Monday, March 26, 2012

Page 23: Achieving Rapid Response Times in Large Online Services

• Can shed load in few percent increments–prioritize shifting load when imbalance is more severe

Load Balancing

1 3

17

2 12 78 4... ... ... ...

9

Overloaded!

Master

Monday, March 26, 2012

Page 24: Achieving Rapid Response Times in Large Online Services

• Can shed load in few percent increments–prioritize shifting load when imbalance is more severe

Load Balancing

1 3

17

2 12 78 4... ... ... ...

9

Master

Monday, March 26, 2012

Page 25: Achieving Rapid Response Times in Large Online Services

• Many machines each recover one or a few partition–e.g. BigTable tablets, GFS chunks, query serving shards

Speeds Failure Recovery

1 3

17

2 12 78 4... ... ... ...

9

Master

Monday, March 26, 2012

Page 26: Achieving Rapid Response Times in Large Online Services

• Many machines each recover one or a few partition–e.g. BigTable tablets, GFS chunks, query serving shards

Speeds Failure Recovery

2 12 78 4... ... ...

9

Master

Monday, March 26, 2012

Page 27: Achieving Rapid Response Times in Large Online Services

• Many machines each recover one or a few partition–e.g. BigTable tablets, GFS chunks, query serving shards

Speeds Failure Recovery

2 12 78 4... ... ...

9

1 3 17

Master

Monday, March 26, 2012

Page 28: Achieving Rapid Response Times in Large Online Services

• Find heavily used items and make more replicas–can be static or dynamic

• Example: Query serving system–static: more replicas of important docs–dynamic: more replicas of Chinese documents as

Chinese query load increases

Selective Replication

... ... ... ...

Master

Monday, March 26, 2012

Page 29: Achieving Rapid Response Times in Large Online Services

• Find heavily used items and make more replicas–can be static or dynamic

• Example: Query serving system–static: more replicas of important docs–dynamic: more replicas of Chinese documents as

Chinese query load increases

Selective Replication

... ... ... ...

Master

Monday, March 26, 2012

Page 30: Achieving Rapid Response Times in Large Online Services

• Find heavily used items and make more replicas–can be static or dynamic

• Example: Query serving system–static: more replicas of important docs–dynamic: more replicas of Chinese documents as

Chinese query load increases

Selective Replication

... ... ... ...

Master

Monday, March 26, 2012

Page 31: Achieving Rapid Response Times in Large Online Services

• Servers sometimes become slow to respond–could be data dependent, but...–often due to interference effects

• e.g. CPU or network spike for other jobs running on shared server

• Non-intuitive: remove capacity under load to improve latency (?!)

• Initiate corrective action–e.g. make copies of partitions on other servers–continue sending shadow stream of requests to server

• keep measuring latency• return to service when latency back down for long enough

Latency-Induced Probation

Monday, March 26, 2012

Page 32: Achieving Rapid Response Times in Large Online Services

• Take action within single high-level request

• Goals:–reduce overall latency–don’t increase resource use too much–keep serving systems safe

Handling Within-Request Variability

Monday, March 26, 2012

Page 33: Achieving Rapid Response Times in Large Online Services

Data Independent Failures

query

Monday, March 26, 2012

Page 34: Achieving Rapid Response Times in Large Online Services

Data Independent Failures

query

Monday, March 26, 2012

Page 35: Achieving Rapid Response Times in Large Online Services

Data Independent Failures

query

Monday, March 26, 2012

Page 36: Achieving Rapid Response Times in Large Online Services

Canary Requests (2)

query

Monday, March 26, 2012

Page 37: Achieving Rapid Response Times in Large Online Services

Canary Requests (2)

query

Monday, March 26, 2012

Page 38: Achieving Rapid Response Times in Large Online Services

Canary Requests (2)

query

Monday, March 26, 2012

Page 39: Achieving Rapid Response Times in Large Online Services

Backup Requests

req 3

Replica 1

Client

Replica 2 Replica 3

req 6

req 5 req 8

Monday, March 26, 2012

Page 40: Achieving Rapid Response Times in Large Online Services

Backup Requests

req 3

Replica 1

Client

Replica 2 Replica 3

req 9

req 6

req 5 req 8

Monday, March 26, 2012

Page 41: Achieving Rapid Response Times in Large Online Services

Backup Requests

req 3

Replica 1

Client

Replica 2 Replica 3

req 9

req 6

req 5 req 8

Monday, March 26, 2012

Page 42: Achieving Rapid Response Times in Large Online Services

Backup Requests

req 3

Replica 1

Client

req 9

Replica 2 Replica 3

req 9

req 6

req 5 req 8

Monday, March 26, 2012

Page 43: Achieving Rapid Response Times in Large Online Services

Backup Requests

req 3

Replica 1

Client

req 9

Replica 2 Replica 3

req 9

req 6

req 8

Monday, March 26, 2012

Page 44: Achieving Rapid Response Times in Large Online Services

Backup Requests

req 3

Replica 1

Client

Replica 2 Replica 3

req 9

req 6

req 8

reply

Monday, March 26, 2012

Page 45: Achieving Rapid Response Times in Large Online Services

Backup Requests

req 3

Replica 1

Client

Replica 2 Replica 3

req 9

req 6

req 8

reply

Monday, March 26, 2012

Page 46: Achieving Rapid Response Times in Large Online Services

Backup Requests

req 3

Replica 1

Client

Replica 2 Replica 3

req 9

req 6

req 8

reply“Cancel req 9”

Monday, March 26, 2012

Page 47: Achieving Rapid Response Times in Large Online Services

Backup Requests

req 3

Replica 1

Client

Replica 2 Replica 3

req 9

req 6

req 8

reply

“Cancel req 9”

Monday, March 26, 2012

Page 48: Achieving Rapid Response Times in Large Online Services

Backup Requests

req 3

Replica 1

Client

Replica 2 Replica 3

req 6

req 8

reply

Monday, March 26, 2012

Page 49: Achieving Rapid Response Times in Large Online Services

• In-memory BigTable lookups–data replicated in two in-memory tables–issue requests for 1000 keys spread across 100 tablets–measure elapsed time until data for last key arrives

Backup Requests Effects

Monday, March 26, 2012

Page 50: Achieving Rapid Response Times in Large Online Services

• In-memory BigTable lookups–data replicated in two in-memory tables–issue requests for 1000 keys spread across 100 tablets–measure elapsed time until data for last key arrives

Backup Requests Effects

Avg Std Dev 95%ile 99%ile 99.9%ile

No backups 33 ms 1524 ms 24 ms 52 ms 994 msBackup after 10 ms 14 ms 4 ms 20 ms 23 ms 50 msBackup after 50 ms 16 ms 12 ms 57 ms 63 ms 68 ms

Monday, March 26, 2012

Page 51: Achieving Rapid Response Times in Large Online Services

• In-memory BigTable lookups–data replicated in two in-memory tables–issue requests for 1000 keys spread across 100 tablets–measure elapsed time until data for last key arrives

Backup Requests Effects

Avg Std Dev 95%ile 99%ile 99.9%ile

No backups 33 ms 1524 ms 24 ms 52 ms 994 msBackup after 10 ms 14 ms 4 ms 20 ms 23 ms 50 msBackup after 50 ms 16 ms 12 ms 57 ms 63 ms 68 ms

• Modest increase in request load:– 10 ms delay: <5% extra requests; 50 ms delay: <1%

Monday, March 26, 2012

Page 52: Achieving Rapid Response Times in Large Online Services

Backup Requests w/ Cross-Server Cancellation

Server 1

Client

Server 2

req 3

req 6

req 5

Monday, March 26, 2012

Page 53: Achieving Rapid Response Times in Large Online Services

Backup Requests w/ Cross-Server Cancellation

Server 1

Client

Server 2

req 9

req 3

req 6

req 5

Monday, March 26, 2012

Page 54: Achieving Rapid Response Times in Large Online Services

Backup Requests w/ Cross-Server Cancellation

Server 1

Client

Server 2

req 9

req 3

req 9also: server 2

req 6

req 5

Each request identifies other server(s) to which request might be sent

Monday, March 26, 2012

Page 55: Achieving Rapid Response Times in Large Online Services

Backup Requests w/ Cross-Server Cancellation

Server 1

Client

Server 2

req 3

req 9also: server 2

req 6 req 9also: server 1

req 5

Each request identifies other server(s) to which request might be sent

Monday, March 26, 2012

Page 56: Achieving Rapid Response Times in Large Online Services

Backup Requests w/ Cross-Server Cancellation

Server 1

Client

Server 2

req 3

req 9also: server 2

req 6 req 9also: server 1

Each request identifies other server(s) to which request might be sent

Monday, March 26, 2012

Page 57: Achieving Rapid Response Times in Large Online Services

Backup Requests w/ Cross-Server Cancellation

Server 1

Client

Server 2

req 3

req 9also: server 2

req 6 req 9also: server 1

“Server 2: Starting req 9”

Each request identifies other server(s) to which request might be sent

Monday, March 26, 2012

Page 58: Achieving Rapid Response Times in Large Online Services

Backup Requests w/ Cross-Server Cancellation

Server 1

Client

Server 2

req 3

req 9also: server 2

req 6 req 9also: server 1

“Server 2: Starting req 9”

Each request identifies other server(s) to which request might be sent

Monday, March 26, 2012

Page 59: Achieving Rapid Response Times in Large Online Services

Backup Requests w/ Cross-Server Cancellation

Server 1

Client

Server 2

req 3

req 6 req 9also: server 1

“Server 2: Starting req 9”

Each request identifies other server(s) to which request might be sent

Monday, March 26, 2012

Page 60: Achieving Rapid Response Times in Large Online Services

Backup Requests w/ Cross-Server Cancellation

Server 1

Client

Server 2

req 3

req 6 req 9also: server 1

Each request identifies other server(s) to which request might be sent

reply

Monday, March 26, 2012

Page 61: Achieving Rapid Response Times in Large Online Services

Backup Requests w/ Cross-Server Cancellation

Server 1

Client

Server 2

req 3

req 6 req 9also: server 1

Each request identifies other server(s) to which request might be sent

reply

Monday, March 26, 2012

Page 62: Achieving Rapid Response Times in Large Online Services

Backup Requests: Bad Case

Server 1

Client

Server 2

req 3 req 5

Monday, March 26, 2012

Page 63: Achieving Rapid Response Times in Large Online Services

Backup Requests: Bad Case

Server 1

Client

Server 2

req 9

req 3 req 5

Monday, March 26, 2012

Page 64: Achieving Rapid Response Times in Large Online Services

Backup Requests: Bad Case

Server 1

Client

Server 2

req 9

req 3

req 9also: server 2

req 5

Monday, March 26, 2012

Page 65: Achieving Rapid Response Times in Large Online Services

Backup Requests: Bad Case

Server 1

Client

Server 2

req 3

req 9also: server 2

req 9also: server 1

req 5

Monday, March 26, 2012

Page 66: Achieving Rapid Response Times in Large Online Services

Backup Requests: Bad Case

Server 1

Client

Server 2

req 9also: server 2

req 9also: server 1

Monday, March 26, 2012

Page 67: Achieving Rapid Response Times in Large Online Services

Backup Requests: Bad Case

Server 1

Client

Server 2

req 9also: server 2

req 9also: server 1

“Server 2: Starting req 9”“Server 1: Starting req 9”

Monday, March 26, 2012

Page 68: Achieving Rapid Response Times in Large Online Services

Backup Requests: Bad Case

Server 1

Client

Server 2

req 9also: server 2

req 9also: server 1

“Server 2: Starting req 9”“Server 1: Starting req 9”

Monday, March 26, 2012

Page 69: Achieving Rapid Response Times in Large Online Services

Backup Requests: Bad Case

Server 1

Client

Server 2

req 9also: server 2

req 9also: server 1

reply

Monday, March 26, 2012

Page 70: Achieving Rapid Response Times in Large Online Services

Backup Requests: Bad Case

Server 1

Client

Server 2

req 9also: server 2

req 9also: server 1

reply

Monday, March 26, 2012

Page 71: Achieving Rapid Response Times in Large Online Services

• Read operations in distributed file system client– send request to first replica– wait 2 ms, and send to second replica– servers cancel request on other replica when starting read

• Time for bigtable monitoring ops that touch disk

Backup Requests w/ Cross-Server Cancellation

Monday, March 26, 2012

Page 72: Achieving Rapid Response Times in Large Online Services

• Read operations in distributed file system client– send request to first replica– wait 2 ms, and send to second replica– servers cancel request on other replica when starting read

• Time for bigtable monitoring ops that touch disk

Backup Requests w/ Cross-Server Cancellation

Cluster state Policy 50%ile 90%ile 99%ile 99.9%ile

Mostly idle No backups 19 ms 38 ms 67 ms 98 msBackup after 2 ms 16 ms 28 ms 38 ms 51 ms

Monday, March 26, 2012

Page 73: Achieving Rapid Response Times in Large Online Services

• Read operations in distributed file system client– send request to first replica– wait 2 ms, and send to second replica– servers cancel request on other replica when starting read

• Time for bigtable monitoring ops that touch disk

Backup Requests w/ Cross-Server Cancellation

Cluster state Policy 50%ile 90%ile 99%ile 99.9%ile

Mostly idle No backups 19 ms 38 ms 67 ms 98 msBackup after 2 ms 16 ms 28 ms 38 ms 51 ms

-43%

Monday, March 26, 2012

Page 74: Achieving Rapid Response Times in Large Online Services

• Read operations in distributed file system client– send request to first replica– wait 2 ms, and send to second replica– servers cancel request on other replica when starting read

• Time for bigtable monitoring ops that touch disk

Backup Requests w/ Cross-Server Cancellation

Cluster state Policy 50%ile 90%ile 99%ile 99.9%ile

Mostly idle No backups 19 ms 38 ms 67 ms 98 msBackup after 2 ms 16 ms 28 ms 38 ms 51 ms

+Terasort No backups 24 ms 56 ms 108 ms 159 msBackup after 2 ms 19 ms 35 ms 67 ms 108 ms

Monday, March 26, 2012

Page 75: Achieving Rapid Response Times in Large Online Services

• Read operations in distributed file system client– send request to first replica– wait 2 ms, and send to second replica– servers cancel request on other replica when starting read

• Time for bigtable monitoring ops that touch disk

Backup Requests w/ Cross-Server Cancellation

Cluster state Policy 50%ile 90%ile 99%ile 99.9%ile

Mostly idle No backups 19 ms 38 ms 67 ms 98 msBackup after 2 ms 16 ms 28 ms 38 ms 51 ms

+Terasort No backups 24 ms 56 ms 108 ms 159 msBackup after 2 ms 19 ms 35 ms 67 ms 108 ms

-38%

Monday, March 26, 2012

Page 76: Achieving Rapid Response Times in Large Online Services

• Read operations in distributed file system client– send request to first replica– wait 2 ms, and send to second replica– servers cancel request on other replica when starting read

• Time for bigtable monitoring ops that touch disk

Backup Requests w/ Cross-Server Cancellation

Cluster state Policy 50%ile 90%ile 99%ile 99.9%ile

Mostly idle No backups 19 ms 38 ms 67 ms 98 msBackup after 2 ms 16 ms 28 ms 38 ms 51 ms

+Terasort No backups 24 ms 56 ms 108 ms 159 msBackup after 2 ms 19 ms 35 ms 67 ms 108 ms

Backups cause about ~1% extra disk reads

Monday, March 26, 2012

Page 77: Achieving Rapid Response Times in Large Online Services

• Read operations in distributed file system client– send request to first replica– wait 2 ms, and send to second replica– servers cancel request on other replica when starting read

• Time for bigtable monitoring ops that touch disk

Backup Requests w/ Cross-Server Cancellation

Cluster state Policy 50%ile 90%ile 99%ile 99.9%ile

Mostly idle No backups 19 ms 38 ms 67 ms 98 msBackup after 2 ms 16 ms 28 ms 38 ms 51 ms

+Terasort No backups 24 ms 56 ms 108 ms 159 msBackup after 2 ms 19 ms 35 ms 67 ms 108 ms

Monday, March 26, 2012

Page 78: Achieving Rapid Response Times in Large Online Services

• Read operations in distributed file system client– send request to first replica– wait 2 ms, and send to second replica– servers cancel request on other replica when starting read

• Time for bigtable monitoring ops that touch disk

Backup Requests w/ Cross-Server Cancellation

Cluster state Policy 50%ile 90%ile 99%ile 99.9%ile

Mostly idle No backups 19 ms 38 ms 67 ms 98 msBackup after 2 ms 16 ms 28 ms 38 ms 51 ms

+Terasort No backups 24 ms 56 ms 108 ms 159 msBackup after 2 ms 19 ms 35 ms 67 ms 108 ms

Backups w/big sort job gives same read latencies as no backups w/ idle cluster!

Monday, March 26, 2012

Page 79: Achieving Rapid Response Times in Large Online Services

• Many variants possible:

• Send to third replica after longer delay–sending to two gives almost all the benefit, however.

• Keep requests in other queues, but reduce priority

• Can handle Reed-Solomon reconstruction similarly

Backup Request Variants

Monday, March 26, 2012

Page 80: Achieving Rapid Response Times in Large Online Services

• Many systems can tolerate inexact results–information retrieval systems

• search 99.9% of docs in 200ms better than 100% in 1000ms

–complex web pages with many sub-components• e.g. okay to skip spelling correction service if it is slow

• Design to proactively abandon slow subsystems–set cutoffs dynamically based on recent measurements

• can tradeoff completeness vs. responsiveness

– important to mark such results as tainted in caches

Tainted Partial Results

Monday, March 26, 2012

Page 81: Achieving Rapid Response Times in Large Online Services

• Some good:–lower latency networks make things like backup request

cancellations work better

• Some not so good:–plethora of CPU and device sleep modes save power,

but add latency variability–higher number of “wimpy” cores => higher fanout =>

more variability

• Software techniques can reduce variability despite increasing variability in underlying hardware

Hardware Trends

Monday, March 26, 2012

Page 82: Achieving Rapid Response Times in Large Online Services

• Tolerating variability– important for large-scale online services– large fanout magnifies importance– makes services more responsive– saves significant computing resources

• Collection of techniques–general good engineering practices

• prioritized server queues, careful management of background activities

–cross-request adaptation• load balancing, micro-partitioning

–within-request adaptation• backup requests, backup requests w/ cancellation, tainted results

Conclusions

Monday, March 26, 2012

Page 83: Achieving Rapid Response Times in Large Online Services

• Joint work with Luiz Barroso and many others at Google

• Questions?

Thanks

Monday, March 26, 2012