Acug datafiniti pellon_sept2013

Building Resource Efficient Distributed Systems At Scale

Michael Pellon (@p3ll0n)Operations Engineer

In the ideal world . . .

. . . we want to be here

cost

wo

rk

But in the “real” world . . .

. . . we usually find ourselves here

cost

wo

rk

Big “jumps” are possible in a relatively short timeframe!

req

uest

s p

er s

eco

nd

~ 2009 - 2012

joules

~ 2013 - ???

RPS/dollar: 4.1xRPS/joule: 4.3xRPS/rack: 10.4x

Avoid “density without value”!

“Respect the problem.”

- Theo Schlossnagle, OmniTI

There is no free lunch.

Tradeoffs cannot be solved by marketing.

How to play with the “big boys” when you are not as “big” as them ...

Lesson #1

Understand deeply the relationship between latency, bandwidth and capacity

across all levels of your infrastructure.

< disk seeks = higher performance

> caching = higher performance

We end up with an ever increasing amount of our cheap DRAM is used to hide the terrible latency of our cheap storage.

This growing split between the bandwidth and latency of our storage systems only becomes apparent at large scale.

CPU DRAM LAN Disk

Bandwidth 1.50 1.27 1.39 1.28

Latency 1.17 1.07 1.12 1.09

Annual Bandwidth and Latency Improvements (Patterson, 2004)

* Extracted from leading commodity components over the last 25 years and what is reported is the multiplicative performance increase per year.

➔ CPU fastest to change and DRAM is the slowest.

CPU DRAM LAN Disk

Bandwidth 1.50 1.27 1.39 1.28

Latency 1.17 1.07 1.12 1.09




➔ Latency is driven by physical limits whereas bandwidth can be addressed through parallelism.

CPU DRAM LAN Disk

Bandwidth 1.50 1.27 1.39 1.28

Latency 1.17 1.07 1.12 1.09




➔ Latency is driven by physical limits whereas bandwidth can be addressed through parallelism.

➔ Bountiful bandwidth with lagging latency!

CPU DRAM LAN Disk

Bandwidth 1.50 1.27 1.39 1.28

Capacity -- 1.52 -- 1.48

Annual Bandwidth and Capacity Improvements (Patterson, 2004)


➔ Widening gap between bandwidth and capacity.


➔ Time to read a complete disk with random IO is increasing 22x / decade or 36% / year.

CPU DRAM LAN Disk

Bandwidth 1.50 1.27 1.39 1.28

Capacity -- 1.52 -- 1.48




➔ Time to read a complete disk with random IO is increasing 22x / decade or 36% / year.

➔ Now our applications cannot afford to have a cache miss!

CPU DRAM LAN Disk

Bandwidth 1.50 1.27 1.39 1.28

Capacity -- 1.52 -- 1.48



Solutions

Caching, prediction and replication.

Solutions


Tape is dead.Disk is tape.Flash is disk.

RAM locality is king.

- Jim Gray, Microsoft (2006)

Requires very careful attention to durability.

Solutions


Expend bandwidth to reduce apparent latency.

Solutions


Expend capacity to reduce apparent latency.

Avoid the problem entirely by using more servers with cheaper, lower powered processors that more closely

match the capabilities of the memory subsystem.

➔ Leverages the massive volume economics of the smart device (e.g., cell phones and tablets) market.


➔ Most workloads are not pushing CPU limits but are IO (disk, network or memory) bound so spending more on a faster CPU will not deliver results.



➔ Price/performance in the device market is far better than current generation server CPUs because there is far less competition in server processors prices tend to be higher and price/performance relatively low.




➔ Server CPU = ~$300 - ~$1000




➔ Server CPU = ~$300 - ~$1000

➔ ARM CPU = ~$15 / Intel Atom S1200 = ~$65




➔ Server CPU = ~$300 - ~$1000

➔ ARM CPU = ~$15 / Intel Atom S1200 = ~$65 ➔ ~25% the processing rate @ ~10% the cost!




➔ Server CPU = ~$300 - ~$1000

➔ ARM CPU = ~$15 / Intel Atom S1200 = ~$65 ➔ ~25% the processing rate @ ~10% the cost!

➔ Volume of the device ecosystem fuels innovation so the performance gap shrinks each generation!

➔ These machines also help with one of the biggest and most certainly the fastest growing cost of any data center -- power!


➔ Your typical 8-core server uses ~200W idle and above 600W TDP (full tilt boogie)!



➔ Bringing 30A @ 208V to each rack that is a 6.2 kW rack (and I know of folks provisioning 12 - 14 kW racks just to fill it up 50%!)



➔ Bringing 30A @ 208V to each rack that is a 6.2 kW rack (and I know of folks provisioning 12 - 14 kW racks just to fill it up 50%!)

➔ If you can save a lot on op-ex by spending a little more on cap-ex it’s a great bargain! (ask your CFO!)

➔ People costs dominate the enterprise player’s data centers but it is very easy and cheap to not let them dominate your costs.

➔ People costs dominate the enterprise player’s data centers but it is very easy and cheap to not let them dominate your costs.

➔ The barrier to entry into automation tools (Puppet, Chef, etc) has never been lower and their penetration into existing systems (networking devices, etc) has never been higher.

Lesson #2

Understand that distributed systems are fundamentally about dealing with

distance and having more than one thing.

Currently writing distributed applications is usually not indistinguishable from writing non-distributed applications.

Despite the non-zero probability of failure within a nearly every aspect of modern computers;

developers of non-distributed applications do not routinely maintain a concept of failing hardware.

complexity

instructions

behaviors

instructions

behaviors

programming language

hardwarelimitations

The difference between an entire data center and a single computer should only be quantitative not qualitative.

Since software development is an entirely quantitative pursuit we should be able to conceal the

entire complexity of the Internet within software.

A clear trajectory in the same direction …

➔ Erlang OTP (Ericsson) and GoCircuit (Tumblr).



➔ General-purpose distributed file systems (and protocols) spanning multiple globally distributed data centers.




➔ Datacenter-scale job schedulers also abound (Google’s Borg/Omega, Apache Mesos, Airbnb’s Chronos, etc.)




➔ Datacenter-scale job schedulers also abound (Google’s Borg/Omega, Apache Mesos, Airbnb Chronos, etc.)

➔ nanomsg scalability protocols (M. Sustrik).




➔ Datacenter-scale job schedulers also abound (Google’s Borg/Omega, Mesos, Airbnb, etc.)

➔ nanomsg scalability protocols (M. Sustrik).

➔ Not only possible but the clear “silent” choice of the majority!

So how to play “big” when you’re “small”?

➔ You need to understand your technical substrate both broadly and deeply so you know where to focus all your resources most effectively.



➔ That understanding will allow to you operate at economies of scale that free up your most important resource -- people.



➔ That understanding will allow to you operate at economies of scale that free up your most important resource -- people.

➔ But remember the focus of our resources is not necessarily where your resources should be focused nor is anyone elses.


➔ Look for areas where a qualitative difference could easily become merely a quantitative difference.


➔ Look for areas where a qualitative difference could easily become merely a quantitative difference.

➔ Quantitative problems are easy to solve through technology, however, qualitative problems are very intractable through technology alone.

Technology

Acug datafiniti pellon_sept2013