[@IndeedEng] Redundant Array of Inexpensive Datacenters

Redundant Array ofInexpensive Datacenters

Charles Valentine and Chris GrafJune 2013

Overview

Charles ValentineVP, Technology Services

I helppeopleget jobs.

Indeed

● 100 million unique visitors per month● Over 50 countries and 26 languages● 3 Billion job searches per month

Indeed Ops● Assist development in designing new products● Engineer scalable systems to support applications● Monitor applications● Fix systems when they break

Indeed Lingo

Datacenter = Point of Presence

Each Presence is Full Stack

● Applications● Services● Read/Write Data systems● Communications● Monitoring

We need serious processing power in each datacenter!

Applications per Datacenter

● Over 40 Java-based web applications● Over 90 Java-based services

Data Systems

● MySQL databases● Mongo databases● Memcached instances● LSM Trees● Search indexes● Numerous other data stores

Goals

● Fast● Reliable● Inexpensive

Triple Constraint

Fast

Reliable

Inexpensive

Traditional Method

Fast

Reliable

Inexpensive

Indeed Method

Fast

Reliable

Inexpensive

Fast

Speed is a product feature● Server Time● Client Time

Monthly Job Searches

1 ms, 3 Billion Times/Month

1 ms = 34 job seeker days per month


20 ms = 22 jobseeker months


100 ms = 9.5 jobseeker years

Reliable

Reliability is a product feature

Impact of Downtime

8,000Disappointed Job Seekers every minute

People get hired on Indeed

7 seconds

Availability

● Jobseekers can find jobs● Less focus on mitigating failure● More focus on recovering quickly

Availability is Good for Job Seekers

9's

Good

99.9% availability => down for 525 minutes

At peak 4,500 jobseekers don't get a job

Better

99.99% availability => down for 52 minutes

At peak 450 jobseekers don't get a job

Almost Best

99.999% uptime => down for 5 minutes

At peak 45 jobseekers don't get a job

Indeed is Always there for Job Seekers

Availability > 99.999%

Less than 5 minutes downtime per year

How It WorksChris Graf

Operations Manager

Maximize Availability

Beyond 99.999%No downtime, scheduled or otherwise

Maximize Performance

Optimize page load times to the millisecond

Minimize Cost

Minimize cost while meeting performance and availability goals

Hosting Models

● Traditional Colocation● The Cloud● Managed Hosting

Traditional Colocation

● You buy the servers, network gear, cables...● You send people to set it up● You send people to fix stuff when it breaks● You manage your own pipes (maybe)

Traditional Colocation Expansion

1. Acquire rack space2. Buy the hardware3. Wait for manufacturing4. Wait for delivery5. Send people to the datacenter to set it all up

Expansion can take weeks

Traditional Colocation

Good if you have● Fairly static environment● Really beefy hardware● Some centralized functionality● Time to wait● Lots of cap-ex budget● Like signing long-term deals● People to do stuff

● You rent access to computing power● You pay to reserve it if you aren't using it● Usually abstracted from hardware layer

The Cloud

Expanding Cloud-based systems

1. Order new instances2. Wait a few minutes3. Provision them

Expansion takes minutes.

The Cloud is good!

If you have significant, unpredictable changes in load

The Cloud is bad!

Costs more if you need all your instances available all of the time

Managed Hosting

● Rent hardware from provider● Provider buys and hosts servers, network,

etc.● Provider deals with hardware issues

Expanding Managed Hosting

1. Order new servers2. Wait a few hours3. Provision

Expansion takes hours (depending on provider)

Indeed Uses Managed Hosting

Least expensive overall

Access to real bare metal hardware

Agile enough

Steps for beyond 99.999% uptime

1. Find a provider2. Sign contract for 100% uptime with 100%

revenue protection3. Profit

Right?

Providers "guarantee" availability

"Service Level Agreement" (SLA) guarantees some percentage of uptime

SLA: brief outages aren't outages

Less than 30 minutes downtime not counted against "100% SLA"

One 5-minute outage per month < 99.99%

Two 25-minute outages per month < 99.9%

The provider can call that 100% available

SLA: maintenance is not downtime

Scheduled maintenance not counted against SLA

1 hour maintenance each month < 99.9%

The provider can call that 100% available

SLA credits don't cover your business

You get a refund for the services, not for lost business and lost customer confidence

Providers lose your hosting fees

You lose your revenue

100% is not really 100%

Hosting is complicated

A single datacenter is rarely 100% available

Bug in provider hardware caused total loss of Internet access under certain load

Core network problem

Power outage1. Utility power was disrupted2. Backup generator and UPS couldn't handle load3. Core network went offline4. Servers lost power5. Upon power restoration, router did not recover

Power Outage Aftermath● Event duration = 54 minutes● Recovery duration = 12 hours● 5% monthly credit for affected hardware

Backhoe Induced Fiber Failure (BIFF)

Wet servers

Tornado peeled back the roof of an AT&T datacenter in 2004.

Other Disasters

● Hurricanes● Floods● Earthquakes● Fires● Etc.

Need better uptime than providers

Can only get ~99.7% after asterisks

We have to build something better

Save a document to a hard disk

Hard Disk

Doc

Saved

Hard Disk

Doc

Disk failure

Hard Disk A

Disaster Recovery

Restore from an external USB drive?

Redundant Storage

Simple case - RAID 1

Hard Disk A

Hard Disk B

RAID - Save it twice

Hard Disk A

Hard Disk B

Doc

RAID - Two copies of everything

Hard Disk A

Hard Disk B

Doc Doc

RAID

Hard Disk A

Hard Drive B

Doc Doc

RAID == Redundant Array of Inexpensive Datacenters

Datacenter A

Datacenter B

Jobseekers

RAID makes datacenters more reliable

Datacenter A

Datacenter B

Jobseekers

Building a more reliable system

Using inexpensive, less reliable components

99.7% in, 99.999% out

Now our system can get better availability as a whole than any single provider can give us.

Expect your datacenter to fail

Failure is inevitable

Design for it

Simpler datacenters with RAID

Only need one of everything inside each datacenter:● Firewalls● Load balancers● Servers provisioned primarily for capacity not

redundancy

Primary and secondary datacenters

21

Datacenter level redundancy

Protects against a single datacenter failure

Datacenter level redundancy

Protects against a single datacenter failure

...

But there are problems that can affect more than one datacenter on the same provider

Denial of service attacks

Distributed denial of service attack against another customer who had servers in the same facilities took multiple facilities offline

Network configuration errors

Provider pushed a bad global route which took their entire global network offline

The biggest threat

Humans

Protect against global provider failure

Use multiple providers to get provider-level redundancy

Provider-level redundancy

21

Provider-level redundancy

21 XX

Recovering from Failure

● Offline● Active/Passive● Active/Active

Offline

● One active datacenter handles all traffic● Backup systems are offline and incomplete● Restore backups to new systems● Downtime during switchover is ~days

Active / Passive (Dark)

● One active datacenter handles all traffic● A second datacenter has provisioned

systems and all data● Switch from primary to secondary● Downtime during switchover is minutes to

hours

Active / Active

● Every datacenter handles traffic● Data and systems are replicated● Failover activated automatically● Downtime during switchover measured in

seconds● Scales beyond two facilities

Jobseeker Impact

Offline: extended downtime for all jobseekers

Active/Passive: some downtime for all jobseekers

Active/Active: brief downtime for some jobseekers

Which jobseekers go to which datacenter?

Offline: go to single datacenter

Active/Passive: go to single datacenter

Active/Active: go to many datacenters?

Send jobseekers to the best datacenter

Use dynamic DNS service to send job seekers to the best, healthy data center

Anycast DNS

Resolving same hostname to different IP addresses

● Client A: nslookup www.indeed.comServer: dns.client-a.com

Address: 1.1.1.1

● Client B: nslookup www.indeed.comServer: dns.client-b.com

Address: 2.2.2.2

DNS Lookup

JobseekerA

Jobseeker DNS

Server5.5.5.5

Indeed DNS Service

www.indeed.com

1.1.1.1

www.indeed.com

1.1.1.1

Vary response from primary DNS

Indeed DNS Service

www.indeed.com

1.1.1.1

www.indeed.com

1.1.1.1

Indeed DNS Service

www.indeed.com

2.2.2.2

www.indeed.com

2.2.2.2

Jobseeker DNS

Server5.5.5.5

Jobseeker DNS

Server8.8.8.8

JobseekerA

JobseekerB

Similar jobseekers get similar responses

Indeed DNS Service

www.indeed.com

1.1.1.1

www.indeed.com

1.1.1.1

Indeed DNS Service

www.indeed.com

2.2.2.2

www.indeed.com

2.2.2.2

Indeed DNS Service

www.indeed.com

2.2.2.2

www.indeed.com

2.2.2.2

Jobseeker DNS

Server5.5.5.5

Jobseeker DNS

Server8.8.8.8

Jobseeker DNS

Server8.8.8.8

JobseekerA

JobseekerB

JobseekerC

Remap jobseekers via DNS changes

Indeed DNS Service

www.indeed.com

1.1.1.1

www.indeed.com

1.1.1.1Reconfig

Indeed DNS Service

www.indeed.com www.indeed.com

2.2.2.22.2.2.2

Jobseeker DNS

Server5.5.5.5

Jobseeker DNS

Server5.5.5.5

JobseekerA

JobseekerA

Outsource your DNS service

Doing this well is an investment

Outsource your DNS service

● Robust● Flexible● Inexpensive

Our core competency is jobs

Their core competency is DNS

Global DNS Service

Degradation and Failure

Manually switch datacenter on service degradation

Automatically switch datacenter on failure

DNS propagation delays

1. Healthcheck cycle - up to 30 seconds2. Healthcheck server to nearest PoP3. Jobseeker's DNS server cache refresh4. Jobseeker's local DNS cache refresh

DNS Time-to-live (TTL)

TTL tells local name servers and clients how long to wait before looking up a domain name again

TTL limits load, but also slows change propagation

Some clients and servers ignore TTL

We specify a 30 second TTL, but local DNS servers and clients can ignore it

Impact of propagation delay

90 second traffic hole

30 minute tailWell-behaved clients

Ignoring our TTL

Big Picture90 second hole

Failing datacenter

Total traffic

Accepting DNS limitations

Complete datacenter failure is extremely rare

Predictable limitation

Massive costs to reduce propagation delay

Remapping Manually

The same system allows us to reroute traffic whenever we want● Datacenter maintenance● Non-critical performance problems● Non-critical feature loss● Other degradation of jobseeker experience

Datacenter Redirection

datacenter disabled

traffic moves to others

Anycast DNS for performance

This capability is also used to improve performance

Closer to the jobseekers

The DNS service can give the IP address of the datacenter closest to the jobseeker.

Network hops

Based on network hops between jobseeker DNS server and our DNS service POP

Network paths

Estimates how many networks traffic must pass through to reach our servers

Count hops

Picks estimated shortest path

Optimize for network distance

We can push our data center presences closer to the jobseekers to reduce network latency

Datacenters for redundancy only

Fast for some jobseekers

Datacenters close to the jobseekers

Fast for most jobseekers

Sent to the East Coast

Sent to Central US

Sent to the West Coast

No downtime for datacenter replacement

Incrementally send traffic to new datacenters

Incrementally reduce traffic to old data centers

Move West Coast hosting?

?

Move West Coast hosting!

-20 ms

Move European hosting?

?

Don't move European hosting!

+50 ms!

Search Engine Performance

Source GrabPerf.org

Page Load Time

1,000ms

9,000ms

Summary and ResultsCharles Valentine

● Higher-capacity network equipment● Redundant firewalls● Redundant load balancers● Bigger Internet connections● Redundant Internet connections

This is "vertical scaling."

Traditional Scaling Model

Horizontal Scaling with RAID

Add capacity by adding datacenters

Add redundancy by adding datacenters

Rent "good" datacenters, not "best"

You can RAID too!

Avoid using proprietary features

● Load balancer● Security devices● Virtualization● Servers

Be Hardware Agnostic

More potential providers

Use free software

No licensing costs or recurring maintenance fees

Agile Providers

● New hardware racked and ready in a few hours

● No need to over provision

Automate configuration

● Cobbler● Puppet

Rent instead of buying

● Obsolete hardware is not your problem● No depreciation● No hardware maintenance● No need to hire people to maintain the hardware

Architect Applications for RAID

Work with your development teams

Traditional Hardware Scaling

● Old hardware supports baseline traffic● New hardware supports growth

Indeed Hardware Scaling

Old hardware gets replaced by new, on demand

Moore's Law

Hardware is always getting better● Faster processors● More memory per chassis● Larger, faster disks

Higher capacity, lower cost

● Number of machines drives cost● Power of machines drives cost● More machines => more problems● Compute power grows faster than compute

cost

Replace hardware every 18 months

Managed hosting+

Moore's Law+

RAID=

new and powerful hardwareevery 18 months

Amazon EC2?

● Amazon is a single provider● Costs more to run 24x7

○ 2x without bandwidth cost● Can't be as close to the jobseeker

What RAID gets you

● Servers closer to your customers● Disposable datacenters

○ Datacenter-level failover○ Get modern hardware every 18 months

● Many hosting options

Spend Time On...

● Automation● Managed DNS● Investigating Providers● Monitoring

Spend Less On

● Proprietary hardware● Network Infrastructure● Support Contracts● Software Licenses● Headcount

Monthly Server Count vs Job Search

Inexpensive

● Cost as a percentage of revenue● Cost of delivery per job search

Revenue vs Infrastructure Cost

Revenue/Search vs. Cost/Search

Fast● 100 ms average client time

Reliable● > 99.999% availability in 2012

Cost Effective● Cost of delivery < 0.5% of revenue

RAIDing FTW

Q&A