Upload
indeedeng
View
2.501
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Video available: http://youtu.be/hOsA5UpPUSU Learn how Indeed built one of the fastest and most reliable websites in the world. Indeed Operations ensures indeed.com is always available and always fast for the jobseeker. Operations leaders Charles Valentine and Chris Graf will share how we configure and provision multiple datacenters around the world to provide a massively scalable platform for connecting job seekers with jobs. Charles and Chris will detail a simple and inexpensive method to build a platform that provides DNS-based global load balancing and failover, provider portability, and disposable datacenters. Speakers: Charles Valentine (VP of Technology Services at Indeed) leads the Operations, IT, and Security teams. Prior to joining Indeed in 2011, Charles served as VP Technology Services at The Knot. Chris Graf has managed operations at Indeed since 2011. In that time, Indeed's traffic has grown by more than 300%. Prior to Indeed, Chris managed Web operations in the online gaming industry.
Citation preview
Redundant Array ofInexpensive Datacenters
Charles Valentine and Chris GrafJune 2013
Overview
Charles ValentineVP, Technology Services
I helppeopleget jobs.
Indeed
● 100 million unique visitors per month● Over 50 countries and 26 languages● 3 Billion job searches per month
Indeed Ops● Assist development in designing new products● Engineer scalable systems to support applications● Monitor applications● Fix systems when they break
Indeed Lingo
Datacenter = Point of Presence
Each Presence is Full Stack
● Applications● Services● Read/Write Data systems● Communications● Monitoring
We need serious processing power in each datacenter!
Applications per Datacenter
● Over 40 Java-based web applications● Over 90 Java-based services
Data Systems
● MySQL databases● Mongo databases● Memcached instances● LSM Trees● Search indexes● Numerous other data stores
Goals
● Fast● Reliable● Inexpensive
Triple Constraint
Fast
Reliable
Inexpensive
Traditional Method
Fast
Reliable
Inexpensive
Indeed Method
Fast
Reliable
Inexpensive
Fast
Speed is a product feature● Server Time● Client Time
Monthly Job Searches
1 ms, 3 Billion Times/Month
1 ms = 34 job seeker days per month
20 ms, 3 Billion Times/Month
20 ms = 22 jobseeker months
100 ms, 3 Billion Times/Month
100 ms = 9.5 jobseeker years
Reliable
Reliability is a product feature
Impact of Downtime
8,000Disappointed Job Seekers every minute
People get hired on Indeed
7 seconds
Availability
● Jobseekers can find jobs● Less focus on mitigating failure● More focus on recovering quickly
Availability is Good for Job Seekers
9's
Good
99.9% availability => down for 525 minutes
At peak 4,500 jobseekers don't get a job
Better
99.99% availability => down for 52 minutes
At peak 450 jobseekers don't get a job
Almost Best
99.999% uptime => down for 5 minutes
At peak 45 jobseekers don't get a job
Indeed is Always there for Job Seekers
Availability > 99.999%
Less than 5 minutes downtime per year
How It WorksChris Graf
Operations Manager
Maximize Availability
Beyond 99.999%No downtime, scheduled or otherwise
Maximize Performance
Optimize page load times to the millisecond
Minimize Cost
Minimize cost while meeting performance and availability goals
Hosting Models
● Traditional Colocation● The Cloud● Managed Hosting
Traditional Colocation
● You buy the servers, network gear, cables...● You send people to set it up● You send people to fix stuff when it breaks● You manage your own pipes (maybe)
Traditional Colocation Expansion
1. Acquire rack space2. Buy the hardware3. Wait for manufacturing4. Wait for delivery5. Send people to the datacenter to set it all up
Expansion can take weeks
Traditional Colocation
Good if you have● Fairly static environment● Really beefy hardware● Some centralized functionality● Time to wait● Lots of cap-ex budget● Like signing long-term deals● People to do stuff
● You rent access to computing power● You pay to reserve it if you aren't using it● Usually abstracted from hardware layer
The Cloud
Expanding Cloud-based systems
1. Order new instances2. Wait a few minutes3. Provision them
Expansion takes minutes.
The Cloud is good!
If you have significant, unpredictable changes in load
The Cloud is bad!
Costs more if you need all your instances available all of the time
Managed Hosting
● Rent hardware from provider● Provider buys and hosts servers, network,
etc.● Provider deals with hardware issues
Expanding Managed Hosting
1. Order new servers2. Wait a few hours3. Provision
Expansion takes hours (depending on provider)
Indeed Uses Managed Hosting
Least expensive overall
Access to real bare metal hardware
Agile enough
Steps for beyond 99.999% uptime
1. Find a provider2. Sign contract for 100% uptime with 100%
revenue protection3. Profit
Right?
Providers "guarantee" availability
"Service Level Agreement" (SLA) guarantees some percentage of uptime
SLA: brief outages aren't outages
Less than 30 minutes downtime not counted against "100% SLA"
One 5-minute outage per month < 99.99%
Two 25-minute outages per month < 99.9%
The provider can call that 100% available
SLA: maintenance is not downtime
Scheduled maintenance not counted against SLA
1 hour maintenance each month < 99.9%
The provider can call that 100% available
SLA credits don't cover your business
You get a refund for the services, not for lost business and lost customer confidence
Providers lose your hosting fees
You lose your revenue
100% is not really 100%
Hosting is complicated
A single datacenter is rarely 100% available
Bug in provider hardware caused total loss of Internet access under certain load
Core network problem
Power outage1. Utility power was disrupted2. Backup generator and UPS couldn't handle load3. Core network went offline4. Servers lost power5. Upon power restoration, router did not recover
Power Outage Aftermath● Event duration = 54 minutes● Recovery duration = 12 hours● 5% monthly credit for affected hardware
Backhoe Induced Fiber Failure (BIFF)
Wet servers
Tornado peeled back the roof of an AT&T datacenter in 2004.
Other Disasters
● Hurricanes● Floods● Earthquakes● Fires● Etc.
Need better uptime than providers
Can only get ~99.7% after asterisks
We have to build something better
Save a document to a hard disk
Hard Disk
Doc
Saved
Hard Disk
Doc
Disk failure
Hard Disk A
Disaster Recovery
Restore from an external USB drive?
Redundant Storage
Simple case - RAID 1
Hard Disk A
Hard Disk B
RAID - Save it twice
Hard Disk A
Hard Disk B
Doc
RAID - Two copies of everything
Hard Disk A
Hard Disk B
Doc Doc
RAID
Hard Disk A
Hard Drive B
Doc Doc
RAID == Redundant Array of Inexpensive Datacenters
Datacenter A
Datacenter B
Jobseekers
RAID makes datacenters more reliable
Datacenter A
Datacenter B
Jobseekers
Building a more reliable system
Using inexpensive, less reliable components
99.7% in, 99.999% out
Now our system can get better availability as a whole than any single provider can give us.
Expect your datacenter to fail
Failure is inevitable
Design for it
Simpler datacenters with RAID
Only need one of everything inside each datacenter:● Firewalls● Load balancers● Servers provisioned primarily for capacity not
redundancy
Primary and secondary datacenters
21
Datacenter level redundancy
Protects against a single datacenter failure
Datacenter level redundancy
Protects against a single datacenter failure
...
But there are problems that can affect more than one datacenter on the same provider
Denial of service attacks
Distributed denial of service attack against another customer who had servers in the same facilities took multiple facilities offline
Network configuration errors
Provider pushed a bad global route which took their entire global network offline
The biggest threat
Humans
Protect against global provider failure
Use multiple providers to get provider-level redundancy
Provider-level redundancy
21
Provider-level redundancy
21 XX
Recovering from Failure
● Offline● Active/Passive● Active/Active
Offline
● One active datacenter handles all traffic● Backup systems are offline and incomplete● Restore backups to new systems● Downtime during switchover is ~days
Active / Passive (Dark)
● One active datacenter handles all traffic● A second datacenter has provisioned
systems and all data● Switch from primary to secondary● Downtime during switchover is minutes to
hours
Active / Active
● Every datacenter handles traffic● Data and systems are replicated● Failover activated automatically● Downtime during switchover measured in
seconds● Scales beyond two facilities
Jobseeker Impact
Offline: extended downtime for all jobseekers
Active/Passive: some downtime for all jobseekers
Active/Active: brief downtime for some jobseekers
Which jobseekers go to which datacenter?
Offline: go to single datacenter
Active/Passive: go to single datacenter
Active/Active: go to many datacenters?
Send jobseekers to the best datacenter
Use dynamic DNS service to send job seekers to the best, healthy data center
Anycast DNS
Resolving same hostname to different IP addresses
● Client A: nslookup www.indeed.comServer: dns.client-a.com
Address: 1.1.1.1
● Client B: nslookup www.indeed.comServer: dns.client-b.com
Address: 2.2.2.2
DNS Lookup
JobseekerA
Jobseeker DNS
Server5.5.5.5
Indeed DNS Service
www.indeed.com
1.1.1.1
www.indeed.com
1.1.1.1
Vary response from primary DNS
Indeed DNS Service
www.indeed.com
1.1.1.1
www.indeed.com
1.1.1.1
Indeed DNS Service
www.indeed.com
2.2.2.2
www.indeed.com
2.2.2.2
Jobseeker DNS
Server5.5.5.5
Jobseeker DNS
Server8.8.8.8
JobseekerA
JobseekerB
Similar jobseekers get similar responses
Indeed DNS Service
www.indeed.com
1.1.1.1
www.indeed.com
1.1.1.1
Indeed DNS Service
www.indeed.com
2.2.2.2
www.indeed.com
2.2.2.2
Indeed DNS Service
www.indeed.com
2.2.2.2
www.indeed.com
2.2.2.2
Jobseeker DNS
Server5.5.5.5
Jobseeker DNS
Server8.8.8.8
Jobseeker DNS
Server8.8.8.8
JobseekerA
JobseekerB
JobseekerC
Remap jobseekers via DNS changes
Indeed DNS Service
www.indeed.com
1.1.1.1
www.indeed.com
1.1.1.1Reconfig
Indeed DNS Service
www.indeed.com www.indeed.com
2.2.2.22.2.2.2
Jobseeker DNS
Server5.5.5.5
Jobseeker DNS
Server5.5.5.5
JobseekerA
JobseekerA
Outsource your DNS service
Doing this well is an investment
Outsource your DNS service
● Robust● Flexible● Inexpensive
Our core competency is jobs
Their core competency is DNS
Global DNS Service
Degradation and Failure
Manually switch datacenter on service degradation
Automatically switch datacenter on failure
DNS propagation delays
1. Healthcheck cycle - up to 30 seconds2. Healthcheck server to nearest PoP3. Jobseeker's DNS server cache refresh4. Jobseeker's local DNS cache refresh
DNS Time-to-live (TTL)
TTL tells local name servers and clients how long to wait before looking up a domain name again
TTL limits load, but also slows change propagation
Some clients and servers ignore TTL
We specify a 30 second TTL, but local DNS servers and clients can ignore it
Impact of propagation delay
90 second traffic hole
30 minute tailWell-behaved clients
Ignoring our TTL
Big Picture90 second hole
Failing datacenter
Total traffic
Accepting DNS limitations
Complete datacenter failure is extremely rare
Predictable limitation
Massive costs to reduce propagation delay
Remapping Manually
The same system allows us to reroute traffic whenever we want● Datacenter maintenance● Non-critical performance problems● Non-critical feature loss● Other degradation of jobseeker experience
Datacenter Redirection
datacenter disabled
traffic moves to others
Anycast DNS for performance
This capability is also used to improve performance
Closer to the jobseekers
The DNS service can give the IP address of the datacenter closest to the jobseeker.
Network hops
Based on network hops between jobseeker DNS server and our DNS service POP
Network paths
Estimates how many networks traffic must pass through to reach our servers
Count hops
Picks estimated shortest path
Optimize for network distance
We can push our data center presences closer to the jobseekers to reduce network latency
Datacenters for redundancy only
Fast for some jobseekers
Datacenters close to the jobseekers
Fast for most jobseekers
Sent to the East Coast
Sent to Central US
Sent to the West Coast
No downtime for datacenter replacement
Incrementally send traffic to new datacenters
Incrementally reduce traffic to old data centers
Move West Coast hosting?
?
Move West Coast hosting!
-20 ms
Move European hosting?
?
Don't move European hosting!
+50 ms!
Search Engine Performance
Source GrabPerf.org
Page Load Time
1,000ms
9,000ms
Summary and ResultsCharles Valentine
● Higher-capacity network equipment● Redundant firewalls● Redundant load balancers● Bigger Internet connections● Redundant Internet connections
This is "vertical scaling."
Traditional Scaling Model
Horizontal Scaling with RAID
Add capacity by adding datacenters
Add redundancy by adding datacenters
Rent "good" datacenters, not "best"
You can RAID too!
Avoid using proprietary features
● Load balancer● Security devices● Virtualization● Servers
Be Hardware Agnostic
More potential providers
Use free software
No licensing costs or recurring maintenance fees
Agile Providers
● New hardware racked and ready in a few hours
● No need to over provision
Automate configuration
● Cobbler● Puppet
Rent instead of buying
● Obsolete hardware is not your problem● No depreciation● No hardware maintenance● No need to hire people to maintain the hardware
Architect Applications for RAID
Work with your development teams
Traditional Hardware Scaling
● Old hardware supports baseline traffic● New hardware supports growth
Indeed Hardware Scaling
Old hardware gets replaced by new, on demand
Moore's Law
Hardware is always getting better● Faster processors● More memory per chassis● Larger, faster disks
Higher capacity, lower cost
● Number of machines drives cost● Power of machines drives cost● More machines => more problems● Compute power grows faster than compute
cost
Replace hardware every 18 months
Managed hosting+
Moore's Law+
RAID=
new and powerful hardwareevery 18 months
Amazon EC2?
● Amazon is a single provider● Costs more to run 24x7
○ 2x without bandwidth cost● Can't be as close to the jobseeker
What RAID gets you
● Servers closer to your customers● Disposable datacenters
○ Datacenter-level failover○ Get modern hardware every 18 months
● Many hosting options
Spend Time On...
● Automation● Managed DNS● Investigating Providers● Monitoring
Spend Less On
● Proprietary hardware● Network Infrastructure● Support Contracts● Software Licenses● Headcount
Monthly Server Count vs Job Search
Inexpensive
● Cost as a percentage of revenue● Cost of delivery per job search
Revenue vs Infrastructure Cost
Revenue/Search vs. Cost/Search
Fast● 100 ms average client time
Reliable● > 99.999% availability in 2012
Cost Effective● Cost of delivery < 0.5% of revenue
RAIDing FTW
Q&A