29
Alex Sherman (Akamai Technologies, Columbia University) Phil Lisiecki (Akamai Technologies) Andy Berkheimer (Akamai Technologies) Joel Wein (Akamai Technologies, Polytechnic University) ACMS: The Akamai Configuration Management System

ACMS: The Akamai Configuration Management System

  • Upload
    moriah

  • View
    50

  • Download
    0

Embed Size (px)

DESCRIPTION

ACMS: The Akamai Configuration Management System. The Akamai Platform. Akamai operates a Content Delivery Network of 15,000+ servers distributed across 1,200+ ISPs in 60+ countries - PowerPoint PPT Presentation

Citation preview

Page 1: ACMS: The Akamai Configuration Management System

Alex Sherman (Akamai Technologies, Columbia University) Phil Lisiecki (Akamai Technologies)Andy Berkheimer (Akamai Technologies) Joel Wein (Akamai Technologies, Polytechnic University)

ACMS: The Akamai Configuration Management System

Page 2: ACMS: The Akamai Configuration Management System

ACMS

The Akamai Platform

• Akamai operates a Content Delivery Network of 15,000+ servers distributed across 1,200+ ISPs in 60+ countries

• Web properties (Akamai’s customers) use these servers to bring their web content and applications closer to the end-users

Page 3: ACMS: The Akamai Configuration Management System

ACMS

Problem: configuration and control

• Even with the widely distributed platform customers need to maintain the control of how their content is served

• Customers need to configure their service options with the same ease and flexibility as if it was a centralized, locally hosted system

Page 4: ACMS: The Akamai Configuration Management System

ACMS

Problem (cont)

• Customer profiles include hundreds of parameters. For example: – Cache TTLs– Allow lists– Whether cookies are accepted– Whether application sessions are stored

• In addition, internal Akamai services require dynamic reconfigurations (mapping, load-balancing, provisioning services)

Page 5: ACMS: The Akamai Configuration Management System

ACMS

Why is this difficult?

• 15,000 servers must synchronize to the latest configurations within a few minutes

• Some servers may be down or “partitioned-off” at the time of reconfiguration

• A server that comes up after some downtime must re-synchronize quickly

• Configuration may be initiated from anywhere on the network and must reach all other servers

Page 6: ACMS: The Akamai Configuration Management System

ACMS

Outline

• High-level overview of the functioning configuration system

• Distributed protocols that guarantee fault-tolerance (based on earlier literature)

• Operational experience and evaluation

Page 7: ACMS: The Akamai Configuration Management System

ACMS

Assumptions

• Configuration files may vary in size from a few hundred bytes to 100MB

• Submissions may originate from anywhere on the Internet

• Configuration files are submitted in their entirety (no diffs)

Page 8: ACMS: The Akamai Configuration Management System

ACMS

System Requirements

• High availability – system must be up 24x7 and accessible from various points on the network

• Fault-tolerant storage of configuration files for asynchronous delivery

• Efficient delivery – configuration files must be delivered to the “live” edge servers quickly

• Recovery – edge servers must “recover” quickly• Consistency – for a given configuration file the

system must synchronize to a “latest” version• Security – configuration files must be authenticated

and encrypted

Page 9: ACMS: The Akamai Configuration Management System

ACMS

Proposed Architecture: Two Subsystems

• Front-end – a small collection of Storage Points responsible for accepting, storing, and synchronizing configuration files

• Back-End – reliable and efficient delivery of configuation files to all of the edge servers - leverages the Akamai CDN

Page 10: ACMS: The Akamai Configuration Management System

ACMS

15,000 Edge Servers

Storage Points

Publishers

2. Storage Points store, synchronize and upload the new file on local web servers

1. Publisher transmits a file to a storage point

3. Edge servers download the new file from the SPs via the CDN

Page 11: ACMS: The Akamai Configuration Management System

ACMS

Front-end fault-tolerance

– We implement agreement protocol on top of replication

– Vector exchange: a quorum based agreement scheme

– No dependence on a single Storage Point

• Eliminate dependence on any given network – SPs hosted by distinct ISP

• Mitigate distributed communication failures

Page 12: ACMS: The Akamai Configuration Management System

ACMS

Quorum Requirement

• We define a quorum as a majority (e.g. 3 out of 5 SPs)

• A quorum of SPs must agree on a submission

• Every future majority overlaps with the earlier majority that agreed on a file

• If there is no quorum of alive and communicating SPs, pending agreements halt until a quorum is reestablished

Page 13: ACMS: The Akamai Configuration Management System

ACMS

Accepting a file

• A publisher contacts an accepting SP

• The accepting SP replicates a temporary file to a majority of SPs

• If replication succeeds the accepting SP initiates an agreement algorithm called Vector Exchange

• Upon success the accepting SP “accepts” and all SPs upload the new file

Page 14: ACMS: The Akamai Configuration Management System

ACMS

Vector Exchange (based on vector clocks)

• For each agreement SPs exchange a bit vector.

• Each bit corresponds to “commitment” status of a corresponding SP.

• Once a majority of bits are set we say that “agreement” takes place

• When any SP “learns” of an agreement it can upload the submission

Page 15: ACMS: The Akamai Configuration Management System

ACMS

Vector Exchange: an example:

• “A” initiates and broadcasts a vector: A:1 B:0 C:0 D:0 E:0

• “C” sets its own bit and re-broadcasts:A:1 B:0 C:1 D:0 E:0

• “D” sets its bit and rebroadcatsA:1 B:0 C:1 D:1 E:0

• Any SP learns of the “agreement” when it sees a majority of bits set.

A

B

CD

E

Page 16: ACMS: The Akamai Configuration Management System

ACMS

Vector Exchange Guarantees

• If a submission is accepted at least a majority have stored and agreed on the submission

• The agreement is never lost by a future quorum. Q:Why?

• A: any future quorum contains at least one SP that saw an initiated agreement.

• VE borrows ideas from Paxos, BFS [Liskov]– Weaker, cannot implement a state machine with

VE– VE offers simplicity, flexibility

Page 17: ACMS: The Akamai Configuration Management System

ACMS

Recovery Routine

• Each SP runs a recovery routine continuously to query other SPs for “missed” agreements.

• If SP finds that it missed an agreement it downloads the corresponding configuration file

• Recovery allows – SPs that experience downtime to recover state– Termination of VE messages once agreement

occurs

Page 18: ACMS: The Akamai Configuration Management System

ACMS

Recovery Optimization: Snapshots

• Snapshot is a hierarchical index structure that describes latest versions of all accepted files

• Each SP updates its own snapshot when it learns of agreements

• As part of the recovery process an SP queries snapshots on other SPs

• Side-effect: snapshots are also used by the edge servers (back-end) to detect changes.

Page 19: ACMS: The Akamai Configuration Management System

ACMS

Back-end: Delivery

• Processes on edge servers subscribe to specific configurations via their local Receiver process

• Receivers periodically query the snapshots on the SPs to learn of any updates.

• If the updates match any subscriptions the Receivers download the files via HTTP IMS requests.

Edge Server

Receiver

SP

Page 20: ACMS: The Akamai Configuration Management System

ACMS

Delivery (continued)

• Delivery is accelerated via the CDN– Local Akamai caches– Hierarchical download– Optimized overlay routing

• Delivery scales with the growth of the CDN• Akamai caches use a short TTL (on the

order of 30 seconds) for the configuration files

Page 21: ACMS: The Akamai Configuration Management System

ACMS

Operational Experience – we rely heavily on the Network Operations Control Center for early fault detection

                                                                                                                                                                     

                                         

Page 22: ACMS: The Akamai Configuration Management System

ACMS

Operational Experience (continued)

• Quorum Assumption– 36 instances of SP disconnected from quorum for

more than 10 minutes due to network outages during Jan-Sep of 2004

– In all instances there was an operating quorum of other SPs

– Shorter network outages do occur (e.g. two several minute outages between a pair of SPs over a 6 day period)

• Permanent Storage – files may get corrupted – NOCC recorded 3 instances of file corruption on the

SPs over a 6 months period– we use md5 hash when writing state files

Page 23: ACMS: The Akamai Configuration Management System

ACMS

Operational Experience - safeguards

• To prevent CDN-wide outages due to a corrupted configuration some files are “zoned”– Publish a file to a set of edge servers = zone 1– If the system processes the file successfully,

publish to zone 2, etc…

• Receivers failover from CDN to SPs• Recovery = backup for VE – useful in

building state on a fresh SP

Page 24: ACMS: The Akamai Configuration Management System

ACMS

File Stats

• Configuration file sizes range from a few hundred bytes to 100MB. The average file size is around 121KB.

• Submission time dominated by replication to SPs (may take up to 2 minutes for very large files)

• 15,000 files submitted over 48 hours

Page 25: ACMS: The Akamai Configuration Management System

ACMS

Propagation Time

• Randomly sampled 250 edge servers to measure propagation time.

• 55 seconds on avg. • Dominated by cache TTL

and polling intervals

Page 26: ACMS: The Akamai Configuration Management System

ACMS

Propagation vs. File Sizes

• Mean and 95th percentile propagation time vs. file size

• 99.95% of updates arrived within 3 minutes

• The rest delayed due to temporary connectivity issues

Page 27: ACMS: The Akamai Configuration Management System

ACMS

Tail of Propagation

• Another random sample of 300 edge servers over a 4 day period

• Measured propagation of small files (under 20KB)

• 99.8% of the time file is received within 2 minutes

• 99.96% of the time file is received within 4 minutes

Page 28: ACMS: The Akamai Configuration Management System

ACMS

Scalability

• Front-end scalability is dominated by replication– With 5 SPs and 121KB avg. file size, Vector

Exchange overhead is 0.4% of bandwidth– With 15 SPs, overhead is 1.2%– For larger footprint can use hashing to pick a set

of SPs for each configuration file

• Back-end scalability – Cacheability grows as the CDN penetrates more

ISPs– Reachability of edge machines inside remote

ISPs improves with more alternate paths

Page 29: ACMS: The Akamai Configuration Management System

ACMS

Conclusion

• ACMS uses a set of distributed algorithms that ensure high level of fault-tolerance

• Quorum based system allows operators to ignore transient faults, and gives them more time to react to significant Storage Point failures

• ACMS is a core subsystem of the Akamai CDN that customers rely on to administer content