Upload
kishore-gopalakrishna
View
1.227
Download
3
Embed Size (px)
DESCRIPTION
Apache Helix presentation at Vmware. Download the file to see animations
Citation preview
1
Building distributed systems using Helix
Kishore Gopalakrishna, @kishoreg1980h?p://www.linkedin.com/in/kgopalak
h?p://helix.incubator.apache.org Apache IncubaGon Oct, 2012 @apachehelix
Outline
• Introduc)on • Architecture • How to use Helix • Tools • Helix usage
2
3
Examples of distributed data systems
4
Single Node
MulG node
Fault tolerance
Cluster Expansion
• ParGGoning • Discovery • Co-‐locaGon
• ReplicaGon • Fault detecGon • Recovery
• Thro?le data movement • Re-‐distribuGon
Lifecycle
5
ApplicaGon
Zookeeper
ApplicaGon
Framework
Consensus System
• File system • Lock • Ephemeral
• Node • ParGGon • Replica • State • TransiGon
Zookeeper provides low level primiGves. We need high level primiGves.
6
Outline
• IntroducGon • Architecture • How to use Helix • Tools • Helix usage
7
8
Terminologies Node A single machine
Cluster Set of Nodes
Resource A logical en/ty e.g. database, index, task
ParGGon Subset of the resource.
Replica Copy of a parGGon
State Status of a parGGon replica, e.g Master, Slave
TransiGon AcGon that lets replicas change status e.g Slave -‐> Master
Core concept -‐ state machine
• Set of legal states – S1, S2, S3
• Set of legal state transiGons – S1àS2 – S2àS1 – S2àS3 – S3àS2
9
Core concept -‐ constraints
• Minimum and maximum number of replicas that should be in a given state – S3àmax=1, S2àmin=2
• Maximum concurrent transiGons – Per node – Per resource – Across cluster
10
Core concept: objecGves
• ParGGon placement – Even distribuGon of replicas in state S1,S2 across cluster
• Failure/Expansion semanGcs – Create new replicas and assign state – Change state of exisGng replicas – Even distribuGon of replicas
11
Augmented finite state machine
12
State Machine
• States • S1,S2,S3
• TransiGon • S1àS2, S2àS1, S2àS3, S3àS1
Constraints
• States • S1à max=1, S2=min=2
• TransiGons • Concurrent(S1-‐>S2) across cluster < 5
ObjecGves
• ParGGon Placement • Failure semanGcs
Message consumer group -‐ problem
13
ASSIGNMENT SCALING FAULT TOLERANCE
PARTITIONED QUEUE CONSUMER
ParGGon management
• One consumer per queue
• Even distribuGon
ElasGcity
• Re-‐distribute queues among consumers
• Minimize movement
Fault tolerance
• Re-‐distribute • Minimize movement • Limit max queue per consumer
Message consumer group: soluGon
Offline Online
14
MAX=1 per par))on Start consumpGon
Stop consumpGon
MAX 10 queues per consumer
ONLINE OFFLINE STATE MODEL
Distributed data store
Node 1 Node 3 Node 2
P.4
P.9 P.10 P.11
P.12
P.1 P.2 P.3 P.7 P.5 P.6
P.8 P.1 P.5 P.6
P.9 P.10
P.4 P.3
P.7 P.8 P.11 P.12
P.2 P.1
ParGGon management
• MulGple replicas • 1 designated master
• Even distribuGon
Fault tolerance
• Fault detecGon • Promote slave to master
• Even distribuGon
• No SPOF
ElasGcity
• Minimize downGme
• Minimize data movement
• Thro?le data movement
MASTER
SLAVE
Distributed data store: soluGon
16
COUNT=2
COUNT=1
minimize(maxnj∈N S(nj) ) t1≤5 S
M O
t1 t2
t3 t4 minimize(maxnj∈N M(nj) )
MASTER SLAVE STATE MODEL
Distributed search service
Node 1 Node 3 Node 2
P.3
P.1 P.2
P.4
ParGGon management
• MulGple replicas • Even distribuGon
• Rack aware placement
Fault tolerance
• Fault detecGon • Auto create replicas
• Controlled creaGon of replicas
ElasGcity
• re-‐distribute parGGons
• Minimize movement
• Thro?le data movement
P.5
P.3 P.4
P.6 P.1
P.5 P.6
P.2
INDEX SHARD
REPLICA
Distributed search service: soluGon
18
MAX=3 (number of replicas)
MAX per node=5
Internals
19
IDEALSTATE
P1 N1:M
N2:S
P2 N2:M
N3:S
P3 N3:M
N1:S
20
ConfiguraGon
• 3 nodes • 3 parGGons • 2 replicas • StateMachine
Constraints
• 1 Master • 1 Slave • Even distribuGon
Replica placement
Replica State
CURRENT STATE
• P1:OFFLINE • P3:OFFLINE N1 • P2:MASTER • P1:MASTER N2 • P3:MASTER • P2:SLAVE N3
21
22
EXTERNAL VIEW
P1 N1:O
N2:M
P2 N2:M
N3:S
P3 N3:M
N1:O
23
Helix Based System Roles
Node 1 Node 3 Node 2
P.4
P.9 P.10 P.11
P.12
P.1 P.2 P.3 P.7 P.5 P.6
P.8 P.1 P.5 P.6
P.9 P.10
P.4 P.3
P.7 P.8 P.11 P.12
P.2 P.1
PARTICIPANT
SPECTATORController
Parition routing logic
CURRENT STATE
IDEAL STATE
RESPONSE COMMAND
Logical deployment
24
Outline
• IntroducGon • Architecture • How to use Helix • Tools • Helix usage
25
Helix based soluGon
1. Define 2. Configure 3. Run
26
Define: State model definiGon
• States – All possible states – Priority
• TransiGons – Legal transiGons – Priority
• Applicable to each parGGon of a resource
• e.g. MasterSlave
27
S
M O
Define: state model
28
Builder = new StateModelDefinition.Builder(“MASTERSLAVE”);! // Add states and their rank to indicate priority. ! builder.addState(MASTER, 1);! builder.addState(SLAVE, 2);! builder.addState(OFFLINE);! ! //Set the initial state when the node starts! builder.initialState(OFFLINE);
//Add transitions between the states.! builder.addTransition(OFFLINE, SLAVE);! builder.addTransition(SLAVE, OFFLINE);! builder.addTransition(SLAVE, MASTER);! builder.addTransition(MASTER, SLAVE);! !
Define: constraints State Transi)on
ParGGon Y Y
Resource -‐ Y
Node Y Y
Cluster -‐ Y
29
S
M O
COUNT=2
COUNT=1 State Transi)on
ParGGon M=1,S=2 -‐
Define:constraints
30
// static constraint! builder.upperBound(MASTER, 1);!!! // dynamic constraint! builder.dynamicUpperBound(SLAVE, "R");!!! ! // Unconstrained ! builder.upperBound(OFFLINE, -1;
Define: parGcipant plug-‐in code
31
Step 2: configure
32
helix-‐admin –zkSvr <zkAddress>
CREATE CLUSTER
-‐-‐addCluster <clusterName>
ADD NODE
-‐-‐addNode <clusterName instanceId(host:port)>
CONFIGURE RESOURCE
-‐-‐addResource <clusterName resourceName par;;ons statemodel>
REBALANCE èSET IDEALSTATE
-‐-‐rebalance <clusterName resourceName replicas>
33
zookeeper view IDEALSTATE
Step 3: Run
34
run-‐helix-‐controller -‐zkSvr localhost:2181 –cluster MyCluster START CONTROLLER
START PARTICIPANT
zookeeper view
35
Znode content
CURRENT STATE EXTERNAL VIEW
36
Spectator Plug-‐in code
37
38
Helix ExecuGon modes
39
IDEALSTATE
P1 N1:M
N2:S
P2 N2:M
N3:S
P3 N3:M
N1:S
ConfiguraGon
• 3 nodes • 3 parGGons • 2 replicas • StateMachine
Constraints
• 1 Master • 1 Slave • Even distribuGon
Replica placement
Replica State
ExecuGon modes
• Who controls what
40
AUTO REBALANCE
AUTO CUSTOM
Replica placement
Helix App App
Replica State
Helix Helix App
Auto rebalance v/s Auto
AUTO REBALANCE AUTO
41
In acGon Auto rebalance
MasterSlave p=3 r=2 N=3 Node1 Node2 Node3
P1:M P2:M P3:M
P2:S P3:S P1:S
Auto MasterSlave p=3 r=2 N=3
42
Node 1 Node 2 Node 3
P1:O P2:M P3:M
P2:O P3:S P1:S
P1:M P2:S
Node 1 Node 2 Node 3
P1:M P2:M P3:M
P2:S P3:S P1:M
Node 1 Node 2 Node 3
P1:M P2:M P3:M
P2:S P3:S P1:S
On failure: Only change states to saGsfy constraint
On failure: Auto create replica and assign state
Custom mode: example
43
44
Custom mode: handling failure � Custom code invoker
� Code that lives on all nodes, but acGve in one place � Invoked when node joins/leaves the cluster � Computes new idealstate � Helix controller fires the transiGon without viola)ng constraints
P1
N1:M
N2:S
P2
N2:M
N3:S
P3
N3:M
N1:S
P1
N1:S
N2:M
P2
N2:M
N3:S
P3
N3:M
N1:S
Transi)ons
1 N1 MàS
2 N2 Sà M
1 & 2 in parallel violate single master constraint
Helix sends 2 aqer 1 is finished
45
• At least 2 separate controllers process to avoid SPOF
• Only one controller acGve • Extra process to manage • Recommended for large
size clusters • Upgrading controller is
easy
Separate • Embedded controller within
each parGcipant • Only one controller acGve • No extra process to manage • Suitable for small size cluster. • Upgrading controller is costly • ParGcipant health impacts
controller
Embedded
Controller deployment
46
Controller fault tolerance
Controller 1
Controller 2
Controller 3
Zookeeper
LEADER STANDBY STANDBY
Zookeeper ephemeral based leader elecGon for deciding controller leader
47
Controller fault tolerance
Controller 1
Controller 2
Controller 3
Zookeeper
OFFLINE LEADER STANDBY
When leader fails, another controller becomes the new leader
48
Managing the controllers
49
Scaling the controller: Leader Standby Model
Controller
Controller
Controller
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
O L
S
STANDBY
LEADEROFFLINE
50
Scaling the controller: Failure
Controller
Controller
Controller
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
O L
S
STANDBY
LEADEROFFLINE
Outline
• IntroducGon • Architecture • How to use Helix • Tools • Helix usage
51
Tools
• Chaos monkey • Data driven tesGng and debugging • Rolling upgrade • On demand task scheduling and intra-‐cluster messaging
• Health monitoring and alerts
52
Data driven tesGng
• Instrument – • Zookeeper, controller, parGcipant logs
• Simulate – Chaos monkey • Analyze – Invariants are
• Respect state transiGon constraints • Respect state count constraints • And so on
• Debugging made easy • Reproduce exact sequence of events
53
Structured Log File -‐ sample timestamp partition instanceName sessionId state
1323312236368 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236426 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236530 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236530 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236561 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
1323312236561 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236685 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
1323312236685 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236685 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236719 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
1323312236719 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
1323312236719 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236814 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
No more than R=2 slaves Time State Number Slaves Instance
42632 OFFLINE 0 10.117.58.247_12918
42796 SLAVE 1 10.117.58.247_12918
43124 OFFLINE 1 10.202.187.155_12918
43131 OFFLINE 1 10.220.225.153_12918
43275 SLAVE 2 10.220.225.153_12918
43323 SLAVE 3 10.202.187.155_12918
85795 MASTER 2 10.220.225.153_12918
How long was it out of whack? Number of Slaves Time Percentage
0 1082319 0.5
1 35578388 16.46
2 179417802 82.99
3 118863 0.05
83% of the Gme, there were 2 slaves to a parGGon 93% of the Gme, there was 1 master to a parGGon
Number of Masters Time Percentage
0 15490456 7.164960359 1 200706916 92.83503964
Invariant 2: State TransiGons FROM TO COUNT
MASTER SLAVE 55
OFFLINE DROPPED 0
OFFLINE SLAVE 298
SLAVE MASTER 155
SLAVE OFFLINE 0
Outline
• IntroducGon • Architecture • How to use Helix • Tools • Helix usage
58
Helix usage at LinkedIn
59
Espresso
In flight
• Apache S4 – ParGGoning, co-‐locaGon – Dynamic cluster expansion
• Archiva – ParGGoned replicated file store – Rsync based replicaGon
• Others in evaluaGon – Bigtop
60
Auto scaling soqware deployment tool
61
• States • Download, Configure, Start • AcGve, Standby
• Constraint for each state • Download < 100 • AcGve 1000 • Standby 100
Download
Configure
Start
Active
Standby
Offline
< 100
1000
100
Summary
• Helix: A Generic framework for building distributed systems
• Modifying/enhancing system behavior is easy – AbstracGon and modularity is key
• Simple programming model: declaraGve state machine
62
Roadmap
• Features • Span mulGple data centers • AutomaGc Load balancing • Distributed health monitoring • YARN Generic ApplicaGon master for real Gme Apps
• Stand alone Helix agent
64
website h?p://helix.incubator.apache.org
user [email protected]
twi?er @apachehelix, @kishoreg1980