Apache Helix presentation at Vmware

Preview:

DESCRIPTION

Apache Helix presentation at Vmware. Download the file to see animations

Citation preview

1  

Building  distributed  systems  using  Helix  

Kishore  Gopalakrishna,  @kishoreg1980h?p://www.linkedin.com/in/kgopalak    

h?p://helix.incubator.apache.org  Apache  IncubaGon  Oct,  2012    @apachehelix  

Outline  

•  Introduc)on  •  Architecture  •  How  to  use  Helix  •  Tools  •  Helix  usage    

2  

3  

Examples  of  distributed  data  systems  

4  

Single  Node  

MulG  node  

Fault  tolerance  

Cluster  Expansion  

•  ParGGoning  •  Discovery  •  Co-­‐locaGon  

•  ReplicaGon  •  Fault  detecGon  •  Recovery  

•  Thro?le  data  movement  •  Re-­‐distribuGon  

Lifecycle  

5  

ApplicaGon  

Zookeeper  

ApplicaGon  

Framework  

Consensus  System  

•  File  system  •  Lock  •  Ephemeral  

•  Node  •  ParGGon  •  Replica  •  State  •  TransiGon    

Zookeeper  provides  low  level  primiGves.     We  need  high  level  primiGves.    

6  

Outline  

•  IntroducGon  •  Architecture  •  How  to  use  Helix  •  Tools  •  Helix  usage  

7  

8  

Terminologies    Node   A  single  machine  

Cluster   Set  of  Nodes  

Resource   A  logical  en/ty  e.g.  database,  index,  task  

ParGGon   Subset  of  the  resource.  

Replica   Copy  of  a  parGGon  

State   Status  of  a  parGGon  replica,  e.g  Master,  Slave  

TransiGon   AcGon  that  lets  replicas  change  status  e.g  Slave  -­‐>  Master  

Core  concept  -­‐  state  machine  

•  Set  of  legal  states  – S1,  S2,  S3  

•  Set  of  legal  state  transiGons  – S1àS2  – S2àS1  – S2àS3  – S3àS2  

   

9  

Core  concept  -­‐  constraints  

•  Minimum  and  maximum  number  of  replicas  that  should  be  in  a  given  state  – S3àmax=1,  S2àmin=2  

•  Maximum  concurrent  transiGons  – Per  node  – Per  resource  – Across  cluster  

   

10  

Core  concept:  objecGves  

•  ParGGon  placement  – Even  distribuGon  of  replicas  in  state  S1,S2  across  cluster  

•  Failure/Expansion  semanGcs  – Create  new  replicas  and  assign  state  – Change  state  of  exisGng  replicas  – Even  distribuGon  of  replicas          

11  

Augmented  finite  state  machine  

12  

State  Machine  

• States  • S1,S2,S3  

• TransiGon  • S1àS2,  S2àS1,  S2àS3,  S3àS1    

Constraints  

• States  • S1à  max=1,  S2=min=2  

• TransiGons  • Concurrent(S1-­‐>S2)  across  cluster  <  5    

ObjecGves  

• ParGGon  Placement  • Failure  semanGcs  

Message  consumer  group  -­‐  problem    

13  

ASSIGNMENT   SCALING   FAULT  TOLERANCE  

PARTITIONED  QUEUE   CONSUMER  

ParGGon  management  

• One  consumer  per  queue  

•  Even  distribuGon  

ElasGcity  

• Re-­‐distribute  queues  among  consumers  

• Minimize  movement  

Fault  tolerance  

• Re-­‐distribute  • Minimize  movement  •  Limit  max  queue  per  consumer  

Message  consumer  group:  soluGon  

Offline   Online  

14  

MAX=1  per  par))on    Start  consumpGon  

Stop  consumpGon  

MAX  10  queues  per  consumer  

ONLINE  OFFLINE  STATE  MODEL  

Distributed  data  store      

Node  1   Node  3  Node  2  

P.4  

P.9   P.10   P.11  

P.12  

P.1   P.2   P.3   P.7  P.5   P.6  

P.8   P.1  P.5   P.6  

P.9   P.10  

P.4  P.3  

P.7   P.8  P.11   P.12  

P.2  P.1  

ParGGon  management  

• MulGple  replicas  • 1  designated  master  

• Even  distribuGon  

Fault  tolerance  

• Fault  detecGon  • Promote  slave  to  master  

• Even  distribuGon  

• No  SPOF  

ElasGcity  

• Minimize  downGme  

• Minimize  data  movement  

• Thro?le  data  movement  

MASTER  

SLAVE  

Distributed  data  store:  soluGon  

16  

COUNT=2

COUNT=1

minimize(maxnj∈N  S(nj)  ) t1≤5 S  

M  O  

t1 t2

t3 t4 minimize(maxnj∈N  M(nj)  )

MASTER  SLAVE  STATE  MODEL  

Distributed  search  service      

Node  1   Node  3  Node  2  

P.3  

P.1   P.2  

P.4  

ParGGon  management  

• MulGple  replicas  • Even  distribuGon  

• Rack  aware  placement  

Fault  tolerance  

• Fault  detecGon  • Auto  create  replicas  

• Controlled  creaGon  of  replicas    

ElasGcity  

• re-­‐distribute  parGGons  

• Minimize  movement  

• Thro?le  data  movement  

P.5  

P.3   P.4  

P.6   P.1  

P.5   P.6  

P.2  

INDEX  SHARD  

REPLICA  

Distributed  search  service:  soluGon  

18  

MAX=3  (number  of  replicas)  

MAX  per  node=5  

Internals  

19  

IDEALSTATE  

P1  N1:M  

N2:S  

P2  N2:M  

N3:S  

P3  N3:M  

N1:S  

20  

ConfiguraGon  

• 3  nodes  • 3  parGGons  • 2  replicas  • StateMachine  

Constraints  

• 1  Master  • 1  Slave  • Even  distribuGon  

Replica  placement  

Replica    State  

CURRENT  STATE  

•  P1:OFFLINE  •  P3:OFFLINE  N1  •  P2:MASTER  •  P1:MASTER  N2  •  P3:MASTER  •  P2:SLAVE  N3  

21  

22  

EXTERNAL  VIEW  

P1  N1:O  

N2:M  

P2  N2:M  

N3:S  

P3  N3:M  

N1:O  

23  

Helix  Based  System  Roles  

Node  1   Node  3  Node  2  

P.4  

P.9   P.10   P.11  

P.12  

P.1   P.2   P.3   P.7  P.5   P.6  

P.8   P.1  P.5   P.6  

P.9   P.10  

P.4  P.3  

P.7   P.8  P.11   P.12  

P.2  P.1  

PARTICIPANT

SPECTATORController

Parition routing logic

CURRENT STATE

IDEAL STATE

RESPONSE COMMAND

Logical  deployment  

24  

Outline  

•  IntroducGon  •  Architecture  •  How  to  use  Helix  •  Tools  •  Helix  usage      

25  

Helix  based  soluGon  

1. Define    2. Configure    3. Run  

26  

Define:  State  model  definiGon  

•  States  –  All  possible  states  –  Priority  

•  TransiGons  –  Legal  transiGons  –  Priority  

•  Applicable  to  each  parGGon  of  a  resource  

•  e.g.  MasterSlave  

27  

S  

M  O  

Define:  state  model  

28  

Builder = new StateModelDefinition.Builder(“MASTERSLAVE”);! // Add states and their rank to indicate priority. ! builder.addState(MASTER, 1);! builder.addState(SLAVE, 2);! builder.addState(OFFLINE);! ! //Set the initial state when the node starts! builder.initialState(OFFLINE);  

//Add transitions between the states.! builder.addTransition(OFFLINE, SLAVE);! builder.addTransition(SLAVE, OFFLINE);! builder.addTransition(SLAVE, MASTER);! builder.addTransition(MASTER, SLAVE);! !

Define:  constraints  State   Transi)on  

ParGGon   Y   Y  

Resource   -­‐   Y  

Node   Y   Y  

Cluster   -­‐   Y  

29  

S  

M  O  

COUNT=2

COUNT=1 State   Transi)on  

ParGGon   M=1,S=2   -­‐  

Define:constraints  

30  

// static constraint! builder.upperBound(MASTER, 1);!!! // dynamic constraint! builder.dynamicUpperBound(SLAVE, "R");!!! ! // Unconstrained ! builder.upperBound(OFFLINE, -1;    

Define:  parGcipant  plug-­‐in  code  

31  

Step  2:  configure  

32  

helix-­‐admin  –zkSvr  <zkAddress>  

CREATE  CLUSTER  

-­‐-­‐addCluster  <clusterName>  

ADD  NODE  

-­‐-­‐addNode  <clusterName  instanceId(host:port)>    

CONFIGURE  RESOURCE    

-­‐-­‐addResource  <clusterName  resourceName  par;;ons  statemodel>    

REBALANCE  èSET  IDEALSTATE  

-­‐-­‐rebalance  <clusterName  resourceName  replicas>  

33  

zookeeper  view  IDEALSTATE  

Step  3:  Run  

34  

run-­‐helix-­‐controller    -­‐zkSvr  localhost:2181  –cluster  MyCluster  START  CONTROLLER  

START  PARTICIPANT  

zookeeper  view  

35  

Znode  content  

CURRENT  STATE   EXTERNAL  VIEW  

36  

Spectator  Plug-­‐in  code  

37  

38  

Helix  ExecuGon  modes  

39  

IDEALSTATE  

P1  N1:M  

N2:S  

P2  N2:M  

N3:S  

P3  N3:M  

N1:S  

ConfiguraGon  

• 3  nodes  • 3  parGGons  • 2  replicas  • StateMachine  

Constraints  

• 1  Master  • 1  Slave  • Even  distribuGon  

Replica  placement  

Replica    State  

ExecuGon  modes  

•  Who  controls  what    

40  

AUTO  REBALANCE  

AUTO   CUSTOM  

Replica  placement  

Helix   App   App  

Replica    State  

Helix   Helix   App  

Auto  rebalance  v/s  Auto  

AUTO  REBALANCE   AUTO  

41  

In  acGon    Auto  rebalance  

MasterSlave  p=3  r=2  N=3  Node1   Node2   Node3  

P1:M   P2:M   P3:M  

P2:S   P3:S   P1:S  

Auto    MasterSlave  p=3  r=2  N=3  

42  

Node  1   Node  2   Node  3  

P1:O   P2:M   P3:M  

P2:O   P3:S   P1:S  

P1:M   P2:S  

Node  1   Node  2   Node  3  

P1:M   P2:M   P3:M  

P2:S   P3:S   P1:M  

Node  1   Node  2   Node  3  

P1:M   P2:M   P3:M  

P2:S   P3:S   P1:S  

On  failure:  Only  change  states  to  saGsfy  constraint  

On  failure:  Auto  create  replica    and  assign  state  

Custom  mode:  example  

43  

44  

Custom  mode:  handling  failure  �  Custom  code  invoker  

�  Code  that  lives  on  all  nodes,  but  acGve  in  one  place  �  Invoked  when  node  joins/leaves  the  cluster  �  Computes  new  idealstate  �  Helix  controller  fires  the  transiGon  without  viola)ng  constraints  

P1  

N1:M  

N2:S  

P2  

N2:M  

N3:S  

P3  

N3:M  

N1:S  

P1  

N1:S  

N2:M  

P2  

N2:M  

N3:S  

P3  

N3:M  

N1:S  

Transi)ons  

1   N1   MàS  

2   N2   Sà  M  

1  &  2  in  parallel  violate  single  master  constraint  

Helix  sends  2  aqer  1  is  finished  

45  

•  At  least  2  separate  controllers  process  to  avoid  SPOF  

•  Only  one  controller  acGve  •  Extra  process  to  manage  •  Recommended  for  large  

size  clusters  •  Upgrading  controller  is  

easy  

Separate  •  Embedded  controller  within  

each  parGcipant  •  Only  one  controller  acGve  •  No  extra  process  to  manage  •  Suitable  for  small  size  cluster.  •  Upgrading  controller  is  costly  •  ParGcipant  health  impacts  

controller  

Embedded  

Controller  deployment  

46  

Controller  fault  tolerance  

Controller 1

Controller 2

Controller 3

Zookeeper

LEADER STANDBY STANDBY

Zookeeper  ephemeral  based  leader  elecGon  for  deciding  controller  leader      

47  

Controller  fault  tolerance  

Controller 1

Controller 2

Controller 3

Zookeeper

OFFLINE LEADER STANDBY

When  leader  fails,  another  controller  becomes  the  new  leader  

48  

Managing  the  controllers  

49  

Scaling  the  controller:    Leader  Standby  Model  

Controller

Controller

Controller

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

O L

S

STANDBY

LEADEROFFLINE

50  

Scaling  the  controller:  Failure  

Controller

Controller

Controller

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

O L

S

STANDBY

LEADEROFFLINE

Outline  

•  IntroducGon  •  Architecture  •  How  to  use  Helix  •  Tools  •  Helix  usage    

51  

Tools  

•  Chaos  monkey  •  Data  driven  tesGng  and  debugging  •  Rolling  upgrade  •  On  demand  task  scheduling  and  intra-­‐cluster  messaging  

•  Health  monitoring  and  alerts  

52  

Data  driven  tesGng  

•  Instrument  –  •   Zookeeper,  controller,  parGcipant  logs  

•  Simulate  –  Chaos  monkey  •  Analyze  –  Invariants  are  

•  Respect  state  transiGon  constraints  •  Respect  state  count  constraints  •  And  so  on  

•  Debugging  made  easy  •  Reproduce  exact  sequence  of  events    

 53  

Structured  Log  File  -­‐  sample  timestamp partition instanceName sessionId state

1323312236368 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1323312236426 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1323312236530 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1323312236530 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1323312236561 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE

1323312236561 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1323312236685 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE

1323312236685 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1323312236685 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1323312236719 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE

1323312236719 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE

1323312236719 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1323312236814 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE

No  more  than  R=2  slaves  Time State Number Slaves Instance

42632 OFFLINE 0 10.117.58.247_12918

42796 SLAVE 1 10.117.58.247_12918

43124 OFFLINE 1 10.202.187.155_12918

43131 OFFLINE 1 10.220.225.153_12918

43275 SLAVE 2 10.220.225.153_12918

43323 SLAVE 3 10.202.187.155_12918

85795 MASTER 2 10.220.225.153_12918

How  long  was  it  out  of  whack?  Number  of  Slaves   Time     Percentage  

0   1082319   0.5  

1   35578388   16.46  

2   179417802   82.99  

3   118863   0.05  

83%  of  the  Gme,  there  were  2  slaves  to  a  parGGon  93%  of  the  Gme,  there  was  1  master  to  a  parGGon  

Number  of  Masters   Time   Percentage  

0 15490456 7.164960359 1 200706916 92.83503964

Invariant  2:  State  TransiGons  FROM   TO   COUNT  

MASTER SLAVE 55

OFFLINE DROPPED 0

OFFLINE SLAVE 298

SLAVE MASTER 155

SLAVE OFFLINE 0

Outline  

•  IntroducGon  •  Architecture  •  How  to  use  Helix  •  Tools  •  Helix  usage    

58  

Helix  usage  at  LinkedIn  

 

 

59  

Espresso  

In  flight  

•  Apache  S4  – ParGGoning,  co-­‐locaGon  – Dynamic  cluster  expansion  

•  Archiva  – ParGGoned  replicated  file  store  – Rsync  based  replicaGon  

•  Others  in  evaluaGon  – Bigtop  

60  

Auto  scaling  soqware  deployment  tool  

61  

•  States  •  Download,  Configure,  Start  •  AcGve,  Standby  

•  Constraint  for  each  state  •  Download    <  100  •  AcGve  1000  •  Standby  100  

Download

Configure

Start

Active

Standby

Offline

< 100

1000

100

Summary  

•  Helix:  A  Generic  framework  for  building  distributed  systems  

•  Modifying/enhancing  system  behavior  is  easy  – AbstracGon  and  modularity  is  key  

•  Simple  programming  model:  declaraGve  state  machine  

62  

Roadmap  

•  Features  •  Span  mulGple  data  centers  •  AutomaGc  Load  balancing  •  Distributed  health  monitoring  •  YARN  Generic  ApplicaGon  master  for  real  Gme  Apps  

•  Stand  alone  Helix  agent    

 

64  

website   h?p://helix.incubator.apache.org  

user   user@helix.incubator.apache.org  

dev   dev@helix.incubator.apache.org  

twi?er   @apachehelix,  @kishoreg1980  

Recommended