47
Akka Cluster in Production

Akka Cluster in Production

Embed Size (px)

Citation preview

Page 1: Akka Cluster in Production

Akka Cluster in Production

Page 2: Akka Cluster in Production

• Reactive?• Actors?• Akka?• Clustering?• Onlineshopping?• Online payment?

Page 3: Akka Cluster in Production

Device Ident

• Detect fraudulent devices in real time• Retail, telco, finance, …

• Integrated via snippet into webshop• Analyses client devices• Results obtained via REST API • > 300 sites,• > 100M devices• 24/7, 3 nines

Page 4: Akka Cluster in Production

Reactive Manifesto

Responsive

Message Driven

Elastic Resilient

Page 5: Akka Cluster in Production

Actors

„The actor model in computer science is a mathematical model of concurrent computation that treats "actors" as the universal primitives of concurrent computation. In response to a message that it receives, an actor can: • make local decisions, • create more actors, • send more messages, and • determine how to respond to the next message received.Actors may modify private state, but can only affect each other through messages (avoiding the need for any locks).“

https://en.wikipedia.org/wiki/Actor_model

Page 6: Akka Cluster in Production

Akka Cluster Components

Remoting

Clustering

Distributed PubSub

Cluster Singleton

Sharding

Distributed Data

Persistence

Cluster Client

Cluster Aware Routers

Actors

Page 7: Akka Cluster in Production

Akka Remote

• replace LocalActorRefProvider by RemoteActorRefProvider

• ActorRef: akka://systemName/user/parent/actorName

• Remote ActorRef: akka.tcp://systemName@hostName:1234/user/parent/actorName

• Look up remote actors• Start remote actors• Cluster aware routing• death watch

Page 8: Akka Cluster in Production

Failure Detector

• The Phi Accrual Failure Detector phi = -log10(1 - F(timeSinceLastHeartbeat))• Each node monitored by small number of other nodes determined using Hash Ring• Output: confidence that a node is unreachable• Also notices when it becomes reachable again

Page 9: Akka Cluster in Production

Akka Cluster

• Cluster Membership managed using Gossip Protocol• Dynamo based system• Subscribe to cluster state events• Roles• Restrictions on number of nodes possible (also per role)

• When Gossip Convergence is reached, a leader can deterministically be determined• Head of list of nodes in alphanumeric order• Leader joins / removes members• Leader can auto-down members

• Join manually or to Seed Nodes • The first seed node joins itself

Page 10: Akka Cluster in Production

1

5

37

28

6 4

Page 11: Akka Cluster in Production

case class Gossip( members: immutable.SortedSet[Member], // sorted set of members with their status, sorted by address overview: GossipOverview = GossipOverview(), version: VectorClock = VectorClock()) // vector clock version

case class GossipOverview( seen: Set[UniqueAddress] = Set.empty, reachability: Reachability = Reachability.empty)

case class VectorClock( versions: TreeMap[VectorClock.Node, Long] = TreeMap.empty[VectorClock.Node, Long]) {

/** * Compare two vector clocks. The outcome will be one of the following: * <p/> * {{{ * 1. Clock 1 is SAME (==) as Clock 2 iff for all i c1(i) == c2(i) * 2. Clock 1 is BEFORE (<) Clock 2 iff for all i c1(i) <= c2(i) * and there exist a j such that c1(j) < c2(j) * 3. Clock 1 is AFTER (>) Clock 2 iff for all i c1(i) >= c2(i) * and there exist a j such that c1(j) > c2(j). * 4. Clock 1 is CONCURRENT (<>) to Clock 2 otherwise. * }}} */ def compareTo(that: VectorClock): Ordering = { compareOnlyTo(that, FullOrder) } }

Page 12: Akka Cluster in Production

Akka Cluster Lifecycle

Page 13: Akka Cluster in Production

Cluster Singletons

• e.g. single point of entry, centralized routing logic, … • live on the oldest node• ClusterSingletonManager started on each node• ClusterSingletonProxy for accessing current Singleton

Page 14: Akka Cluster in Production

Cluster Singleton

Singleton Manager

Singleton Proxy

Singleton Manager

Node1 Node2 Node3

Singleton Proxy

Singleton Proxy

Singleton

Page 15: Akka Cluster in Production

Cluster Singleton

Singleton Manager

Singleton Proxy

Singleton Manager

Node1 Node2 Node3

Singleton Proxy

Singleton Proxy

Singleton

Page 16: Akka Cluster in Production

Cluster Singletons

• e.g. single point of entry, centralized routing logic, … • live on the oldest node• ClusterSingletonManager started on each node• ClusterSingletonProxy for accessing current Singleton• caveats:

• Single point of bottleneck• Must recover state on migration• In case of split brain, multiple singletons

Page 17: Akka Cluster in Production

Distributed PubSub

• DistributedPubSubMediator started on all nodes• Subscriptions are gossiped, eventually consistent• modes Publish, Group Publish, Send• used e.g. for cluster wide config, chat system, …

Page 18: Akka Cluster in Production

Cluster Sharding

• Distribute Work• Workload partitioned by shard key derived from message• Messages must be serializable• Each node is responsible for n shards and each shard is allocated to one node• ShardRegion is entry point for messages and controls workers• ShardCoordinator singleton assigns shards• Shards distributed by no of workers by default• Shards migrate for rebalancing or on failure• Shard assignments can be persisted• Running Workers per shard can be remembered.

• workers must step down• Workers must persist state if they need it after migration

Page 19: Akka Cluster in Production

Cluster Sharding

Shard Region

Shard Coordinator

Shard Region

Shard Region Proxy

Node1 Node2 Node3

Shard NodeShard Node Shard Node Shard Node

Page 20: Akka Cluster in Production

Cluster Sharding

Shard Region

Shard Coordinator

Shard Region

Shard Region Proxy

Node1 Node2 Node3

Shard NodeShard Node Shard Node Shard Node

! „Hello World“ key = 1234 shard = 12

Page 21: Akka Cluster in Production

Cluster Sharding

Shard Region

Shard Coordinator

Shard Region

Shard Region Proxy

Node1 Node2 Node3

Shard NodeShard Node Shard Node Shard Node

! „Hello World“ key = 1234 shard = 12

? 12

Page 22: Akka Cluster in Production

Cluster Sharding

Shard Region

Shard Coordinator

Shard Region

Shard Region Proxy

Node1 Node2 Node3

Shard NodeShard Node Shard Node12 Node1

Shard Node

! „Hello World“ key = 1234 shard = 12

! Node1

Page 23: Akka Cluster in Production

Cluster Sharding

Shard Region

Shard Coordinator

Shard Region

Shard Region Proxy

Node1 Node2 Node3

Shard NodeShard Node12 Node1

Shard Node12 Node1

Shard Node

Entity 1234

Page 24: Akka Cluster in Production

Cluster Sharding

Shard Region

Shard Coordinator

Shard Region

Shard Region Proxy

Node1 Node2 Node3

Shard NodeShard Node12 Node1

Shard Node12 Node1

Shard Node

Entity 1234

! „H

ello

Wor

ld

Page 25: Akka Cluster in Production

Cluster Sharding

Shard Region

Shard Coordinator

Shard Region

Shard Region Proxy

Node1 Node2 Node3

Shard NodeShard Node12 Node1

Shard Node12 Node1

Shard Node

Entity 1234

! „Hello World“ key = 1234 shard = 12

? 12

Page 26: Akka Cluster in Production

Cluster Sharding

Shard Region

Shard Coordinator

Shard Region

Shard Region Proxy

Node1 Node2 Node3

Shard NodeShard Node12 Node1

Shard Node12 Node1

Shard Node

Entity 1234

! „Hello World“ key = 1234 shard = 12

! Node1

Page 27: Akka Cluster in Production

Cluster Sharding

Shard Region

Shard Coordinator

Shard Region

Shard Region Proxy

Node1 Node2 Node3

Shard Node12 Node1

Shard Node12 Node1

Shard Node12 Node1

Shard Node

Entity 1234

! „Hello World“ key = 1234 shard = 12

! Node1

Page 28: Akka Cluster in Production

Cluster Sharding

Shard Region

Shard Coordinator

Shard Region

Shard Region Proxy

Node1 Node2 Node3

Shard Node12 Node1

Shard Node12 Node1

Shard Node12 Node1

Shard Node

Entity 1234

! „Hello World

Page 29: Akka Cluster in Production

Cluster Sharding

Shard Region

Shard Coordinator

Shard Region

Shard Region Proxy

Node1 Node2 Node3

Shard Node12 Node1

Shard Node12 Node1

Shard Node12 Node1

Shard Node

Entity 1234

! „Happy Day“ key = 1299 shard = 12

Page 30: Akka Cluster in Production

Cluster Sharding

Shard Region

Shard Coordinator

Shard Region

Shard Region Proxy

Node1 Node2 Node3

Shard Node12 Node1

Shard Node12 Node1

Shard Node12 Node1

Shard Node

Entity 1234

Entity 1299

! „Happy Day“

Page 31: Akka Cluster in Production

Cluster Sharding

Shard Region

Shard Coordinator

Shard Region

Shard Region Proxy

Node1 Node2 Node3

Shard Node12 Node19 Node2

Shard Node12 Node19 Node23 Node1

Shard Node12 Node19 Node23 Node1

Shard Node

9 Node23 Node1

Entity 1234

Entity 0902

Entity 0901

Entity 345

Entity 1299

Page 32: Akka Cluster in Production

Cluster Sharding

Shard Region

Shard Coordinator

Shard Region

Shard Region Proxy

Node1 Node2 Node3

Shard Node

9 Node2

Shard Node

9 Node2

Entity 1234

Entity 0902

Entity 0901

Entity 345

Entity 1299

Shard Coordinator

Shard Node

Not remembering shards

Page 33: Akka Cluster in Production

Cluster Sharding

Shard Region

Shard Coordinator

Shard Region

Shard Region Proxy

Node1 Node2 Node3

Shard Node

9 Node2

Shard Node

9 Node2

Entity 1234

Entity 0902

Entity 0901

Entity 345

Entity 1299

Shard Coordinator

Shard Node

9 Node2

Not remembering shards / ddata https://github.com/akka/akka/issues/19003

Page 34: Akka Cluster in Production

Cluster Sharding

Shard Region

Shard Coordinator

Shard Region

Shard Region Proxy

Node1 Node2 Node3

Shard Node

9 Node2

Shard Node

Entity 1234

Entity 0902

Entity 0901

Entity 345

Entity 1299

Shard Coordinator

Shard Node12 Node29 Node23 Node2

Entity 1234

Entity 345

Entity 1299

remembering shards / persistence

Page 35: Akka Cluster in Production

Sharding and Persistence

• Persist ShardCoordinator state• akka.cluster.sharding { state-store-mode = „persistence" journal-plugin-id = "akka-contrib-mongodb-persistence-journal-sharding" snapshot-plugin-id = "akka-contrib-mongodb-persistence-snapshot-sharding" }

• akka.cluster.sharding.state-store-mode = "ddata"

• Remember Shard entities• akka.cluster.sharding.remember-entities = "on"

• Step down of workers• super ! Passivate(StopMessage)

Page 36: Akka Cluster in Production

Distributed Data

• KV-store based on Conflict Free Replicated Data Types (CRDTs) • Counters: GCounter, PNCounter• Sets: GSet, ORSet• Maps: ORMap, ORMultiMap, LWWMap, PNCounterMap• Registers: LWWRegister, Flag

• not intended for Big Data: In memory, full state replicated• start ddata.Replicator

val Counter1Key = PNCounterKey("counter1")replicator ! Update(Counter1Key, PNCounter(), WriteLocal)(_ + 1)

val readFrom3 = ReadFrom(n = 3, timeout = 1.second)replicator ! Get(Counter1Key, readFrom3)

• Consistency levels• ReadLocal / WriteLocal• ReadFromN / WriteToN• ReadMajority / WriteMajority• WriteAll and ReadAll

• Can subscribe to changes

Page 37: Akka Cluster in Production

Putting everything together

Page 38: Akka Cluster in Production

SHOPXhttp://shop.com SHOP

Page 39: Akka Cluster in Production

SHOP

Sharding

Xhttp://shop.com SHOP

Page 40: Akka Cluster in Production

SHOP

ConfigConfig

Sharding

Xhttp://shop.com

Config

SHOP BACKOFFICE

Page 41: Akka Cluster in Production

SHOP

ConfigConfig

Sharding

Xhttp://shop.com

Config

Proxy

SHOP BACKOFFICE

EXPENSIVE

Page 42: Akka Cluster in Production

Live Example...

• config• scaling horizontally• cluster client• multi JVM test• singleton monitoring throughput and lifecycle

Page 43: Akka Cluster in Production

Caveats and Lessons learned

• Remoting Setup• TLS certificates rolling update

• difficult to test if new settings work whole the old ones are still there • In our case: export restricted crypto• Not too critical, if noticed early in rolling upgrade

• Adjust Failure detector settings to your environment• quite strict for us• must accept higher latencies & short interruptions in cloud environment

• Configure internal & external hostname in containerized / NATed / … environment• hostname and IP are completely different for Akka!

Page 44: Akka Cluster in Production

Caveats and Lessons learned

• Cluster Setup• Currently rather static hardware environment

• Joining to list of seed nodes. • Had a split brain once, need to carefully restart the right part of the cluster

• Log cluster state each node sees (or use JMX)!• preventing split brain

• maybe disable auto down• split brain resolver• adjust failure detector settings to your environment

• restart order youngest to oldest to minimize singleton migrations

Page 45: Akka Cluster in Production

Caveats and Lessons learned

• Sharding• Recovery of persistent actors should be planned (might take time)• If shard coordinator fails to recover you're doomed

• separate journal for internal sharding state• can be cleaned if cluster is shutdown• ICE delete journal

• Will allow for ShardCoordinator recovery• Will make shard allocation state in ShardRegions inconsistent• Fix state by rolling restart

Page 46: Akka Cluster in Production

Shutdown

// Play can not stop accepting requests. // Fail healthcheck, so the loadbalancer removes this node. GlobalHealthcheck.fail(reason = shuttingDown)

// migrate all shards to other nodes, stop accepting new ones val cluster = Cluster(context.system) context.watch(region) region ! ShardRegion.GracefulShutdown

// After shutdown of ShardRegion, shutdown the ActorSystemcase Terminated(`region`) => cluster.leave(cluster.selfAddress) cluster.registerOnMemberRemoved { system.terminate }

// After shutdown of the ActorSystem, shutdown the App system.registerOnTermination { System.exit(0)

}

Page 47: Akka Cluster in Production

Q&Ahttps://riskident.com/en/about/jobs/