C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

When Bad Things Happen to Good Data:

Understanding Anti-Entropy in Cassandra

Jason Brown

@jasobrown [email protected]

About me

•  Senior Software Engineer @ Netflix •  Apache Cassandra committer

•  E-Commerce Architect, Major League Baseball Advanced Media

•  Wireless developer (J2ME and BREW)

Maintaining consistent state is hard in a distributed system

CAP theorem works against you

Inconsistencies creep in

•  Node is down •  Network partition •  Dropped mutations •  Process crash before commit log flush •  File corruption

Cassandra trades C for AP

Anti-Entropy Overview

•  write time o  tunable consistency o  atomic batches o  hinted handoff

•  read time o  consistent reads o  read repair

•  maintenance time o  node repair

Write Time

Cassandra Writes Basics

•  determine all replica nodes in all DCs •  send to replicas in local DC •  send one replica node in remote DCs,

o  it will forward to peers

•  all respond back to original coordinator

Writes - request path

Writes - response path

Writes - Tunable consistency

Coordinator blocks for specified count of replicas to respond

•  consistency level o  ALL o  EACH_QUORUM o  LOCAL_QUORUM o  ONE / TWO / THREE o  ANY

Hinted handoff

Save a copy of the write for down nodes, and replay later

hint = target replica + mutation data

Hinted handoff - storing

•  on coordinator, store a hint for any nodes not currently 'up'

•  if a replica doesn't respond within write_request_timeout_in_ms, store a hint

•  max_hint_window_in_ms - maximum amount of time a dead host will have hints generated.

Hinted handoff - replay

•  try to send hints to nodes •  runs every ten minutes •  multithreaded (as of 1.2) •  throttable (kb per second)

Hinted Handoff - R2 down

R2 down, coordinator (R1) stores hint

Hinted handoff - replay

R2 comes back up, R1 plays hints for it

What if coordinator dies?

Atomic Batches

•  coordinator stores incoming mutation to two peers in same DC o  deletes from peers on successful completion

•  peers will replay the batch if not deleted o  runs every 60 seconds

•  with 1.2, all mutates use atomic batch

Read Time

Cassandra Reads - setup

•  determine endpoints to invoke o  consistency level vs. read repair

•  first data node to send back full data set, other nodes only return a digest

•  wait until the CL number of nodes to return

LOCAL_QUORUM read

Pink nodes contain requested row key

Consistent reads

•  compare the digests of returned data sets •  if any mismatches, send request again to

same CL data nodes. o  this time no digests, full data set

•  compare the full data sets, send updates to out of date replicas

•  block until those fixes are responded to •  return data to caller

Read Repair

•  synchronizes the client-requested data amongst all replicas

•  piggy-backs on normal reads, but waits for all replicas to respond asynchronously

•  then, just like consistent reads, compares the digests, and fix if needed

Read Repair

green lines = LOCAL_QUORUM nodes blue lines = nodes for read repair

Read Repair - configuration

•  setting per column family •  percentage of all calls to CF •  Local DC vs. Global chance

Read repair fixes data that is actually requested,

... but what about data that isn't requested?

Node Repair - introduction

•  repairs inconsistencies across all replicas for a given range

•  nodetool repair o  repairs the ranges the node contains o  one of more column families (within the same

keyspace) o  can choose local datacenter only (c* 1.2)

•  should be part of std operations maintenance for c*, esp if you delete data o  ensures tombstones are propagated, and avoid

resurrected data

•  repair is IO and CPU intensive

Node Repair - cautions

Node Repair - details 1

•  determine peer nodes with matching ranges •  triggers a major (validation) compaction on

peer nodes o  read and generate hash for every row in CF o  add result to a Merkle Tree o  return tree to initiator

Node Repair - details 2

•  initiator awaits trees from all nodes •  compares each tree to every other tree •  if any differences exist, two nodes are

exchange the conflicting ranges o  these ranges get written out as new, local sstables

'ABC' node is repair initiator

Nodes sharing range A

Nodes sharing range B

Nodes sharing range C

Five nodes participating in repair

Anti-Entropy wrap-up

•  CAP Theorem lives, tradeoffs must be made

•  C* contains processes to make diverging data sets consistent

•  Tunable controls exist at write and read times, as well on-demand

Thank you!

Q & A time

@jasobrown

Notes from Netflix

•  carefully tune RR_chance •  schedule repair operations •  tickler •  store more hints vs. running repair

Technology

C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown