46
Beolink.org Data replication Fabrizio Manfredi Furuholmen

OSDC 2014: Fabrizio Manfredi - Data replication

  • Upload
    netways

  • View
    343

  • Download
    7

Embed Size (px)

DESCRIPTION

Data replication is a crucial component for distributed services deployed in a multi-Data Center environment. The replication schema needs to be carefully evaluated before its implementation, wrong design or the misuse in most of the case end with a big service outages. To understand the replication it is needed to understand the algorithms behind it, for this reason the session will start to explaining the most used algorithms to solve the CAP theorem (Consistency , Availability and Partitioning Tolerance) like Consistent Hash, Vector clock, Gossip protocol, Paxos and Raft. The second part of the talk will be focused to analyze how the products on the market do the replication (replication in action) with advantages and disadvantages, the talk will cover the distributed filesystem (cephs, tahoe, extreemfs..), distributed databases (db replication primitieves and external tool like Tungsten), Nosql (riak, cassandra, mongodb, couchdb) and Frameworks for in house solution (beardb, open replication,..). The talk will also show the evaluation methods and testing process for identify the best solution for your environment.

Citation preview

Page 1: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

Data replication Fabrizio Manfredi Furuholmen"

Page 2: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

FOSDEM 2014"2"

Agenda

!  Introduction !  overview !  Theorem !  Common Pattern

!  Implementation !  Filesystem !  RDBMS !  Nosql !  Framework

!  Example

Page 3: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

3"

Data Replication

http://blog.open-e.com/in-a-nutshell-data-replication-snapshots-and-backup/"

Page 4: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

4"

Data Replication

http://www.dreamstime.com/stock-images-cloud-computing-scalability-reliability-background-concept-word-image34898574"

Page 5: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

5"

Introduction

Page 6: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

6"

World Connection

Page 7: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

7"

Main Problem

VS!

Page 8: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

8"

Main Problem

Page 9: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

9"

CAP theorem

According to Brewer’s CAP theorem, it is impossible for any distributed computer system to simultaneously provide all three of Consistency, Availability and Partition Tolerance.""

You "can’t have the three at the

same time !and get an acceptable latency."

Page 10: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

10"

CAP

ACID!!Atomic: Everything in a transaction succeeds or the entire transaction is rolled back."Consistent: A transaction cannot leave the database in an inconsistent state."Isolated: Transactions cannot interfere with each other."Durable: Completed transactions persist, even when servers restart etc.""-  Strong consistency for transaction highest priority"-  Pessimistic"-  Complex mechanisms"

"-  Availability and scaling highest priorities"-  Weak consistency"-  Optimistic"-  Best Effort"-  Simple and FAST "

Basic Availability"Soft-state"Eventual consistency""

BASE""

RDBMS!

NoSQL!

Page 11: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

11"

Data Distribution

Business Decision!

Page 12: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

12"

Start with some Algorithms

Page 13: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

13"

Data Distribution

Replication!

Data Placement"

Data Consistency"

System Coordination"

Data Transmission"

Page 14: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

14"

Data Placement

Better Distribution = partitioning !Parallel operation = parallel stream/multi core!

!

Page 15: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

15"

Data Placement

Page 16: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

16"

Data placement by HASH

It isn’t rocket science !!

Page 17: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

17"

Data Distribution

http://www.cs.rutgers.edu/~pxk/417/notes/23-lookup.html"

Consistent HASH!

Chord"

Space base/multi dimension"

Page 18: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

18"

Data placement

http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/"

Vnode base" Proximity base"

Replication"

Page 19: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

19"

Data Consistency

http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/"

To avoid ACID implementation but to guarantee the consistency some solution leave to the client the ownership of the algorithm.""-  Read and Write quorum!-  Write quorum Read all!

Page 20: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

20"

Data Consistency

http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/"

To avoid ACID implementation but to guarantee the consistency some solution leave to the client the ownership of the algorithm.""-  Read and Write quorum!-  Write quorum Read all!

Page 21: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

21"

Coordination Protocol

Consensus protocol!"Paxos , Raft, ect""Based on the state machine approach (The state machine approach is a technique for converting an algorithm into a fault-tolerant, distributed implementation. )"""""

Epidemic (Gossip)!"epidemic: anybody can infect anyone "else with equal probability"""""""

Anti-entropy protocols assume that synchronization is performed by a fixed schedule – every node regularly chooses another node at random or by some rule and exchanges database contents, resolving differences. "

O(log n)"http://www.cis.cornell.edu/IAI/events/Gossip_Tutorial.pdf"

Page 22: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

22"

Transmission Protocol

Optimization!-  Re order"-  Deduplication""

!Transmission"-  By difference (Merkel tree) "-  Callback "-  Compression"-  Auto correction"

Locking!-  Distributed locking"-  Multiversioning"-  …"

!"

mito

sis!

Page 23: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

23"

Implementation

Page 24: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

24"

Answer …no Answer

Block replication, file

Information

Document , blog, session

Content with a TTL over a 1m

Distributed file system

RDMBS

NoSQL

Caching system

Page 25: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

25"

Distributed Filesystem

DFS is a service that provides a single point of reference and a logical tree structure for file system resources that may be physically located anywhere on the network."""

One significant responsibility of a file system is to ensure that, regardless of the actions by programs accessing the data, the structure remains consistent…"

Page 26: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

26"

Filesystem

""

Properties of DFS!"•  Simple from application point of view"•  Data consistency""

Base on the solution!"•  Partitioning Tolerance "•  Scalability"•  High Avaibility """"

Page 27: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

27"

Filesystem DRDB

DRDB!!Replication mode: Asynchronous, Memory synchronous , Synchronous "Transfer optimization: DRProxy """

Main Goals!!Disk replication, single service availability""Disaster Recovery"""

Page 28: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

28"

Filesystem CEPH

""

Ceph!Data distribution: Hash base"Consensus protocol: Raft for consensus"Write mode: Write one, read one, client is notified when all replicas have been written"Weak consistency with cache pool"""

Openstack Backednd at Cern""1128 OSDs"3PB"XXX vms""http://www.slideshare.net/"Inktank_Ceph/scaling-ceph-at-cern "

Main Goals!!- Blockdevice/base for other filesystem"- Cloud support, image storage and vm storage"""

Page 29: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

29"

CEPH

""

Users: > 5000"VMs > 7000"> 250k VMs spawned"

http://www.synnefo.org/resources.html"

Page 30: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

30"

RDBMS

""

Property of RDBMS!"•  Quite Simple from application point of view"•  Data consistency""

Base on the solution!"•  Low Partitioning Tolerance "•  Low Scalability"•  Low High Availability """"

Page 31: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

31"

RDBMS

!Asynchronous Replication"Semi synchronous""

Postgres"Synchronous"Asynchronous"

Page 32: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

32"

NoSQL

Properties of DFS!"•  Fast"""

Base on the solution!"•  Partitioning Tolerance "•  Scalability"•  High Availability"•  Simple """"

Page 33: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

33"

NoSQL Performance

http://planetcassandra.org/nosql-performance-benchmarks/"

Page 34: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

34"

Riak

Geo Replication!

Tunable trade-offs for distribution and replication (N, R, W) "

Distributed Hash Table"

Page 35: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

35"

Filesystem over NoSQL

FUSE!In most of the case non stable"!S3 Interface!Internet standard de facto"

Page 36: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

36"

Filesystem over NoSQL

Wooga"

http://www.slideshare.net/wooga/riak-at-woogariak-meetup-sept-2013?qid=4809eca2-8378-4e70-8e75-0db29b635fa5&v=qf1&b=&from_search=3"

https://fosdem.org/2014/schedule/event/nyt_cassandra/"

Page 37: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

37"

Combine different solution

37"

Edge node (Varnish)!

Nosql!

Loc

al !

cach

e!C

entr

aliz

e! c

ache!

Info!

Sto

rage!

DFS!

Origin (Distribute cache)!

Loca

l !

DB! Nosql!Dec

reas

e th

e nu

mbe

r of t

he re

ques

ts!

Increase of the age of the data!

Page 38: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

38"

Framework

Build your system if you need … " ""….do you really need"

CERN"

CERN"

Page 39: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

39"

Framework

Don’t forget Rsync !!

Page 40: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

40"

Framework

Replication or Caching ?!

Page 41: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

41"

Build a solution

•  Split in pieces"

•  Track version "

•  Transfer when needed"

•  Transfer the difference"

•  Use Notification when is possible"

•  Move data close to computation"

•  Move master close to write operation"

•  Split counter to avoid dead lock"

•  In HTTP don’t forget the Etag and lastmodify"" ""

openkad!

open-chord!

openReplica!

Raft!

Page 42: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

42"

Build a solution

Page 43: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org! " Five pylons

43"

Obj

ects"

• Separation btw data and metadata"

•  Each element is marked with a revision"

• Each element is marked with an hash."

Cac

he"

•  Client side"

•  Callback/Notify"

•  Persistent!

Tran

smis

sion"

•  Parallel operation"

•  Http like protocol"

•  Compression"

•  Transfer by difference"

Dis

trib

utio

n" • Resource discovery by DNS"

• Data spread on multi node cluster"

• Decentralize!

• Independents cluster!

• Data Replication!

Secu

rity" • Secure

connection"

•  Encryption client side,"

•  Extend ACL"

•  Delegation/Federation!

• Admin Delegation!

Page 44: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

44"

Build a solution

- Consistent HASH"

-  Zmq transport protocol"

- Gossip protocol for failure detection"

-  Tunable trade-offs ""

Pisa is a simple block data replication !on a wide range of node!

Page 45: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org! " And …

45"

“There is always a failure waiting around the corner”"

*Werner Vogel! "

Page 46: OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org! !

Thank you http://[email protected]"