50
On Data Placement in Distributed Systems LADIS’14 Jo˜ ao Paiva, Lu´ ıs Rodrigues {joao.paiva, ler}@tecnico.ulisboa.pt Instituto Superior T´ ecnico / INESC-ID, Lisboa, Portugal October 23, 2014

On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

On Data Placement in Distributed SystemsLADIS’14

Joao Paiva, Luıs Rodrigues{joao.paiva, ler}@tecnico.ulisboa.pt

Instituto Superior Tecnico / INESC-ID, Lisboa, Portugal

October 23, 2014

Page 2: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

What is Data Placement?

I Deciding how to assign data items to nodes in a distributedsystem in such way that they can be later retrieved.

Page 3: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Data Placement Affects

Data Access Locality

Placing correlated data together can reduce latency of operations

Load Balancing

By knowing the workload, data can be placed in a way to even outthe load across all nodes

Availability

Data can be replicated depending on probability of node failure

Page 4: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Constraints to data placement practicality

I Lack of flexibility limits data placement improvements

I Scalability imposes limits on the flexibility of placement

Example

I Using a centralized directory is flexible, but not scalable

I Using consistent hashing is scalable, but not flexible

Page 5: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Constraints to data placement practicality

I Lack of flexibility limits data placement improvements

I Scalability imposes limits on the flexibility of placement

Example

I Using a centralized directory is flexible, but not scalable

I Using consistent hashing is scalable, but not flexible

Page 6: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Main Goal

Provide better options between

I Strong flexibility, limited scalability

I Limited flexibility, good scalability

Page 7: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Two Scenarios

Internet Scale

I Millions of nodes

I Short term connections

I Asymmetric, inconstant network

Datacenter Scale

I Thousands of nodes

I Stable connections

I Controlled network infrastructure

Page 8: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Two Scenarios: Previous state of the art

Internet Scale

I Scalable solutions with little flexibility, concerned with churn

Datacenter Scale

I Very flexible solutions, concerned with workload changes

Page 9: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Summary of Findings

Improvements for both scenarios:

I More flexible solution for Internet-Scale

I More scalable solution for Datacenter-Scale

Page 10: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Outline

Introduction

Internet Scale

Datacenter Scale

Conclusion

Page 11: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Internet Scale: Rollerchain

I Data assigned to node groups

I Variable replication degree

I Nodes have no fixed position

Page 12: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Variable Replication Degree

Page 13: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Variable Replication Degree

Page 14: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Variable Replication Degree

Page 15: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Variable Replication Degree

Page 16: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Variable Replication Degree

Page 17: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Variable Replication Degree

Page 18: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Variable Replication Degree

Page 19: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Nodes have no fixed position

Page 20: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Nodes have no fixed position

Page 21: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Nodes have no fixed position

Page 22: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Internet Scale: Implementation

Rollerchain

I Gossip-based and structured overlay

I Better churn resilience than state of the art

I Decreased replication costs

”Rollerchain: a DHT for Efficient Replication”, Joao Paiva, Joao Leitao and Luıs

Rodrigues, Symposium on Network Computing and Applications (IEEE NCA), August

2013. (Best student paper award)

Page 23: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Internet Scale: Implementation

Data Placement Policies

I Avoid-Surplus: Reducing monitoring costs

I Resilient Load-Balancing : Improving load balancing

I Supersize-me: Reducing replication costs

Read the paper to know the best policies:

”Policies for Efficient Data Replication in P2P Systems”, Joao Paiva, and Luıs

Rodrigues, International Conference on Parallel and Distributed Systems (IEEE

ICPADS), December 2013.

Page 24: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Internet Scale: Implementation

Data Placement Policies

I Avoid-Surplus: Reducing monitoring costs

I Resilient Load-Balancing : Improving load balancing

I Supersize-me: Reducing replication costs

Read the paper to know the best policies:

”Policies for Efficient Data Replication in P2P Systems”, Joao Paiva, and Luıs

Rodrigues, International Conference on Parallel and Distributed Systems (IEEE

ICPADS), December 2013.

Page 25: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Outline

Introduction

Internet Scale

Datacenter Scale

Conclusion

Page 26: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Datacenter scale: AutoPlacer

System where data placement is defined by combining:

I Consistent hashing for most items

I Precise placement for selected items

Locality-improving round-based algorithm for in-memory data grids

”AutoPlacer: scalable self-tuning data placement in distributed key-value stores”,

J. Paiva, P. Ruivo, P. Romano and L. Rodrigues, International Conference on

Autonomic Computing (USENIX ICAC), June 2013. (Best paper finalist)

”AutoPlacer: scalable self-tuning data placement in distributed key-value stores”,

J. Paiva, P. Ruivo, P. Romano and L. Rodrigues, ACM Transactions on Autonomous

and Adaptive Systems (ACM TAAS)

Page 27: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Datacenter scale: AutoPlacer

System where data placement is defined by combining:

I Consistent hashing for most items

I Precise placement for selected items

Locality-improving round-based algorithm for in-memory data grids

”AutoPlacer: scalable self-tuning data placement in distributed key-value stores”,

J. Paiva, P. Ruivo, P. Romano and L. Rodrigues, International Conference on

Autonomic Computing (USENIX ICAC), June 2013. (Best paper finalist)

”AutoPlacer: scalable self-tuning data placement in distributed key-value stores”,

J. Paiva, P. Ruivo, P. Romano and L. Rodrigues, ACM Transactions on Autonomous

and Adaptive Systems (ACM TAAS)

Page 28: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Algorithm overview

Online, round-based approach:

1. Statistics: Monitor data access to collect hotspots

2. Optimization: Decide placement for hotspots

3. Lookup: Encode / broadcast data placement

4. Move data

Page 29: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Algorithm overview

Online, round-based approach:

1. Statistics: Monitor data access to collect hotspots

2. Optimization: Decide placement for hotspots

3. Lookup: Encode / broadcast data placement

4. Move data

Page 30: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Statistics: Data access monitoring

Key concept: Top-K stream analysis algorithm

I Lightweight

I Sub-linear space usage

I Inaccurate result... But with bounded error

Page 31: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Statistics: Data access monitoring

Key concept: Top-K stream analysis algorithm

I Lightweight

I Sub-linear space usage

I Inaccurate result... But with bounded error

Page 32: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Statistics: Data access monitoring

Key concept: Top-K stream analysis algorithm

I Lightweight

I Sub-linear space usage

I Inaccurate result... But with bounded error

Page 33: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Algorithm overview

Online, round-based approach:

1. Statistics: Monitor data access to collect hotspots

2. Optimization: Decide placement for hotspots

3. Lookup: Encode / broadcast data placement

4. Move data

Page 34: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Optimization

Integer Linear Programming problem formulation:

min∑j∈N

∑i∈O

X ij(crr rij + crwwij) + Xij(cl

r rij + clwwij) (1)

subject to:

∀i ∈ O :∑j∈N

Xij = d ∧ ∀j ∈ N :∑i∈O

Xij ≤ Sj

Inaccurate input:

I Does not provide optimal placement

I Upper-bound on error

Page 35: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Accelerating optimization

1. ILP Relaxed to Linear Programming problem

2. Distributed optimization

LP relaxation

I Allow data item ownership to be in [0− 1] interval

Distributed Optimization

I Partition by the N nodes

I Each node optimizes hotspots mapped to it by CH

I Strengthen capacity constraint

Page 36: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Algorithm overview

Online, round-based approach:

1. Statistics: Monitor data access to collect hotspots

2. Optimization: Decide placement for hotspots

3. Lookup: Encode / broadcast data placement

4. Move data

Page 37: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Lookup: Encoding placement

Probabilistic Associative Array (PAA)

I Associative array interface (keys→values)

I Probabilistic and space-efficient

I Trade-off space usage for accuracy

Page 38: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Probabilistic Associative Array: Usage

Building

1. Build PAA from hotspot mappings

2. Broadcast PAA

Looking up objects

I If item is hotspot, return PAA mapping

I Otherwise, default to Consistent Hashing

Page 39: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Probabilistic Associative Array: Usage

Building

1. Build PAA from hotspot mappings

2. Broadcast PAA

Looking up objects

I If item is hotspot, return PAA mapping

I Otherwise, default to Consistent Hashing

Page 40: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

PAA: Building blocks

I Bloom FilterSpace-efficient membership test (is item in PAA?)

I Decision tree classifierSpace-efficient mapping (where is hotspot mapped to?)

Page 41: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

PAA: Building blocks

I Bloom FilterSpace-efficient membership test (is item in PAA?)

I Decision tree classifierSpace-efficient mapping (where is hotspot mapped to?)

Page 42: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

PAA: Properties

Bloom Filter:

I No False Negatives: never return ⊥ for items in PAA.

I False Positives: match items that it was not supposed to.

Decision tree classifier:

I Inaccurate values (bounded error).

I Deterministic response: deterministic (item→node)mapping.

Page 43: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

PAA: Properties

Bloom Filter:

I No False Negatives: never return ⊥ for items in PAA.

I False Positives: match items that it was not supposed to.

Decision tree classifier:

I Inaccurate values (bounded error).

I Deterministic response: deterministic (item→node)mapping.

Page 44: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

PAA: Properties

Bloom Filter:

I No False Negatives: never return ⊥ for items in PAA.

I False Positives: match items that it was not supposed to.

Decision tree classifier:

I Inaccurate values (bounded error).

I Deterministic response: deterministic (item→node)mapping.

Page 45: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Outline

Introduction

Internet Scale

Datacenter ScaleAutoplacerEvaluation

ConclusionConclusion

Page 46: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Evaluation: Throughput

10

100

1000

0 5 10 15 20 25 30

Tran

sact

ions

per

sec

ond

(TX/

s)

Time (minutes)

100% locality90% locality50% locality

0% localitybaseline

Page 47: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Evaluation: Optimization

Page 48: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Outline

Introduction

Internet Scale

Datacenter Scale

Conclusion

Page 49: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Conclusions

Internet Scale

I More flexible overlay for data placement

I Policies to improve metrics using added flexibility

Datacenter Scale

I Scalable mechanism for data placement

I Algorithm to improve locality through hotspot placement

Page 50: On Data Placement in Distributed Systems - LADIS'14jgpaiva/pubs/ladis14-presentation.pdf · I Supersize-me: Reducing replication costs Read the paper to know the best policies: "Policies

Thank you

[email protected]/joao.paiva