34
An Efficient Multi-Dimensional Index for Cloud Data Management Xiangyu Zhang Jing Ai Zhongyuan Wang Jiaheng Lu Xiaofeng Meng School of Information Renmin University of China

An Efficient Multi-Dimensional Index for Cloud Data Management

  • Upload
    myron

  • View
    35

  • Download
    0

Embed Size (px)

DESCRIPTION

An Efficient Multi-Dimensional Index for Cloud Data Management. Xiangyu Zhang Jing Ai Zhongyuan Wang Jiaheng Lu Xiaofeng Meng School of Information Renmin University of China. Outline . Motivation Query Answering on the Cloud Related Work - PowerPoint PPT Presentation

Citation preview

Page 1: An Efficient Multi-Dimensional Index for Cloud Data Management

An Efficient Multi-Dimensional Index for Cloud DataManagement

Xiangyu Zhang Jing Ai Zhongyuan WangJiaheng Lu Xiaofeng Meng

School of InformationRenmin University of China

Page 2: An Efficient Multi-Dimensional Index for Cloud Data Management

Outline Motivation Query Answering on the CloudRelated WorkEMINC: Index the Cloud Efficiently

◦Node Bounding◦Extended Node Bounding◦Cost Estimation based Index Update

EvaluationConclusion & Future Work

Page 3: An Efficient Multi-Dimensional Index for Cloud Data Management

Outline Motivation Query Answering on the CloudRelated WorkEMINC: Index the Cloud Efficiently

◦Node Bounding◦Extended Node Bounding◦Cost Estimation based Index Update

EvaluationConclusion & Future Work

Page 4: An Efficient Multi-Dimensional Index for Cloud Data Management

Motivation Cloud systems have been

justified as brilliant for web search applications◦Simple structure, mostly key-value

pairs◦Flexible, efficient for analytic work

However, they are insufficient for complex data management needs◦No powerful language as SQL◦Hard to process complex queries◦Lack of efficient index structures

Page 5: An Efficient Multi-Dimensional Index for Cloud Data Management

Distributed Cloud base?• BigTable

• HBase

How to query on other attributes besides primary key?

Page 6: An Efficient Multi-Dimensional Index for Cloud Data Management

Motivation As part of our Cloud-based DBMS

project, we aim to build efficient index structure on the Cloud.

Page 7: An Efficient Multi-Dimensional Index for Cloud Data Management

Outline Motivation Query Answering on the CloudRelated WorkEMINC: Index the Cloud Efficiently

◦Node Bounding◦Extended Node Bounding◦Cost Estimation based Index Update

EvaluationConclusion & Future Work

Page 8: An Efficient Multi-Dimensional Index for Cloud Data Management

Query Answering in the Cloud

Fast locating of relevant slave

nodes

Efficient lookup on each slave nodes

Page 9: An Efficient Multi-Dimensional Index for Cloud Data Management

Outline Motivation Query Answering on the CloudRelated WorkEMINC: Index the Cloud Efficiently

◦Node Bounding◦Extended Node Bounding◦Cost Estimation based Index Update

EvaluationConclusion & Future Work

Page 10: An Efficient Multi-Dimensional Index for Cloud Data Management

Related WorkS. Wu and K.-L. Wu, “An indexing framework for

efficient retrieval on the cloud,” IEEE Data Eng. Bull., vol. 32, pp.75–82, 2009.

H. chih Yang and D. S. Parker, “Traverse: Simplified indexing on large map-reduce-merge clusters,” in Proceedings of DASFAA 2009, Brisbane, Australia, April 2009, pp. 308–322.

M. K. Aguilera, W. Golab, and M. A. Shah, “A practical scalable distributed b-tree,” in Proceedings of VLDB’08, Auckland, New Zealand, August 2008, pp. 598–609.

Page 11: An Efficient Multi-Dimensional Index for Cloud Data Management

Distributed DatabaseData slicing in DDBS

◦Horizontal, vertical, etc.◦Slice based on conditions◦Check condition conflict on query

processingData distribution on the Cloud is

different and could be very complex if expressed as set of conditions

Condition check is too expensive

Page 12: An Efficient Multi-Dimensional Index for Cloud Data Management

Outline Motivation Query Answering on the CloudRelated WorkEMINC: Index the Cloud Efficiently

◦Node Bounding◦Extended Node Bounding◦Cost Estimation based Index Update

EvaluationConclusion & Future Work

Page 13: An Efficient Multi-Dimensional Index for Cloud Data Management

Outline Motivation Query Answering on the CloudRelated WorkEMINC: Index the Cloud Efficiently

◦Node Bounding◦Extended Node Bounding◦Cost Estimation based Index Update

EvaluationConclusion & Future Work

Page 14: An Efficient Multi-Dimensional Index for Cloud Data Management

EMINC: Node BoundingNode cube of a table on a slave

node◦Value range of table on this node

Id A B1 1 12 2 23 3 44 6 75 5 10

Node Cube: (1,1), (6,10)

Page 15: An Efficient Multi-Dimensional Index for Cloud Data Management

EMINC: ArchitectureEach leaf node corresponds to one node cube

Use KD-Tree to maintain local index on slave nodes

Page 16: An Efficient Multi-Dimensional Index for Cloud Data Management

EMINC: Query ProcessingGet query cube of the query and

look up in the R-Tree to get relevant data nodes◦1<x<2, 3<y<4 => Query Cube:

(1,3),(2,4)

YES NoNode Cube

Query Cube

Node Cube

Query Cube

Page 17: An Efficient Multi-Dimensional Index for Cloud Data Management

Outline Motivation Query Answering on the CloudRelated WorkEMINC: Index the Cloud Efficiently

◦Node Bounding◦Extended Node Bounding◦Cost Estimation based Index Update

EvaluationConclusion & Future Work

Page 18: An Efficient Multi-Dimensional Index for Cloud Data Management

EMINC: Extended Node BoundingProblem with single bounding

◦Bad performance for sparse nodeMany queries will be mislead to this node

Page 19: An Efficient Multi-Dimensional Index for Cloud Data Management

EMINC: Cube Cutting

Single Node Cube with Low Accuracy

Multiple Node Cube with High Accuracy

Page 20: An Efficient Multi-Dimensional Index for Cloud Data Management

EMINC: Cube Methods

Random cutting Equal cutting Clustering-based cutting

Page 21: An Efficient Multi-Dimensional Index for Cloud Data Management

Outline Motivation Query Answering on the CloudRelated WorkEMINC: Index the Cloud Efficiently

◦Node Bounding◦Extended Node Bounding◦Cost Estimation based Index Update

EvaluationConclusion & Future Work

Page 22: An Efficient Multi-Dimensional Index for Cloud Data Management

EMINC: Index Update StrategyIndex update issues:

◦Cubes may invalidate themselves after certain data update, thus need reconstruction

Insertion invalidates cube◦Create a node cube containing new

dataFor regular maintenance of index

◦Cost estimation based update strategy

Page 23: An Efficient Multi-Dimensional Index for Cloud Data Management

EMINC: Cost Estimation StrategyCost of index update:

◦Recalculate cubes on local node◦Transfer to master node and

maintain R-Tree◦Query performance will be affected

Benefit of index update:◦More accurate query directing, less

waste

Page 24: An Efficient Multi-Dimensional Index for Cloud Data Management

EMINC: Two Phase MethodAfter one update:

1. Wait for a time period of deltaT2. deltaT expires, check if an update

is neededDetermin deltaTCheck for updateAssumption :

Number of queries to be processed

Total size of node cubes of this node

Page 25: An Efficient Multi-Dimensional Index for Cloud Data Management

EMINC: Phase OneAfter pervious update:benefit =

decrement-of-query/time* deltaT◦We enjoy the benefit of pervious

update for deltaT time periodcost = number-of-queries missed

◦Number of queries we could process if we use pervious update time to answer queries

Page 26: An Efficient Multi-Dimensional Index for Cloud Data Management

EMINC: Phase Twobenefit > cost => deltaTAfter deltaT expires, check if an

update is needed. This check involves following:◦Record update frequency◦Expected benefit ratio◦Performance requirement

We leave this as future work

Page 27: An Efficient Multi-Dimensional Index for Cloud Data Management

Outline Motivation Query Answering on the CloudRelated WorkEMINC: Index the Cloud Efficiently

◦Node Bounding◦Extended Node Bounding◦Cost Estimation based Index Update

EvaluationConclusion & Future Work

Page 28: An Efficient Multi-Dimensional Index for Cloud Data Management

Evaluation 6 machines

◦1 as master node◦5 slave nodes simulating 100~1000

nodesEach machine had a 2.33GHz

Intel Core2 Quad CPU, 4GB of main memory, and a 320G disk.

Machines ran Ubuntu 9.04 Server OS.

Page 29: An Efficient Multi-Dimensional Index for Cloud Data Management

Evaluation: Point Query

Page 30: An Efficient Multi-Dimensional Index for Cloud Data Management

Evaluation: Range Query

Page 31: An Efficient Multi-Dimensional Index for Cloud Data Management

Outline Motivation Query Answering on the CloudRelated WorkEMINC: Index the Cloud Efficiently

◦Node Bounding◦Extended Node Bounding◦Cost Estimation based Index Update

EvaluationConclusion & Future Work

Page 32: An Efficient Multi-Dimensional Index for Cloud Data Management

ConclusionIn this paper we presented a

series of approaches on building efficient multi-dimensional index on Cloud platform.

We developed the node bounding technique to reduce query processing cost on the cloud platform.

In order to maintain efficiency of the index, we proposed a cost estimation-based approach for index update.

Page 33: An Efficient Multi-Dimensional Index for Cloud Data Management

Future WorkComplete cost estimation modelTake replication of data into

considerationImplement in Hbase to further

verify performance

Page 34: An Efficient Multi-Dimensional Index for Cloud Data Management

ThanksPlease visit our lab for more information:http://idke.ruc.edu.cn/