Upload
robert
View
28
Download
0
Embed Size (px)
DESCRIPTION
Index for Cloud Data Management. Lab of Web And Mobile Data Management ( WAMDM ) Youzhong MA. Outline. Motivating Applications E xisting Technologies Conclusions & Future work . Motivating Application. select sum(number) from Product where product.name = ‘beer’ - PowerPoint PPT Presentation
Citation preview
Index for Cloud Data Management
Lab of Web And Mobile Data Management(WAMDM)Youzhong MA
Outline
Motivating Applications Existing Technologies Conclusions & Future work
rowkey name Price number1 beer 3.00$ 1000
2 beer 7.00$ 2500
3 milk 2.00$ 1300
4 mlik 4.5$ 2100
Motivating Application
Cloud System
select sum(number) from Productwhere product.name = ‘beer’ and product.price <=10$ and product.price >=5$
Big Data in a Private Cloud
Table: Product
Queries with multi-attributes and non-rowkeyare quite common !
Page 4
Motivating Application: Mobile Coupon Distribution
Coupon
CurrentLocation Current
Location CurrentLocation
Distribution Policy• Area• # of coupons
Mobile CouponDistributer
Page 5
Motivating Application: Mobile Coupon Distribution
CurrentLocation
CurrentLocation
CurrentLocation
CurrentLocation Current
Location
CurrentLocation
CurrentLocation
CurrentLocation
CurrentLocation
CurrentLocation Current
Location
Distribution Policy• Area• # of coupons
CouponCouponCoupon
Large amounts of DataHigh Throughput
System ScalabilityMulti-Dimensional QueryNearest Neighbors Query
Efficient Complex Queries
125,000,000 subscribersin Japan
Outline
Motivating Applications Existing Technologies Conclusions & Future work
Existing TechnologiesMulti-
dimensional Queries
Scalability
Relational DBsSpatial DBs
Commercial products
but expensive
Open source products
Key-Value Stores
What We Want
at a reasonable price
Solutions-overview
Rowkey Non-rowkey
Single Dimensional
Index
[BigTable、 HBase]
[Point Query、 Range Query]
[Aguilera PVLDB’08][S.Wu Data Eng’09][S. Wu PVLDB’10]
Multiple Dimensional Index
[X.Zhang CloudDB’09][J.Wang SIGMOD’10][G.Chen VLDB’11][Y. Zou NPC’10][Shoji Nishimura MDM’11]
Local Index +
Global Index
NECCAS
Efficient B-tree Based Indexing for Cloud Data Processing
S. Wu, D. Jiang, B. C. Ooi, and K.-L. Wu. PVLDB'10
Efficient B-tree Based Indexing for Cloud Data Processing
Motivation Designing a scalable and high-throughput
indexing scheme to support efficient query for huge volumes of data in cloud
Low maintenance cost but also support parallel search
System Architecture
① Local Index
② BATONoverlay network
③ publish
Challenges How to select the local B+-tree nodes to publish in Global index? How to organize the global index? How to maximize the throughput?
Selecting local B+-tree nodes Cost modeling
Query cost1. routing cost:2. local search cost:
Update cost
: cost of sending an index message: cost of random I/O
1: Search in global index
2: Search in local index
21 log *2
N
( )*h n
21( )* log *2
g n N
Adaptive indexing strategy
Index expand Index collapse
Local Index
BATON: Balanced Tree Overlay Network
A distributed tree structure for P2P systems Supporting range search
Index Construction Assign a range to each node For each node n
The range of its left sub-tree is less than that of nThe range of its right sub-tree is larger than that of n
Publish local B+-tree node to BATON
Maximizing the throughput Eventual consistent model Lazy update
if the update does not affect the key range of a local B+-tree, the stale index will not affect the correctness of the query processing.
Eager update updates in the Left-most and right-most nodes
Pros and cons Pros
Supporting efficient point query and range query for non-rowkey
Proposed an adaptive indexing strategy based on the cost model of overlay routings
Cons Can not support multi-dimensional query
Multi-dimensional index
[X.Zhang CloudDB’09]
Multi-dimensional index
[J.Wang SIGMOD’10]
[G.Chen VLDB’11]
MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location
Aware Services Shoji Nishimura, Sudipto Das. MDM'11
Contributions Using linearization to implement a scalable multi-dimensional index structure layered over a range-partitioned Key-value store Implementing a K-d tree and a Quad tree by the design
Ordered Key-Value Stores
key00
key11
keynn
key00
key01
key0X
value00
value01
value0X
key11
key12
key1Y
value11
value12
value1Y
keynn valuenn
Index
BucketsSorted by key
Good at 1-D Range Query
LongitudeTime
Latit
ude
But, our target is multi-dimensional…
Naïve Solution: Linearlization
key00
key11
keynn
key00
key01
key0X
value00
value01
value0X
key11
key12
key1Y
value11
value12
value1Y
keynn valuenn
Projects n-D space to 1-D space
Simple, but problematic…
Apply a Z-ordering curve…
5 7 13 15
4 6 12 14
1 3 9 11
0 2 8 10
Problem: False positive scansMD-query on Linearized space
Translate a MD-query to linearized range query.
• Ex. Query from 2 to 9.
Scan queried linearized range. Filter points out of the queried area.
• ex. blue-hatched area (4 to 7)
Require the boundary information of the original space.
5 7 13 15
4 6 12 14
1 3 9 11
0 2 8 102
9
Build a Multi-dimensional Index Layer on top of an Ordered Key-Value store
MD-HBase
Single Dimensional IndexMulti-Dimensional Index
Ordered Key-Value Storeex. BigTable, HBase, …
MD-HBase
Space Partition By the K-d tree
0101 0111 1101 1111
0100 0110 1100 1110
0001 0011 1001 1011
0000 0010 1000 1010
Binary Z-ordering space
00 01 10 11
11
10
01
00
0101 0111 1101 1111
0100 0110 1100 1110
0001 0011 1001 1011
0000 0010 1000 1010
00 01 10 11
11
10
01
00
Partitioned space bythe K-d tree
How do we represent these subspaces?
bitwise interleaving
Key Idea: The longest common prefix naming scheme
0101 0111 1101 1111
0100 0110 1100 1110
0001 0011 1001 1011
0000 0010 1000 1010
00 01 10 11
11
10
01
00
000* 1***
Subspaces represented as the longest common prefix of keys!
Remarkable Property• Preserve boundary information
of the original space
1***
Left-bottomcorner
Right-topcorner
1000 1111*→0 *→1
(10, 00) (11, 11)
Build an index with the longest common prefix of keys
0101 0111 1101 1111
0100 0110 1100 1110
0001 0011 1001 1011
0000 0010 1000 1010
00 01 10 11
11
10
01
00 000* 001*
01**
1***
000*
001*
01**
1***
Index
Buckets
allocate per subspace
Reconstruct the boundary Info. &Check whether intersecting the queried area
Multi-dimensional Range Query
0101 0111 1101 1111
0100 0110 1100 1110
0001 0011 1001 1011
0000 0010 1000 1010
00 01 10 11
11
10
01
00
000*
001*
01**
10**
11**
Index
Filter
001*
000*
001*
10**
11**
01**
10**
Scan
Scan
Subspace Pruning
Scan 0010 -1001on the index
Variations of Storage Layer Table Share Model
Use single table, Maintain bucket boundary Most space efficiency
Table per Bucket Model Allocate a table per bucket Most flexible mapping
One-to-one, one-to-many, many-to-one Bucket split is expensive
Copy all points to the new buckets.
Region per Bucket Model Allocate a region per bucket Most bucket split efficiency Require modification of HBase
bucketstable
Experimental Results: Multi-dimensional Range Query Dataset: 400,000,000 points Queries: select objects within MD ranges and change selectivity Cluster size: 16 nodes MD-HBase responses 10~100 times faster than others
and responses proportional time to selectivity.
1
10
100
1000
0.01 0.1 1 10
Selectivity (%)
Resp
onse
Tim
e (S
ec)
MD-HBase HBase(ZOrder) MapReduce
Experimental Results: Insert Dataset: spatially skewed data MD-HBase shows good scalability without
significant overhead.
0
50,000
100,000
150,000
200,000
250,000
0 4 8 12 16 20
Number of nodes
Thou
ghpu
t(r
ecor
ds/s
ec)
MD-HBase
Hbase(Zorder)
Conclusions
Designed a scalable multi-dimensional data store. Mapping multi-dimension to single dimension Key Idea: indexing the longest common prefix of keys
Demonstrated scalable insert throughput and excellent query performance. Range Query: 10-100 times faster than existing
technologies. Insert: 220K inserts/sec on 16 nodes cluster without
overhead
CCIndex: A Complemental Clustering Index on Distributed Ordered Tables for Multi-dimensional Range Queries
Y. Zou, J. Liu, S. Wang. NPC’10
end
Introduction Motivation
Building index in DOTs to support multi-dimensional range query
High performance, low space overhead, high reliability DOT
Distributed Ordered Table BigTable, HBase
ObservationsUsually 3 to 5 replica in DOTs Index number is usually less than 5Random read is significantly slower than scan
Basic idea: Complemental Clustering Index
CCIT:convert slow random reads to fast sequential scan
CCT:for fast datarecovery
Challenges
Performance Reliability Space overhead
Performance
HBase 0.20.1 16 nodes 90 million
records
Query optimization based on the region-to-server mapping information
Reliability: Fault tolarance Get other index value
from CCTs Query the CCITs to
recover data Replicate CCTs
Space overhead
N: the index column number
X-axis Length of
record to length of index columns
Y-axis Overhead ratio
Conclusions
Proposed CCIndex to support Multi-dimensional range query in DOTs
Not suitable for more than 5 index columns Write operation is slower than the original table
Outline
Motivating Applications Existing Technologies Conclusions & Future work
Conclusions Index for non-rowkey in cloud data management system Solutions
Local index + global index Linearlization Secondary index
Key issues Index reliability Query result correctness Index maintenance…
Future work Study the architecture of HDFS and Hbase in detail Test the existing index solutions in Cloud Index framework and index structure
References M. K. Aguilera, W. Golab, and M. A. Shah. A practical scalable distributed b-tree. PVLDB, 1(1):598–
609, 2008.Y. Zou, J. Liu, S. Wang. CCIndex: a Complemental Clustering Index on Distributed
Ordered Tables for Multi-dimensional Range Queries. NPC’10.S. Wu and K.-L. Wu, “An indexing framework for efficient retrieval on the cloud,” IEEE
Data Eng. Bull., vol. 32, pp.75–82, 2009.J. Wang, S. Wu, H. Gao, J. Li, and B. C. Ooi. Indexing multi-dimensional data in a
cloud system. In SIGMOD, 2010.S. Wu, D. Jiang, B. C. Ooi, and K.-L. Wu. Efficient b-tree based indexing for cloud data
processing. PVLDB, 3(1):1207–1218, 2010.X. Zhang, J. Ai, Z. Wang, J. Lu, and X. Meng, “An efficient multidimensional index for
cloud data management,” in CloudDB, 2009, pp.17–24.Shoji Nishimura, Sudipto Das. MD-HBase: A Scalable Multi-dimensional Data
Infrastructure for Location Aware Services. MDM2011.
Thank you