Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
Storage Efficiency in Clustered Storage Environment
Ankur Saran Tata Consultancy Services
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
Abstract
Data Storage Efficiency mainly by de-duplicating and compressing data is very popular and indeed it is one of the unique selling points for storage vendors. Such feature has some inherent problem, which results in running them as background application on the storage systems, because of the amount of workload generated, due to the CPU’s intensive scanning operations, which eats up huge amount of CPU resources at run time. Despite of multiple de-dupe and compression techniques present, none can efficiently deal with handling data in production environment.
As data is growing with exponential rate, almost all storage vendors are
coming up with scale-out/clustered solutions to address the big data problem for their production environment. In clustered environment however, there is scope in leftover CPU bandwidth which can be used to improve data Efficiency ratio without affecting other operation significantly. This works on idea of distributing load from already busy Node to the relatively free Nodes using few distribution algorithms, map reduce being one of them. 2
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
Topics to be covered
Storage trend Storage Efficiency & Challenges Storage Efficiency in Cluster Environment Benefits & Challenges Conclusion
3
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
Storage Trend
4
Storage need has been continuously growing Next Generation Storage needs
System to handle exponential growth in data from Medical sciences i.e. data from DNA sequence etc Finance data storage and analysis Scientific data storage and analysis i.e. Handle big data
handling
Need of extremely scalable data storage system Clustered Storage is definitely one of the next
generation Storage system.
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
Clustered Storage Structure
5
Fig1:One of the Clustered Storage Architecture
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
Structure of Clustered Storage
Common Name space is shared as point of transaction
All the Nodes in system have Storage Space and Computing ability.
All the Node have tightly coupled storage with processing unit.
All the Node are loosely coupled cluster.
6
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
Storage Efficiency & Challenges
7
Storage Efficiency
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
Storage Efficiency
Storage Efficiency comprises of storage features e.g. Deduplication Compression Snap shots Thin Provisioning etc. (Reference http://en.wikipedia.org/wiki/Storage_efficiency)
8
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
Storage and Storage Efficiency
Storage market players competes on point of features they provided such as dedupe.
Many process in Storage Efficiency such as fingerprint searching of duplicate data present in storage are CPU intensive, hence these processes runs as low priority process.
Design in which CPU intensive process distributed among the Node (with left over CPU) can be a asset to capitalize.
9
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
In
Clustered Environment
10
Storage Efficiency in Cluster Environment
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
Structure of Clustered Storage Efficiency
11
Fig2:Setup with Global Fingerprint Database
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
Storage Efficiency in Clustered Storage
Global Fingerprint Database (GFDB) contains unique Fingerprint of all data across Nodes.
All the Data storage Node can query and update GFDB to keep track of data in the system.
Duplicate Fingerprint in GFDB mean duplicate data block in the system.
12
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
13
slave
Fig3: Initial Storage setup without data
Clustered Storage System Setup and GFDB
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
Important Components
Storage Master: Controls the block flow and maintain information about the stored blocks, CPU utilization
Storage Slaves: Storage units of system
Fingerprints : Unique identification key for the data block
Fingerprint Database: Data base to store unique key with value as data block address
14
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
Clustered Storage System with Single Instance of Data
15
Fig4:System after first stream of data is stored in system and related changes in FPDB
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
System with duplicate Entires
16
Fig5:System after second stream of data is stored in system and related changes in FPDB
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
Processing of Duplicate data
Mapping of fingerprint database having key as fingerprint and value as path id of the fingerprint
Mapper( fingerprint , path) //For each fingerprint emit(fingerprint,1)
17
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
Reducer Function
Reducer Collects the fingerprints and reduces the duplicate count if any
Reducer(fingerprint ,iteration value) if fingerprint == new_fingerprint emit(result)
18
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
Cluster Level Deduplication
19
FP Slave Name Block Id A Slave1 1 B Slave1 2 C Slave2 1 D Slave2 2 . . . . . . . . . . . . . . . E SlaveN 1 F SlaveN 2
FP Slave Name Block Id A Slave1,Slave2 1,3 B Slave1 2 C Slave2 1 D Slave2 2 . . . . . . . . . . . . . . . E SlaveN,Slave1 1,3 F SlaveN,Slave1 2,4
Fig6:Before and after Deduplication
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
Fingerprint Database
20
FP Slave Name Block Id
A Slave1 1
B Slave1 2
C Slave2 1
D Slave2 2 . . .
. . .
. . .
. . .
. . .
E SlaveN 1
F SlaveN 2
FP Slave Name Block Id
A Slave1,Slave2 1,3
B Slave1 2
C Slave2 1
D Slave2 2 . . .
. . .
. . .
. . .
. . .
E SlaveN,Slave1 1,3
F SlaveN,Slave1 2,4
Before Deduplication After Deduplication
Fig7:FPDB before and after dedupe
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
Benefits & Challenges
With Clustered sfdssds
21
Benefits & Challenges
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
PoC : Scenario & Results
22
Graph1: Data type vs. Dedupe saving in percentage across Cluster
0
10
20
30
40
50
60
70
80
90
100
Ran
dom
Mix
Mix
of Z
ero+
Ran
dom
All
Zer
o
1g 1g 1g
Percentage Saving
%age Saving
0
10
20
30
40
50
60
70
80
90
100
Sim
ilar
All
Zer
o
2 Fi
les
with
all
Zer
o
2 Fi
les
with
all
Zer
o
1 Fi
le w
ith a
ll Z
ero
10X
400m
b al
l Zer
o
2 Fi
les
with
all
Zer
o
1 Fi
le w
ith a
ll Z
ero
4 Fi
le w
ith a
ll Z
ero
2 Fi
les
with
all
Zer
o
1 Fi
le w
ith a
ll Z
ero
5 Fi
les
with
all
Zer
o
2 Fi
les
with
all
Zer
o
128m 1g 2g 4g 4g 4g 10g 10g 16g 16g 16g 19g 19g
Percentage Saving
%age Saving
Constants for the PoC No of Nodes 2
Block Size 64m Replication Factor 1
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
Results
Storage Efficiency in cluster environment reduces the storage requirement by many folds.
Searching the Global fingerprint database
worked as a boon to design as it can be used to reduce the time required to process duplicate data in cluster
23
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
Challenges
Creation of Global Finger Print Database (GFDB)
Removal of data blocks and updating it on GFDB and Storage System
Master Node- Single point of failure.
24
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
Conclusion
Multi Node storage is the storage of future Next Generation storage can be more efficient
with utilization of distributed architecture Features like dedupe will be the key to success in
the industry. Storage Efficiency will help in optimizing Storage
requirement and reducing carbon footprints
25
2012 Storage Developer Conference. © Tata Consultancy Services Limited 2012. All Rights Reserved.
Questions?
Mail us @ [email protected]
26
Special Thanks to Arvind Kumar and Gaurav Gupta, Suraj Bhatnagar
Thanks to Nishi Gupta, Udayan Singh and my project members