Scale-out Data
Deduplication Architecture
Gideon Senderov
Product Management & Technical Marketing
NEC Corporation of America
Outline
• Data Growth and Retention
• Deduplication Methods
• Legacy Architecture Limitations
• Adaptive Platform for Long-term Data
• Enhanced Scale-out Attributes
– Scalability
– Performance
– Resiliency
Page 2
Long-term Data
Source: IDC 2010
0
10
20
30
40
50
60
70
80
90
2008 2009 2010 2011 2012 2013 2014
Content Depots & CloudUnstructured dataReplicated dataStructured data
Consumption of Enterprise
Disk Capacity by Type
CAGR
23.6%
54.8%
(EB)
24.2%
75.6%
88%
• According to SNIA, 80% of files are dormant, unchanged.
• According to Contoural, average file shares grow 40% annually.
• Non-mission critical data filling costly Tier 1 storage
Page 3
Increasing Enterprise Size
Data Outlasts Its Container
Page 4
Months Years Decades Centuries
DATA
MINING
FILE
SHARING
FILE
SHARING
WWW
DATA
MINING
MIGRATE
DATA
MIGRATE
DATA
MIGRATE
DATA
Data Migration Costs
• In 2011 there will be 1775 Exabytes of information
• Archive data outlasts a storage container, data migrate
costs $1K – $1.8K per TB
• 15% of Storage Admin’s time spent on data migration
Page 5
Source: IDC, Gartner, Contoural Consulting
Deduplication Granularity
• File (SIS)
– Data deduplication performed at a file or object granularity
• Sub-file (Block)
– Data deduplication performed at a sub-file granularity
• Fixed size chunk
• Variable size chunk
• Sub-file variable size chunk deduplication benefits
– Greater efficiency by deduplicating data at finer granularity
– Automatic adjustments (“slide”) across inserted data
Page 6
Inline vs. Post-process
• Inline
– Data deduplication performed before writing the deduplicated data
• Post-Process
– Data deduplication performed after the data to be deduplicated has been initially stored
• Inline deduplication benefits
– Eliminate need for disk buffer for un-deduplicated data
– Eliminate need for subsequent “idle time” for processing
– Immediate replication and shorter RTO for DR
Page 7
Generic vs. Application-aware
• Generic
– Deduplication with same algorithm against all data streams
• Application-aware
– Custom deduplication for specific data streams to account for metadata inserted by the corresponding applications
• Application-aware deduplication benefits
– Greater efficiency by eliminating negative impact of application metadata on deduplication ratio
– Cross-application deduplication for greater scope and higher deduplication efficiency
Page 8
Local vs. Global
• Local
– Deduplication across a single node or sub-node, requiring separate deduplication repositories for multiple nodes
• Global
– Deduplication spanning multiple nodes with a single shared deduplication repository across all nodes
• Global deduplication benefits
– Greater scalability of a single deduplication repository
– Greater efficiency with a broader scope spanning multiple nodes
Page 9
Legacy Scalability Limitations
• Inadequate scalability of capacity & performance
– Cannot scale performance to keep up with growth
– Multiple products with different architectures
– More siloed capacity to manage
• Limited deduplication scope
– Limited scalability proliferates duplicate data across appliances
– Lower deduplication ratio for large environments
Page 10
Local vs. Global Scope
• Single deduplication repository across entire solution
• Data deduplication across ALL data from ALL nodes
– Cross-node dedupe for greater efficiency
– Cross-application dedupe with application-awareness
Page 11
Local – Volume
Dedupe
Repository
Dedupe
Repository
Dedupe
Repository
Local – Node
Dedupe
Repository
Dedupe
Repository
Global
Dedupe
Repository
Legacy Resiliency Limitations
Page 12
5 6 7 8
2 3 41
Q: What data
can be restored
if block #1
lost?
A: NONE!
14 6 2 1
7 1 4
6
1
3
6
4 6
1
7
6
8
1 4
7
5 1
Day 1: FullDay 2:
Incremental…Day 7: Full
Day 3:
Incremental
1
Traditional RAID is not sufficient for deduped data
Non-disruptive
RemoveNon-disruptive
Add, Upgrade
V1 Grid + V1 + V2 + Vx = 1 System
Page 13Page 13
Adaptive Scale-out ArchitectureOne system becomes Greener, Faster, Denser
• Concurrently support multiple HW generations
• Non-disruptive resource scaling
• Zero provisioning, dynamic self-management
• Automatically balance system workload
• In-place technology refresh – No migration
Page 14Page 14
Independent Linear Scalability
• Customized performance and capacity per
application
• Dynamically adjust to changing data growth
Uniquely Scale
Performance and Capacity
to Meet Current and
Future Needs
Capacity
Pe
rfo
rma
nce
SAFE DEDUPEProtection
SCALABLE DEDUPEPerformance
GLOBAL DEDUPEScope
AttributesCriteria
Page 15
Enhanced Scale-out Attributes
Scope
• Multiple applications and data sources across a single scalable deduplication repository
• Broader scope leads to greater efficiencies
– Higher dedupe ratios
– Improved capacity utilization
– Easier management
– Longer retention periods
– Lower costs
• Single scalable system vs. multiple disparate silos
– Cross-application deduplication for greater efficiency
– Global deduplication for entire environment
Page 16
Performance
• Linear performance scalability to keep up with data growth
• Scalability
– Linear scalability vs. high degradation
– Independent performance scalability
– Beyond the physical boundaries of the system
• Inline dedupe vs. post-process
– Keeping up with the workload
• Effects of deduplication rates
– Higher performance with higher dedupe
Page 17
Scale-out Dedupe Architecture
• Multi-node architecture
– Inline data routing
– Processing and memory scales with capacity
• Distributed hash table
• Linear performance scalability
• Global deduplication across ALL nodes
Page 18
Linear Performance Scalability
Page 19
Resiliency
• Enhanced data resiliency, especially with fewer copies due to deduplication
• The greater the dedupe ratio, the greater the exposure
• Maximum protection against multiple failures
– Component failures
– Appliance or node failures
• Upgradeability and technology refresh
– In-place technology refresh vs. forklift upgrade
– Online versus downtime
– Configurable resiliency for different applications
– Investment protection
Page 20
Erasure-coded Resiliency
• “User dialable” disk/node protection
– Intermix of multiple resiliency levels
– Dynamically allocated protection
• Greater protection with less overhead
– No idle spare drives
• Faster healing while maximizing I/O performance
– Only data is reconstructed
– Reconstruction across multiple spindles/processors
Page 21
Long-term Evolution
• Functionality
– Application-specific requirements (Security, Resiliency)
– Regulatory compliance requirements
• Performance
– Linear scalability throughput HW/SW enhancements
– Multiple I/O profiles
• Attributes
– Flexibility
– Efficiency
– Granularity
Page 22
Page 23
Long-term Data Protection
Sto
rag
e S
pe
nd
ing
TimeMonths Years Decades Centuries
Clones/Snapshots
WAN-Optimized Replication
Erasure-Coded Resiliency
Inline Global Deduplication
Automatic Load Balancing
Dynamic Provisioning
Massive ScalabilityFeatures drive
savings