Scale-out Data Deduplication Architecture

Scale-out Data

Deduplication Architecture

Gideon Senderov

Product Management & Technical Marketing

NEC Corporation of America

Outline

• Data Growth and Retention

• Deduplication Methods

• Legacy Architecture Limitations

• Adaptive Platform for Long-term Data

• Enhanced Scale-out Attributes

– Scalability

– Performance

– Resiliency

Page 2

Long-term Data

Source: IDC 2010

0

10

20

30

40

50

60

70

80

90

2008 2009 2010 2011 2012 2013 2014

Content Depots & CloudUnstructured dataReplicated dataStructured data

Consumption of Enterprise

Disk Capacity by Type

CAGR

23.6%

54.8%

(EB)

24.2%

75.6%

88%

• According to SNIA, 80% of files are dormant, unchanged.

• According to Contoural, average file shares grow 40% annually.

• Non-mission critical data filling costly Tier 1 storage

Page 3

Increasing Enterprise Size

Data Outlasts Its Container

Page 4

Months Years Decades Centuries

DATA

MINING

FILE

SHARING

FILE

SHARING

WWW

DATA

MINING

MIGRATE

DATA

MIGRATE

DATA

MIGRATE

DATA

Data Migration Costs

• In 2011 there will be 1775 Exabytes of information

• Archive data outlasts a storage container, data migrate

costs $1K – $1.8K per TB

• 15% of Storage Admin’s time spent on data migration

Page 5

Source: IDC, Gartner, Contoural Consulting

Deduplication Granularity

• File (SIS)

– Data deduplication performed at a file or object granularity

• Sub-file (Block)

– Data deduplication performed at a sub-file granularity

• Fixed size chunk

• Variable size chunk

• Sub-file variable size chunk deduplication benefits

– Greater efficiency by deduplicating data at finer granularity

– Automatic adjustments (“slide”) across inserted data

Page 6

Inline vs. Post-process

• Inline

– Data deduplication performed before writing the deduplicated data

• Post-Process

– Data deduplication performed after the data to be deduplicated has been initially stored

• Inline deduplication benefits

– Eliminate need for disk buffer for un-deduplicated data

– Eliminate need for subsequent “idle time” for processing

– Immediate replication and shorter RTO for DR

Page 7

Generic vs. Application-aware

• Generic

– Deduplication with same algorithm against all data streams

• Application-aware

– Custom deduplication for specific data streams to account for metadata inserted by the corresponding applications

• Application-aware deduplication benefits

– Greater efficiency by eliminating negative impact of application metadata on deduplication ratio

– Cross-application deduplication for greater scope and higher deduplication efficiency

Page 8

Local vs. Global

• Local

– Deduplication across a single node or sub-node, requiring separate deduplication repositories for multiple nodes

• Global

– Deduplication spanning multiple nodes with a single shared deduplication repository across all nodes

• Global deduplication benefits

– Greater scalability of a single deduplication repository

– Greater efficiency with a broader scope spanning multiple nodes

Page 9

Legacy Scalability Limitations

• Inadequate scalability of capacity & performance

– Cannot scale performance to keep up with growth

– Multiple products with different architectures

– More siloed capacity to manage

• Limited deduplication scope

– Limited scalability proliferates duplicate data across appliances

– Lower deduplication ratio for large environments

Page 10

Local vs. Global Scope

• Single deduplication repository across entire solution

• Data deduplication across ALL data from ALL nodes

– Cross-node dedupe for greater efficiency

– Cross-application dedupe with application-awareness

Page 11

Local – Volume

Dedupe

Repository

Dedupe

Repository

Dedupe

Repository

Local – Node

Dedupe

Repository

Dedupe

Repository

Global

Dedupe

Repository

Legacy Resiliency Limitations

Page 12

5 6 7 8

2 3 41

Q: What data

can be restored

if block #1

lost?

A: NONE!

14 6 2 1

7 1 4

6

1

3

6

4 6

1

7

6

8

1 4

7

5 1

Day 1: FullDay 2:

Incremental…Day 7: Full

Day 3:

Incremental

1

Traditional RAID is not sufficient for deduped data

Non-disruptive

RemoveNon-disruptive

Add, Upgrade

V1 Grid + V1 + V2 + Vx = 1 System

Page 13Page 13

Adaptive Scale-out ArchitectureOne system becomes Greener, Faster, Denser

• Concurrently support multiple HW generations

• Non-disruptive resource scaling

• Zero provisioning, dynamic self-management

• Automatically balance system workload

• In-place technology refresh – No migration

Page 14Page 14

Independent Linear Scalability

• Customized performance and capacity per

application

• Dynamically adjust to changing data growth

Uniquely Scale

Performance and Capacity

to Meet Current and

Future Needs

Capacity

Pe

rfo

rma

nce

SAFE DEDUPEProtection

SCALABLE DEDUPEPerformance

GLOBAL DEDUPEScope

AttributesCriteria

Page 15

Enhanced Scale-out Attributes

Scope

• Multiple applications and data sources across a single scalable deduplication repository

• Broader scope leads to greater efficiencies

– Higher dedupe ratios

– Improved capacity utilization

– Easier management

– Longer retention periods

– Lower costs

• Single scalable system vs. multiple disparate silos

– Cross-application deduplication for greater efficiency

– Global deduplication for entire environment

Page 16

Performance

• Linear performance scalability to keep up with data growth

• Scalability

– Linear scalability vs. high degradation

– Independent performance scalability

– Beyond the physical boundaries of the system

• Inline dedupe vs. post-process

– Keeping up with the workload

• Effects of deduplication rates

– Higher performance with higher dedupe

Page 17

Scale-out Dedupe Architecture

• Multi-node architecture

– Inline data routing

– Processing and memory scales with capacity

• Distributed hash table

• Linear performance scalability

• Global deduplication across ALL nodes

Page 18

Linear Performance Scalability

Page 19

Resiliency

• Enhanced data resiliency, especially with fewer copies due to deduplication

• The greater the dedupe ratio, the greater the exposure

• Maximum protection against multiple failures

– Component failures

– Appliance or node failures

• Upgradeability and technology refresh

– In-place technology refresh vs. forklift upgrade

– Online versus downtime

– Configurable resiliency for different applications

– Investment protection

Page 20

Erasure-coded Resiliency

• “User dialable” disk/node protection

– Intermix of multiple resiliency levels

– Dynamically allocated protection

• Greater protection with less overhead

– No idle spare drives

• Faster healing while maximizing I/O performance

– Only data is reconstructed

– Reconstruction across multiple spindles/processors

Page 21

Long-term Evolution

• Functionality

– Application-specific requirements (Security, Resiliency)

– Regulatory compliance requirements

• Performance

– Linear scalability throughput HW/SW enhancements

– Multiple I/O profiles

• Attributes

– Flexibility

– Efficiency

– Granularity

Page 22

Page 23

Long-term Data Protection

Sto

rag

e S

pe

nd

ing

TimeMonths Years Decades Centuries

Clones/Snapshots

WAN-Optimized Replication

Erasure-Coded Resiliency

Inline Global Deduplication

Automatic Load Balancing

Dynamic Provisioning

Massive ScalabilityFeatures drive

savings

Documents

Scale-out Data Deduplication Architecture