Netapp Deduplication concepts

Preview:

Citation preview

NetApp Deduplication

Deduplication refers to the elimination of redundant data in the storage. In the deduplication process, duplicate data is deleted, leaving only one copy of the data to be stored. However, indexing of all data is still retained should that data ever be required. De-duplication is able to reduce the required storage capacity since only the unique data is stored. 

NetApp deduplication provides block-level deduplication within the entire flexible volume. Essentially, deduplication removes duplicate blocks, storing only unique blocks in the flexible volume, and it creates a small amount of additional metadata in the process

Notable features of deduplication include 1. It works with a high degree of granularity: that is, at the 4KB block level2. It operates on the active file system of the flexible volume3. It is a background process that can be configured to run automatically, be scheduled, or run manually through the command line interface (CLI), NetApp Systems Manager4. It is enabled and managed by using a simple CLI or GUI such as Systems Manager

HOW DEDU WORKS

The core enabling technology of deduplication is fingerprints. These are unique digital signatures for every 4KB data block in the flexible volume.

When deduplication runs for the first time on a flexible volume with existing data, it scans the blocks in the flexible volume and creates a fingerprint database, which contains a sorted list of all fingerprints for used blocks in the flexible volume. After the fingerprint file is created, fingerprints are checked for duplicates and if found, first a byte-by-byte comparison of the blocks is done to make sure that the blocks are indeed identical. If they are found to be identical, the block’s pointer is updated to the already existing data block and the duplicate data block is released and inode is updated.

HOW DEDU WORKS

when you 'sis' a volume, the behavior of that volume changes and the changes takes place in two phases-:

PHASE-1-: SIS enabled: Pre-Process: Before the block is written to the array

collecting Fingerprint

Note-This is for the new blocks, for the existing data blocks that were written before enabling SIS, we need to run the scan on the existing data and pull those fingerprints into the catalogue.

Phase-2 SIS start :Post process -After the block is written to the array

sorting, comparing and deduping.

Phase-1The moment the SIS is enabled every time SIS notices a block write request coming in the SIS process makes a call to Dataontap to get a copy of the fingerprint for that block so that it can store this fingerprint in its catalogue file.Note- This request interruptus the write string and results in a 7% performance penalty for all writes into any volume with SIS enabled.

Phase-2Now at some point you want to dedupe the volume using ‘sis start’ command manually or automatic.SIS goes through the process of comparing fingerprints from the fingerprint database catalogue file, validating data and deduping blocks that pass the validation phase.

Important Note

Nothing about the basic data structure of the WAFL file system has changed except we are traversing a different path in the file structure to get to your desired data block. That so why NetApp dedupe usually has no perceivable impact on read performance. All we have done is redirect some block pointers. Accessing your data might go a little faster, a little slower or more likely not change at all. It all depends on the pattern of the file system data structure and the pattern of request coming from the application.

What is a Fingerprint?

Fingerprint is a small digital representation of a larger data object.basically it is a checksum character generated by WAFL for each BLOCK for the purpose of consistency checking.

Is fingerprint generated by SIS?No, Each time a WAFL block is created a checksum character is generated for the purpose of consistency checking. NetApp deduplication (SIS) simply borrows a copy of this checksum and stores it in a catalogue as fingerprint.

What happens during post process deduplication? The fingerprint catalog is sorted and searched for

identical fingerprints. When a fingerprint match is made the associated data

blocks are retrieved and scanned byte by byte. Assuming successful validation the inode pointer

metadata of the duplicate block is redirected to the original block.

The duplicate block is marked as “Free” and returned to the system eligible for re-use.

Volume or data constituent & Aggregate deduplication overhead

Each volume with deduplication enabled, up to 4% of the physical amount of data written to that volume is required in order to store volume deduplication metadata

&

Each aggregate that contains any volumes with deduplication enabled, up to 3% of the physical amount of data contained in all of those volumes with deduplication enabled within the aggregate is required in order to store the aggregate deduplication metadata.

Thin and Thick Provisioning

Thin Provisioning

Definition-:A thin-provisioned volume is a volume for which storage is not set aside up-front. Instead, the storage for the volume is allocated as it is needed.The storage architecture uses aggregates to virtualize the physical storage into pools for logical allocation. The volumes and LUNs see the logical space, and the aggregate controls the physical space. This architecture provides the flexibility to create multiple volumes and LUNs that can exceed the physical space available in the aggregate. All volumes and LUNs in the aggregate will use the available storage within the aggregate as a shared storage pool. This will allow them to efficiently allocate the space available in the aggregate as data is written to it, rather than preallocating (reserving) the space called Thin Provisioning.

Thick Provisioning

Definition-: In virtual storage, thick provisioning is a type of storage allocation in which the amount of storage capacity on a volume is pre-allocated on physical storage (aggregate) at the time the volume is created.

Multi-Tenancy; What is it?

Secure Multi-Tenancy – Definition

Supporting multiple “tenants” (users, customers, etc.) from single shared infrastructure while keeping all data isolated and secure

Customers concerned with security and privacy require secure multi-tenancy

– Government agencies – Financial companies – Service Providers – Etc.

Multi-Tenancy and Cloud Infrastructure

Secure Multi-tenancy for virtualized environments

Secure Multi-tenancy for virtualized environments

SolutionThe only validated solution to support

end to end multitenancy across application and data

Data is securely isolated from virtual server, network, to virtual storage

Introducing MultiStore

Multistore and Vfiler

A logical partition of N/W and storage resource in Data ONTAP called multistore and it provides a secure storage consolidation solution.

When enabled, the Multistore license creates a logical unit called vFiler0 which contains all of the storage and network resources of the physical FAS unit.  Additional vFilers can then be created with storage and network resources assigned specifically to them.

What is Vfiler ?

A lightweight Instance of Data ONTAP Multi protocol server and all the system resource are shared b/w Vfiler units.

Storage units in the vfilers are Flexvols and Qtrees

Network Units are IP Address ,VLAN,VIFs,aliases and Ipspaces

Vfiler units are not hypervisors –vfiler resource cannot be accessed and discovered by any other vfiler units

Multi store configuration:

Up to 65 secure partitions (vFiler units) on a single storage system (64+vfiler0)

IP Storage based (NFS,CIFS & iSCSI servers) Additional storage and n/w resource can be moved, added

or deleted NFS, CIFS, iSCSI, HTTP, NDMP, FTP, SSH and SFTP

protocols are supported -Protocols can be enabled / disabled per vFiler -Destroying a vFiler does not destroy data

Multistore-One Physical System, Multiple Virtual Storage Partitions

What Makes MultiStore Secure?

MultiStore provides multiple layers of security – IPspaces – Administrative separation – Protocol separation – Storage separation An IPspace has a dedicated routing table

Each physical interface (Ethernet port) or logical interface (VLAN) is bound to a single Ipspace

What Makes MultiStore Secure?

A single IPspace may have multiple physical & logical interfaces bound to it

Each customer has a unique Ipspace

Use of VLANs or VIFs is a best practice with Ipspaces

File Services Consolidation

Application Hosting

Always-On Data Mobility

Always-On Data Mobility

No planned downtime for-:– Storage capacity expansion – Scheduled maintenance outages –– Software Upgrades

Adding Mobility to Multi-Tenancy

Automated Disaster Recovery DR Site

Recommended