Upload
puneet-goyal
View
215
Download
0
Embed Size (px)
Citation preview
8/2/2019 Diamond Elite Funded Head Count Agmt
1/3
[1]
STORAGE SWITZERLAND
THE BIG DATA ARCHIVE
Big Data is often thought of as a specialized use case
involving machine generated data, typically associated
with web search logs, satellite imagery or other sensor
data, on which analytics are performed to enable some
sort of decision support application. While this is an
important example of a Big Data project, another is the
collection of human generated data that also needs to be
retained, organized and made readily available for data
mining or compliance reasons. An archive is needed to
store both machine and human generated data and a Big
Data storage solution can be the ideal solution.
Human generated data is essentially file data created from
office productivity applications. These files are the
contracts, designs, proposals, video, audio, images and
analytical summary data that drives the organization. Also
included in this category would be data files generated by
multi-media tools such as video camcorders, mobile
phone cameras, notebook PC microphones, etc. Just like
machine generated data this data has value and apotentially higher compliance requirement, at least from a
litigation perspective. But unlike machine generated data
which often goes straight to low-cost archival storage, this
human generated file data is frequently stored on
expensive, high performance primary storage over its full
lifespan.
At the point of creation and during early modification these
human generated files can justify being stored on the more
expensive primary storage location where rapid access is
not only essential but expected. Over time though, most
file based data rapidly loses its need for immediacy and
could be more appropriately stored on something cost
effective but maybe not as responsive as primary storage.
However, unlike old copies of a database, which need to
remain somewhat accessible for either mining and
compliance or just to ease the minds of users, these older
files can be put onto a secondary tier of storage.
This secondary storage area is an ideal use case for a disk
archive tier, something designed specifically to store this
type of data cost effectively. Again, this data will be
retained because the organization has to, but also
because it wants to. Companies will mine this data to
provide insight to support future decisions. This mining
requirement means that archiving alone is not enough for
the organization. They need all the capabilities that a BigData Storage Infrastructure needs, but in a more capacity-
centric form.
George Crump, Senior Analyst
8/2/2019 Diamond Elite Funded Head Count Agmt
2/3
[2]
The Value of a Big Data Archive
A Big Data Archive brings three specific value areas to the
enterprise. First, similar to a classic archive, it should allow
for the reduction of primary storage consumption and
support growth. According to studies, well over 80% of file
data on most primary storage systems is not in active use
and is therefor wasting this high cost resources performance
abilities. If this data was moved to a secondary, high capacity
storage area, but still one with moderate performance, most
additional file access could be done without impact to the
users. This could have a significant, positive impact to the IT
budget.
Secondary disk tiers have been available for years, as have
software products to classify and move that data. The cost
savings on primary storage alone motivated many users to
move to a two-tier storage infrastructure. But many otherdata centers were not so inclined. Big Data Archive brings
three more motivation points to the equation that should
interest all data centers to adopt this multi-tier approach to
storage.
The first point is that organizations are beginning to
understand the value of this data and to acknowledge theres
a real desire to retain, categorize and in the future mine, this
information to help make better business decisions or speed
product development. They are coming to realize that
archiving makes practical sense in the data center and itsshortcomings are being eliminated by Big Data storage
architectures.
The second motivational point is the need for compliance.
Organizations and litigators are beginning to understand that
retention is more than just making sure email is saved or that
it can be found (discovered). Retention means keeping all the
files that exist in relation to a case as well. In the past this
meant providing boxes of paper documents. Today most
documents are digital and are never printed. Retaining
electronic documents is not only important it may be the only
evidence of that information that can be retained.
Finally a Big Data Archive is complimentary and may even be
part of what was previously considered a separate project.
This makes the cost to add a Big Data Archive to a current
big data project minimal or may allow it to be the
foundational component in a future big data project. In short
by leveraging both initiatives, costs can be contained and
ROI be realized sooner.
As a result a Big Data Archive has unique requirements that a
simple second tier of storage, or even a basic archive
solution, typically cannot meet. Whether from machine or
human generated data sources, a Big Data Archive must
match the compliance capabilities of disk archiving while
meeting requirements like dense scaling, high throughput
and fast retrieval.
Requirements For The Big Data Archive
Density Scaling
Legacy second tier disk systems and even archive systems
both have scaling issues when measured against the BigData Archive challenge. The requirement to scale to
Petabytes is now the starting point for many of these
systems. This quickly eliminates single box architectures.
Even legacy scale out storage architectures may not be
suitable for the Big Data Archive challenge. These systems
were designed to add nodes rapidly and as a result their
capacity per node is limited and they quickly consume
available data center floor space. The modern Big Data
Archive will need a very dense architecture to maximize
capacity on a per node basis and not waste that floor space.In these environments storage (disk drives) has practically
become less expensive than the sheet metal (the other
components in each node) that surrounds them. Thus,
making it critical to use each node to its full potential before
adding another.
High Throughput
Big Data Archives must also have the ability to ingest large
amounts of data quickly. Legacy archive solutions were
designed to have data trickle into them over the course of
time. Big Data Archives may store very large numbers of
different sized files on an ongoing basis. There can be
millions of small files that are being archived from traditional
Big Data project or a relatively few, very large rich media files
being archived from user projects.
8/2/2019 Diamond Elite Funded Head Count Agmt
3/3
[3]
In both cases the ingestion of these files requires that the
receiving nodes encode the data and then segment it to the
other nodes in the cluster. This background work could
cripple legacy archive solutions whose nodes are typically
interconnected via a 1GbE infrastructure. Instead, a higher
speed backbone is required so that additional throughput
can be maintained. Solutions like Isilons NL Scale Out NAS
connect via an internal Infiniband backbone for very highthroughput performance, enabling them to sustain ingest
rates that match the requirements of a Big Data Archive.
Fast Retrieval
Retrieval is also different for the Big Data Archive than it is
for the traditional archive storage system. It may need to
produce thousands or millions of files very quickly or in
some cases it may be desirable to actually perform the
search and analysis on the Big Data Archive itself.
Traditional archive architectures and legacy second tier
storage systems are typically found lacking when asked to
provide data quickly as the capacity scales over 1 PB. Its
important to remember that archive systems were designed
to provide performance better than the platform they were
replacing, which for most was optical disk.
Big Data Archives operate against a different standard. They
need to provide consistent performance thats comparable
to most primary storage systems, no matter what thecapacity level. Again, Isilons NL series surpasses this
expectation and provides near primary storage performance
but with the throughput and density that Big Data Archiving
requires.
Protection & Disaster Recovery
Protecting 1PB+ environments requires a change in thinking.
Nightly backups are no longer a reality, not only because of
the size of the solution but also because of the amount of
data that can be ingested at any time. If a large archive job
is submitted, and then later a catastrophic failure occurs, a
significant amount of data could be permanently lost. For
example in the case of machine generated sensor data there
may be no way to ever recover it.
Data protection needs to be integrated and then augmented
into the Big Data Archive. First the system should have nosingle points of failure and the users should be able to set
the data protection level by data type. This would
accommodate unrecoverable data, like point-in-time sensor
data, which might need a higher level of redundancy than
traditional file data.
Next, the data needs to be transferred in real time to a
second location via built-in replications tools. That data
again needs be prioritized based on whether it can be
replaced.
Finally, there are always some organizations that will want to
move to an alternate device all together, even tape, in case
of a regional disaster. The Big Data Archive should have the
ability to add copy out performance when needed. As an
example Isilon can add a class of nodes to their cluster
called, backup accelerators, that are specifically designed
to move data to another device. This allows the other nodes
to continue to deliver high throughput and fast retrieval while
the cluster gets its data copied to alternate storage devices.
Summary
The Big Data Archive can be a component of a larger Big
Data project or it can be an archive designed specifically for
Big Data. In either case leveraging that investment to also
include human generated data that needs to be stored for
mining or compliance reasons is an excellent way to achieve
a greater ROI on the Big Data project. It can also help
discover new ways to make better decisions by retaining
and analyzing existing information.
About Storage Switzerland
Storage Switzerland is an analyst firm focused on the virtualization and storage marketplaces. For more informatio
please visit our web site: http://www.storage-switzerland.com
Copyright 2011 Storage Switzerland, Inc. - All rights reserved
http://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://localhost/Users/johndoe/Desktop/Cloud_Storage_s_Weakness(3).dochttp://www.storage-switzerland.com/http://www.storage-switzerland.com/http://www.isilon.com/nl-serieshttp://www.isilon.com/nl-series