Upload
dablyu
View
69
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Understanding Big Data
Citation preview
© MomentumSI 2011 Understanding Big Data Page 1
White Paper An Overview of Big Data Problems and Solutions
Understanding Big Data: Storage, Analytics and Learning
© MomentumSI 2011 Understanding Big Data Page 2
Table of Contents
Table of Contents ...................................................................................... 2
Overview .................................................................................................. 3
How Much Data is Big Data? .......................................................................................................... 3
Big Databases ........................................................................................... 4
Introducing NoSQL Data Stores .................................................................................................... 5
As-A-Service Data Stores ................................................................................................................. 6
Big Analytics ............................................................................................. 7
Column Oriented Storage ........................................................................................................... 8
Simultaneous Load and Query ................................................................................................. 8
Large Blocks and Compressed Data ....................................................................................... 8
Massively Parallel / Share Nothing Architecture .............................................................. 8
MapReduce and Hadoop Solutions.............................................................................................. 9
Big Learning ............................................................................................ 11
Big Learning Technology .............................................................................................................. 11
Our Solutions .......................................................................................... 13
© MomentumSI 2011 Understanding Big Data Page 3
Overview
This paper provides an overview of the exponential growth of data in organizations and
the need for businesses to extract value from it. Big Data represents the next phase of
data management, both from an opportunity perspective and a delivery capability view.
Organizations will be challenged to use new techniques to capture data, analyze it and
to mine the gems.
How Much Data is Big Data?
An interesting question is, “How much data is big data?” And although there is no right answer, there are some commonly held views. First, it’s worth noting that what was big just a few years ago is no longer considered big today. For example, at the time of this writing, a general-purpose 1 terabyte hard drive costs about $70 online. Increases in storage density, I/O operations and throughput have enabled the use of cheap disks to store massive amounts of information. Although views differ, most people agree that Big Data starts in the 10’s of terabytes but becomes more interesting and appropriate as the data approaches 100’s of terabytes or even 1,000’s (aka, a petabyte). Some have pointed out that they’re not sure if they have enough data to warrant a Big Data approach. Many of the Big Data technologies are so disruptive that they are being applied in “Small and Medium Data.” In some ways, the term “Big Data” is unfortunate in that it implies that it will only work with the largest of data sets. The truth is that it depends on which aspects and technologies you’re working with. Some solutions may be overkill for the problem at hand; however, as a rule MomentumSI sees many of the Big Data techniques being used at small- and medium-sized shops.
© MomentumSI 2011 Understanding Big Data Page 4
Big Databases
Big Data deals with more than just the amount of storage. One of the key considerations of a Big Data database is the ability to on-board data at a much more rapid pace than its predecessors. This feature is often called “streaming” because the data comes in a fluid, often non-stop motion. Consider the case presented by Associate Professor Deb Roy, head of the MIT Media Lab’s Cognitive Machines research group:
Obviously, the capturing of so much real-time information in such a rapid pace requires a
rethinking of the approach and technologies.
“The data storage requirements of the Human Speechome Project present challenges that cannot be easily addressed with conventional storage technologies. Basic requirements include high-performance reads/writes in excess of 160 gigabits/second, massive shared volumes in excess of several hundred terabytes, and smooth scalability from an initial 50 terabytes to capacity well in excess of a petabyte. Additional requirements include 100 percent data redundancy, file access by computers running multiple operating systems, a fully virtualized storage fabric, and affordability using low cost, high capacity SATA hard drives.”
Roy is recording nearly all of his new son’s waking hours in an ambitious attempt to use these data to unravel the mystery of how humans naturally acquire language within the context of their primary social setting. He will pay particular attention to the role of physical and social context in how his son, 9 months old, learns early words and early grammatical constructions. Roy’s vast recording and analysis effort, known as “The Human Speechome Project” (speech + home), will yield some 400,000 hours of audio and video data over three years. Roy and his wife have already gathered more than 300 gigabytes per day of compressed data by recording an average of 12-14 hours a day.
© MomentumSI 2011 Understanding Big Data Page 5
This is the new world of Big Data. Individuals and organizations are taking advantage of cheap
storage and commodity computing cycles to solve problems that would not have been within
their reach only a few years ago.
The need to capture information in rapid succession has driven the requirements for Big Data
databases. The various needs have driven an array of new databases, many of which do not
conform to the classic SQL standard and are dubbed “NoSQL.”
Introducing NoSQL Data Stores
The need to capture new kinds of information, and at rates previously unheard of, drove a series
of exciting introductions. Recent advances include:
Document-Oriented Databases – focus on capturing data even when you don’t have a
predefined schematic view of what is going to be needed. Hence, the system needs to
allow one to easily grow the document structure by adding new “leafs” (or extensions).
These systems often scale by using read-only replicas and auto-sharding. Examples
include MongoDB and CouchDB.
Key/Value Stores – focus on caching small pieces of information to either memory or
disk-backed memory. Because the focus is really caching, little emphasis is placed on
query languages. Examples include: Memcached, Redis and Project Voldemort.
Peer Data Stores – focus on scaling data across multiple servers (or peers) in order to
alleviate potential read or write bottlenecks to any one server. In addition, by creating
replicas of data on multiple servers this method provides a built-in redundancy
increasing the availability of the system. A key goal of the Peer Data Store is to provide
near-linear horizontal scaling. Examples include: Cassandra, Riak and hBase.
Many of the databases that were developed for special use cases chose not to implement the
SQL interface and collectively became known as “NoSQL” solutions. However, many of them
have now gone back and implemented a SQL access layer on top of their solution to make it
easier for developers to use. In some cases though, like Key/Value stores, it just didn’t make
sense to force SQL on their model.
© MomentumSI 2011 Understanding Big Data Page 6
As-A-Service Data Stores
In addition to the data stores mentioned above, there is another movement towards delivering
software and data in an on-demand, over-the-network, model called “as a Service.” This model
is enabled by cloud computing where resources are opaquely provisioned based on user
demand. Popular options found here include:
BLOB Storage – BLOB (Binary Large Objects) are just continuous streams of bits ranging
from a few bits long to gigabytes in length. BLOBs are typically things like images,
videos, software files or documents. Amazon provides a BLOB system called S3 (Simple
Storage System), while Microsoft offers a similar service called, “Azure BLOB Storage.”
Similar features are now being found in private clouds: Eucalyptus provides Walrus,
while OpenStack offers OpenStack Object Storage.
Relational Database as a Service – Since so many companies have their databases
rooted in the relational world, there is tremendous pressure to preserve the model that
their systems currently use and their staff is productive at using. The focus of Relational
Data as a Service is to offer developers a model that they are used to, but to make the
access, operations and scaling of the database much easier than before. This is
accomplished by having a service that gives developers an on-demand model while also
providing automated scaling, automated backup and automated recovery of the system.
Examples include Amazon’s Relational Data Service, Microsoft SQL Azure Database and
Saleforce’s Database.com.
In addition to the categories mentioned, additional popular services include Amazon SimpleDB
and Google’s Big Table. This software falls into the Peer Data Store category – but in this case, it
is delivered in the as-a-Service Model. It is worth noting that both offerings are considered
significant advances in the way in which they store and process large volumes of data. They
serve as the inspiration for many of the open-source databases that have been available under
traditional licenses, but that are now being introduced into the as-a-Service world to compete
with the large Internet companies.
© MomentumSI 2011 Understanding Big Data Page 7
Big Analytics
Big Analytics refers to the need to analyze our large data stores using new tools and
techniques. This kind of analytics is driven by a person using a report-writing tool. The
reason for pointing this out is that other types of analytics (like Big Mining) are now
being driven by automated routines using machine learning to make recommendations
and the like.
It wasn’t so long ago that running data analysis routines across 10’s or 100’s of gigabytes
was a real challenge. As the size of the data continues to grows, so does the need to use
new techniques to solve the problems. Over the past few years we’ve seen two trends
emerge:
- Use of proprietary MPP (massively parallel processing) techniques
- Use of MapReduce (a standardized MPP technique)
Both solutions are viable techniques for analyzing large datasets, as they focus on
decomposing very large problems into smaller units. Recently, however, there is
growing interest in using MapReduce as a common foundation for batch jobs, data
analytics, data mining and more.
First, it’s important to understand the key characteristics of modern data analytics
platforms:
“Big Database” – Organizational Call-To-Action
The efficient use of data has always been the heart of competitive advantage. Today, organizations must revisit their data collection strategy. Although the technology enables cost-effective and simplified on-boarding and processing, this doesn’t mean “collecting data for the sake of collecting data” – it means experimenting with large data sets to understand if business advantage can be established. In addition, companies should review the current use of databases in their organization. The modern data stores that are available today have significant cost advantages over their predecessors.
© MomentumSI 2011 Understanding Big Data Page 8
Column-Oriented Storage
Huge efficiency gains are available if a hard drive is able to read continuous data. Many analytics requests enable this type of a read, hence the associated gains. Relational databases focused on storing data in rows, for instance:
1, Bill Smith, 9900 Main Street 2, John Carter, 8804 Vine Street 3, Tim Jones, 505 1st Street
In a column-oriented database, all of the columns are serialized together:
1,2,3; Bill Smith, John Carter, Tim Jones; 9900 Main Street, 8804 Vine Street, 505 1st Street;
In general, row-oriented access is now more commonly found in relational database systems, while column-oriented is common in OLAP / data warehousing.
Simultaneous Load and Query
Historically, organizations had a choice: Do you want to load new data into the warehouse – or do you want to query data that’s already in there? Naturally, customers answered, “both.” Vendors have responded with solutions to the problem by offering what is often called, “real-time analytics.” Fundamentally, it enables incremental data changes to be added on the fly without having to take down the system and execute a massive reload.
Large Blocks and Compressed Data
Since data analytics deals with very large data sets, it’s necessary to manage the data in a very efficient way. Modern analytics solutions typically work on large blocks of data (10’s or 100’s of megabytes) in a single pass. In addition, there is a need to be able to compress the data, typically reducing the size by 3X to 10X. Naturally, data compression can’t negatively affect the actual data analysis routines.
Massively Parallel / Share Nothing Architecture
“Divide-and-conquer” is a time-tested technique to break large problems into smaller ones. Massively parallel processing breaks large problems into smaller ones and passes the local processing of a data set to an independent server. This enables multiple servers to work on different parts of the problem in parallel, expediting the overall execution. A side benefit of this approach is the inherent resilience gained by not relying on a single system to do all the work. If designed using a Share Nothing Architecture, an MPP system can share no critical outage points (a.k.a., Single Point of Failure).
© MomentumSI 2011 Understanding Big Data Page 9
MapReduce and Hadoop Solutions
MapReduce is a software technique introduced by Google to enable massively parallel
processing of very large data sets across clusters of computers. Hadoop is an open-
source implementation of the MapReduce technique written in Java and available from
Apache. Fundamentally, Hadoop enables very large data sets to be spread over a
distributed file system so that clusters of commodity computers can process the data. It
is now common for data-intensive shops, including enterprise IT organizations, to either
run their own Hadoop cluster or to leverage one from an external cloud provider (e.g.,
Amazon offers Elastic MapReduce).
Organizations that have committed to the MapReduce and Hadoop model are now
interested in using their current clusters to solve Big Analytics problems. Success stories
from companies like Yahoo! have indicated that the size of the data that can be
processed exceeds the petabyte mark – with the ability to run analytics routines on
thousands of commodity servers simultaneously. In other words, it’s a proven model
that has open industry support and a tremendous amount of momentum behind the
movement. For this reason, virtually all of the major analytics vendors have had to
readdress their execution model as customers are questioning the use of specialized
“Hadoop has been instrumental in enabling ‘agile’ data analysis. In software development,
‘agile practices’ are associated with faster product cycles, closer interaction between
developers and consumers, and testing. Traditional data analysis has been hampered by
extremely long turn-around times. If you start a calculation, it might not finish for hours, or
even days. But Hadoop (and particularly Elastic MapReduce) make it easy to build clusters that
can perform computations on long datasets quickly. Faster computations make it easier to test
different assumptions, different datasets, and different algorithms. It’s easier to consult with
clients to figure out whether you’re asking the right questions, and it’s possible to pursue
intriguing possibilities that you’d otherwise have to drop for lack of time.” – An O’Reilly Radar
Report: What is Data Science?
© MomentumSI 2011 Understanding Big Data Page 10
hardware, proprietary architectures and often, exceedingly high maintenance and
operations costs.
HP/Vertica have announced the “Vertica Connector for Hadoop”
EMC/Greenplum have “Greenplum MapReduce”
IBM has “InfoSphere BigInsights” (for Hadoop)
Aster Data has “Aster Hadoop Data Connector”
In addition to the mainstream players, the Apache Software Foundation has also released an
open source analytics tool designed specifically to run on Hadoop called “Hive.” Although early
in its maturity cycle, Hive shows significant promise and is sure to spawn off a number of
complementary or even competitive projects, all aiming to bring a simple but powerful model of
Big Analytics to the Hadoop world.
“Big Analytics” – Organizational Call-To-Action
Over the last decade, we’ve witnessed a strong move from very large and expensive symmetric multiprocessor (SMP) boxes to more cost-effective MPP solutions. The new push is to avoid using dedicated “analytics” equipment to move toward a more reusable and scalable clustered approach using Hadoop on either bare metal or on Infrastructure-as-a-Service. Organizations should revisit their strategy to understand the pros and cons of each model to determine the solution that best fits their needs.
© MomentumSI 2011 Understanding Big Data Page 11
Big Learning
By "Big Learning,” we’re referring to the use of artificial intelligence techniques (with an
emphasis on Machine Learning) on very large data sets. The scenarios for which mining
approaches are used vary widely. The techniques are applied to virtually any field (business,
engineering, core research, etc.) to identify better ways to solve a problem. Examples include:
- People who bought X are likely to also buy Y
- The characteristics of fraud are X: Does the transaction looks like X? (where X changes)
- A tumor looks like X: Is this a picture of a tumor?
Computing power has become significantly cheaper, and the ability to easily store and process it
on large clusters is readily available. In addition, artificial intelligence routines have become
more mainstream. Once, only the largest companies could afford the kinds of computers
necessary to perform these complex calculations. Only the richest companies could afford to
hire the data scientists needed to manage the process. Today, most students graduating with a
computer science degree have the requisite skills to be productive by using off-the-shelf
software and a credit card to access an on-line cloud computing cluster.
It wasn’t so long ago that most companies didn’t have a “warehousing / analytics group” or even
a “database group.” Now, it’s more common to see organizations with a data mining or
machine learning group. This function has moved from an interesting research area to a core
function for driving competitive business advantage.
Big Learning Technology
From a technology perspective, a variety of software systems and algorithms are used. Many of
the solutions are prepackaged into specialize solutions. For instance, there are several packages
on the market that help companies determine if someone is attempting to hack into their
network (intrusion detection). Most of the advanced systems use some kind of machine learning
algorithm to identify these rogue patterns. Recommendation engines are also commonly found
in bundled solutions. These solutions are an easy way to get started, but often come up short
when the required intelligence surpasses their built-in functionality or the sheer amount of data
exceeds their ability to analyze it.
© MomentumSI 2011 Understanding Big Data Page 12
General-purpose machine learning routines are now becoming more widespread to
accommodate the large data sets and need for custom routines. In these cases, companies will
often use their existing Hadoop cluster to store and analyze large volumes of data but will apply
new machine learning routines against the data. This enables them to share the costs associated
with Big Database and Big Analytics, both from an infrastructure and operations perspective.
One of the more interesting projects to use while learning algorithms is from Apache called,
“Mahout.” Like other Big Data solutions, it leverages Hadoop to provide the backbone for
analysis.
“Currently Mahout supports mainly four use cases: Recommendation mining takes users'
behavior and from that tries to find items users might like. Clustering takes e.g. text documents
and groups them into groups of topically related documents. Classification learns from existing
categorized documents what documents of a specific category look like and is able to assign
unlabelled documents to the (hopefully) correct category. Frequent itemset mining takes a set
of item groups (terms in a query session, shopping cart content) and identifies, which individual
items usually appear together.” – Mahout.Apache.Org Website
“Big Learning” – Organizational Call-To-Action
Organizations should identify the areas where “Big Learning” will impact their business. Create a plan to incrementally grow out an internal capability related to this field. Assume that large amounts of data will be collected and analyzed via commodity compute clusters. Build out the necessary computing environment to harvest the intelligence and train up staff on the new model.
© MomentumSI 2011 Understanding Big Data Page 13
Our Solutions
MomentumSI helps organizations to take advantage of Big Data concepts and to grow
their internal capabilities.
Hadoop Fast Start
Our Fast Start program is designed for organizations that are new to Hadoop and want
to quickly implement a state of the art cluster. Momentum offers our Tough Hadoop
Distribution which binds to popular private cloud (IaaS) technologies like Eucalyptus and
vCloud Director from VMware.
Big Database Projects
Selecting and implementing the right Big Database can make the difference between
success and failure. MomentumSI’s consultants are experts in Big Database
implementations and bring their consultative experience to your problem.
Big Analytics Projects
Momentum provides the consulting expertise to turn your big analytics problem into a
manageable solution. Our solutions leverage Hadoop and industry-leading packaged
solutions to deliver analytical solutions on commodity compute stacks at a fraction of
the cost of traditional infrastructures.
Big Learning Projects
Machine learning on Big Data represents the next wave of innovation. Momentum
provides consulting expertise on the use Apache Mahout on Hadoop.
© MomentumSI 2011
About MomentumSI
MomentumSI is a leading IT services and solutions company focused on enterprise
transformation. It helps organizations quickly and cost-efficiently adopt innovative, agile
practices to align business needs with IT processes. MomentumSI specializes in helping
companies incorporate disruptive technologies, including Cloud Computing, DevOps, BPM and
SOA. Industries served include financial services, insurance, healthcare, pharmaceuticals, high-
tech, retail and manufacturing. Founded in 1997, MomentumSI is a privately held company that
operates globally with headquarters in Austin, Texas and offices in San Francisco, Washington
D.C., New York and Sydney. For more information, contact [email protected] or call
1-888-886-8560 or visit. Visit our website at http://www.MomentumSI.com.
MomentumSI and Tough are brand names of Momentum Software, Inc. All other brand names and product names
are trademarks or registered trademarks of their respective companies.