White Paper An Overview of Big Data Problems and Solutions Understanding Big Data: Storage, Analytics and Learning

Understanding Big Data

Understanding Big Data

White Paper An Overview of Big Data Problems and Solutions

Understanding Big Data: Storage, Analytics and Learning

This paper provides an overview of the exponential growth of data in organizations and

the need for businesses to extract value from it. Big Data represents the next phase of

data management, both from an opportunity perspective and a delivery capability view.

Organizations will be challenged to use new techniques to capture data, analyze it and

to mine the gems.

How Much Data is Big Data?

An interesting question is, “How much data is big data?” And although there is no right answer, there are some commonly held views. First, it’s worth noting that what was big just a few years ago is no longer considered big today. For example, at the time of this writing, a general-purpose 1 terabyte hard drive costs about $70 online. Increases in storage density, I/O operations and throughput have enabled the use of cheap disks to store massive amounts of information. Although views differ, most people agree that Big Data starts in the 10’s of terabytes but becomes more interesting and appropriate as the data approaches 100’s of terabytes or even 1,000’s (aka, a petabyte). Some have pointed out that they’re not sure if they have enough data to warrant a Big Data approach. Many of the Big Data technologies are so disruptive that they are being applied in “Small and Medium Data.” In some ways, the term “Big Data” is unfortunate in that it implies that it will only work with the largest of data sets. The truth is that it depends on which aspects and technologies you’re working with. Some solutions may be overkill for the problem at hand; however, as a rule MomentumSI sees many of the Big Data techniques being used at small- and medium-sized shops.

Big Databases

Big Data deals with more than just the amount of storage. One of the key considerations of a Big Data database is the ability to on-board data at a much more rapid pace than its predecessors. This feature is often called “streaming” because the data comes in a fluid, often non-stop motion. Consider the case presented by Associate Professor Deb Roy, head of the MIT Media Lab’s Cognitive Machines research group:

Obviously, the capturing of so much real-time information in such a rapid pace requires a

rethinking of the approach and technologies.

“The data storage requirements of the Human Speechome Project present challenges that cannot be easily addressed with conventional storage technologies. Basic requirements include high-performance reads/writes in excess of 160 gigabits/second, massive shared volumes in excess of several hundred terabytes, and smooth scalability from an initial 50 terabytes to capacity well in excess of a petabyte. Additional requirements include 100 percent data redundancy, file access by computers running multiple operating systems, a fully virtualized storage fabric, and affordability using low cost, high capacity SATA hard drives.”

Roy is recording nearly all of his new son’s waking hours in an ambitious attempt to use these data to unravel the mystery of how humans naturally acquire language within the context of their primary social setting. He will pay particular attention to the role of physical and social context in how his son, 9 months old, learns early words and early grammatical constructions. Roy’s vast recording and analysis effort, known as “The Human Speechome Project” (speech + home), will yield some 400,000 hours of audio and video data over three years. Roy and his wife have already gathered more than 300 gigabytes per day of compressed data by recording an average of 12-14 hours a day.

This is the new world of Big Data. Individuals and organizations are taking advantage of cheap

storage and commodity computing cycles to solve problems that would not have been within

their reach only a few years ago.

The need to capture information in rapid succession has driven the requirements for Big Data

databases. The various needs have driven an array of new databases, many of which do not

conform to the classic SQL standard and are dubbed “NoSQL.”

Introducing NoSQL Data Stores

The need to capture new kinds of information, and at rates previously unheard of, drove a series

of exciting introductions. Recent advances include:

Document-Oriented Databases – focus on capturing data even when you don’t have a

predefined schematic view of what is going to be needed. Hence, the system needs to

allow one to easily grow the document structure by adding new “leafs” (or extensions).

These systems often scale by using read-only replicas and auto-sharding. Examples

include MongoDB and CouchDB.

Key/Value Stores – focus on caching small pieces of information to either memory or

disk-backed memory. Because the focus is really caching, little emphasis is placed on

query languages. Examples include: Memcached, Redis and Project Voldemort.

Peer Data Stores – focus on scaling data across multiple servers (or peers) in order to

alleviate potential read or write bottlenecks to any one server. In addition, by creating

replicas of data on multiple servers this method provides a built-in redundancy

increasing the availability of the system. A key goal of the Peer Data Store is to provide

near-linear horizontal scaling. Examples include: Cassandra, Riak and hBase.

Many of the databases that were developed for special use cases chose not to implement the

SQL interface and collectively became known as “NoSQL” solutions. However, many of them

have now gone back and implemented a SQL access layer on top of their solution to make it

easier for developers to use. In some cases though, like Key/Value stores, it just didn’t make

sense to force SQL on their model.

As-A-Service Data Stores

In addition to the data stores mentioned above, there is another movement towards delivering

software and data in an on-demand, over-the-network, model called “as a Service.” This model

is enabled by cloud computing where resources are opaquely provisioned based on user

demand. Popular options found here include:

BLOB Storage – BLOB (Binary Large Objects) are just continuous streams of bits ranging

from a few bits long to gigabytes in length. BLOBs are typically things like images,

videos, software files or documents. Amazon provides a BLOB system called S3 (Simple

Storage System), while Microsoft offers a similar service called, “Azure BLOB Storage.”

Similar features are now being found in private clouds: Eucalyptus provides Walrus,

while OpenStack offers OpenStack Object Storage.

Relational Database as a Service – Since so many companies have their databases

rooted in the relational world, there is tremendous pressure to preserve the model that

their systems currently use and their staff is productive at using. The focus of Relational

Data as a Service is to offer developers a model that they are used to, but to make the

access, operations and scaling of the database much easier than before. This is

accomplished by having a service that gives developers an on-demand model while also

providing automated scaling, automated backup and automated recovery of the system.

Examples include Amazon’s Relational Data Service, Microsoft SQL Azure Database and

Saleforce’s Database.com.

In addition to the categories mentioned, additional popular services include Amazon SimpleDB

and Google’s Big Table. This software falls into the Peer Data Store category – but in this case, it

is delivered in the as-a-Service Model. It is worth noting that both offerings are considered

significant advances in the way in which they store and process large volumes of data. They

serve as the inspiration for many of the open-source databases that have been available under

traditional licenses, but that are now being introduced into the as-a-Service world to compete

with the large Internet companies.

Big Analytics

Big Analytics refers to the need to analyze our large data stores using new tools and

techniques. This kind of analytics is driven by a person using a report-writing tool. The

reason for pointing this out is that other types of analytics (like Big Mining) are now

being driven by automated routines using machine learning to make recommendations

and the like.

It wasn’t so long ago that running data analysis routines across 10’s or 100’s of gigabytes

was a real challenge. As the size of the data continues to grows, so does the need to use

new techniques to solve the problems. Over the past few years we’ve seen two trends


- Use of proprietary MPP (massively parallel processing) techniques

- Use of MapReduce (a standardized MPP technique)

Both solutions are viable techniques for analyzing large datasets, as they focus on

decomposing very large problems into smaller units. Recently, however, there is

growing interest in using MapReduce as a common foundation for batch jobs, data

analytics, data mining and more.

First, it’s important to understand the key characteristics of modern data analytics


“Big Database” – Organizational Call-To-Action

The efficient use of data has always been the heart of competitive advantage. Today, organizations must revisit their data collection strategy. Although the technology enables cost-effective and simplified on-boarding and processing, this doesn’t mean “collecting data for the sake of collecting data” – it means experimenting with large data sets to understand if business advantage can be established. In addition, companies should review the current use of databases in their organization. The modern data stores that are available today have significant cost advantages over their predecessors.

Column-Oriented Storage

Huge efficiency gains are available if a hard drive is able to read continuous data. Many analytics requests enable this type of a read, hence the associated gains. Relational databases focused on storing data in rows, for instance:

1, Bill Smith, 9900 Main Street 2, John Carter, 8804 Vine Street 3, Tim Jones, 505 1st Street

In a column-oriented database, all of the columns are serialized together:

1,2,3; Bill Smith, John Carter, Tim Jones; 9900 Main Street, 8804 Vine Street, 505 1st Street;

In general, row-oriented access is now more commonly found in relational database systems, while column-oriented is common in OLAP / data warehousing.

Simultaneous Load and Query

Historically, organizations had a choice: Do you want to load new data into the warehouse – or do you want to query data that’s already in there? Naturally, customers answered, “both.” Vendors have responded with solutions to the problem by offering what is often called, “real-time analytics.” Fundamentally, it enables incremental data changes to be added on the fly without having to take down the system and execute a massive reload.

Large Blocks and Compressed Data

Since data analytics deals with very large data sets, it’s necessary to manage the data in a very efficient way. Modern analytics solutions typically work on large blocks of data (10’s or 100’s of megabytes) in a single pass. In addition, there is a need to be able to compress the data, typically reducing the size by 3X to 10X. Naturally, data compression can’t negatively affect the actual data analysis routines.

Massively Parallel / Share Nothing Architecture

“Divide-and-conquer” is a time-tested technique to break large problems into smaller ones. Massively parallel processing breaks large problems into smaller ones and passes the local processing of a data set to an independent server. This enables multiple servers to work on different parts of the problem in parallel, expediting the overall execution. A side benefit of this approach is the inherent resilience gained by not relying on a single system to do all the work. If designed using a Share Nothing Architecture, an MPP system can share no critical outage points (a.k.a., Single Point of Failure).

MapReduce and Hadoop Solutions

MapReduce is a software technique introduced by Google to enable massively parallel

processing of very large data sets across clusters of computers. Hadoop is an open-

source implementation of the MapReduce technique written in Java and available from

Apache. Fundamentally, Hadoop enables very large data sets to be spread over a

distributed file system so that clusters of commodity computers can process the data. It

is now common for data-intensive shops, including enterprise IT organizations, to either

run their own Hadoop cluster or to leverage one from an external cloud provider (e.g.,

Amazon offers Elastic MapReduce).

Organizations that have committed to the MapReduce and Hadoop model are now

interested in using their current clusters to solve Big Analytics problems. Success stories

from companies like Yahoo! have indicated that the size of the data that can be

processed exceeds the petabyte mark – with the ability to run analytics routines on

thousands of commodity servers simultaneously. In other words, it’s a proven model

that has open industry support and a tremendous amount of momentum behind the

movement. For this reason, virtually all of the major analytics vendors have had to

readdress their execution model as customers are questioning the use of specialized

“Hadoop has been instrumental in enabling ‘agile’ data analysis. In software development,

‘agile practices’ are associated with faster product cycles, closer interaction between

developers and consumers, and testing. Traditional data analysis has been hampered by

extremely long turn-around times. If you start a calculation, it might not finish for hours, or

even days. But Hadoop (and particularly Elastic MapReduce) make it easy to build clusters that

can perform computations on long datasets quickly. Faster computations make it easier to test

different assumptions, different datasets, and different algorithms. It’s easier to consult with

clients to figure out whether you’re asking the right questions, and it’s possible to pursue

intriguing possibilities that you’d otherwise have to drop for lack of time.” – An O’Reilly Radar

Report: What is Data Science?

hardware, proprietary architectures and often, exceedingly high maintenance and

operations costs.

HP/Vertica have announced the “Vertica Connector for Hadoop”

EMC/Greenplum have “Greenplum MapReduce”

IBM has “InfoSphere BigInsights” (for Hadoop)

Aster Data has “Aster Hadoop Data Connector”

In addition to the mainstream players, the Apache Software Foundation has also released an

open source analytics tool designed specifically to run on Hadoop called “Hive.” Although early

in its maturity cycle, Hive shows significant promise and is sure to spawn off a number of

complementary or even competitive projects, all aiming to bring a simple but powerful model of

Big Analytics to the Hadoop world.

“Big Analytics” – Organizational Call-To-Action

Over the last decade, we’ve witnessed a strong move from very large and expensive symmetric multiprocessor (SMP) boxes to more cost-effective MPP solutions. The new push is to avoid using dedicated “analytics” equipment to move toward a more reusable and scalable clustered approach using Hadoop on either bare metal or on Infrastructure-as-a-Service. Organizations should revisit their strategy to understand the pros and cons of each model to determine the solution that best fits their needs.

Big Learning

By "Big Learning,” we’re referring to the use of artificial intelligence techniques (with an

emphasis on Machine Learning) on very large data sets. The scenarios for which mining

approaches are used vary widely. The techniques are applied to virtually any field (business,

engineering, core research, etc.) to identify better ways to solve a problem. Examples include:

- People who bought X are likely to also buy Y

- The characteristics of fraud are X: Does the transaction looks like X? (where X changes)

- A tumor looks like X: Is this a picture of a tumor?

Computing power has become significantly cheaper, and the ability to easily store and process it

on large clusters is readily available. In addition, artificial intelligence routines have become

more mainstream. Once, only the largest companies could afford the kinds of computers

necessary to perform these complex calculations. Only the richest companies could afford to

hire the data scientists needed to manage the process. Today, most students graduating with a

computer science degree have the requisite skills to be productive by using off-the-shelf

software and a credit card to access an on-line cloud computing cluster.

It wasn’t so long ago that most companies didn’t have a “warehousing / analytics group” or even

a “database group.” Now, it’s more common to see organizations with a data mining or

machine learning group. This function has moved from an interesting research area to a core

function for driving competitive business advantage.

Big Learning Technology

From a technology perspective, a variety of software systems and algorithms are used. Many of

the solutions are prepackaged into specialize solutions. For instance, there are several packages

on the market that help companies determine if someone is attempting to hack into their

network (intrusion detection). Most of the advanced systems use some kind of machine learning

algorithm to identify these rogue patterns. Recommendation engines are also commonly found

in bundled solutions. These solutions are an easy way to get started, but often come up short

when the required intelligence surpasses their built-in functionality or the sheer amount of data

exceeds their ability to analyze it.

General-purpose machine learning routines are now becoming more widespread to

accommodate the large data sets and need for custom routines. In these cases, companies will

often use their existing Hadoop cluster to store and analyze large volumes of data but will apply

new machine learning routines against the data. This enables them to share the costs associated

with Big Database and Big Analytics, both from an infrastructure and operations perspective.

One of the more interesting projects to use while learning algorithms is from Apache called,

“Mahout.” Like other Big Data solutions, it leverages Hadoop to provide the backbone for


“Currently Mahout supports mainly four use cases: Recommendation mining takes users'

behavior and from that tries to find items users might like. Clustering takes e.g. text documents

and groups them into groups of topically related documents. Classification learns from existing

categorized documents what documents of a specific category look like and is able to assign

unlabelled documents to the (hopefully) correct category. Frequent itemset mining takes a set

of item groups (terms in a query session, shopping cart content) and identifies, which individual

items usually appear together.” – Mahout.Apache.Org Website

“Big Learning” – Organizational Call-To-Action

Organizations should identify the areas where “Big Learning” will impact their business. Create a plan to incrementally grow out an internal capability related to this field. Assume that large amounts of data will be collected and analyzed via commodity compute clusters. Build out the necessary computing environment to harvest the intelligence and train up staff on the new model.

About MomentumSI

