12
Business white paper Make all your information matter Hadoop and HP Vertica Analytics Platform

Make all your Information Matter: Hadoop and HP Vertica Analytics

  • Upload
    doanque

  • View
    222

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Make all your Information Matter: Hadoop and HP Vertica Analytics

Business white paper

Make all your information matterHadoop and HP Vertica Analytics Platform

Page 2: Make all your Information Matter: Hadoop and HP Vertica Analytics

Table of contents

3 Executive summary

4 The Big Data problem

5 Complementary analytics platforms

7 The best of both worlds: HP Vertica Analytics Platform and Hadoop together

8 Consider these real-world use cases

9 Use cases with HBase and HP Vertica Analytics Platform

10 Strengths and limitations of popular Hadoop components

11 Key takeaways

11 To learn more

Page 3: Make all your Information Matter: Hadoop and HP Vertica Analytics

3

Executive summaryHP Vertica Analytics Platform and Hadoop are highly complementary systems for Big Data analytics. The HP Vertica Analytics Platform is ideal for interactive real-time analytics and the Hadoop open-source platform is well suited for batch-oriented data processing.

When used together, the HP Vertica Analytics Platform and Hadoop provide your organization with a powerful set of data analytics capabilities that can do far more than either platform could do on its own. This combination enables your enterprise to extract significantly higher levels of value from massive amounts of structured, unstructured, and semi-structured data.

Page 4: Make all your Information Matter: Hadoop and HP Vertica Analytics

4

The Big Data problemVolume, velocity, and variety create complexity

In today’s information-driven world, your enterprise faces an onslaught of structured, unstructured, and semi-structured data. While conventional business systems continue to swell in size, storage environments are being hit with a hurricane of social media content, audio and video files, email, text messages, image files, documents, transactional information, and more.

In one case in point, large communications networks and their associated switches, billing systems, and service departments generate hundreds of millions of individual call details records (CDRs) daily. These terabytes of dynamic customer data will continue to grow exponentially as carriers add new services and as IP-based traffic increases.

Already, the number of subscribers to mobile, fixed-line, and cable communications services is growing by millions of people every year. And the volume of CDR, Internet Protocol Detail Records (IPDRs), subscriber profile information, network probe, and machine-to-machine data that communications companies must store and analyze is expected to grow by 12 to 13% per year.1

To gain value from massive amounts of data, your enterprise needs powerful Big Data analytics tools. These tools go far beyond the capabilities of traditional relational database management systems (RDBMSs), which were designed for online transaction processing (OLTP) and structured data—and not for the volume, velocity, and variety of a world of Big Data.

With these needs in mind, hundreds of organizations are deploying HP Vertica Analytics Platform for interactive real-time analytics and the Hadoop open-source platform for batch-oriented data processing. This combination of complementary data toolsets enables your enterprise to unify your structured, unstructured, and semi-structured data—and make all your information matter.

1 Source: “Market trends: Big Data opportunities in vertical industries,” Gartner, July 2012.

Page 5: Make all your Information Matter: Hadoop and HP Vertica Analytics

5

Complementary analytics platforms Two platforms purpose-built for Big Data

HP Vertica Analytics Platform and Hadoop are highly complementary analytics platforms. Both are modern, scalable, massively parallel processing (MPP) systems built for commodity hardware and low-cost processing of Big Data. While they have some overlapping capabilities, each of the platforms offers unique features that help your organization capitalize on the full range of your data.

HP Vertica Analytics Platform is a real-time analytics database platform purpose-built for Big Data. It consists of a massively parallel database and an extensible analytics framework optimized for fast data analysis—scaling from gigabytes to petabytes. Additionally, HP Vertica Analytics Platform supports standards like SQL, JDBC, ODBC, and R for data analysis. This standards-based versatility makes it easier for your organization to preserve your existing business intelligence (BI) investments.

The Hadoop Distributed File System (HDFS), in turn, is an open- source distributed file system that can serve as an effective storage ground for large amounts of data. Hadoop is extremely efficient at loading any type of data—structured, unstructured, or semi-structured. Hadoop is also well suited for batch processing where immediate interactive analytics are not required.

Complementary, not competitive

HP Vertica Analytics Platform is custom built for high-performance analytics. It is orders of magnitudes more efficient in highly analytical use cases compared to Hadoop. In an internal benchmark comparison, Counting Triangles, the HP Vertica Analytics Platform was 40 times faster than a comparable Hadoop program and 22 times faster than a program written in Pig, a framework that provides a higher-level language to increase developer productivity in Hadoop.

This level of performance can significantly reduce the time required to extract knowledge from your data, creating more business opportunities to monetize your data. That point was underscored in a study that pitted parallel DBMSs against the Hadoop MapReduce framework on a variety of tasks.2 The study—MapReduce and Parallel DBMSs: Friends or Foes?—yielded these results on three tasks:

Unlike Hadoop, HP Vertica Analytics Platform is a next-generation analytical database platform with standard SQL and ACID transaction support, combined with much more advanced analytics and processing capabilities.3 Hadoop is not a database. HP Vertica Analytics Platform also supports popular ecosystems for business intelligence, ETL (extract, transform, load) data warehouses, and data management, including Cognos, Microstrategy, Tableau, and others.

In general, Hadoop is well suited for long-running batch mode data processing and some analytics, while HP Vertica Analytics Platform is designed purposefully for interactive and real-time analytics as well as data processing.

2 Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., and Rasin, A. MapReduce and Parallel DBMSs: Friends or Foes. In Communications of the ACM, January 2010. http://database.cs.brown.edu/papers/stonebraker-cacm2010.pdf

3 For detailed information on the analytics and processing capabilities of HP Vertica Analytics Platform, visit: vertica.com/the-analytics-platform/.

Figure 1: Benchmark performance on a 100-node cluster

0

200

400

600

800

1000

1200

0

200

400

600

800

1000

1200

0

50

100

150

200

250

300

Grep task

284s

108s268s 55s

1146s 1158s

Web log task Join task

HadoopHP Vertica

Page 6: Make all your Information Matter: Hadoop and HP Vertica Analytics

6

A few caveats about Hadoop

Many technology teams, frustrated with the high costs and limitations of traditional row-oriented data warehouses and ETL infrastructures, see Hadoop as a potential replacement for their entire data warehouse infrastructure. Like many open-source solutions, Hadoop, at least at first glance, appears to be a solution for every problem. However, Hadoop has its limitations, including those that come with a document-oriented, batch-oriented system.

Hadoop should be viewed as one step toward an enhanced analytic infrastructure. While it is effective for exploration, Hadoop, unlike HP Vertica Analytics Platform, is not optimized for analysis or performance. In addition, Hadoop creates multiple integration points that can result in a fragile flow of data. Adding new flows is often time-consuming and costly. What’s more, with Hadoop, analytic cycles (the timeframe from receiving raw data to making effective decisions) tend to be longer than they should be. None of the Hadoop tools solves the problem of real-time, ad hoc access to data for your analysts, who specialize in understanding the data—not just the code.

Given its limitations, a Hadoop-only deployment should be thought of as a short-term solution targeted at certain needs, such as exploratory analysis of unstructured data at a low initial cost for use in a lab setting when applying structure to semi-structured data (e.g., web logs). Hadoop’s greatest strengths emerge when it is paired with a higher-end analytics platform that delivers the performance and speed needed to keep a business competitive.

Hadoop’s greatest strengths emerge when it is paired with a higher-end analytics platform that delivers the performance and speed needed to keep a business competitive.

Consider total cost of ownership

When you deploy Hadoop into production, maintenance requirements and programming demands can rapidly overwhelm any potential cost advantages. A Hadoop ecosystem typically includes a team of developers who specialize in massively parallel development and create custom code for queries, driving up your total cost of ownership.

In many use cases, HP Vertica Analytics Platform delivers higher performance with less hardware, with less administrative complexity, and with standard SQL and no custom coding—so you don’t need to be a developer to write a query. These factors can equate to a significantly lower total cost of ownership when you choose the HP Vertica Analytics Platform for certain use cases (see the use cases below).

An analytics hubWith the release of HP Vertica 6, HP took Big Data analytics to a new level of usability and performance. Your enterprise can now locate data where it makes sense and access it through your interfaces of choice, including standard SQL, business intelligence tools, or advanced analytics languages, such as R. This universal data access framework, combined with the platform’s massively parallel architecture, enables your organization to gain richer insights into all your data in a much shorter timeframe—minutes, not weeks or months.

Page 7: Make all your Information Matter: Hadoop and HP Vertica Analytics

7

The best of both worlds: HP Vertica Analytics Platform and Hadoop togetherUsing Hadoop and HP Vertica Analytics Platform together gives you more value than you could realize using either of the platforms separately. You get the best of both worlds. And HP makes it easy to connect the two platforms.

Connecting the platforms

To enable the integration of the two platforms, HP offers connectors that allow you to seamlessly move data back and forth between Hadoop and HP Vertica Analytics Platform. With the release of HP Vertica 6, HP helps your organization accelerate your Hadoop environment by using HP Vertica Analytics Platform analytics capabilities broadly across your Hadoop systems. With these connectors in place, your users can choose to load the data upfront or at query time—an invaluable capability for data scientists pursuing data exploration.

HP provides several ways to use Hadoop and Vertica in a complementary manner with support for:

• MapReduce—If you choose to use MapReduce programming, HP provides a bi-directional Hadoop connector as a source and sync to your MapReduce jobs.

• Sqoop—A JDBC connector also works with Sqoop, a tool that enables you to transfer bulk data between Hadoop and your databases.

• Hadoop Distributed File System (HDFS)—HP provides an HDFS connector that allows you to directly load data from HDFS into HP Vertica Analytics Platform tables using the copy command. HP Vertica Analytics Platform also supports external tables that can directly load data from your HDFS per query. This allows you to use HP Vertica SQL and analytics directly on your HDFS data.

Joint use cases

You can leverage the relative strengths of the two platforms for several use cases. In general, Hadoop is suitable for batch mode analysis and HP Vertica Analytics Platform for interactive analytics. Here are a few such use cases:

• Hadoop for ETL and HP Vertica Analytics Platform for analytics—Convert unstructured or semi-structured logs into a structured format (relational tuples) for analysis with HP Vertica Analytics Platform. In this scenario, Hadoop serves primarily as an ETL tool and HP Vertica Analytics Platform as the data analytics engine.

• HDFS for storage and HP Vertica Analytics Platform plus Hadoop for analytics—Run real-time analytics on HP Vertica Analytics Platform to capitalize on the speed and the performance of the analytics platform. Long-running and exploratory analytics run on Hadoop, relying on the fault tolerance of the Hadoop platform. This scenario enables you to load data from HDFS directly to HP Vertica Analytics Platform and provide HP Vertica SQL access to HDFS—again, using Hadoop primarily for data storage (or a data “parking lot”) and HP Vertica Analytics Platform for fast analysis

• HP Vertica Analytics Platform for storage and analytics and Hadoop as a multi-purpose tool—A less common use case is to use HP Vertica Analytics Platform primarily for data storage, and take advantage of Hadoop’s capabilities beyond MapReduce, such as scheduler and load balancing, data conversion tools for other formats (for example, STATA), and backup for HDFS via Sqoop.

Overlapping use cases

HP Vertica Analytics Platform handles a range of use cases just as well as or better than Hadoop. In many cases, HP Vertica Analytics Platform requires less hardware and less administrative complexity in delivering higher performance. Also, HP Vertica Analytics Platform uses standard SQL, as opposed to custom coding, to analyze data, contributing to an overall lower total cost of ownership. That said, in some use cases, either HP Vertica Analytics Platform or Hadoop could be used effectively.

Here are a few examples:

• Analyzing logs and machine data—Depending on your preference or development skills, you can use the HP Vertica SDK to write custom C++ or R code or use Hadoop or Java for log parsing and analyzing machine data. HP also makes parsers available on GitHub for web logs, tag clouds, and more. For any use case that does not require fault tolerance (for example, for long-running analysis), HP Vertica Analytics Platform is typically used.

• Ingesting XML, JSON, and Avro formats—Again, depending on your preference, you can ingest these formats in Hadoop or HP Vertica Analytics Platform. It all depends on where you primarily import and store the data and then perform the real-time analytics with HP Vertica Analytics Platform.

Page 8: Make all your Information Matter: Hadoop and HP Vertica Analytics

8

Consider these real-world use casesHere are some real-world examples—based on how organizations are using the HP Vertica Analytics Platform and Hadoop today. These examples show how your organization could use HP Vertica Analytics Platform and Hadoop in a complementary manner.

Processing social video events

A social video company uses Hadoop for batch processing of logs and HP Vertica Analytics Platform for ETL, ad hoc analytics, and interactive dashboards. In addition, the company uses a KV store for serving low-latency data needs. This architecture allows the company to collect and process hundreds of millions of events daily on a petabyte-scale infrastructure.

Accelerating drug discovery

A pharmaceutical company sought to analyze gene variants for improved drug targeting and discovery. The company found its solution in a combination of Hadoop and HP Vertica Analytics Platform, with a few additional supporting tools. It uses Hadoop to find the variants between a sample sequence and a reference genome, and uses HP Vertica Analytics Platform to run structured analysis on very large sets of data to determine oncology targets. In addition, the company uses HDFS for a raw data store and Hadoop/MapReduce for genomic algorithms that aren’t based on structured data.

Delivering digital consumer insights

A digital intelligence company uses HDFS to store raw input behavioral data and Hadoop/MapReduce to find conversions (regular-expressions processing) by determining what type of user clicked on a particular advertisement, and HP Vertica Analytics Platform to store and operationalize high-value business data. In addition, the company’s Big Data solution supports reporting and analytics via Tableau and the R programming language, and it uses custom ETL. This combination of technologies helps the company achieve faster insights that are delivered more consistently with less administrative overhead and lower-cost, commodity hardware.

Enabling privacy assurance

A company focused on web privacy uses HDFS to collect user privacy reporting requests, MapReduce to process and structure the data into HP Vertica Analytics Platform (ETL), and the platform to analyze statistics for every third-party tag on a website in measuring site performance. Consumers benefit from a free browser plug-in that can tell them who is tracking them. Advertisers, in turn, can provide greater transparency to end users and better understand the impact of third-party tags on website performance.

Page 9: Make all your Information Matter: Hadoop and HP Vertica Analytics

9

Use cases with HBase and HP Vertica Analytics PlatformAnalyzing Facebook data

You can use the capabilities of HP Vertica Analytics Platform and Hadoop in a complementary manner to gain insights from Facebook data. Say you want to look up the number of users who hit the “like” button for items appearing on a page on Facebook. HBase, the open-source, column-oriented database that relies on Hadoop, is well suited to this task—a single key-value lookup.

But what if you want to analyze all of the recent clicks across Facebook to identify the 20 fastest growing “likes” in the Facebook environment? While nothing in the Hadoop ecosystem lets you conduct analytics over hundreds of millions of rows in a database, HP Vertica Analytics Platform is well suited to the challenge. What’s more, HP Vertica Analytics Platform has a higher-level language that allows your users to express queries easily, while Hadoop typically requires a developer to do the same.

Understanding power usage trends

A power utility benefits from using the complementary capabilities of HP Vertica Analytics Platform and Hadoop to help its customers and engineers understand power usage trends. Hadoop is well suited to “personal analytics” applications that allow customers to retrieve information that provides insights into their power-usage trends over different time periods and different weather conditions. HBase is adept at pulling up small numbers of rows from database tables.

At the same time, utility engineers can run power-usage analytics over the entire body of customer usage data to help them gain insights into optimal configuration of the company’s electric grid. HP Vertica Analytics Platform is ideally suited for this undertaking. It is designed for analytics over petabytes of data.

Page 10: Make all your Information Matter: Hadoop and HP Vertica Analytics

10

Strengths and limitations of popular Hadoop componentsIn addition to HDFS and MapReduce, popular components of the Hadoop ecosystem include HBase, Hive, and Pig. Here is a look at these components, including their strengths and limitations, and how they can complement HP Vertica Analytics Platform.

As noted in the use cases above, HBase and HP Vertica Analytics Platform can work in a complementary manner. You can use HBase as a serving platform and HP Vertica Analytics Platform for complex analytics and model development. For example, in an advertising use case, you might look up user profiles employed in ad-serving from HBase and use HP Vertica Analytics Platform to do analysis that generates the user profiles.

HP Vertica external tables over HDFS are a complete replacement for Hive. HP Vertica Analytics Platform offers much more complete SQL and analytic support and much better performance. What’s more, you have an easy migration path into HP Vertica Analytics Platform when it is time to operationalize the analytics.

Pig and HP Vertica Analytics Platform can be complementary. Pig is a useful scripting language for ad hoc processing of unstructured and semi-structured data. If you use Pig, HP provides a Pig connector that allows you to load and store data from the HP Vertica Analytics Platform.

For processing that is similar to SQL, HP Vertica external tables over HDFS are a complete replacement for Pig. HP Vertica Analytics Platform provides much more complete SQL and analytic support and better performance. And as with Hive, you have an easy migration path into HP Vertica Analytics Platform when it is time to operationalize the analytics.

Description Strengths Limitations

Open-source, column-oriented database, modeled after BigTable (from Google)

• Provides random-access to Hadoop Distributed File System (HDFS) data

• Is independent from batch MapReduce

• No standard SQL support; your existing BI tools will not work

• Designed for workloads with simple key-value or range lookups, not complex analytics

• No support for ACID/Transactions; you cannot use HBase to replace existing database applications without custom coding

• Large hardware footprint and added complexity of the Hadoop stack

Description Strengths Limitations

A tool developed at Facebook to provide SQL-like language access (HQL) to Hadoop

• Provides a SQL-like interface to Hadoop

• Very limited subset SQL, compared to HP Vertica Analytics Platform

• Fundamentally a Hadoop-based solution, and has similar properties to Hadoop in terms of performance, administration complexity, and a large hardware footprint

Description Strengths Limitations

A framework that provides a higher-level language to increase developer productivity in Hadoop

• Is useful for developers who lack experience in Java

• Does not have an optimizer and therefore requires developers to optimize/tune their scripts

• Meant for power users; you have to do a lot more programming compared to HP Vertica Analytics Platform

• Most of your BI tools that you can use with HP Vertica Analytics Platform will not be useful with Pig

• Fundamentally a Hadoop-based solution; Pig has similar properties to Hadoop in terms of performance, administration complexity, and a large hardware footprint

HBase

Hive

Pig

Page 11: Make all your Information Matter: Hadoop and HP Vertica Analytics

11

Key takeawaysTo address today’s rich array of data sources, rapidly growing data volumes, and complex analytics challenges, you need to evolve your infrastructure in a way that leverages technology most effectively and doesn’t break your budget. The HP Vertica Analytics Platform and the Hadoop open-source platform help you get there today. This combination gives you access to the right tools for the job at hand—from storage and batch processing to real-time analytics on Big Data.

A large and growing number of organizations around the world are proving the value of this combination by using the HP Vertica Analytics Platform with Hadoop to get the best of both worlds—high-performance analytics and a distributed open-source file system.

What’s more, with HP, you can be confident that you have a partner who can offer you a total solution that allows you to gain more value from your investments in both the HP Vertica Analytics Platform and your Hadoop environment. This combination enables your enterprise to bring together your structured, unstructured, and semi-structured data—and make all your information matter.

To learn moreFor a technically oriented view of the integration of HP Vertica Analytics Platform and Hadoop, check out these blog posts on the HP Vertica website:

• Teaching the elephant new tricks

• Counting triangles

To test drive HP Vertica Enterprise Edition, visit vertica.com/evaluate. HP also offers HP Vertica Community Edition software, a free version of HP Vertica Enterprise Edition limited to one terabyte of data and three nodes. Sign up for HP Vertica Community Edition at vertica.com/community.

Page 12: Make all your Information Matter: Hadoop and HP Vertica Analytics

This is an HP Indigo digital print.

Get connected hp.com/go/getconnected Share with colleagues

Get the insider view on tech trends, support alerts, and HP solutions.

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein.

Java is a registered trademark of Oracle Corporation and/or its affiliates.

4AA4-4526ENW, Created November 2012