Introduction to Data Analyst Training

Cloudera Data Analyst Training: Using Pig, Hive, and Impala with Hadoop

Tom Wheeler

Agenda

Why Cloudera Training?*Target Audience and Prerequisites*Course Outline*Short Presentation Based on Actual Course Material*

- Understanding Hadoop, Pig, Hive, Sqoop, and Impala

Question and Answer Session*

32,000trained professionals by 2015

Rising demand for Big Data and analytics experts but a

DEFICIENCY OF TALENTwill result in a shortfall of

Source: Accenture “Analytics in Action,“ March 2013.

1 Broadest Range of CoursesDeveloper, Admin, Analyst, HBase, Data Science

2

3

Widest Geographic Coverage50 cities worldwide plus online

5 Leading Platform & CommunityCDH deployed more than all other distributions combined

6 Relevant Training MaterialClasses updated regularly as tools evolve

7 Practical Hands-On ExercisesReal-world labs complement live instruction

Most Experienced InstructorsMore than 15,000 students trained since 2009

4 Leader in CertificationOver 5,000 accredited Cloudera professionals 8 Ongoing Learning

Video tutorials and e-learning complement training

Why Cloudera Training?

55%of the Fortune 100 have attended live Cloudera training

Source: Fortune, “Fortune 500 “ and “Global 500,” May 2012.

Cloudera Trains the Top Companies

100%of the top 20 global technology firms to

use Hadoop

Cloudera has trained employees from

Big Data professionals from

94%

88%

Would recommend or highly recommend Cloudera training to friends and colleagues

Indicate Cloudera training provided theHadoop expertise their roles require

Sources: Cloudera Past Public Training Participant Study, December 2012. Cloudera Customer Satisfaction Study, January 2013.

66% Draw on lessons from Cloudera training on at least a monthly basis

What Do Our Students Say?

“

”

Cloudera is the best vendor evangelizing the Big Data movement and is doing a great service promoting Hadoop in the industry. Developer training was a great way to get started on my journey.

Cloudera Data Analyst Training: Using Pig, Hive, and Impala with HadoopAbout the Course

This course was created for people in analytical roles, including–Data Analyst–Business Intelligence Analyst–Operations Analyst–Reporting Specialist

Also useful for others who want to use high-level Big Data tools–Business Intelligence Developer–Data Warehouse Engineer– ETL Developers

Intended Audience

Developers who want to learn details of MapReduce programming–Recommend Cloudera Developer Training for Apache Hadoop

System administrators who want to learn how to install/configure tools–Recommend Cloudera Administrator Training for Apache Hadoop

Who Should Not Take this Course

No prior knowledge of Hadoop is required

What is required is an understanding of–Basic relational database concepts–Basic knowledge of SQL

–Basic end-user UNIX commands

Course Prerequisites

SELECT id, first_name, last_name FROM customers;ORDER BY last_name;

$ mkdir /data$ cd /data$ rm /home/tomwheeler/salesreport.txt

During this course, you will learn

The purpose of Hadoop and its related tools

The features that Pig, Hive, and Impala offer for data acquisition, storage, and analysis

How to identify typical use cases for large-scale data analysis

How to load data from relational databases and other sources

How to manage data in HDFS and export it for use with other systems

How Pig, Hive, and Impala improve productivity for typical analysis tasks

The language syntax and data formats supported by these tools

Course Objectives

How to design and execute queries on data stored in HDFS

How to join diverse datasets to gain valuable business insight

How to analyze structured, semi-structured, and unstructured data

How Hive and Pig can be extended with custom functions and scripts

How to store and query data for better performance

How to determine which tool is the best choice for a given task

Course Objectives (cont’d)

Hadoop Fundamentals–Hands-On Exercise: Data Ingest with Hadoop Tools

Introduction to Pig

Basic Data Analysis with Pig–Hands-On Exercise: Using Pig for ETL Processing

Processing Complex Data with Pig–Hands-On Exercise: Analyzing Ad Campaign Data with Pig

Multi-Dataset Operations with Pig–Hands-On Exercise: Analyzing Disparate Data Sets with Pig

Extending Pig–Hands-On Exercise: Extending Pig with Streaming and UDFs

Course Outline

Pig Troubleshooting and Optimization–Demo: Troubleshooting a Failed Job with the Web UI

Introduction to Hive

Relational Data Analysis with Hive–Hands-On Exercise: Running Hive Queries on the Shell, Scripts, and Hue

Hive Data Management–Hands-On Exercise: Data Management with Hive

Text Processing with Hive–Hands-On Exercise: Gaining Insight with Sentiment Analysis

Hive Optimization

Extending Hive–Hands-On Exercise: Data Transformation with Hive

Course Outline (cont’d)

Introduction to Impala

Analyzing Data with Impala–Hands-On Exercise: Interactive Analysis with Impala

Choosing the Best Tool for the Job

Course Outline (cont’d)

We are generating data faster than ever–Processes are increasingly automated–People are increasingly interacting online– Systems are increasingly interconnected

Velocity

We are producing a wide variety of data– Social network connections– Images, audio, and video– Server and application log files–Product ratings on shopping and review Web sites–And much more…

Not all of this maps cleanly to the relational model

Variety

Every day…–More than 1.5 billion shares are traded on the New York Stock Exchange– Facebook stores 2.7 billion comments and ‘Likes’–Google processes about 24 petabytes of data

Every minute…– Foursquare handles more than 2,000 check-ins– TransUnion makes nearly 70,000 updates to credit files

And every second…–Banks process more than 10,000 credit card transactions

Volume

This data has many valuable applications–Product recommendations–Predicting demand–Marketing analysis– Fraud detection–And many, many more…

We must process it to extract that value–And processing all the data can yield more accurate results

Data Has Value

We’re generating too much data to process with traditional tools

Two key problems to address –How can we reliably store large amounts of data at a reasonable cost?–How can we analyze all the data we have stored?

We Need a System that Scales

Scalable and economical data storage and processing–Distributed and fault-tolerant –Harnesses the power of industry standard hardware

Heavily inspired by technical documents published by Google

‘Core’ Hadoop consists of two main components– Storage: the Hadoop Distributed File System (HDFS)–Processing: MapReduce

What is Apache Hadoop?

Apache Pig builds on Hadoop to offer high-level data processing– This is an alternative to writing low-level MapReduce code–Pig is especially good at joining and transforming data

Apache Pig

people = LOAD '/user/training/customers' AS (cust_id, name);orders = LOAD '/user/training/orders' AS (ord_id, cust_id, cost);groups = GROUP orders BY cust_id;totals = FOREACH groups GENERATE group, SUM(orders.cost) AS t;result = JOIN totals BY group, people BY cust_id;DUMP result;

Pig is also widely used for Extract, Transform, and Load (ETL) processing

Use Case: ETL Processing

Hive is another abstraction on top of MapReduce– Like Pig, it also reduces development time –Hive uses a SQL-like language called HiveQL

Apache Hive

SELECT customers.cust_id, SUM(cost) AS total FROM customers JOIN orders ON customers.cust_id = orders.cust_idGROUP BY customers.cust_idORDER BY total DESC;

Server log files are an important source of data

Hive allows you to treat a directory of log files like a table–Allows SQL-like queries against raw data

Use Case: Log File Analytics

Apache Sqoop

Sqoop exchanges data between a database and Hadoop

It can import all tables, a single table, or a portion of a table into HDFS–Result is a directory in HDFS containing comma-delimited text files

Sqoop can also export data from HDFS back to the database

Massively parallel SQL engine which runs on a Hadoop cluster– Inspired by Google’s Dremel project–Can query data stored in HDFS or HBase tables

High performance – Typically at least 10 times faster than Pig, Hive, or MapReduce–High-level query language (subset of SQL)

Impala is 100% Apache-licensed open source

Cloudera Impala

Where Impala Fits Into the Data Center

MapReduce– Low-level processing and analysis

Pig–Procedural data flow language executed using MapReduce

Hive– SQL-based queries executed using MapReduce

Impala–High-performance SQL-based queries using a custom execution engine

Recap of Data Analysis/Processing Tools

Comparing Pig, Hive, and Impala

Description of Feature Pig Hive Impala

SQL-based query language No Yes Yes

User-defined functions (UDFs) Yes Yes No

Process data with external scripts Yes Yes No

Extensible file format support Yes Yes No

Complex data types Yes Yes No

Query latency High High Low

Built-in data partitioning No Yes Yes

Accessible via ODBC / JDBC No Yes Yes

• Submit questions in the Q&A panel

• Watch on-demand video of this webinar at http://cloudera.com

• Follow Cloudera University @ClouderaU

• Attend Tom’s talk at OSCON: http://tinyurl.com/oscontom

• Or Tom’s talks at StampedeCon:http://tinyurl.com/stampedetom

• Thank you for attending!

Register now for Cloudera training at http://university.cloudera.com

Use discount code Wheeler_10 to save 10% on new enrollments in Data Analyst

Training classes delivered by Cloudera until September 1, 2013

Use discount code 15off2 to save 15% on enrollments in two or more training classes delivered by Cloudera until

September 1, 2013

http://cloudera.com/

https://twitter.com/clouderau

http://tinyurl.com/oscontom




http://university.cloudera.com/

Technology

Introduction to Data Analyst Training