Upload
cloudera-inc
View
7.752
Download
0
Embed Size (px)
Citation preview
Cloudera Data Analyst Training: Using Pig, Hive, and Impala with Hadoop
Tom Wheeler
Agenda
Why Cloudera Training?*Target Audience and Prerequisites*Course Outline*Short Presentation Based on Actual Course Material*
- Understanding Hadoop, Pig, Hive, Sqoop, and Impala
Question and Answer Session*
32,000trained professionals by 2015
Rising demand for Big Data and analytics experts but a
DEFICIENCY OF TALENTwill result in a shortfall of
Source: Accenture “Analytics in Action,“ March 2013.
1 Broadest Range of CoursesDeveloper, Admin, Analyst, HBase, Data Science
2
3
Widest Geographic Coverage50 cities worldwide plus online
5 Leading Platform & CommunityCDH deployed more than all other distributions combined
6 Relevant Training MaterialClasses updated regularly as tools evolve
7 Practical Hands-On ExercisesReal-world labs complement live instruction
Most Experienced InstructorsMore than 15,000 students trained since 2009
4 Leader in CertificationOver 5,000 accredited Cloudera professionals 8 Ongoing Learning
Video tutorials and e-learning complement training
Why Cloudera Training?
55%of the Fortune 100 have attended live Cloudera training
Source: Fortune, “Fortune 500 “ and “Global 500,” May 2012.
Cloudera Trains the Top Companies
100%of the top 20 global technology firms to
use Hadoop
Cloudera has trained employees from
Big Data professionals from
94%
88%
Would recommend or highly recommend Cloudera training to friends and colleagues
Indicate Cloudera training provided theHadoop expertise their roles require
Sources: Cloudera Past Public Training Participant Study, December 2012. Cloudera Customer Satisfaction Study, January 2013.
66% Draw on lessons from Cloudera training on at least a monthly basis
What Do Our Students Say?
“
”
Cloudera is the best vendor evangelizing the Big Data movement and is doing a great service promoting Hadoop in the industry. Developer training was a great way to get started on my journey.
Cloudera Data Analyst Training: Using Pig, Hive, and Impala with HadoopAbout the Course
This course was created for people in analytical roles, including–Data Analyst–Business Intelligence Analyst–Operations Analyst–Reporting Specialist
Also useful for others who want to use high-level Big Data tools–Business Intelligence Developer–Data Warehouse Engineer– ETL Developers
Intended Audience
Developers who want to learn details of MapReduce programming–Recommend Cloudera Developer Training for Apache Hadoop
System administrators who want to learn how to install/configure tools–Recommend Cloudera Administrator Training for Apache Hadoop
Who Should Not Take this Course
No prior knowledge of Hadoop is required
What is required is an understanding of–Basic relational database concepts–Basic knowledge of SQL
–Basic end-user UNIX commands
Course Prerequisites
SELECT id, first_name, last_name FROM customers;ORDER BY last_name;
$ mkdir /data$ cd /data$ rm /home/tomwheeler/salesreport.txt
During this course, you will learn
The purpose of Hadoop and its related tools
The features that Pig, Hive, and Impala offer for data acquisition, storage, and analysis
How to identify typical use cases for large-scale data analysis
How to load data from relational databases and other sources
How to manage data in HDFS and export it for use with other systems
How Pig, Hive, and Impala improve productivity for typical analysis tasks
The language syntax and data formats supported by these tools
Course Objectives
How to design and execute queries on data stored in HDFS
How to join diverse datasets to gain valuable business insight
How to analyze structured, semi-structured, and unstructured data
How Hive and Pig can be extended with custom functions and scripts
How to store and query data for better performance
How to determine which tool is the best choice for a given task
Course Objectives (cont’d)
Hadoop Fundamentals–Hands-On Exercise: Data Ingest with Hadoop Tools
Introduction to Pig
Basic Data Analysis with Pig–Hands-On Exercise: Using Pig for ETL Processing
Processing Complex Data with Pig–Hands-On Exercise: Analyzing Ad Campaign Data with Pig
Multi-Dataset Operations with Pig–Hands-On Exercise: Analyzing Disparate Data Sets with Pig
Extending Pig–Hands-On Exercise: Extending Pig with Streaming and UDFs
Course Outline
Pig Troubleshooting and Optimization–Demo: Troubleshooting a Failed Job with the Web UI
Introduction to Hive
Relational Data Analysis with Hive–Hands-On Exercise: Running Hive Queries on the Shell, Scripts, and Hue
Hive Data Management–Hands-On Exercise: Data Management with Hive
Text Processing with Hive–Hands-On Exercise: Gaining Insight with Sentiment Analysis
Hive Optimization
Extending Hive–Hands-On Exercise: Data Transformation with Hive
Course Outline (cont’d)
Introduction to Impala
Analyzing Data with Impala–Hands-On Exercise: Interactive Analysis with Impala
Choosing the Best Tool for the Job
Course Outline (cont’d)
We are generating data faster than ever–Processes are increasingly automated–People are increasingly interacting online– Systems are increasingly interconnected
Velocity
We are producing a wide variety of data– Social network connections– Images, audio, and video– Server and application log files–Product ratings on shopping and review Web sites–And much more…
Not all of this maps cleanly to the relational model
Variety
Every day…–More than 1.5 billion shares are traded on the New York Stock Exchange– Facebook stores 2.7 billion comments and ‘Likes’–Google processes about 24 petabytes of data
Every minute…– Foursquare handles more than 2,000 check-ins– TransUnion makes nearly 70,000 updates to credit files
And every second…–Banks process more than 10,000 credit card transactions
Volume
This data has many valuable applications–Product recommendations–Predicting demand–Marketing analysis– Fraud detection–And many, many more…
We must process it to extract that value–And processing all the data can yield more accurate results
Data Has Value
We’re generating too much data to process with traditional tools
Two key problems to address –How can we reliably store large amounts of data at a reasonable cost?–How can we analyze all the data we have stored?
We Need a System that Scales
Scalable and economical data storage and processing–Distributed and fault-tolerant –Harnesses the power of industry standard hardware
Heavily inspired by technical documents published by Google
‘Core’ Hadoop consists of two main components– Storage: the Hadoop Distributed File System (HDFS)–Processing: MapReduce
What is Apache Hadoop?
Apache Pig builds on Hadoop to offer high-level data processing– This is an alternative to writing low-level MapReduce code–Pig is especially good at joining and transforming data
Apache Pig
people = LOAD '/user/training/customers' AS (cust_id, name);orders = LOAD '/user/training/orders' AS (ord_id, cust_id, cost);groups = GROUP orders BY cust_id;totals = FOREACH groups GENERATE group, SUM(orders.cost) AS t;result = JOIN totals BY group, people BY cust_id;DUMP result;
Pig is also widely used for Extract, Transform, and Load (ETL) processing
Use Case: ETL Processing
Hive is another abstraction on top of MapReduce– Like Pig, it also reduces development time –Hive uses a SQL-like language called HiveQL
Apache Hive
SELECT customers.cust_id, SUM(cost) AS total FROM customers JOIN orders ON customers.cust_id = orders.cust_idGROUP BY customers.cust_idORDER BY total DESC;
Server log files are an important source of data
Hive allows you to treat a directory of log files like a table–Allows SQL-like queries against raw data
Use Case: Log File Analytics
Apache Sqoop
Sqoop exchanges data between a database and Hadoop
It can import all tables, a single table, or a portion of a table into HDFS–Result is a directory in HDFS containing comma-delimited text files
Sqoop can also export data from HDFS back to the database
Massively parallel SQL engine which runs on a Hadoop cluster– Inspired by Google’s Dremel project–Can query data stored in HDFS or HBase tables
High performance – Typically at least 10 times faster than Pig, Hive, or MapReduce–High-level query language (subset of SQL)
Impala is 100% Apache-licensed open source
Cloudera Impala
Where Impala Fits Into the Data Center
MapReduce– Low-level processing and analysis
Pig–Procedural data flow language executed using MapReduce
Hive– SQL-based queries executed using MapReduce
Impala–High-performance SQL-based queries using a custom execution engine
Recap of Data Analysis/Processing Tools
Comparing Pig, Hive, and Impala
Description of Feature Pig Hive Impala
SQL-based query language No Yes Yes
User-defined functions (UDFs) Yes Yes No
Process data with external scripts Yes Yes No
Extensible file format support Yes Yes No
Complex data types Yes Yes No
Query latency High High Low
Built-in data partitioning No Yes Yes
Accessible via ODBC / JDBC No Yes Yes
• Submit questions in the Q&A panel
• Watch on-demand video of this webinar at http://cloudera.com
• Follow Cloudera University @ClouderaU
• Attend Tom’s talk at OSCON: http://tinyurl.com/oscontom
• Or Tom’s talks at StampedeCon:http://tinyurl.com/stampedetom
• Thank you for attending!
Register now for Cloudera training at http://university.cloudera.com
Use discount code Wheeler_10 to save 10% on new enrollments in Data Analyst
Training classes delivered by Cloudera until September 1, 2013
Use discount code 15off2 to save 15% on enrollments in two or more training classes delivered by Cloudera until
September 1, 2013