Upload
joel-bond
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Getting Started with Hadoop
2
Who We Are
How We Do It
We deliver relevant products and services.
A distribution of Apache Hadoop that is tested, certified and supported
Comprehensive support and professional service offerings
A suite of management software for Hadoop operations
Training and certification programs for developers, administrators, managers and data scientists
Technical Team
Unmatched knowledge and experience.
Founders, committers and contributors to Hadoop
A wealth of experience in the design and delivery of production software
Credentials
The Apache Hadoop experts.
Number 1 distribution of Apache Hadoop in the world
Largest contributor to the open source Hadoop ecosystem
More committers on staff than any other company
More than 100 customers across a wide variety of industries
Strong growth in revenue and new accounts
Mission: To help organizations profit from their data
Leadership
Strong executive team with proven abilities.
Mike OlsonCEO
Kirk DunnCOOCharles ZedlewskiVP, ProductMary RorabaughCFO
Jeff HammerbacherChief Scientist
Amr AwadallaVP Engineering
Doug CuttingChief ArchitectOmer TrajmanVP, Customer Solutions
©2011 Cloudera, Inc. All Rights Reserved.
©2011 Cloudera, Inc. All Rights Reserved.3
Users of Cloudera
Financial Web Retail & Consumer
MediaTelecom
4
What is Apache Hadoop?
Hadoop Distributed File System (HDFS)
File Sharing & Data Protection Across Physical Servers
MapReduce
Distributed Computing Across Physical Servers
Flexibility
A single repository for storing processing & analyzing any type of data
Not bound by a single schema
Scalability
Scale-out architecture divides workloads across multiple nodes
Flexible file system eliminates ETL bottlenecks
Low Cost
Can be deployed on commodity hardware
Open source platform guards against vendor lock
Hadoop is a platform for data storage and processing that is…
Scalable Fault tolerant Open source
CORE HADOOP COMPONENTS
©2011 Cloudera, Inc. All Rights Reserved.
©2011 Cloudera, Inc. All Rights Reserved.5
What Makes Hadoop Different?
• Ability to scale out to Petabytes in size using commodity hardware
• Processing (MapReduce) jobs are sent to the data versus shipping the data to be processed
• Hadoop doesn’t impose a single data format so it can easily handle structure, semi-structure and unstructured data
• Manages fault tolerance and data replication automatically
©2011 Cloudera, Inc. All Rights Reserved.6
Why the Need for Hadoop?
10,000
2005 20152010
5,000
0
1.8 trillion gigabytes of data wascreated in 2011…
More than 90% is unstructured data
Approx. 500 quadrillion files
Quantity doubles every 2 years
STRUCTURED DATA UNSTRUCTURED DATA
GIG
AB
YT
ES
OF
DA
TA C
RE
AT
ED
(IN
BIL
LIO
NS
)
Source: IDC 2011
©2011 Cloudera, Inc. All Rights Reserved.7
Hadoop Use CasesA
DV
AN
CE
D A
NA
LYT
ICS
DA
TA P
RO
CE
SS
ING
Social Network Analysis
Content Optimization
Network Analytics
Loyalty & Promotions Analysis
Fraud Analysis
Entity Analysis
Clickstream Sessionization
Clickstream Sessionization
Mediation
Data Factory
Trade Reconciliation
SIGINT
Application ApplicationIndustry
Web
Media
Telco
Retail
Financial
Federal
Bioinformatics Genome MappingSequencing Analysis
Use CaseUse Case
©2011 Cloudera, Inc. All Rights Reserved.8
Hadoop in the Enterprise
Logs Files Web DataRelational Databases
IDE’s BI / AnalyticsEnterprise Reporting
Enterprise Data Warehouse
Web Application
Management Tools
OPERATORS ENGINEERS ANALYSTS BUSINESS USERS
CUSTOMERS
9
What is CDH?
Fastest Path to Success
No need to write your own scripts or do integration testing on different components
Works with a wide range of operating systems, hardware, databases and data warehouses
Stable and Reliable
Extensive Cloudera QA systems, software & processes
Tested & run in production at scale
Proven at scale in dozens of enterprise environments
Community Driven
Incorporates only main-line components from the Apache Hadoop ecosystem – no forks or proprietary underpinnings
FREE
Cloudera’s Distribution IncludingApache Hadoop (CDH) is an enterprise-ready distribution of Hadoop that is…
100% Apache open source Contains all components needed for deployment Fully documented and supported Released on a reliable schedule
©2011 Cloudera, Inc. All Rights Reserved.
10
Component Cloudera Committers Cloudera Founder 2011 Commits
Common 6 Yes #1
HDFS 6 Yes #2
MapReduce 5 Yes #1
HBase 2 No #2
Zookeeper 1 Yes #2
Oozie 1 Yes #1
Pig 0 No #3
Hive 1 No #2
Sqoop 2 Yes #1
Flume 3 Yes #1
Hue 3 Yes #1
Snappy 2 No #1
Bigtop 8 Yes #1
Avro 4 Yes #1
Whirr 2 Yes #1
©2011 Cloudera, Inc. All Rights Reserved.
Cloudera’s Commitment to the Open Source Community
©2011 Cloudera, Inc. All Rights Reserved.11
Components of CDH
Coordination
Data IntegrationFast Read/Write
Access
Languages / Compilers
Workflow Scheduling
APACHE ZOOKEEPER
APACHE FLUME, APACHE SQOOP
APACHE HBASE
APACHE PIG, APACHE HIVE
APACHE OOZIE APACHE OOZIE
File System Mount
User Interface
FUSE-DFS
HUE
Cloudera Enterprise
Block Size = 64MBReplication Factor = 3
Hadoop Distributed File System
Cost is $400-$500/TB
©2011 Cloudera, Inc. All Rights Reserved.12
1
2
3
4
5 2
3
4
5
2
4
5
1
3
5
1
2
5
1
3
4
HDFS
©2011 Cloudera, Inc. All Rights Reserved.13
Components of Hadoop
• NameNode – Holds all metadata for HDFS– Needs to be a highly reliable machine
• RAID drives – typically RAID 10• Dual power supplies• Dual network cards – Bonded
– The more memory the better – typical 36GB to - 64GB
• Secondary NameNode – Provides check pointing for the NameNode. Same hardware as the NameNode should be used
©2011 Cloudera, Inc. All Rights Reserved.14
Components of Hadoop
• DataNodes – Hardware will depend on the specific needs of the cluster– No RAID needed, JBOD (just a bunch of
disks) is used– Typical ratio is:
• 1 hard drive• 2 cores• 4GB of RAM
©2011 Cloudera, Inc. All Rights Reserved.15
Networking
• One of the most important things to consider when setting up a Hadoop cluster
• Typically a top of rack is used with Hadoop with a core switch
• Careful on over subscribing the backplane of the switch!
©2011 Cloudera, Inc. All Rights Reserved.16
Map
• Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line).
• map() produces one or more intermediate values along with an output key from the input.
MapTask
(key 1, values)
(key 2, values)
(key 3, values)
ShufflePhase
(key 1, int. values)
(key 1, int. values)
(key 1, int. values)
Reduce Task
Final (key, values)
©2011 Cloudera, Inc. All Rights Reserved.17
Reduce
• After the map phase is over, all the intermediate values for a given output key are combined together into a list
• reduce() combines those intermediate values into one or more final values for that same output key
MapTask
(key 1, values)
(key 2, values)
(key 3, values)
ShufflePhase
(key 1, int. values)
(key 1, int. values)
(key 1, int. values)
Reduce Task
Final (key, values)
©2011 Cloudera, Inc. All Rights Reserved.18
MapReduce Execution
©2011 Cloudera, Inc. All Rights Reserved.19
Sqoop
SQL to Hadoop
Tool to import/export any JDBC-supported database into Hadoop
Transfer data between Hadoop and external databases or EDW
High performance connectors for some RDBMS
Developed at Cloudera
©2011 Cloudera, Inc. All Rights Reserved.20
Flume
Distributed, reliable, available service for efficiently moving large amounts of data as it is produced
Suited for gathering logs from multiple systems
Inserting them into HDFS as they are generated
Design goals
Reliability, Scalability, Manageability, Extensibility
Developed at Cloudera
Flume: high-level architecture
Agent Agent Agent
Processor Processor
Collector(s)
Agent
Configurable levels of reliability
Guarantee delivery in event of failure
Deployable, centrally administered
compress
encrypt
batch
encrypt
Flexibly deploy decorators at any step to improve performance, reliability or security
Optionally pre-process incoming data: perform transformations, suppressions, metadata enrichment
Writes to multiple HDFS file formats (text, sequence, JSON, Avro, others)
Parallelized writes across many collectors – as much write throughput as
MASTER
Master send configuration to all Agents
©2011 Cloudera, Inc. All Rights Reserved.21
©2011 Cloudera, Inc. All Rights Reserved.22
HBase
Column-family store. Based on design of Google BigTable
Provides interactive access to information
Holds extremely large datasets (multi-TB)
Constrained access model
(key, value) lookup
Limited transactions (only one row)
©2011 Cloudera, Inc. All Rights Reserved.
HBase
23
©2011 Cloudera, Inc. All Rights Reserved.24
Hive
SQL-based data warehousing application
Language is SQL-like
Supports SELECT, JOIN, GROUP BY, etc.
Features for analyzing very large data sets
Partition columns, Sampling, Buckets
Example:SELECT s.word, s.freq, k.freq FROM shakespeares JOIN ON (s.word= k.word) WHERE s.freq >= 5;
©2011 Cloudera, Inc. All Rights Reserved.25
Pig
Data-flow oriented language – “Pig latin”
Datatypes include sets, associative arrays, tuples
High-level language for routing data, allows easy
integration of Java for complex tasks
Example:emps=LOAD 'people.txt’ AS(id,name,salary); rich = FILTER emps BY salary > 100000; srtd =
ORDER rich BY salary DESC; STORE srtd INTO ’rich_people.txt';
©2011 Cloudera, Inc. All Rights Reserved.26
Oozie
Oozie is a workflow/cordination service to manage data processing
jobs for Hadoop
©2011 Cloudera, Inc. All Rights Reserved.27
Zookeeper
Zookeeper is a distributed consensus engine
Provides well-defined concurrent access semantics:
Leader election
Service discovery
Distributed locking / mutual exclusion
Message board / mailboxes
©2011 Cloudera, Inc. All Rights Reserved.28
Pipes and Streaming
Multi-language connector libraries for MapReduce
Write native-code MapReduce in C++
Write MapReduce passes in any scripting language,
including
Perl
Python
©2011 Cloudera, Inc. All Rights Reserved.29
FUSE - DFS
Allows mounting of HDFS volumes via Linux FUSE file
system
Does allow easy integration with other systems for data
import/export
Does not imply HDFS can be used for general-purpose
file system
©2011 Cloudera, Inc. All Rights Reserved.30
Hadoop Security
Authentication is secured by Kerberos v5 and integrated with LDAP
Hadoop server can ensure that users and groups are who they say they are
Job Control includes Access Control Lists, which means Jobs can specify who
can view logs, counters, configurations and who can modify a job
Tasks now run as the user who launched the job
©2011 Cloudera, Inc. All Rights Reserved.31
Cloudera Enterprise
Simplify and Accelerate Hadoop Deployment
Reduce Adoption Costs and Risks
Lower the Cost of Administration
Increase the Transparency Control of Hadoop
Leverage the Experience of Our Experts
Cloudera Enterprise makesopen source Hadoop enterprise-easy
EFFECTIVENESS
Ensuring YouGet Value From Your Hadoop Deployment
EFFICIENCY
Enabling You toAffordably Run Hadoop in Production
Cloudera Manager
End-to-End Management Application for Apache
Hadoop
Production-Level Support
Our Team of Experts On-Call to Help You Meet
Your SLAs
CLOUDERA ENTERPRISE COMPONENTS
©2011 Cloudera, Inc. All Rights Reserved.32
Cloudera Manager
The industry’s firstend-to-end management
applicationfor Apache Hadoop
Proactively manages theApache Hadoop stack
Automates the full operational lifecycle of Apache Hadoop
DISCOVER DIAGNOSE OPTIMIZEACT
HDFS MAPREDUCE HBASE
ZOOKEEPER OOZIE HUE
©2011 Cloudera, Inc. All Rights Reserved.34
Cloudera Enterprise
Including Cloudera Support
Feature Benefit
Flexible Support Windows Choose from 8x5 or 24x7 options to meet SLA requirements
Configuration Checks Verify that your Hadoop cluster is fine-tuned for your environment
Issue Resolution and Escalation Processes
Proven processes ensure that support cases get resolved with maximum efficiency
Comprehensive Knowledgebase
Browse through hundreds of Articles and Tech Notes to expand upon your knowledge of Apache Hadoop
Certified Connectors Connect your Apache Hadoop cluster to your existing data analysis tools such as IBM Netezza and Revolution Analytics
Notification of New Developments and Events
Stay up to speed with what’s going on in the Apache Hadoop community
©2011 Cloudera, Inc. All Rights Reserved.35
Cloudera University
Public and Private Training to Enable Your Success
Class DescriptionDeveloper Training & Certification(4 Days)
Hands-on training and certification for developers who want to analyze their data but are new to Apache Hadoop
System Administrator Training & Certification (3 Days)
Hands-on training and certification for administrators who will be responsible for setting up, configuring, monitoring an Apache Hadoop cluster
HBase Training (2 Day) Covers the HBase architecture, data model, and Java API as well as some advanced topics and best practices
Analyzing Data with Hive and Pig(2 Days)
Hive and Pig training is designed for people who have a basic understanding of how Apache Hadoop works and want to utilize these languages for analysis of their data
Essentials for Managers (1 Day) Provides decision-makers the information they need to know about Apache Hadoop, answering questions such as “when is Hadoop appropriate?”, “what are people using Hadoop for?” and “what do I need to know about choosing Hadoop?”
©2011 Cloudera, Inc. All Rights Reserved.36
Cloudera Consulting Services
Put Our Expertise To Work For You.
Service Description
Use Case Discovery Assess the appropriateness and value of Hadoop for your organization
New Hadoop Deployment Set up and configure high performance, production-ready Hadoop clusters
Proof of Concept Verify the prototype functionality and project feasibility for a new Hadoop cluster
Production Pilot Deploy your first production-level project using Hadoop
Process and Team Development Define the requirements and processes for creating a new Hadoop team
Hadoop Deployment Certification Perform periodic health checks to certify and tune up existing Hadoop clusters
Cloudera’s team of Solutions Architects provides guidance and hands-on expertise to address unique enterprise challenges.
©2011 Cloudera, Inc. All Rights Reserved.37
Journey of the Cloudera Customer
Discover the Benefits of Apache Hadoop
Cloudera’s Distribution
Subscribe to Cloudera Enterprise
Flexibility to store and mine all types
of data
The fastest, surest path to success with
Apache Hadoop
Simplify and accelerate Apache
Hadoop deployment
©2011 Cloudera, Inc. All Rights Reserved.38
Cloudera in Production
Logs Files Web DataRelational Databases
IDE’s BI / AnalyticsEnterprise Reporting
Enterprise Data Warehouse
Operational Rules Engines
Management Tools
OPERATORS ENGINEERS ANALYSTS BUSINESS USERS
Cloudera’s Distribution Including Apache Hadoop (CDH)
&SCM Express
Cloudera Enterprise Cloudera Management Suite Cloudera Support
Cloudera Services
Consulting Services Cloudera University
Web Application
CUSTOMERS
©2011 Cloudera, Inc. All Rights Reserved.39
Cloudera helps you profit from all your data.
cloudera.com+1 (888) [email protected]
twitter.com/cloudera
facebook.com/cloudera
Get Hadoop
©2011 Cloudera, Inc. All Rights Reserved.40
Cloudera Manager
The first and only Hadoop management application that:
1. Manages the full Hadoop lifecycle
2. Manages and monitors the complete Hadoop stack
3. Incorporates comprehensive log and event management
4. Has Technical Support integration built-in
©2011 Cloudera, Inc. All Rights Reserved.41
Cloudera Manager
Key Features and Functionality:
Automated Deployment Installs the complete Hadoop stack in minutes. The simple, wizard-based interface guides you through the steps.
Centralized Management Gives you complete, end-to-end visibility and control over your Hadoop cluster from a single interface
Service & Configuration Management Set server roles, configure services and manage security across the cluster
Gracefully start, stop and restart of services as needed
Audit Trails Maintains a complete record of configuration changes for SOX compliance
Proactive Health Checks Monitors dozens of service performance metrics and alerts you when you approach critical thresholds
Intelligent Log Management Gather, view and search Hadoop logs collected from across the cluster
Scans Hadoop logs for irregularities and warns you before they impact the cluster
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
©2011 Cloudera, Inc. All Rights Reserved.42
Key Features and Functionality:
Cloudera Manager
Global Time Control Establishes the time context globally for almost all views
Correlates jobs, activities, logs, system changes, configuration changes and service metrics along a single timeline to simplify diagnosis
Support Integration Takes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with resolution
Event Management Creates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and make them available for alerting and searching
Alerting Generates email alerts when certain events occur
Operational Reports Visualize current and historical disk usage by user, group and directoryTrack MapReduce activity on the cluster by job or user
Host Level Monitoring View information pertaining to hosts in your cluster including status, resident memory, virtual memory and roles
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
©2011 Cloudera, Inc. All Rights Reserved.43
Max Number of Nodes Supported 50 Unlimited
Automated Deployment
Host-Level Monitoring
Secure Communication Between Server & Agents
Configuration Management
Manage HDFS, MapReduce, HBase, Hue, Oozie & Zookeeper
Audit Trails
Start/Stop/Restart Services
Add/Restart/Decomission Role Instances
Configuration Versioning & History
Support for Kerberos
Service Monitoring
Proactive Health Checks
Status & Health Summary
Intelligent Log Management
Events Management & Alerts
Activity Monitoring
Operational Reporting
Global Time Control
Support Integration
FREE EDITION ENTERPRISE EDITION**Two Editions:
** Part of the Cloudera Enterprise subscription
©2011 Cloudera, Inc. All Rights Reserved.44
View Service Health and Performance
©2011 Cloudera, Inc. All Rights Reserved.45
Get Host-Level Snapshots
©2011 Cloudera, Inc. All Rights Reserved.46
Monitor and Diagnose Cluster Workloads
©2011 Cloudera, Inc. All Rights Reserved.47
Gather, View and Search Hadoop Logs
©2011 Cloudera, Inc. All Rights Reserved.48
Track Events From Across the Cluster
©2011 Cloudera, Inc. All Rights Reserved.49
Run Reports on System Performance & Usage
©2011 Cloudera, Inc. All Rights Reserved.50
New in Cloudera Manager 3.7
1. Proactive Health Checks Monitors dozens of service performance metrics and alerts you when you approach critical thresholds
2. Intelligent Log Management Gathers and scans Hadoop logs for irregularities and warns you before they impact the cluster
3. Global Time Control Correlates jobs, activities, logs, system changes, configuration changes and service metrics along a single timeline to simplify diagnosis
4. Support Integration Takes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with resolution
5. Event Management Creates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and make them available for alerting and searching
6. Alerts Generates email alerts when certain events occur
7. Audit Trails Maintains a complete record of configuration changes for SOX compliance
8. Operational Reporting Visualize current and historical disk usage by user, group and directory and track MapReduce activity on the cluster by job or user
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
©2011 Cloudera, Inc. All Rights Reserved.51
Cloudera Support
Our team of experts on call to help you meet your SLAs
Feature Benefit
Flexible Support Windows Choose from 8x5 or 24x7 options to meet SLA requirements
Configuration Checks Verify that your Hadoop cluster is fine-tuned for your environment
Issue Resolution and Escalation Processes
Proven processes ensure that support cases get resolved with maximum efficiency
Comprehensive Knowledgebase Browse through hundreds of Articles and Tech Notes to expand upon your knowledge of Apache Hadoop
Certified Connectors Connect your Apache Hadoop cluster to your existing data analysis tools such as IBM Netezza, Revolution Analytics, and MicroStrategy
Proactive Notification of New Developments and Events
Stay up to speed with what’s going on in the Apache Hadoop community
©2011 Cloudera, Inc. All Rights Reserved.52
Cloudera Enterprise
Why Cloudera Enterprise?
Apache Hadoop is a distributed system that presents unique operational challenges
The fixed cost of managing an internal patch and release infrastructure is prohibitive
Apache Hadoop skills and expertise are scarce
It’s challenging to track consistently to community development efforts
Only Cloudera Enterprise
Has a management application that supports the full lifecycle of operationalizing Apache
Hadoop
• • •
Has production support backed by theApache committers
• • •
Has the depth of experience supporting hundreds of production Apache Hadoop clusters
The Fastest Path to SuccessRunning Apache Hadoop in Production.