CCD-410 Cloudera Study Material

Embed Size (px)

Citation preview

PowerPoint Presentation

1

Cloudera Certified Developer for Apache Hadoop (CCDH)

1

Who We Are2

How We Do ItWe deliver relevant products and services.A distribution of Apache Hadoop that is tested, certified and supportedComprehensive support and professional service offeringsA suite of management software for Hadoop operationsTraining and certification programs for developers, administrators, managers and data scientistsTechnical TeamUnmatched knowledge and experience.Founders, committers and contributors to HadoopA wealth of experience in the design and delivery of production software CredentialsThe Apache Hadoop experts.Number 1 distribution of Apache Hadoop in the worldLargest contributor to the open source Hadoop ecosystemMore committers on staff than any other companyMore than 100 customers across a wide variety of industriesStrong growth in revenue and new accounts

Mission: To help organizations profit from their data

LeadershipStrong executive team with proven abilities.

Mike OlsonCEOKirk DunnCOOCharles ZedlewskiVP, ProductMary RorabaughCFOJeff HammerbacherChief ScientistAmr AwadallaVP EngineeringDoug CuttingChief ArchitectOmer TrajmanVP, Customer Solutions

Mission: need talking points here

How we do it: We do it by offering a complete set of products and services to enable our customers. Training, Services, Support and Management software

We ARE the experts. We have the #1 Hadoop distribution, the most project founders, committers, and the most customers. We train thousands of people and certify many. Our services team can take you from best practices to cluster certification to fine tuning your Hbase implementation.

Our tech team made up of project founders and committers and our Executive team is broad and deep across open source, web and enterprise companies.2

Users of Cloudera

3

FinancialWebRetail & ConsumerMedia

Telecom

3

What is Apache Hadoop?4

Hadoop Distributed File System (HDFS)

File Sharing & Data Protection Across Physical Servers

MapReduce

Distributed Computing Across Physical Servers

Flexibility

A single repository for storing processing & analyzing any type of dataNot bound by a single schemaScalability

Scale-out architecture divides workloads across multiple nodesFlexible file system eliminates ETL bottlenecksLow Cost

Can be deployed on commodity hardwareOpen source platform guards against vendor lockHadoop is a platform for data storage and processing that isScalableFault tolerantOpen sourceCORE HADOOP COMPONENTS

What Makes Hadoop Different?Ability to scale out to Petabytes in size using commodity hardwareProcessing (MapReduce) jobs are sent to the data versus shipping the data to be processedHadoop doesnt impose a single data format so it can easily handle structure, semi-structure and unstructured dataManages fault tolerance and data replication automatically

5

Largest cluster under management known is at FB with 21PB and 2k nodes. 5

Why the Need for Hadoop?

6

10,000

2005201520105,00001.8 trillion gigabytes of data wascreated in 2011More than 90% is unstructured dataApprox. 500 quadrillion filesQuantity doubles every 2 years

STRUCTURED DATA

UNSTRUCTURED DATA

GIGABYTES OF DATA CREATED (IN BILLIONS)Source: IDC 2011

Hadoop Use Cases

7

ADVANCED ANALYTICSDATA PROCESSINGSocial Network AnalysisContent OptimizationNetwork Analytics

Loyalty & Promotions AnalysisFraud AnalysisEntity AnalysisClickstream SessionizationClickstream SessionizationMediationData FactoryTrade ReconciliationSIGINTApplicationApplicationIndustryWebMediaTelcoRetailFinancialFederalBioinformaticsGenome MappingSequencing AnalysisUse CaseUse Case

Hadoop in the Enterprise

8

LogsFilesWeb DataRelational DatabasesIDEsBI / AnalyticsEnterprise Reporting

Enterprise Data WarehouseWeb Application

Management Tools

OPERATORSENGINEERSANALYSTSBUSINESS USERSCUSTOMERS

Apache Hadoop is a new solution in your existing infrastructure.It does not replace any existing major existing investment.Apache brings data that youre already generating into context and integrates it with your business.You get access to key information about how your business is operating but pulling togetherWeb and application logsUnstructured filesWeb dataRelational dataHadoop is used by your team to analyze this data and deliver it to business users directly and via existing data management technologies8

What is CDH?9

Fastest Path to Success

No need to write your own scripts or do integration testing on different componentsWorks with a wide range of operating systems, hardware, databases and data warehousesStable and Reliable

Extensive Cloudera QA systems, software & processesTested & run in production at scaleProven at scale in dozens of enterprise environmentsCommunity Driven

Incorporates only main-line components from the Apache Hadoop ecosystem no forks or proprietary underpinningsFREE

Clouderas Distribution IncludingApache Hadoop (CDH) is an enterprise-ready distribution of Hadoop that is100% Apache open sourceContains all components needed for deploymentFully documented and supportedReleased on a reliable schedule

9

10ComponentCloudera CommittersCloudera Founder2011 CommitsCommon6Yes#1HDFS6Yes#2MapReduce5Yes#1HBase2No#2Zookeeper1Yes#2Oozie1Yes#1Pig0No#3Hive1No#2Sqoop2Yes#1Flume3Yes#1Hue3Yes#1Snappy2No#1Bigtop8Yes#1Avro4Yes#1Whirr2Yes#1

Clouderas Commitment to the Open Source Community

10

Components of CDH

11

Coordination

Data IntegrationFast Read/Write AccessLanguages / CompilersWorkflowSchedulingAPACHE ZOOKEEPERAPACHE FLUME, APACHE SQOOPAPACHE HBASEAPACHE PIG, APACHE HIVEAPACHE OOZIEAPACHE OOZIEFile System MountUser InterfaceFUSE-DFSHUECloudera Enterprise

11

Block Size = 64MBReplication Factor = 3Hadoop Distributed File SystemCost is $400-$500/TB

12

12345

2345

245

135

125

134

HDFS

Pool commodity servers in a single hierarchical namespace.Designed for large files that are written once and read many times.Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes.Typical Hadoop node is eight cores with 16GB ram and four 1TB SATA disks.Default block size is 64MB, though most folks now set it to 128MB

12

Components of HadoopNameNode Holds all metadata for HDFSNeeds to be a highly reliable machineRAID drives typically RAID 10Dual power suppliesDual network cards BondedThe more memory the better typical 36GB to - 64GBSecondary NameNode Provides check pointing for the NameNode. Same hardware as the NameNode should be used

13

Components of HadoopDataNodes Hardware will depend on the specific needs of the clusterNo RAID needed, JBOD (just a bunch of disks) is usedTypical ratio is:1 hard drive2 cores4GB of RAM

14

NetworkingOne of the most important things to consider when setting up a Hadoop clusterTypically a top of rack is used with Hadoop with a core switch Careful on over subscribing the backplane of the switch!

15

Map

16Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line).

map() produces one or more intermediate values along with an output key from the input.

MapTask(key 1, values)(key 2, values)(key 3, values)ShufflePhase(key 1, int. values)(key 1, int. values)(key 1, int. values)Reduce TaskFinal (key, values)

16

Reduce

17After the map phase is over, all the intermediate values for a given output key are combined together into a list

reduce() combines those intermediate values into one or more final values for that same output key

MapTask(key 1, values)(key 2, values)(key 3, values)ShufflePhase(key 1, int. values)(key 1, int. values)(key 1, int. values)Reduce TaskFinal (key, values)

17

MapReduce Execution

18

18

Sqoop

19SQL to HadoopTool to import/export any JDBC-supported database into HadoopTransfer data between Hadoop and external databases or EDWHigh performance connectors for some RDBMSDeveloped at Cloudera

19

Flume

20Distributed, reliable, available service for efficiently moving large amounts of data as it is producedSuited for gathering logs from multiple systemsInserting them into HDFS as they are generatedDesign goalsReliability, Scalability, Manageability, ExtensibilityDeveloped at Cloudera

20

Flume: high-level architecture

Agent

Agent

Agent

Processor

Processor

Collector(s)

AgentConfigurable levels of reliabilityGuarantee delivery in event of failureDeployable, centrally administeredcompress

encrypt

batch

encryptFlexibly deploy decorators at any step to improve performance, reliability or securityOptionally pre-process incoming data: perform transformations, suppressions, metadata enrichment

Writes to multiple HDFS file formats (text, sequence, JSON, Avro, others)Parallelized writes across many collectors as much write throughput as

MASTERMaster send configuration to all Agents

21

HBase

22Column-family store. Based on design of Google BigTableProvides interactive access to informationHolds extremely large datasets (multi-TB)Constrained access model(key, value) lookupLimited transactions (only one row)

22

HBase

23

Hive

24SQL-based data warehousing applicationLanguage is SQL-likeSupports SELECT, JOIN, GROUP BY, etc.Features for analyzing very large data setsPartition columns, Sampling, Buckets

Example:SELECT s.word, s.freq, k.freq FROM shakespeares JOIN ON (s.word= k.word) WHERE s.freq >= 5;

24

Pig

25Data-flow oriented language Pig latinDatatypes include sets, associative arrays, tuples High-level language for routing data, allows easy integration of Java for complex tasks

Example:emps=LOAD 'people.txt AS(id,name,salary); rich = FILTER emps BY salary > 100000; srtd = ORDER rich BY salary DESC; STORE srtd INTO rich_people.txt';

25

Oozie

26Oozie is a workflow/cordination service to manage data processing jobs for Hadoop

Zookeeper

27Zookeeper is a distributed consensus engineProvides well-defined concurrent access semantics:Leader electionService discoveryDistributed locking / mutual exclusionMessage board / mailboxes

27

Pipes and Streaming

28Multi-language connector libraries for MapReduceWrite native-code MapReduce in C++Write MapReduce passes in any scripting language, includingPerlPython

28

FUSE - DFS

29Allows mounting of HDFS volumes via Linux FUSE file systemDoes allow easy integration with other systems for data import/exportDoes not imply HDFS can be used for general-purpose file system

29

Hadoop Security

30Authentication is secured by Kerberos v5 and integrated with LDAPHadoop server can ensure that users and groups are who they say they areJob Control includes Access Control Lists, which means Jobs can specify who can view logs, counters, configurations and who can modify a jobTasks now run as the user who launched the job

Cloudera Enterprise

31

Simplify and Accelerate Hadoop DeploymentReduce Adoption Costs and RisksLower the Cost of AdministrationIncrease the Transparency Control of HadoopLeverage the Experience of Our ExpertsCloudera Enterprise makesopen source Hadoop enterprise-easyEFFECTIVENESSEnsuring YouGet Value From Your Hadoop Deployment

EFFICIENCYEnabling You toAffordably Run Hadoop in Production

Cloudera Manager

End-to-End Management Application for Apache Hadoop

Production-Level Support

Our Team of Experts On-Call to Help You Meet Your SLAs

CLOUDERA ENTERPRISE COMPONENTS

Cloudera Manager

32

The industrys firstend-to-end management applicationfor Apache HadoopProactively manages theApache Hadoop stackAutomates the full operational lifecycle of Apache HadoopDISCOVERDIAGNOSEOPTIMIZEACT

HDFSMAPREDUCEHBASEZOOKEEPEROOZIEHUE

Cloudera Enterprise

Demo

33

Cloudera Enterprise

34Including Cloudera Support

FeatureBenefitFlexible Support WindowsChoose from 8x5 or 24x7 options to meet SLA requirementsConfiguration ChecksVerify that your Hadoop cluster is fine-tuned for your environmentIssue Resolution and Escalation ProcessesProven processes ensure that support cases get resolved with maximum efficiencyComprehensive KnowledgebaseBrowse through hundreds of Articles and Tech Notes to expand upon your knowledge of Apache HadoopCertified ConnectorsConnect your Apache Hadoop cluster to your existing data analysis tools such as IBM Netezza and Revolution AnalyticsNotification of New Developments and EventsStay up to speed with whats going on in the Apache Hadoop community

34

Cloudera University

35Public and Private Training to Enable Your Success

ClassDescriptionDeveloper Training & Certification(4 Days)Hands-on training and certification for developers who want to analyze their data but are new to Apache HadoopSystem Administrator Training & Certification (3 Days)Hands-on training and certification for administrators who will be responsible for setting up, configuring, monitoring an Apache Hadoop clusterHBase Training (2 Day)Covers the HBase architecture, data model, and Java API as well as some advanced topics and best practicesAnalyzing Data with Hive and Pig(2 Days)Hive and Pig training is designed for people who have a basic understanding of how Apache Hadoop works and want to utilize these languages for analysis of their dataEssentials for Managers (1 Day)Provides decision-makers the information they need to know about Apache Hadoop, answering questions such as when is Hadoop appropriate?, what are people using Hadoop for? and what do I need to know about choosing Hadoop?

Cloudera Consulting Services

36Put Our Expertise To Work For You.

ServiceDescriptionUse Case DiscoveryAssess the appropriateness and value of Hadoop for your organizationNew Hadoop DeploymentSet up and configure high performance, production-ready Hadoop clustersProof of ConceptVerify the prototype functionality and project feasibility for a new Hadoop clusterProduction PilotDeploy your first production-level project using HadoopProcess and Team DevelopmentDefine the requirements and processes for creating a new Hadoop teamHadoop Deployment CertificationPerform periodic health checks to certify and tune up existing Hadoop clusters

Clouderas team of Solutions Architects provides guidance and hands-on expertise to address unique enterprise challenges.

Journey of the Cloudera Customer

37Discover the Benefits of Apache HadoopClouderas DistributionSubscribe to Cloudera Enterprise

Flexibility to store and mine all types of data

The fastest, surest path to success with Apache Hadoop

Simplify and accelerate Apache Hadoop deployment

Apache HadoopGain the flexibility to store and mine all types of dataLeverage the scale-out architecture for complex data analysisEasily scale to meet growing data requirementsAvoid vendor lock-in with an open source technology

CDH The fastest, surest path to success with Apache HadoopStable, reliable version of Apache Hadoop without the vendor lock-in imposed by proprietary vendorsIntegrates with your other technology platforms ensuring investment protection

Cloudera EnterpriseSimplify and accelerate Apache Hadoop deploymentReduce adoption costs and risksMore effectively manage cluster resourcesLeverage the experience of our experts

37

Cloudera in Production

38

LogsFilesWeb DataRelational DatabasesIDEsBI / AnalyticsEnterprise Reporting

Enterprise Data WarehouseOperational Rules EnginesManagement Tools

OPERATORSENGINEERSANALYSTSBUSINESS USERS

Clouderas Distribution Including Apache Hadoop (CDH)&SCM Express

Cloudera EnterpriseCloudera Management SuiteCloudera Support

Cloudera ServicesConsulting ServicesCloudera UniversityWeb Application

CUSTOMERS

38

39Cloudera helps you profit from all your data.

cloudera.com

+1 (888) [email protected]

twitter.com/clouderafacebook.com/clouderaGet Hadoop

39

Cloudera Manager

40The first and only Hadoop management application that:

1. Manages the full Hadoop lifecycle

2. Manages and monitors the complete Hadoop stack

3. Incorporates comprehensive log and event management

4. Has Technical Support integration built-in

Cloudera Manager

41Key Features and Functionality:Automated DeploymentInstalls the complete Hadoop stack in minutes. The simple, wizard-based interface guides you through the steps.Centralized ManagementGives you complete, end-to-end visibility and control over your Hadoop cluster from a single interfaceService & Configuration ManagementSet server roles, configure services and manage security across the clusterGracefully start, stop and restart of services as neededAudit TrailsMaintains a complete record of configuration changes for SOX complianceProactive Health ChecksMonitors dozens of service performance metrics and alerts you when you approach critical thresholdsIntelligent Log ManagementGather, view and search Hadoop logs collected from across the clusterScans Hadoop logs for irregularities and warns you before they impact the cluster

ONLY CLOUDERAONLY CLOUDERAONLY CLOUDERAONLY CLOUDERAONLY CLOUDERA

Key Features and Functionality:Cloudera Manager

42Global Time ControlEstablishes the time context globally for almost all viewsCorrelates jobs, activities, logs, system changes, configuration changes and service metrics along a single timeline to simplify diagnosisSupport IntegrationTakes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with resolutionEvent ManagementCreates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and make them available for alerting and searchingAlertingGenerates email alerts when certain events occurOperational ReportsVisualize current and historical disk usage by user, group and directoryTrack MapReduce activity on the cluster by job or userHost Level MonitoringView information pertaining to hosts in your cluster including status, resident memory, virtual memory and roles

ONLY CLOUDERAONLY CLOUDERAONLY CLOUDERAONLY CLOUDERA

43Max Number of Nodes Supported50UnlimitedAutomated DeploymentHost-Level MonitoringSecure Communication Between Server & AgentsConfiguration ManagementManage HDFS, MapReduce, HBase, Hue, Oozie & ZookeeperAudit TrailsStart/Stop/Restart ServicesAdd/Restart/Decomission Role InstancesConfiguration Versioning & HistorySupport for KerberosService MonitoringProactive Health ChecksStatus & Health SummaryIntelligent Log ManagementEvents Management & AlertsActivity MonitoringOperational ReportingGlobal Time ControlSupport Integration

FREE EDITIONENTERPRISE EDITION**

Two Editions:** Part of the Cloudera Enterprise subscription

44

View Service Health and Performance

45

Get Host-Level Snapshots

46

Monitor and Diagnose Cluster Workloads

47Gather, View and Search Hadoop Logs

48Track Events From Across the Cluster

49Run Reports on System Performance & Usage

New in Cloudera Manager 3.7

501. Proactive Health ChecksMonitors dozens of service performance metrics and alerts you when you approach critical thresholds2. Intelligent Log ManagementGathers and scans Hadoop logs for irregularities and warns you before they impact the cluster3. Global Time ControlCorrelates jobs, activities, logs, system changes, configuration changes and service metrics along a single timeline to simplify diagnosis4. Support IntegrationTakes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with resolution5. Event ManagementCreates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and make them available for alerting and searching6. AlertsGenerates email alerts when certain events occur7. Audit TrailsMaintains a complete record of configuration changes for SOX compliance8. Operational ReportingVisualize current and historical disk usage by user, group and directory and track MapReduce activity on the cluster by job or user

ONLY CLOUDERAONLY CLOUDERAONLY CLOUDERAONLY CLOUDERAONLY CLOUDERAONLY CLOUDERAONLY CLOUDERA

Cloudera Support

51Our team of experts on call to help you meet your SLAs

FeatureBenefitFlexible Support WindowsChoose from 8x5 or 24x7 options to meet SLA requirementsConfiguration ChecksVerify that your Hadoop cluster is fine-tuned for your environmentIssue Resolution and Escalation ProcessesProven processes ensure that support cases get resolved with maximum efficiencyComprehensive KnowledgebaseBrowse through hundreds of Articles and Tech Notes to expand upon your knowledge of Apache HadoopCertified ConnectorsConnect your Apache Hadoop cluster to your existing data analysis tools such as IBM Netezza, Revolution Analytics, and MicroStrategyProactive Notification of New Developments and EventsStay up to speed with whats going on in the Apache Hadoop community

51

Cloudera Enterprise

52

Why Cloudera Enterprise?Apache Hadoop is a distributed system that presents unique operational challengesThe fixed cost of managing an internal patch and release infrastructure is prohibitiveApache Hadoop skills and expertise are scarceIts challenging to track consistently to community development efforts

Only Cloudera EnterpriseHas a management application that supports the full lifecycle of operationalizing Apache Hadoop Has production support backed by theApache committers Has the depth of experience supporting hundreds of production Apache Hadoop clustersThe Fastest Path to SuccessRunning Apache Hadoop in Production.

52

Block Size = 64MBReplication Factor = 3Hadoop Distributed File SystemCost is $400-$500/TB

53

Pool commodity servers in a single hierarchical namespace.Designed for large files that are written once and read many times.Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes.Typical Hadoop node is eight cores with 16GB ram and four 1TB SATA disks.Default block size is 64MB, though most folks now set it to 128MB

53

MapReduce: Distributed Processing

54

Differentiate between MapReduce the platform and MapReduce the programming model. The analogy is similar to the RDBMs which executes the queries, and SQL which is the language for the queries.MapReduce can run on top of HDFS or a selection of other storage systemsIntelligent scheduling algorithms for locality, sharing, and resource optimization.54

Thank you.