If you can't read please download the document
Upload
roxycodone-online
View
378
Download
6
Embed Size (px)
Citation preview
PowerPoint Presentation
1
Cloudera Certified Developer for Apache Hadoop (CCDH)
1
Who We Are2
How We Do ItWe deliver relevant products and services.A distribution of Apache Hadoop that is tested, certified and supportedComprehensive support and professional service offeringsA suite of management software for Hadoop operationsTraining and certification programs for developers, administrators, managers and data scientistsTechnical TeamUnmatched knowledge and experience.Founders, committers and contributors to HadoopA wealth of experience in the design and delivery of production software CredentialsThe Apache Hadoop experts.Number 1 distribution of Apache Hadoop in the worldLargest contributor to the open source Hadoop ecosystemMore committers on staff than any other companyMore than 100 customers across a wide variety of industriesStrong growth in revenue and new accounts
Mission: To help organizations profit from their data
LeadershipStrong executive team with proven abilities.
Mike OlsonCEOKirk DunnCOOCharles ZedlewskiVP, ProductMary RorabaughCFOJeff HammerbacherChief ScientistAmr AwadallaVP EngineeringDoug CuttingChief ArchitectOmer TrajmanVP, Customer Solutions
Mission: need talking points here
How we do it: We do it by offering a complete set of products and services to enable our customers. Training, Services, Support and Management software
We ARE the experts. We have the #1 Hadoop distribution, the most project founders, committers, and the most customers. We train thousands of people and certify many. Our services team can take you from best practices to cluster certification to fine tuning your Hbase implementation.
Our tech team made up of project founders and committers and our Executive team is broad and deep across open source, web and enterprise companies.2
Users of Cloudera
3
FinancialWebRetail & ConsumerMedia
Telecom
3
What is Apache Hadoop?4
Hadoop Distributed File System (HDFS)
File Sharing & Data Protection Across Physical Servers
MapReduce
Distributed Computing Across Physical Servers
Flexibility
A single repository for storing processing & analyzing any type of dataNot bound by a single schemaScalability
Scale-out architecture divides workloads across multiple nodesFlexible file system eliminates ETL bottlenecksLow Cost
Can be deployed on commodity hardwareOpen source platform guards against vendor lockHadoop is a platform for data storage and processing that isScalableFault tolerantOpen sourceCORE HADOOP COMPONENTS
What Makes Hadoop Different?Ability to scale out to Petabytes in size using commodity hardwareProcessing (MapReduce) jobs are sent to the data versus shipping the data to be processedHadoop doesnt impose a single data format so it can easily handle structure, semi-structure and unstructured dataManages fault tolerance and data replication automatically
5
Largest cluster under management known is at FB with 21PB and 2k nodes. 5
Why the Need for Hadoop?
6
10,000
2005201520105,00001.8 trillion gigabytes of data wascreated in 2011More than 90% is unstructured dataApprox. 500 quadrillion filesQuantity doubles every 2 years
STRUCTURED DATA
UNSTRUCTURED DATA
GIGABYTES OF DATA CREATED (IN BILLIONS)Source: IDC 2011
Hadoop Use Cases
7
ADVANCED ANALYTICSDATA PROCESSINGSocial Network AnalysisContent OptimizationNetwork Analytics
Loyalty & Promotions AnalysisFraud AnalysisEntity AnalysisClickstream SessionizationClickstream SessionizationMediationData FactoryTrade ReconciliationSIGINTApplicationApplicationIndustryWebMediaTelcoRetailFinancialFederalBioinformaticsGenome MappingSequencing AnalysisUse CaseUse Case
Hadoop in the Enterprise
8
LogsFilesWeb DataRelational DatabasesIDEsBI / AnalyticsEnterprise Reporting
Enterprise Data WarehouseWeb Application
Management Tools
OPERATORSENGINEERSANALYSTSBUSINESS USERSCUSTOMERS
Apache Hadoop is a new solution in your existing infrastructure.It does not replace any existing major existing investment.Apache brings data that youre already generating into context and integrates it with your business.You get access to key information about how your business is operating but pulling togetherWeb and application logsUnstructured filesWeb dataRelational dataHadoop is used by your team to analyze this data and deliver it to business users directly and via existing data management technologies8
What is CDH?9
Fastest Path to Success
No need to write your own scripts or do integration testing on different componentsWorks with a wide range of operating systems, hardware, databases and data warehousesStable and Reliable
Extensive Cloudera QA systems, software & processesTested & run in production at scaleProven at scale in dozens of enterprise environmentsCommunity Driven
Incorporates only main-line components from the Apache Hadoop ecosystem no forks or proprietary underpinningsFREE
Clouderas Distribution IncludingApache Hadoop (CDH) is an enterprise-ready distribution of Hadoop that is100% Apache open sourceContains all components needed for deploymentFully documented and supportedReleased on a reliable schedule
9
10ComponentCloudera CommittersCloudera Founder2011 CommitsCommon6Yes#1HDFS6Yes#2MapReduce5Yes#1HBase2No#2Zookeeper1Yes#2Oozie1Yes#1Pig0No#3Hive1No#2Sqoop2Yes#1Flume3Yes#1Hue3Yes#1Snappy2No#1Bigtop8Yes#1Avro4Yes#1Whirr2Yes#1
Clouderas Commitment to the Open Source Community
10
Components of CDH
11
Coordination
Data IntegrationFast Read/Write AccessLanguages / CompilersWorkflowSchedulingAPACHE ZOOKEEPERAPACHE FLUME, APACHE SQOOPAPACHE HBASEAPACHE PIG, APACHE HIVEAPACHE OOZIEAPACHE OOZIEFile System MountUser InterfaceFUSE-DFSHUECloudera Enterprise
11
Block Size = 64MBReplication Factor = 3Hadoop Distributed File SystemCost is $400-$500/TB
12
12345
2345
245
135
125
134
HDFS
Pool commodity servers in a single hierarchical namespace.Designed for large files that are written once and read many times.Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes.Typical Hadoop node is eight cores with 16GB ram and four 1TB SATA disks.Default block size is 64MB, though most folks now set it to 128MB
12
Components of HadoopNameNode Holds all metadata for HDFSNeeds to be a highly reliable machineRAID drives typically RAID 10Dual power suppliesDual network cards BondedThe more memory the better typical 36GB to - 64GBSecondary NameNode Provides check pointing for the NameNode. Same hardware as the NameNode should be used
13
Components of HadoopDataNodes Hardware will depend on the specific needs of the clusterNo RAID needed, JBOD (just a bunch of disks) is usedTypical ratio is:1 hard drive2 cores4GB of RAM
14
NetworkingOne of the most important things to consider when setting up a Hadoop clusterTypically a top of rack is used with Hadoop with a core switch Careful on over subscribing the backplane of the switch!
15
Map
16Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line).
map() produces one or more intermediate values along with an output key from the input.
MapTask(key 1, values)(key 2, values)(key 3, values)ShufflePhase(key 1, int. values)(key 1, int. values)(key 1, int. values)Reduce TaskFinal (key, values)
16
Reduce
17After the map phase is over, all the intermediate values for a given output key are combined together into a list
reduce() combines those intermediate values into one or more final values for that same output key
MapTask(key 1, values)(key 2, values)(key 3, values)ShufflePhase(key 1, int. values)(key 1, int. values)(key 1, int. values)Reduce TaskFinal (key, values)
17
MapReduce Execution
18
18
Sqoop
19SQL to HadoopTool to import/export any JDBC-supported database into HadoopTransfer data between Hadoop and external databases or EDWHigh performance connectors for some RDBMSDeveloped at Cloudera
19
Flume
20Distributed, reliable, available service for efficiently moving large amounts of data as it is producedSuited for gathering logs from multiple systemsInserting them into HDFS as they are generatedDesign goalsReliability, Scalability, Manageability, ExtensibilityDeveloped at Cloudera
20
Flume: high-level architecture
Agent
Agent
Agent
Processor
Processor
Collector(s)
AgentConfigurable levels of reliabilityGuarantee delivery in event of failureDeployable, centrally administeredcompress
encrypt
batch
encryptFlexibly deploy decorators at any step to improve performance, reliability or securityOptionally pre-process incoming data: perform transformations, suppressions, metadata enrichment
Writes to multiple HDFS file formats (text, sequence, JSON, Avro, others)Parallelized writes across many collectors as much write throughput as
MASTERMaster send configuration to all Agents
21
HBase
22Column-family store. Based on design of Google BigTableProvides interactive access to informationHolds extremely large datasets (multi-TB)Constrained access model(key, value) lookupLimited transactions (only one row)
22
HBase
23
Hive
24SQL-based data warehousing applicationLanguage is SQL-likeSupports SELECT, JOIN, GROUP BY, etc.Features for analyzing very large data setsPartition columns, Sampling, Buckets
Example:SELECT s.word, s.freq, k.freq FROM shakespeares JOIN ON (s.word= k.word) WHERE s.freq >= 5;
24
Pig
25Data-flow oriented language Pig latinDatatypes include sets, associative arrays, tuples High-level language for routing data, allows easy integration of Java for complex tasks
Example:emps=LOAD 'people.txt AS(id,name,salary); rich = FILTER emps BY salary > 100000; srtd = ORDER rich BY salary DESC; STORE srtd INTO rich_people.txt';
25
Oozie
26Oozie is a workflow/cordination service to manage data processing jobs for Hadoop
Zookeeper
27Zookeeper is a distributed consensus engineProvides well-defined concurrent access semantics:Leader electionService discoveryDistributed locking / mutual exclusionMessage board / mailboxes
27
Pipes and Streaming
28Multi-language connector libraries for MapReduceWrite native-code MapReduce in C++Write MapReduce passes in any scripting language, includingPerlPython
28
FUSE - DFS
29Allows mounting of HDFS volumes via Linux FUSE file systemDoes allow easy integration with other systems for data import/exportDoes not imply HDFS can be used for general-purpose file system
29
Hadoop Security
30Authentication is secured by Kerberos v5 and integrated with LDAPHadoop server can ensure that users and groups are who they say they areJob Control includes Access Control Lists, which means Jobs can specify who can view logs, counters, configurations and who can modify a jobTasks now run as the user who launched the job
Cloudera Enterprise
31
Simplify and Accelerate Hadoop DeploymentReduce Adoption Costs and RisksLower the Cost of AdministrationIncrease the Transparency Control of HadoopLeverage the Experience of Our ExpertsCloudera Enterprise makesopen source Hadoop enterprise-easyEFFECTIVENESSEnsuring YouGet Value From Your Hadoop Deployment
EFFICIENCYEnabling You toAffordably Run Hadoop in Production
Cloudera Manager
End-to-End Management Application for Apache Hadoop
Production-Level Support
Our Team of Experts On-Call to Help You Meet Your SLAs
CLOUDERA ENTERPRISE COMPONENTS
Cloudera Manager
32
The industrys firstend-to-end management applicationfor Apache HadoopProactively manages theApache Hadoop stackAutomates the full operational lifecycle of Apache HadoopDISCOVERDIAGNOSEOPTIMIZEACT
HDFSMAPREDUCEHBASEZOOKEEPEROOZIEHUE
Cloudera Enterprise
Demo
33
Cloudera Enterprise
34Including Cloudera Support
FeatureBenefitFlexible Support WindowsChoose from 8x5 or 24x7 options to meet SLA requirementsConfiguration ChecksVerify that your Hadoop cluster is fine-tuned for your environmentIssue Resolution and Escalation ProcessesProven processes ensure that support cases get resolved with maximum efficiencyComprehensive KnowledgebaseBrowse through hundreds of Articles and Tech Notes to expand upon your knowledge of Apache HadoopCertified ConnectorsConnect your Apache Hadoop cluster to your existing data analysis tools such as IBM Netezza and Revolution AnalyticsNotification of New Developments and EventsStay up to speed with whats going on in the Apache Hadoop community
34
Cloudera University
35Public and Private Training to Enable Your Success
ClassDescriptionDeveloper Training & Certification(4 Days)Hands-on training and certification for developers who want to analyze their data but are new to Apache HadoopSystem Administrator Training & Certification (3 Days)Hands-on training and certification for administrators who will be responsible for setting up, configuring, monitoring an Apache Hadoop clusterHBase Training (2 Day)Covers the HBase architecture, data model, and Java API as well as some advanced topics and best practicesAnalyzing Data with Hive and Pig(2 Days)Hive and Pig training is designed for people who have a basic understanding of how Apache Hadoop works and want to utilize these languages for analysis of their dataEssentials for Managers (1 Day)Provides decision-makers the information they need to know about Apache Hadoop, answering questions such as when is Hadoop appropriate?, what are people using Hadoop for? and what do I need to know about choosing Hadoop?
Cloudera Consulting Services
36Put Our Expertise To Work For You.
ServiceDescriptionUse Case DiscoveryAssess the appropriateness and value of Hadoop for your organizationNew Hadoop DeploymentSet up and configure high performance, production-ready Hadoop clustersProof of ConceptVerify the prototype functionality and project feasibility for a new Hadoop clusterProduction PilotDeploy your first production-level project using HadoopProcess and Team DevelopmentDefine the requirements and processes for creating a new Hadoop teamHadoop Deployment CertificationPerform periodic health checks to certify and tune up existing Hadoop clusters
Clouderas team of Solutions Architects provides guidance and hands-on expertise to address unique enterprise challenges.
Journey of the Cloudera Customer
37Discover the Benefits of Apache HadoopClouderas DistributionSubscribe to Cloudera Enterprise
Flexibility to store and mine all types of data
The fastest, surest path to success with Apache Hadoop
Simplify and accelerate Apache Hadoop deployment
Apache HadoopGain the flexibility to store and mine all types of dataLeverage the scale-out architecture for complex data analysisEasily scale to meet growing data requirementsAvoid vendor lock-in with an open source technology
CDH The fastest, surest path to success with Apache HadoopStable, reliable version of Apache Hadoop without the vendor lock-in imposed by proprietary vendorsIntegrates with your other technology platforms ensuring investment protection
Cloudera EnterpriseSimplify and accelerate Apache Hadoop deploymentReduce adoption costs and risksMore effectively manage cluster resourcesLeverage the experience of our experts
37
Cloudera in Production
38
LogsFilesWeb DataRelational DatabasesIDEsBI / AnalyticsEnterprise Reporting
Enterprise Data WarehouseOperational Rules EnginesManagement Tools
OPERATORSENGINEERSANALYSTSBUSINESS USERS
Clouderas Distribution Including Apache Hadoop (CDH)&SCM Express
Cloudera EnterpriseCloudera Management SuiteCloudera Support
Cloudera ServicesConsulting ServicesCloudera UniversityWeb Application
CUSTOMERS
38
39Cloudera helps you profit from all your data.
cloudera.com
+1 (888) [email protected]
twitter.com/clouderafacebook.com/clouderaGet Hadoop
39
Cloudera Manager
40The first and only Hadoop management application that:
1. Manages the full Hadoop lifecycle
2. Manages and monitors the complete Hadoop stack
3. Incorporates comprehensive log and event management
4. Has Technical Support integration built-in
Cloudera Manager
41Key Features and Functionality:Automated DeploymentInstalls the complete Hadoop stack in minutes. The simple, wizard-based interface guides you through the steps.Centralized ManagementGives you complete, end-to-end visibility and control over your Hadoop cluster from a single interfaceService & Configuration ManagementSet server roles, configure services and manage security across the clusterGracefully start, stop and restart of services as neededAudit TrailsMaintains a complete record of configuration changes for SOX complianceProactive Health ChecksMonitors dozens of service performance metrics and alerts you when you approach critical thresholdsIntelligent Log ManagementGather, view and search Hadoop logs collected from across the clusterScans Hadoop logs for irregularities and warns you before they impact the cluster
ONLY CLOUDERAONLY CLOUDERAONLY CLOUDERAONLY CLOUDERAONLY CLOUDERA
Key Features and Functionality:Cloudera Manager
42Global Time ControlEstablishes the time context globally for almost all viewsCorrelates jobs, activities, logs, system changes, configuration changes and service metrics along a single timeline to simplify diagnosisSupport IntegrationTakes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with resolutionEvent ManagementCreates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and make them available for alerting and searchingAlertingGenerates email alerts when certain events occurOperational ReportsVisualize current and historical disk usage by user, group and directoryTrack MapReduce activity on the cluster by job or userHost Level MonitoringView information pertaining to hosts in your cluster including status, resident memory, virtual memory and roles
ONLY CLOUDERAONLY CLOUDERAONLY CLOUDERAONLY CLOUDERA
43Max Number of Nodes Supported50UnlimitedAutomated DeploymentHost-Level MonitoringSecure Communication Between Server & AgentsConfiguration ManagementManage HDFS, MapReduce, HBase, Hue, Oozie & ZookeeperAudit TrailsStart/Stop/Restart ServicesAdd/Restart/Decomission Role InstancesConfiguration Versioning & HistorySupport for KerberosService MonitoringProactive Health ChecksStatus & Health SummaryIntelligent Log ManagementEvents Management & AlertsActivity MonitoringOperational ReportingGlobal Time ControlSupport Integration
FREE EDITIONENTERPRISE EDITION**
Two Editions:** Part of the Cloudera Enterprise subscription
44
View Service Health and Performance
45
Get Host-Level Snapshots
46
Monitor and Diagnose Cluster Workloads
47Gather, View and Search Hadoop Logs
48Track Events From Across the Cluster
49Run Reports on System Performance & Usage
New in Cloudera Manager 3.7
501. Proactive Health ChecksMonitors dozens of service performance metrics and alerts you when you approach critical thresholds2. Intelligent Log ManagementGathers and scans Hadoop logs for irregularities and warns you before they impact the cluster3. Global Time ControlCorrelates jobs, activities, logs, system changes, configuration changes and service metrics along a single timeline to simplify diagnosis4. Support IntegrationTakes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with resolution5. Event ManagementCreates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and make them available for alerting and searching6. AlertsGenerates email alerts when certain events occur7. Audit TrailsMaintains a complete record of configuration changes for SOX compliance8. Operational ReportingVisualize current and historical disk usage by user, group and directory and track MapReduce activity on the cluster by job or user
ONLY CLOUDERAONLY CLOUDERAONLY CLOUDERAONLY CLOUDERAONLY CLOUDERAONLY CLOUDERAONLY CLOUDERA
Cloudera Support
51Our team of experts on call to help you meet your SLAs
FeatureBenefitFlexible Support WindowsChoose from 8x5 or 24x7 options to meet SLA requirementsConfiguration ChecksVerify that your Hadoop cluster is fine-tuned for your environmentIssue Resolution and Escalation ProcessesProven processes ensure that support cases get resolved with maximum efficiencyComprehensive KnowledgebaseBrowse through hundreds of Articles and Tech Notes to expand upon your knowledge of Apache HadoopCertified ConnectorsConnect your Apache Hadoop cluster to your existing data analysis tools such as IBM Netezza, Revolution Analytics, and MicroStrategyProactive Notification of New Developments and EventsStay up to speed with whats going on in the Apache Hadoop community
51
Cloudera Enterprise
52
Why Cloudera Enterprise?Apache Hadoop is a distributed system that presents unique operational challengesThe fixed cost of managing an internal patch and release infrastructure is prohibitiveApache Hadoop skills and expertise are scarceIts challenging to track consistently to community development efforts
Only Cloudera EnterpriseHas a management application that supports the full lifecycle of operationalizing Apache Hadoop Has production support backed by theApache committers Has the depth of experience supporting hundreds of production Apache Hadoop clustersThe Fastest Path to SuccessRunning Apache Hadoop in Production.
52
Block Size = 64MBReplication Factor = 3Hadoop Distributed File SystemCost is $400-$500/TB
53
Pool commodity servers in a single hierarchical namespace.Designed for large files that are written once and read many times.Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes.Typical Hadoop node is eight cores with 16GB ram and four 1TB SATA disks.Default block size is 64MB, though most folks now set it to 128MB
53
MapReduce: Distributed Processing
54
Differentiate between MapReduce the platform and MapReduce the programming model. The analogy is similar to the RDBMs which executes the queries, and SQL which is the language for the queries.MapReduce can run on top of HDFS or a selection of other storage systemsIntelligent scheduling algorithms for locality, sharing, and resource optimization.54
Thank you.