Hadoop past, present and future

  • Published on

  • View

  • Download

Embed Size (px)


Ever wonder what Hadoop might look like in 12 months or 24 months or longer? Apache Hadoop MapReduce has undergone a complete re-haul to emerge as Apache Hadoop YARN, a generic compute fabric to support MapReduce and other application paradigms. As a result, Hadoop looks very different from itself 12 months ago. This talk will take you through some ideas for YARN itself and the many myriad ways it is really moving the needle for MapReduce, Pig, Hive, Cascading and other data-processing tools in the Hadoop ecosystem.


  • 1. Hadoop : Past, Present and Future Chris Harris Email : charris@hortonworks.com Twitter : cj_harris5 Hortonworks Inc. 2013

2. Past Hortonworks Inc. 2013Page 2 3. A little history its 2005 Hortonworks Inc. 2013 4. A Brief History of Apache HadoopApache Project EstablishedYahoo! begins to Operate at scaleHortonworks Data Platform2013 20042006200820102012Enterprise Hadoop2005: Yahoo! creates team under E14 to work on Hadoop Hortonworks Inc. 2013Page 4 5. Key Hadoop Data Types 1. Sentiment Understand how your customers feel about your brand and products right now2. Clickstream Capture and analyze website visitors data trails and optimize your website3. Sensor/Machine Discover patterns in data streaming automatically from remote sensors and machines4. Geographic Analyze location-based data to manage operations where they occur5. Server Logs Research logs to diagnose process failures and prevent security breaches6. Unstructured (txt, video, pictures, etc..) Understand patterns in files across millions of web pages, emails, and documents Hortonworks Inc. 2013Value 6. Hadoop is NOT ! ! ! ! ! !ESB NoSQL HPC Relational Real-time The Jack of all Trades Hortonworks Inc. 2013 7. Hadoop 1 Limited up to 4,000 nodes per cluster O(# of tasks in a cluster) JobTracker bottleneck - resource management, job scheduling and monitoring Only has one namespace for managing HDFS Map and Reduce slots are static Only job to run is MapReduce Hortonworks Inc. 2013 8. Hadoop 1 - Basics MapReduce (Computation Framework)ABCCBBCAAAHDFS (Storage Framework) Hortonworks Inc. 2013 9. Hadoop 1 - Reading Files NameNode read fileSNameNode (fsimage/edit)Hadoop Client return DNs, block ids, etc.checkpointheartbeat/ block reportread blocksDN | TTDN | TTDN | TTDN | TTDN | TTDN | TTDN | TTDN | TTDN | TTDN | TTDN | TTDN | TTRack1Rack2Rack3 Hortonworks Inc. 2013RackN 10. Hadoop 1 - Writing Files NameNode request writeHadoop ClientSNameNode (fsimage/edit) checkpointreturn DNs, etc.block reportwrite blocksDN | TTDN | TTDN | TTDN | TTDN | TTDN | TTDN | TTDN | TTDN | TTDN | TTDN | TTDN | TTRack1Rack2Rack3 Hortonworks Inc. 2013RackNreplication pipelining 11. Hadoop 1 - Running Jobs Hadoop Clientsubmit jobJobTrackermap deploy jobshuffle DN | TTpart 0 Hortonworks Inc. 2013DN | TTDN | TTDN | TTDN | TTDN | TTDN | TTDN | TTDN | TTDN | TTDN | TTRack1reduceDN | TTRack2Rack3RackN 12. Hadoop 1 - SecurityauthN/authZLDAP/ADUsersF I R E W A L LKDCservice requestHadoop Clusterblock token delegate tokenClient Node/ Spoke ServerEncryption Plugin* block token is for accessing data * delegate token is for running jobs Hortonworks Inc. 2013 13. Hadoop 1 - APIs ! ! ! !org.apache.hadoop.mapreduce.Partitioner org.apache.hadoop.mapreduce.Mapper org.apache.hadoop.mapreduce.Reducer org.apache.hadoop.mapreduce.Job Hortonworks Inc. 2013 14. Present Hortonworks Inc. 2013Page 14 15. Hadoop 2 ! ! ! ! ! ! !Potentially up to 10,000 nodes per cluster O(cluster size) Supports multiple namespace for managing HDFS Efficient cluster utilization (YARN) MRv1 backward and forward compatible Any apps can integrate with Hadoop Beyond Java Hortonworks Inc. 2013 16. Hadoop 2 - Basics Hortonworks Inc. 2013 17. Hadoop 2 - Reading Files (w/ NN Federation) SNameNode per NN Hadoop ClientNN1/ns1 NN2/ns2 NN3/ns3 NN4/ns4fsimage/edit copyread filecheckpointorreturn DNs, block ids, etc. fs sync read blocksBackup NN per NNcheckpointregister/ heartbeat/ block reportBlock Pools DN | NMDN | NMDN | NMDN | NMDN | NMDN | NMDN | NMDN | NMDN | NMDN | NMDN | NMDN | NMns1Rack1Rack2 Hortonworks Inc. 2013Rack3RackNns2ns3ns4dn1, dn2 dn1, dn3 dn4, dn5 dn4, dn5 18. Hadoop 2 - Writing Files SNameNode per NN Hadoop ClientNN1/ns1 NN2/ns2 NN3/ns3 NN4/ns4request writefsimage/edit copy checkpointorreturn DNs, etc. fs sync write blocks block reportDN | NMDN | NMDN | NMDN | NMDN | NMDN | NMDN | NMDN | NMDN | NMDN | NMcheckpointDN | NMDN | NMBackup NN per NNRack1Rack2 Hortonworks Inc. 2013Rack3RackNreplication pipelining 19. Hadoop 2 - Running Jobs create app1Hadoop Client 1submit app1ASM NMResourceManager.......negotiates....... Containers .......reports to....... ASMScheduler .......partitions....... Resourcescreate app2Hadoop Client 2submit app2SchedulerASMqueues status reportNodeManager C2.1 NodeManager C2.2 NodeManager AM2Rack1 Hortonworks Inc. 2013NodeManagerNodeManagerC1.3 NodeManager C2.3C1.2NodeManager AM1Rack2NodeManager C1.4 NodeManager C1.1RackN 20. Hadoop 2 - Security DMZ KDC LDAP/AD Knox Gateway ClusterEnterprise/ Cloud SSO Provider JDBC ClientF I R E W A L LF I R E W A L LHadoop ClusterREST Client Native Hive/HBase EncryptionBrowser(HUE) Hortonworks Inc. 2013 21. Hadoop 2 - APIs ! org.apache.hadoop.yarn.api.ApplicationClientProt ocol ! org.apache.hadoop.yarn.api.ApplicationMasterPro tocol ! org.apache.hadoop.yarn.api.ContainerManagemen tProtocol Hortonworks Inc. 2013 22. Future Hortonworks Inc. 2013Page 22 23. Apache Tez A New Hadoop Data Processing Framework Hortonworks Inc. 2013Page 23 24. HDP: Enterprise Hadoop Distribution OPERATIONAL SERVICES AMBARIFLUME PIGFALCON* OOZIEHortonworks Data Platform (HDP)DATA SERVICESSQOOPHIVE&HCATALOGHBASEEnterprise HadoopOTHER The ONLY 100% open source and complete distributionLOAD& EXTRACTHADOOP CORE PLATFORM SERVICESNFS WebHDFSKNOX*MAP REDUCE* TEZ* YARN* HDFS Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and SnapshotsHORTONWORKS DATAPLATFORM(HDP) Hortonworks Inc. 2013 Enterprise grade, proven and tested at scale Ecosystem endorsed to ensure interoperabilityPage 24 25. Tez (Speed) What is it? A data processing framework as an alternative to MapReduce A new incubation project in the ASFWho else is involved? 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, MicrosoftWhy does it matter? Widens the platform for Hadoop use cases Crucial to improving the performance of low-latency applications Core to the Stinger initiative Evidence of Hortonworks leading the community in the evolution of Enterprise Hadoop Hortonworks Inc. 2013 26. Moving Hadoop Beyond MapReduce Low level data-processing execution engine Built on YARN Enables pipelining of jobs Removes task and job launch times Does not write intermediate output to HDFS Much lighter disk and network usageNew base of MapReduce, Hive, Pig, Cascading etc. Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline Hortonworks Inc. 2013 27. Tez - Core Idea Task with pluggable Input, Processor & OutputInputProcessorOutputTaskTez Task - YARN ApplicationMaster to run DAG of Tez Tasks Hortonworks Inc. 2013 28. Building Blocks for Tasks MapReduce MapHDFS InputMap ProcessorSorted OutputMapReduce Map TaskSpecial Pig/Hive MapHDFS InputMap ProcessorPipelin e Sorter OutputTez Task Hortonworks Inc. 2013MapReduce ReduceShuffle InputReduce ProcessorHDFS OutputMapReduce Reduce TaskSpecial Pig/Hive ReduceShuffle Skipmerge InputReduce ProcessorTez TaskSorted OutputIntermediate Reduce for Map-Reduce-ReduceShuffle InputReduce ProcessorSorted OutputIntermediate Reduce for Map-Reduce-ReduceIn-memory MapHDFSI nputMap ProcessorTez TaskInmemor y Sorted Output 29. Pig/Hive-MR versus Pig/Hive-Tez SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.stateJob 1Job 2 I/O Synchronization BarrierI/O Synchronization BarrierSingle Job Job 3Pig/Hive - MR Hortonworks Inc. 2013Pig/Hive - Tez 30. Tez on YARN: Going Beyond BatchTez TaskTez Optimizes Execution New runtime engine for more efficient data processing Hortonworks Inc. 2013Always-On Tez Service Low latency processing for all Hadoop data processing 31. Apache Knox Secure Access to Hadoop Hortonworks Inc. 2013 32. Knox InitiativeMake Hadoop security simpleSimplify SecurityAggregate AccessClient AgilitySimplify security for both users and operators.Deliver unified and centralized access to the Hadoop cluster.Provide seamless access for users while securing cluster at the perimeter, shielding the intricacies of the security implementation.Make Hadoop feel like a single application to users.Ensure service users are abstracted from where services are located and how services are configured & scaled. Hortonworks Inc. 2013 33. Knox: Make Hadoop Security Simple Authentication & Verification User Store KDC, AD, LDAPClient{REST}! Hortonworks Inc. 2013Knox GatewayHadoop Cluster 34. Knox: Next Generation of Hadoop Security All users see one end-point website online apps + analytics toolsend users All online systems see one endpoint RESTful serviceGateway Consistency across all interfaces and capabilities Firewalled cluster that no end users need to access More IT-friendly. Enables: Systems admins DB admins Security admins Network admins Hortonworks Inc. 2013rewallHadoop clusterrewall 35. Apache Falcon Data Lifecycle Management for Hadoop Hortonworks Inc. 2013 36. Data Lifecycle on Hadoop is ChallengingData Management NeedsToolsData ProcessingOozieReplicationSqoopRetentionDistcpSchedulingFlumeReprocessingMap / ReduceMulti Cluster ManagementHive and Pig JobsProblem: Patchwork of tools complicate data lifecycle management. Result: Long development cycles and quality challenges. Hortonworks Inc. 2013 37. Falcon: One-stop Shop for Data Lifecycle Apache Falcon ProvidesOrchestratesData Management NeedsToolsData ProcessingOozieReplicationSqoopRetentionDistcpSchedulingFlumeReprocessingMap / ReduceMulti Cluster ManagementHive and Pig JobsFalcon provides a single interface to orchestrate data lifecycle. Sophisticated DLM easily added to Hadoop applications. Hortonworks Inc. 2013 38. Falcon At A Glance Data Processing ApplicationsSpec Files or REST APIsFalcon Data Lifecycle Management Service Data Import and ReplicationScheduling and CoordinationData Lifecycle PoliciesMulti-Cluster ManagementSLA Management> Falcon provides the key services data processing applications need. > Complex data processing logic handled by Falcon instead of hard-coded in apps. > Faster development and higher quality for ETL, reporting and other data processing apps on Hadoop. Hortonworks Inc. 2013 39. Falcon Core Capabilities Core Functionality Pipeline processing Replication Retention Late data handlingAutomates Scheduling and retry Recording audit, lineage and metricsOperations and Management Monitoring, management, metering Alerts and notifications Multi Cluster FederationCLI and REST API Hortonworks Inc. 2013 40. Falcon Example: Multi-Cluster Failover Primary Hadoop Cluster Cleansed DataConformed DataPresented DataBI and AnalyticsReplicationStaged DataStaged DataPresented DataFailover Hadoop Cluster> Falcon manages workflow, replication or both. > Enables business continuity without requiring full data reprocessing. > Failover clusters require less storage and CPU. Hortonworks Inc. 2013 41. Falcon Example: Retention PoliciesStaged DataCleansed DataConformed DataPresented DataRetain 5 YearsRetain 3 YearsRetain 3 YearsRetain Last Copy Only> Sophisticated retention policies expressed in one place. > Simplify data retention for audit, compliance, or for data re-processing. Hortonworks Inc. 2013 42. Falcon Example: Late Data Handling Online Transaction Data (Pull via Sqoop) Wait up to 4 hours for FTP data to arriveStaging AreaCombined DatasetWeb Log Data (Push via FTP)> Processing waits until all data is available. > Developers dont write complex data handling rules within applications. Hortonworks Inc. 2013 43. Multi Cluster Management with Prism> Prism is the part of Falcon that handles multi-cluster. > Key use cases: Replication and data processing that spans clusters. Hortonworks Inc. 2013Page 43 44. Hortonworks Sandbox Go from Zero to Big Data in 15 minutes Hortonworks Inc. 2013Page 44 45. Sandbox: A Guided Tour of HDP Tutorials and videos give a guided tour of HDP and Hadoop Perfect for beginners or anyone learning more about Hadoop Installs easily on your laptop or desktopEasy-to-use editors for Apache Pig and Hive Hortonworks Inc. 2013Easily import data and create tablesBrowse and manage HDFS filesLatest tutorials pushed directly to your Sandbox Page 45 46. THANK YOU! Chris Harris charris@hortonworks.comDownload Sandbox hortonworks.com/sandbox Hortonworks Inc. 2013Page 46