56
© Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop Present – Open Enterprise Hadoop Yifeng Jiang Solutions Engineer, Hortonworks, inc. July 26, 2015

Hadoop Present - Open Enterprise Hadoop

Embed Size (px)

Citation preview

  1. 1. Hortonworks Inc. 2011 2015. All Rights Reserved Hadoop Present Open Enterprise Hadoop Yifeng Jiang Solutions Engineer, Hortonworks, inc. July 26, 2015
  2. 2. Hortonworks Inc. 2011 2015. All Rights Reserved (Yifeng Jiang) Solutions Engineer @ Hortonworks Japan HBase book author Twitter: @uprush
  3. 3. Hortonworks Inc. 2011 2015. All Rights Reserved Ageda Hadoop Core Updates Data Access in Hadoop Hadoop Security Hadoop Management
  4. 4. Hortonworks Inc. 2011 2015. All Rights Reserved Hadoop Present Enterprise Ready Hadoop
  5. 5. Hortonworks Inc. 2011 2015. All Rights Reserved Hadoop Number of Issues Resolved Number of Line of Code Increased http://ajisakaa.blogspot.jp
  6. 6. Hortonworks Inc. 2011 2015. All Rights Reserved Open Leadership Code Contributed in 2014 by Organization Hortonworks
  7. 7. Hortonworks Inc. 2011 2015. All Rights Reserved : 20116: Yahoo! Hadoop 24 201412: 600Hadoop Apache Project Committers PMC Members Hadoop 27 21 Pig 5 5 Hive 18 6 Tez 16 15 HBase 6 4 Phoenix 4 4 Accumulo 2 2 Storm 3 2 Slider 11 11 Falcon 5 3 Flume 1 1 Sqoop 1 1 Ambari 36 28 Oozie 3 2 Zookeeper 2 1 Knox 13 3 Ranger 11 n/a TOTAL 164 109
  8. 8. Hortonworks Inc. 2011 2015. All Rights Reserved Hortonworks Data Platform 2.2 Stack
  9. 9. Hortonworks Inc. 2011 2015. All Rights Reserved Hadoop Core HDFS + YARN: Data Operating System
  10. 10. Hortonworks Inc. 2011 2015. All Rights Reserved HDFS Scalable & Efficient Data Lake Storage
  11. 11. Hortonworks Inc. 2011 2015. All Rights Reserved HDFS: more Efficient Data Lake Storage HDFS NFS Gateway Mount HDFS path Erasure Coding (under dev) Reduce storage cost from 3x to 1.4x Tiered Storage DataNode becomes collection of tiered storages DISK, SSD, RAM, ARCHIVAL
  12. 12. Hortonworks Inc. 2011 2015. All Rights Reserved Storage Growth Challenges Some cluster storage need grows very fast High volumes of data More users and new use cases to Hadoop Only way to grow storage is add more nodes Page 12Architecting the Future of Big Data Cluster Storage and Compute Capacity Cluster Storage Utilization Compute Utilization
  13. 13. Hortonworks Inc. 2011 2015. All Rights Reserved Archival Storage Scenario Data Usage Hot - Less than 7 days with very high usage Warm Less than 1 month and used ~20 times per month Cold Less than 3 months and used 5 times per month Frozen - 3 months to 7 years and used approximately 2 times per year Ebay 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 0 10 20 30 40 50 60 70 80 Temperature of Data Hadoop TIME (Data Age) FrequencyofDataUsage(perMonth) Cold Data Hot Data Warm Data Cold Data
  14. 14. Hortonworks Inc. 2011 2015. All Rights Reserved Archival Storage for Cost Efficiency Scale Storage independently from Compute. Archival Storage Tier Deploy storage dense hardware nodes Utilize storage policies for datasets: Hot, Warm, Cold Achieve ~4x lower price point per GB Cluster Storage Capacity Cluster Storage Utilization Compute Utilization Cluster Compute Capacity
  15. 15. Hortonworks Inc. 2011 2015. All Rights Reserved HDFS Storage Architecture - Before
  16. 16. Hortonworks Inc. 2011 2015. All Rights Reserved HDFS Storage Architecture - Now
  17. 17. Hortonworks Inc. 2011 2015. All Rights Reserved Storage Policy: SSD & Hot SSD SSD SSD SSD SSD SSD SSD SSD SSD DISK DISK DISK DISK DISK DISK HDP Cluster A DISK DISK DISK A A SSD All replicas on SSDDataSet A (e.g., HBase) Hot All replicas on DISK DataSet B (others) B B B I2.8x I2.8x I2.8x d2.8x d2.8x d2.8x
  18. 18. Hortonworks Inc. 2011 2015. All Rights Reserved Storage Policy: AmbariHDFS Conguration Groups I2 D2 AmbariGroupsDataNode dfs.datanode.data.dir I2 group: [SSD]/hadoop/hdfs/data1,[SSD]/hadoop/hdfs/data2, D2 group: [DISK]/hadoop/hdfs/data1,[DISK]/hadoop/hdfs/data2, HDFS
  19. 19. Hortonworks Inc. 2011 2015. All Rights Reserved Storage Policy $ hdfs dfs -mkdir /hbase$ hdfs dfsadmin -setStoragePolicy /hbase ALL_SSD Set storage policy ALL_SSD on /hbase$ hdfs dfsadmin -getStoragePolicy /ssd The storage policy of /ssd: BlockStoragePolicy{ALL_SSD:12, storageTypes=[SSD], creationFallbacks=[DISK],replicationFallbacks=[DISK]} HBaseSSDi2 /hbase ALL_SSD
  20. 20. Hortonworks Inc. 2011 2015. All Rights Reserved HDFS: Next Step Erasure Code GA Ozone: an object store in HDFS HDFS-7285 HDFS-7240
  21. 21. Hortonworks Inc. 2011 2015. All Rights Reserved YARN Extends Hadoop into Data OS
  22. 22. Hortonworks Inc. 2011 2015. All Rights Reserved Recap: Whats YARN Cluster Resource Management Resource sharing Capacity scheduler Fair Sharing: pluggable queue policies new Isolation Memory, CPU Node labels new Workload types Batch, interactive, in-memory
  23. 23. Hortonworks Inc. 2011 2015. All Rights Reserved Storm Storm StormStorm Exclusive Node Labels enable Isolated Partitions S App Storm Configure Partitions Storm B App Exclusive Labels enforce Isolation S S nodes labels S S HDP 2.2
  24. 24. Hortonworks Inc. 2011 2015. All Rights Reserved Spark Spark SparkSpark Non-Exclusive Node Labels S App Spark Configure non- exclusive labels Spark B App Schedule if free capacity S S nodes labels S S B YARN-3214 HDP 2.3
  25. 25. Hortonworks Inc. 2011 2015. All Rights Reserved Working with Labels Ambari YARN Guided Configuration: Enable node labels YARN CLI: Create and assign labels ResourceManager UI: View Node Labels in Cluster Capacity Scheduler View: Define workload management policy with labels $ yarn rmadmin -addToClusterNodeLabels spark(exclusive=false) $ yarn cluster -list-node-labels $ yarn rmadmin -replaceLabelsOnNode node5=spark
  26. 26. Hortonworks Inc. 2011 2015. All Rights Reserved YARN: Next Step Disk & network isolation Just isolation enforce equal sharing of Disk and Network I/O across containers running on node Current in technical preview of HDP 2.3 Disk resource: Local Disk Iops not HDFS read/writes Network resource: Outbound only bandwidth (mbits/sec) YARN-2619 YARN-2140
  27. 27. Hortonworks Inc. 2011 2015. All Rights Reserved Data Access Innovation SQL, Spark, Stream Processing, Search
  28. 28. Hortonworks Inc. 2011 2015. All Rights Reserved Hive: Enterprise SQL at Hadoop Scale Native transactions Delivered: Insert, Update, Delete Performance: 100x faster ORC File Hive on Tez Cost Based Optimizer Vertorized SQL engine 28
  29. 29. Hortonworks Inc. 2011 2015. All Rights Reserved Hive: Next Step SQL Enhancement Transactions: BEGIN, COMMIT, ROLLBACK SQL 2011 Analytics Performance Sub-second response: LLAP, HBase as metastore, etc. Apache Hive
  30. 30. Hortonworks Inc. 2011 2015. All Rights Reserved Spark Features HDP 2.3.x & Spark 1.3.1 Supported Spark Core MLlib Spark on YARN Kerberos Ambari support Tech Preview SparkSQL* Spark Streaming DataFrame Spark ML Pipeline API Unsupported GraphX BlinkDB Spark Standalone/ Mesos
  31. 31. Hortonworks Inc. 2011 2015. All Rights Reserved Resource Management YARN for multi-tenant, diverse workloads with predictable SLAs Tiered Memory Storage HDFS in-memory tier External BlockStore for RDD Cache SparkSQL & Hive for SQL Interop with modern Metastore/HS2, optimized ORC support, advanced analytics e.g. Geospatial Spark & NoSQL Deep integration with HBase via DataSources/Catalyst for Predicate/Aggregate Pushdown Connect The Dots Algorithms to Use-Cases Higher-level ML Abstractions E.g. OneVsRest Validation, tuning, pipeline assembly... e.g. GeoSpatial Spark and Hadoop How Can We Do Better? Storage YARN: Data Operating System Governance Security Operations Resource Management
  32. 32. Hortonworks Inc. 2011 2015. All Rights Reserved Ease of Use Apache Zeppelin for interactive notebooks Metadata & Governance Apache Atlas for metadata & Apache Falcon support for Spark pipelines Security & Operations Apache Ranger managed authorization and deployment/ management via Apache Ambari Deployable Anywhere Linux, Windows, on-premises or cloud Self-Service Spark in the Cloud Easy launch of Data Science clusters via Cloudbreak and Ambari for Azure, AWS, GCP, OpenStack, Docker Spark and Hadoop How Can We Do Better? Storage YARN: Data Operating System Governance Security Operations Resource Management
  33. 33. Hortonworks Inc. 2011 2015. All Rights Reserved Platform Innovation for Data Access An integrated scalable platform for data access powered by HDP Limitless storage Deep analytics Real-time access
  34. 34. Hortonworks Inc. 2011 2015. All Rights Reserved Security End to End Security in Hadoop
  35. 35. Hortonworks Inc. 2011 2015. All Rights Reserved Five Security Requirements Authentication Kerberos Authorization Audit Encryption HDP 2.3 Security support RANGER HDFS Hadoop Security Overview
  36. 36. Hortonworks Inc. 2011 2015. All Rights Reserved HDFS Fully Secure Flow End to End Security HiveServer 2 A B C KDC Use Hive ST, submit query Hive gets Namenode (NN) service ticket 6.Hive creates map reduce using NN ST Ranger 3.Knox gets service ticket for Hive 4.Knox calls as proxy user 1.Original request w/user id/password Client gets query result SSL O/JDBC Client SSL SASL SSL SSL SSL LDAP 2.Knox Authenticates user/pass Ranger Sync users/groups from LDAP 5. Ranger AuthZ Apache Knox Apache Knox
  37. 37. Hortonworks Inc. 2011 2015. All Rights Reserved Ranger: Central Security Administration 37 Table/column access control Audit logging Flexible definition Control group/ user permissions
  38. 38. Hortonworks Inc. 2011 2015. All Rights Reserved Hadoop Management Ambari: Hadoop for Everyone, 100% Open Source
  39. 39. Hortonworks Inc. 2011 2015. All Rights Reserved Whats Apache Ambari? 100% open source operational platform to provision, manage and monitor Hadoop clusters
  40. 40. Hortonworks Inc. 2011 2015. All Rights Reserved Apache Ambari Mission Easyopera,onat scale Largescaleclusterinstall,manageandmonitor Ecientandscaleatscale Easytoextendwith community Innovatewithcommunity Integratewithenterpriseso:ware Acceleratenewfeatureandadop=on Centralized managementfor thewholeHadoop stack AccesspointforallHadoopusers,notjustclustermanagement Easyofuse
  41. 41. Hortonworks Inc. 2011 2015. All Rights Reserved Ambari 2.1 HDP Stack High Availability HDP Stack Mode Ambari 2.0 Ambari 2.1 HDFS: NameNode HDP 2.0+ Active/ Standby YARN: ResourceManager HDP 2.1+ Active/ Standby HBase: HBaseMaster HDP 2.1+ Multi-master Hive: HiveServer2 HDP 2.1+ Multi-instance Hive: Hive Metastore HDP 2.1+ Multi-instance Hive: WebHCat Server HDP 2.1+ Multi-instance Oozie: Oozie Server HDP 2.1+ Multi-instance Storm: Nimbus Server HDP 2.3 Multi-instance Ranger: AdminServer HDP 2.3 Multi-instance
  42. 42. Hortonworks Inc. 2011 2015. All Rights Reserved Install Wizard
  43. 43. Hortonworks Inc. 2011 2015. All Rights Reserved Guided Configs for HDFS
  44. 44. Hortonworks Inc. 2011 2015. All Rights Reserved Guided Configs for YARN & MapReduce
  45. 45. Hortonworks Inc. 2011 2015. All Rights Reserved Enable Features in YARN
  46. 46. Hortonworks Inc. 2011 2015. All Rights Reserved Cluster Dashboard
  47. 47. Hortonworks Inc. 2011 2015. All Rights Reserved Service Dashboard
  48. 48. Hortonworks Inc. 2011 2015. All Rights Reserved Service Manage - HDFS
  49. 49. Hortonworks Inc. 2011 2015. All Rights Reserved Host Manage
  50. 50. Hortonworks Inc. 2011 2015. All Rights Reserved Monitor & Alert Email SNMP Notifications Script new
  51. 51. Hortonworks Inc. 2011 2015. All Rights Reserved User Views HDFS File View Files View Browse HDFS file system.
  52. 52. Hortonworks Inc. 2011 2015. All Rights Reserved User Views YARN CS, Tez Capacity Scheduler View Browse + manage YARN queues Tez View View information related to Tez jobs that are executing on the cluster.
  53. 53. Hortonworks Inc. 2011 2015. All Rights Reserved User Views Pig, Hive Pig View Author and execute Pig Scripts. Hive View Author, execute and debug Hive queries.
  54. 54. Hortonworks Inc. 2011 2015. All Rights Reserved Summary
  55. 55. Hortonworks Inc. 2011 2015. All Rights Reserved Open Enterprise Hadoop Hadoop/YARN-powered data operating system 100% open source, multi-tenant data platform for any application, any data set, anywhere. Built on a centralized architecture of shared enterprise services Scalable tiered storage Resource and workload management Trusted data governance & metadata management Consistent operations Comprehensive security Developer APIs and tools YARN: data operating system Governance Security Operations Resource management Data access: batch, interactive, real-time Storage Commodity Appliance Cloud
  56. 56. Hortonworks Inc. 2011 2015. All Rights Reserved Thank you Yifeng Jiang, Solutions Engineer, Hortonworks @uprush