Upload
adam-doyle
View
137
Download
0
Embed Size (px)
Citation preview
© 2016 MapR Technologies 1© 2016 MapR Technologies 1© 2016 MapR Technologies
Back to School – Hadoop Ecosystem Overview
Matt Miller, Solutions EngineerSeptember, 2016
© 2016 MapR Technologies 2© 2016 MapR Technologies 2
Roadmap• What is Apache?• Hadoop Timeline and Level Set• Hadoop Suite of tools
1. Hive2. Sqoop3. Pig4. Oozie5. Hbase6. Flume7. Kafka8. Drill9. Yarn10. Zookeeper
• Use Cases• Q&A
© 2016 MapR Technologies 3© 2016 MapR Technologies 3
What is Apache?• Non-profit organization
• Governs the development of open source “Projects”
• “Top Level” projects are the most prominent
• Features “committers” from all over the world
© 2016 MapR Technologies 4© 2016 MapR Technologies 4
Hadoop Timeline
2003GFS White Paper Published
2004Map Reduce White Paper Published
2006Hadoop is born
HDFS + MapReduce
2009Hadoop distributions start
popping up
2016Organized Chaos – New projectsreleased every few months andonly the winners gain traction
2007 - PresentHadoop continuously evolves.
New tools are released to improveusability and make it easier to adopt.
2000 2020
© 2016 MapR Technologies 5© 2016 MapR Technologies 5
What is Hadoop?
Distributed File System + Processing Engine
HDFS Map Reduce
© 2016 MapR Technologies 6© 2016 MapR Technologies 6
What is MapReduce?• Three phase program built for distributed processing
– Map– Shuffle/Sort– Reduce
• Processing overhead associated with MR jobs(~30 seconds)
• Heavy disk usage
© 2016 MapR Technologies 7© 2016 MapR Technologies 7
1.) Hive• First SQL on Hadoop – HiveQL is the language
• Hadoop data warehousing tool
• Converts HiveQL into a Map Reduce job
• Bash, Java, and Python scripts can execute Hive commands
• Not ANSI compliant but VERY similar
Use Hive for long running jobs -- not ad-hoc queries
© 2016 MapR Technologies 8© 2016 MapR Technologies 8
2.) Sqoop• RDBMS connector for Hadoop
• Execute Sqoop scripts via the command line
• Sqoop can move Schemas, Tables, or Select statement results
• Helps improve ETL or enable data warehouse offload
Use Sqoop anytime data needs to move to/from an RDBMS
© 2016 MapR Technologies 9© 2016 MapR Technologies 9
3.) Pig• High level coding language for processing data
• Language used to express data flows is called Pig Latin
• Pig turns data flows into a series of MR jobs
• Can run in a single JVM or on a Hadoop Cluster
• User Defined Functions(UDFs) make Pig code easy to repurpose
Use pig to speed up development process
© 2016 MapR Technologies 10© 2016 MapR Technologies 10
4.) Oozie• Workflow Orchestration
• Schedule tasks to be completed based on time or completion of a previous task
• Used for Automation
• Develop these workflows either in a GUI or in XML– Hint: the GUI is much much MUCH simpler
Use Oozie when you need workflows
© 2016 MapR Technologies 11© 2016 MapR Technologies 11
5.) Hbase• Database built on HDFS
• Meant for big and fast data
• Hbase is a NoSQL database– Multiple types of NoSQL databases:
• Wide-column stores, Document DB, Graph DB, Key-Value stores• Hbase is a wide-column store
Use Hbase when “real-time read/write access to very large datasets” is required
© 2016 MapR Technologies 12© 2016 MapR Technologies 12
6.) Flume• Meant for ingesting streams of data
• Runs on the same cluster and stores data in HDFS– Also flexible enough to stream into Hbase or SolR
• Flume PUSHES data to its destination
• Flume does NOT store data within itself
Use Flume when basic streaming is required
© 2016 MapR Technologies 13© 2016 MapR Technologies 13
7.) Kafka• …Also meant for ingesting streams of data
• Runs on its own cluster
• Kafka does not PUSH data to other places– Other places pull from Kafka
• Kafka streams in the data, then PUBLISHES the data on its cluster and multiple users can SUBSCRIBE to that data and get their copy.
Use Kafka for advanced streaming
© 2016 MapR Technologies 14© 2016 MapR Technologies 14
8.) Drill• Flexible SQL tool
• Works with a lot of data types and storage platforms
• Does not require transformations to the data
• For ad-hoc analytics and performant queries on LARGE data sets
• Scales to thousands of nodes
Use Drill for data exploration and performant SQL
© 2016 MapR Technologies 15© 2016 MapR Technologies 15
9.) Yarn• Yet Another Resource Negotiator• Helps you allocate resources (and enforce usage quotas) to multiple
groups/users
10.) Zookeeper• Coordinates the distribution of jobs• Handles partial failures• Provides synchronization of jobs
Use Yarn for Multitenancy
ALWAYS use Zookeeper with Hadoop
© 2016 MapR Technologies 16© 2016 MapR Technologies 16
Use Case 1: Expensive RDBMS• Organization has 5 TB of sales
data in RDBMS ($$$)
• Currently 50 reports being generated regularly
• Largest report takes ~24 hours to generate
• Team only knows SQL
HDFS
Sqoop
Hive
Hive/Drill
© 2016 MapR Technologies 17© 2016 MapR Technologies 17
Use Case 2: Customer 360 Data Lake/Hub• 50 TB of customer data
• Data consists of everything from ERP data to JSON data from a rest API
• Four different business units need access to the data and they each have performance requirements
• Basic users need ad-hoc query capabilities
• Weekly jobs need to be kicked off during off hours
HDFS
Drill
YARN
Drill
Oozie
© 2016 MapR Technologies 18© 2016 MapR Technologies 18
Use Case 3: Online Video Game Support• Stats need to be updated milliseconds after
the game finishes
• Player needs to be able to randomly look up other player stats in less than a second
• System can never go down or lose information
• Management wants to save this data so analytics can be run on these datasets.
Kafka/Flume & Hbase
Hbase
Kafka & Hbase
HDFS
© 2016 MapR Technologies 19© 2016 MapR Technologies 19
Advice for those getting started…• Don’t try to hire a big data team, build from within
– MOTIVATED Linux and SQL people are enough to get started
• Target legacy RDMBS and move ~80% to Hadoop– Quick win– Instant validation and justification if you can cut costs and improve speed
at the same time
• Have fun
© 2016 MapR Technologies 20© 2016 MapR Technologies 20
Additional Resources• Full List of Hadoop Ecosystem
• Books:– The Definitive Guide to Hadoop– Hadoop Application Architectures
• Free Training:– Coursera and Edx
• My favorite is a Python specialization series– learn.mapr.com
• Free courses from 100 level to 400 level
© 2016 MapR Technologies 21© 2016 MapR Technologies 21
Q & A@mapr maprtech
Engage with us!
MapR
maprtech
mapr-technologies