Back to School - St. Louis Hadoop Meetup September 2016

© 2016 MapR Technologies 1© 2016 MapR Technologies 1© 2016 MapR Technologies

Back to School – Hadoop Ecosystem Overview

Matt Miller, Solutions EngineerSeptember, 2016

© 2016 MapR Technologies 2© 2016 MapR Technologies 2

Roadmap• What is Apache?• Hadoop Timeline and Level Set• Hadoop Suite of tools

1. Hive2. Sqoop3. Pig4. Oozie5. Hbase6. Flume7. Kafka8. Drill9. Yarn10. Zookeeper

• Use Cases• Q&A


What is Apache?• Non-profit organization

• Governs the development of open source “Projects”

• “Top Level” projects are the most prominent

• Features “committers” from all over the world


Hadoop Timeline

2003GFS White Paper Published

2004Map Reduce White Paper Published

2006Hadoop is born

HDFS + MapReduce

2009Hadoop distributions start

popping up

2016Organized Chaos – New projectsreleased every few months andonly the winners gain traction

2007 - PresentHadoop continuously evolves.

New tools are released to improveusability and make it easier to adopt.

2000 2020


What is Hadoop?

Distributed File System + Processing Engine

HDFS Map Reduce


What is MapReduce?• Three phase program built for distributed processing

– Map– Shuffle/Sort– Reduce

• Processing overhead associated with MR jobs(~30 seconds)

• Heavy disk usage


1.) Hive• First SQL on Hadoop – HiveQL is the language

• Hadoop data warehousing tool

• Converts HiveQL into a Map Reduce job

• Bash, Java, and Python scripts can execute Hive commands

• Not ANSI compliant but VERY similar

Use Hive for long running jobs -- not ad-hoc queries


2.) Sqoop• RDBMS connector for Hadoop

• Execute Sqoop scripts via the command line

• Sqoop can move Schemas, Tables, or Select statement results

• Helps improve ETL or enable data warehouse offload

Use Sqoop anytime data needs to move to/from an RDBMS


3.) Pig• High level coding language for processing data

• Language used to express data flows is called Pig Latin

• Pig turns data flows into a series of MR jobs

• Can run in a single JVM or on a Hadoop Cluster

• User Defined Functions(UDFs) make Pig code easy to repurpose

Use pig to speed up development process


4.) Oozie• Workflow Orchestration

• Schedule tasks to be completed based on time or completion of a previous task

• Used for Automation

• Develop these workflows either in a GUI or in XML– Hint: the GUI is much much MUCH simpler

Use Oozie when you need workflows


5.) Hbase• Database built on HDFS

• Meant for big and fast data

• Hbase is a NoSQL database– Multiple types of NoSQL databases:

• Wide-column stores, Document DB, Graph DB, Key-Value stores• Hbase is a wide-column store

Use Hbase when “real-time read/write access to very large datasets” is required


6.) Flume• Meant for ingesting streams of data

• Runs on the same cluster and stores data in HDFS– Also flexible enough to stream into Hbase or SolR

• Flume PUSHES data to its destination

• Flume does NOT store data within itself

Use Flume when basic streaming is required


7.) Kafka• …Also meant for ingesting streams of data

• Runs on its own cluster

• Kafka does not PUSH data to other places– Other places pull from Kafka

• Kafka streams in the data, then PUBLISHES the data on its cluster and multiple users can SUBSCRIBE to that data and get their copy.

Use Kafka for advanced streaming


8.) Drill• Flexible SQL tool

• Works with a lot of data types and storage platforms

• Does not require transformations to the data

• For ad-hoc analytics and performant queries on LARGE data sets

• Scales to thousands of nodes

Use Drill for data exploration and performant SQL


9.) Yarn• Yet Another Resource Negotiator• Helps you allocate resources (and enforce usage quotas) to multiple

groups/users

10.) Zookeeper• Coordinates the distribution of jobs• Handles partial failures• Provides synchronization of jobs

Use Yarn for Multitenancy

ALWAYS use Zookeeper with Hadoop


Use Case 1: Expensive RDBMS• Organization has 5 TB of sales

data in RDBMS ($$$)

• Currently 50 reports being generated regularly

• Largest report takes ~24 hours to generate

• Team only knows SQL

HDFS

Sqoop

Hive

Hive/Drill


Use Case 2: Customer 360 Data Lake/Hub• 50 TB of customer data

• Data consists of everything from ERP data to JSON data from a rest API

• Four different business units need access to the data and they each have performance requirements

• Basic users need ad-hoc query capabilities

• Weekly jobs need to be kicked off during off hours

HDFS

Drill

YARN

Drill

Oozie


Use Case 3: Online Video Game Support• Stats need to be updated milliseconds after

the game finishes

• Player needs to be able to randomly look up other player stats in less than a second

• System can never go down or lose information

• Management wants to save this data so analytics can be run on these datasets.

Kafka/Flume & Hbase

Hbase

Kafka & Hbase

HDFS


Advice for those getting started…• Don’t try to hire a big data team, build from within

– MOTIVATED Linux and SQL people are enough to get started

• Target legacy RDMBS and move ~80% to Hadoop– Quick win– Instant validation and justification if you can cut costs and improve speed

at the same time

• Have fun


Additional Resources• Full List of Hadoop Ecosystem

• Books:– The Definitive Guide to Hadoop– Hadoop Application Architectures

• Free Training:– Coursera and Edx

• My favorite is a Python specialization series– learn.mapr.com

• Free courses from 100 level to 400 level

https://hadoopecosystemtable.github.io/

http://shop.oreilly.com/product/0636920021773.do

http://shop.oreilly.com/product/0636920033196.do

https://www.coursera.org/specializations/python

http://learn.mapr.com/


Q & A@mapr maprtech

[email protected]

Engage with us!

MapR

maprtech

mapr-technologies

Data & Analytics

Back to School - St. Louis Hadoop Meetup September 2016