Upload
lynn-langit
View
1.439
Download
1
Tags:
Embed Size (px)
DESCRIPTION
deck from DM Nov 2012
Citation preview
Hadoop on Azure
Lynn LangitPractioner, Author, Instructor
Nov 2012 – DevelopMentor / London
Hadoop = BigData?
• HUGE Hype factor in 2011 / 2012
Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license• enables applications to work with thousands of nodes and petabytes of data• was inspired by Google's MapReduce and Google File System (GFS) papers
Oracle Loader for Hadoop
SQL Server Connector for Hadoop
Flavors of NoSQL
Column Database
Wide, sparse column sets
RDBMS vs. HadoopTraditional RDBMS Hadoop
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
Query Response Time
Can be near immediate Has latency (due to batch processing)
What about the cloud?
The reality…two pivots
Storage Methods• SQL (RDBMS) • Hadoop
Storage Locations• On premises • Cloud-hosted
Demo - Setting up Your Cluster
Cluster Allocation Process
Working with Hadoop on AzureTools / Languages• MapReduce
• Map (query/format)• Reduce (aggregate)• plug-in for Eclipse (Java)• JavaScript• C# Streaming
• Pig (ETL -- Java)• Hive (HQL Query)
• HBase tables• Others
• Mahout (analyze)• R (analyze)
Tasks – DBA vs. Hadoop on AzureRDBMS Hadoop on AzureImport Data Upload Data using FTP or import via SqoopSetup Security Setup SecurityScale Compute (up or out) Add child nodes to the clusterPerform a Backup Monitor and replace failed nodesRestore a Database n/aClean up data via ETL Execute a PIG jobCreate an Index – query tune Write a HIVE query (HQL)Join Tables Together Run MapReducen/a Monitor and manage running MapReduce jobsSchedule a Job Schedule a (Cron) JobRun Database Maintenance Monitor space and resources used
Send an Email from SQL Server Set up resource threshold alerts
Manage License costs Manage usage time charges
Demo - Basic Administration
Open Ports, Interactive, Remote…
Demo - Basic Administration
Connect via RDP
NameNode Utility – Top Level
NameNode Utility – Drill Down
Demo - Basic Administration
Configuring Upload from Azure
Using the Azure Storage Viewer
Configuring Upload from MarketPlace
Asking Questions = MapReduce
Samples
More Samples
Demo - MapReduce using Java
• WordCount example
Demo - MapReduce using C# Streaming
• WordCount example
Demo - MapReduce using JavaScript
• WordCount example
Demo - Using HIVE
• WordCount example
Demo - Using HIVE
Monitoring Job Results• In the portal
– Main Console• Job icon (button) status summary• Job History
– Interactive Console• JS quick feedback• JS detailed feedback (log)
• Using RDP– Map/Reduce tool
Demo – Monitoring Job Status
Download – ODBC for HIVE
• Includes add-in for Excel
Demo - Hadoop Connector to Excel
Connecting to PowerPivot
• Create an ODBC connection to HIVE• Connect to ‘other data source’ in PowerPivot
Case Study - Klout
Real-World – Hadoop and…
Facebook runs on Hadoop & MySQL
Twitter runs on Hadoop (ran on FlockDb/graph)
Yahoo runs on Hadoop
LinkedIn runs on Hadoop & Voldemort
Klout runs Hadoop (on Azure) &HBase (Hive) & SQL Server SSAS BISM cubes
Hadoop To-Do ListBigData = Hadoop• Use Hadoop when business
needs designate• Use other NoSQL if a better fit
Hadoop on the cloud• Quick and cheap• Specialized use cases
• Behavioral data• dev, test , training
environments
Hadoop access technologies• Learn Map/Reduce• Use HIVE via Excel• Pay attention to Impala
The Changing Data Landscape
HadoopRDBMS
OtherServices
www.TeachingKidsProgramming.org• Free Courseware ( • Do a Recipe Teach a Kid (Ages 10 ++)• Java or Microsoft SmallBasic
• recipes)
Toward Data Craftsmanship…
Follow me @LynnLangit
RSS my blog www.LynnLangit.com
Hire me• To help build your BI/Big Data solution• To teach your team next gen BI• To learn more about using NoSQL solutions