Upload
lynn-langit
View
3.784
Download
0
Embed Size (px)
DESCRIPTION
60 minute webcast for DevelopMentor - Hadoop on Azure
Citation preview
Hadoop on AzureBigData on the Azure platform@LynnLangit
Hadoop = BigData?
• HUGE Hype factor in 2011 / 2012
Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license• enables applications to work with thousands of nodes and petabytes of data• was inspired by Google's MapReduce and Google File System (GFS) papers
Oracle Loader for Hadoop
SQL Server Connector for Hadoop
Flavors of NoSQL
Column Database
Wide, sparse column sets
RDBMS vs. HadoopTraditional RDBMS Hadoop
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
Query Response Time
Can be near immediate Has latency (due to batch processing)
What about the cloud?
The reality…two pivots
Storage Methods• SQL (RDBMS) • Hadoop
Storage Locations• On premises • Cloud-hosted
Demo - Setting up Your Cluster
Cluster Allocation Process
Working with Hadoop on AzureTools / Languages• MapReduce
• Map (query/format)• Reduce (aggregate)• plug-in for Eclipse (Java)• JavaScript• C# Streaming
• Pig (ETL -- Java)• Hive (HQL Query)
• HBase tables• Others
• Mahout (analyze)• R (analyze)
Tasks – DBA vs. Hadoop on AzureRDBMS Hadoop on AzureImport Data Upload Data using FTP or import via SqoopSetup Security Setup SecurityScale Compute (up or out) Add child nodes to the clusterPerform a Backup Monitor and replace failed nodesRestore a Database n/aClean up data via ETL Execute a PIG jobCreate an Index – query tune Write a HIVE query (HQL)Join Tables Together Run MapReducen/a Monitor and manage running MapReduce jobsSchedule a Job Schedule a (Cron) JobRun Database Maintenance Monitor space and resources used
Send an Email from SQL Server Set up resource threshold alerts
Manage License costs Manage usage time charges
Demo - Basic Administration
Open Ports
Demo - Basic Administration
Connect via RDP
NameNode Utility – Top Level
NameNode Utility – Drill Down
Demo - Basic Administration
Configure connections to remote storage
Configuring Upload from AWS S3
Configuring Upload from Azure
Using the Azure Storage Viewer
Configuring Upload from DataMarket
Asking Questions = MapReduce
Samples
Demo - MapReduce using Java
• WordCount example using AWS S3 data
Demo - MapReduce using C# Streaming
• WordCount example
Demo - MapReduce using JavaScript
• WordCount example
Demo - Using HIVE
• WordCount example
Demo - Using HIVE
Monitoring Job Results• In the portal
– Main Console• Job icon (button) status summary• Job History
– Interactive Console• JS quick feedback• JS detailed feedback (log)
• Using RDP– Map/Reduce tool
Demo – Monitoring Job Status
Download – ODBC for HIVE
• Includes add-in for Excel
Demo - Hadoop Connector to Excel
Connecting to PowerPivot
• Create an ODBC connection to HIVE• Connect to ‘other data source’ in PowerPivot
Real-World – Hadoop and…
Facebook runs on Hadoop & MySQL
Twitter runs on Hadoop (ran on FlockDb/graph)
Yahoo runs on Hadoop
LinkedIn runs on Hadoop & Voldemort
Klout runs Hadoop (on Azure) &HBase (Hive) & SQL Server SSAS BISM cubes
Hadoop To-Do ListBigData = Hadoop• Use Hadoop when business
needs designate
Hadoop on the cloud• Quick and cheap• Specialized use cases• Behavioral data• dev, test , training environments
Hadoop access technologies• Learn Map/Reduce• Use HIVE via Excel
The Changing Data Landscape
HadoopRDBMS
OtherServices
TeachingKidsProgramming.org
Do a Recipe Teach a Kid (Ages 10 ++)SmallBasic or Java Free Courseware (recipes)
Toward Data Craftsmanship…
Follow me @LynnLangit
RSS my blog www.LynnLangit.com
Hire me• To help build your BI/Big Data solution• To teach your team next gen BI• To learn more about using NoSQL solutions