53

Big Data in the Microsoft Platform

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Big Data in the Microsoft Platform
Page 2: Big Data in the Microsoft Platform

Building Big Data Solutions in the Microsoft Platform

Jesus RodriguezTellago, Inc, Tellago Studios

Page 3: Big Data in the Microsoft Platform

Big Data?

Page 4: Big Data in the Microsoft Platform

About Me…• Hackerpreneur• Co-Founder Tellago, Tellago Studios, Inc.• Microsoft Architect Advisor• Microsoft MVP• Oracle ACE• Speaker, Author• http://weblogs.asp.net/gsusx • http://jrodthoughts.com • http://moesion.com

Page 5: Big Data in the Microsoft Platform

Agenda• Big Data Overview• MS HDInsight

– Map Reduce– HDFS– Hive– Pig – Sqoop

• HDInsight Service• The Hadoop Ecosystem• The Future….

Page 6: Big Data in the Microsoft Platform

Big Data?

• A bunch of data?• An industry?• An expertise?• A trend?• A cliché?

Page 7: Big Data in the Microsoft Platform

A Clue?• 2008: Google processes 20 PB a day• 2009: Facebook has 2.5 PB user data

+ 15 TB/day • 2009: eBay has 6.5 PB user data +

50 TB/day• 2011: Yahoo! has 180-200 PB of data• 2012: Facebook ingests 500 TB/day

Page 8: Big Data in the Microsoft Platform

We Love Data!

Page 9: Big Data in the Microsoft Platform

But...

Page 10: Big Data in the Microsoft Platform

Processing Large Amounts of Data is Complicated....

Page 11: Big Data in the Microsoft Platform

Sucessful Big Data = Scalable Computing + Large Storage

Page 12: Big Data in the Microsoft Platform

A Trivial Model

Page 13: Big Data in the Microsoft Platform

Not So Fast....

Page 14: Big Data in the Microsoft Platform

Parallel Data Computing is Complicated

Page 15: Big Data in the Microsoft Platform

So Is Large Data Storage

Page 16: Big Data in the Microsoft Platform

Enter the World of Hadoop...

Page 17: Big Data in the Microsoft Platform

Hadoop Design Principles• System Shall Manage and Heal

Itself• Performance Shall Scale Linearly • Compute Shall Move to Data• Simple Core, Modular and

Extensible

Page 18: Big Data in the Microsoft Platform

Hadoop History• 2002-2004: Doug Cutting and Mike Cafarella started working on Nutch• 2003-2004: Google publishes GFS and MapReduce papers • 2004: Cutting adds DFS & MapReduce support to Nutch• 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch• 2007: NY Times converts 4TB of archives over 100 EC2s• 2008: Web-scale deployments at Y!, Facebook, Last.fm• April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910 nodes• May 2009:

– Yahoo does fastest sort of a TB, 62secs over 1460 nodes– Yahoo sorts a PB in 16.25hours over 3658 nodes

• June 2009, Oct 2009: Hadoop Summit, Hadoop World• September 2009: Doug Cutting joins Cloudera

Page 19: Big Data in the Microsoft Platform

Hadoop Ecosystem

HDFS(Hadoop Distributed File System)

HBase (key-value store)

MapReduce (Job Scheduling/Execution System)

Pig (Data Flow) Hive (SQL)

BI ReportingETL Tools

Avr

o (S

eri

aliz

atio

n)

Zo

oke

ep

r (C

oo

rdin

atio

n)

Sqoop

RDBMS

(Streaming/Pipes APIs)

Page 20: Big Data in the Microsoft Platform

Microsoft & Hadoop

Page 21: Big Data in the Microsoft Platform

HDInsight

Page 22: Big Data in the Microsoft Platform

HDFS

Page 23: Big Data in the Microsoft Platform

HDFS Is…• A distributed file system• Redundant storage• Designed to reliably store data using

commodity hardware• Designed to expect hardware failures• Intended for large files• Designed for batch inserts• The Hadoop Distributed File System

Page 24: Big Data in the Microsoft Platform

HDFS at a Glance

Block Size = 64MBReplication Factor = 3

Cost/GB is a few ¢/month vs $/month

Page 25: Big Data in the Microsoft Platform

HDInsight

HDFS Demo

Page 26: Big Data in the Microsoft Platform

Map Reduce

Page 27: Big Data in the Microsoft Platform

Map Reduce Is…• A programming model for expressing

distributed computations at a massive scale

• An execution framework for organizing and performing such computations

• An open-source implementation called Hadoop

Page 28: Big Data in the Microsoft Platform

Map Reduce At a Glance

Page 29: Big Data in the Microsoft Platform

HDInsight

Map Reduce Demo

Page 30: Big Data in the Microsoft Platform

Hive

Page 31: Big Data in the Microsoft Platform

Hive Is…• A system for managing and querying structured

data built on top of Hadoop– Map-Reduce for execution– HDFS for storage– Metadata on raw files

• Key Building Principles:– SQL as a familiar data warehousing tool– Extensibility – Types, Functions, Formats, Scripts– Scalability and Performance

Page 32: Big Data in the Microsoft Platform

Hive Architecture

Page 33: Big Data in the Microsoft Platform

HDInsight

Hacking with Hive

Page 34: Big Data in the Microsoft Platform

Pig

Page 35: Big Data in the Microsoft Platform

Pig Is…Apache Pig is a platform for analyzing large data sets that

consists of a high-level language (PigLatin) for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

• Ease of programming

• Optimization opportunities

• Extensibility

• Built upon Hadoop

Page 36: Big Data in the Microsoft Platform

Pig Architecture

Parser (PigLatinLogicalPlan)

Optimizer (LogicalPlan LogicalPlan)

Compiler (LogicalPlan PhysiclaPlan MapReducePlan)

ExecutionEngine

Pig Context

Hadoop

Grunt (Interactive shell) PigServer (Java API)

Page 37: Big Data in the Microsoft Platform

HDInsight Rocking Data Processing

with Pig

Page 38: Big Data in the Microsoft Platform

Sqoop

Page 39: Big Data in the Microsoft Platform

Sqoop Is…• Easy import of data from many

databases to HDFS• Generates code for use in

MapReduce applications• Integrates with Hive

Page 40: Big Data in the Microsoft Platform

Sqoop Architecture

Page 41: Big Data in the Microsoft Platform

HDInsight

Bulk Data Loading Using Sqoop

Page 42: Big Data in the Microsoft Platform

HDInsight Service

Page 43: Big Data in the Microsoft Platform

HDInsight Service Architecture

Page 44: Big Data in the Microsoft Platform

HDInsight

HDInsight Service Overview

Page 45: Big Data in the Microsoft Platform

Hadoop Considerations

Page 46: Big Data in the Microsoft Platform

Super Crowded Ecosystem

Page 47: Big Data in the Microsoft Platform

The Hadoop Ecosystem

Page 48: Big Data in the Microsoft Platform

Hadoop is not a silver bullet...

Page 49: Big Data in the Microsoft Platform

Some Challenges• Hadoop doesn’t power big data applications

– Not a transactional datastore. Slosh back and forth via ETL

• Processing latency– Non-incremental, must re-slurp entire dataset every

pass

• Ad-Hoc queries– Bare metal interface, data import

• Graphs– Only a handful of graph problems amenable to MR

Page 50: Big Data in the Microsoft Platform

Beyond Hadoop• Percolator(incremental processing)http://research.google.com/pubs/pub36726.html • Dremel(ad-hoc analysis queries)http://research.google.com/pubs/pub36632.html • Pregel (Big graphs)http://dl.acm.org/citation.cfm?id=1807184

Page 51: Big Data in the Microsoft Platform

In the Meantime...

Page 52: Big Data in the Microsoft Platform

Takeaways • Hadoop provides the foundation of big

data solutions• Computing and storage are the

fundamental components of Hadoop• HDInsight Server and Service are

Microsoft’s distributions of Hadoop• HDInsight is just one component of

Microsoft’s BI strategy

Page 53: Big Data in the Microsoft Platform

[email protected]

http://www.tellagostudios.com http://jrodthoughts.com

http://twitter.com/#!/jrodthoughtshttp://weblogs.asp.net/gsusx