25
The Leader in Big Data Consulting

What You Should Know About Big Data

Embed Size (px)

Citation preview

Page 1: What You Should Know About Big Data

The Leader in Big Data Consulting

Page 2: What You Should Know About Big Data

www.mammothdata.com | @mammothdataco

What You Should Know About Big Data

{CIO/CTO Breakfast Forum | Columbia}

Page 3: What You Should Know About Big Data

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Andrew C. Oliver, President & Founder

● @acoliver

● Programming since age 8

● Java since ~1997

● Founded POI project (currently hosted at Apache) with Marc Johnson ~2000

○ Former member Jakarta PMC

○ Emeritus member of Apache Software Foundation

● Joined JBoss ~2002

● Former Board Member/current helper/lifetime member: Open Source Initiative (http://opensource.org)

● Column in InfoWorld: http://www.infoworld.com/author-bios/andrew-oliver

○ I make fanboys cry

Page 4: What You Should Know About Big Data

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Open Software Integrators

Founded Nov 2007 by Andrew C. Oliver (me)in Durham, NC

Based in Durham, NCOffice also in Chicago, ILOperate Nationally (and occasionally internationally)Started out specializing in Java/Linux/Enterprise Scalability, now moved more towards

NoSQL, Big DataProfessional Services (Consulting, Training, Strategy)

Page 5: What You Should Know About Big Data

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Overview

What is Big Data?

What is Hadoop?

But…

Where should you use Big Data technologies?

Market Segments for Hadoop

Where shouldn’t you use Big Data technologies?

How can you identify places to use this?

Why should you do this?

Alphabet soup

Page 6: What You Should Know About Big Data

www.mammothdata.com | @mammothdataco

What is Big Data?

{CIO/CTO Breakfast Forum | Columbia}

Page 7: What You Should Know About Big Data

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

What is Big Data?

marketing term for a set of technologies

mainly in the Hadoop ecosystem

Not a specific number of bytes or petabytes

Page 8: What You Should Know About Big Data

www.mammothdata.com | @mammothdataco

What is Hadoop?

{CIO/CTO Breakfast Forum | Columbia}

Page 9: What You Should Know About Big Data

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

What is Hadoop?

Core

HDFS - a distributed filesystem

YARN - a cluster manager

Map-Reduce implementation / API

Pig - a map reduce scripting query language

Hive - SQL and data warehousing infrastructure

Page 10: What You Should Know About Big Data

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

But...

There is a larger ecosystem beyond this core...

Page 11: What You Should Know About Big Data

www.mammothdata.com | @mammothdataco

Where Should You Use Big Data Technologies?

{CIO/CTO Breakfast Forum | Columbia}

Page 12: What You Should Know About Big Data

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Where You Should Use...

Unstructured Data

Lots of Data

High volume input

Datawarehousing

Streams

Machine Learning / Decision Support

BI/Analytics

Page 13: What You Should Know About Big Data

www.mammothdata.com | @mammothdataco

Market Segments

{CIO/CTO Breakfast Forum | Columbia}

Page 14: What You Should Know About Big Data

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Market Segments

“New” market, new kinds of problems

Data Warehousing Market (MPP systems, Teradata, Neteeza)

Machine Learning / Decision / BI

...growing… really fast

Page 15: What You Should Know About Big Data

www.mammothdata.com | @mammothdataco

Where Shouldn’t You Use Big Data Technology?

{CIO/CTO Breakfast Forum | Columbia}

Page 16: What You Should Know About Big Data

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Market Segments

With a few exceptions this isn’t your “operational” datastore

Cassandra sometimes

Clickstreams sort of

Page 17: What You Should Know About Big Data

www.mammothdata.com | @mammothdataco

How Can You Identify Your Opportunities to Use Big Data Technology?

{CIO/CTO Breakfast Forum | Columbia}

Page 18: What You Should Know About Big Data

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

How To Find Uses

Some obvious

long running queries?

Questions your database can’t handle

Can you aggregate the data you need to aggregate or to answer all of your questions?

where are costs such as licensing a constraint?

Unify disparate datastreams

Page 19: What You Should Know About Big Data

www.mammothdata.com | @mammothdataco

Why Should You Do This?

{CIO/CTO Breakfast Forum | Columbia}

Page 20: What You Should Know About Big Data

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Why Should You Do This?

How much data have you thrown away then found out it was useful?

Weblogs since 1996

What questions do you have ?

How is your database doing for those long running queries?

How much to expand your proprietary data warehouse?

Competitive advantage

Page 21: What You Should Know About Big Data

www.mammothdata.com | @mammothdataco

Alphabet Soup

{CIO/CTO Breakfast Forum | Columbia}

Page 22: What You Should Know About Big Data

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Alphabet Soup

Core

Hadoop File System (HDFS) - distributed filesystem

Yet Another Resource Negotiator (YARN) - cluster manager

Pig - SQL on steroids, query language for map-reduce jobs

Hive / Impala - datawarehousing / SQL frameworks

More

HBase /Cassandra - Column Family datastores (time series data especially)

Page 23: What You Should Know About Big Data

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Alphabet Soup

More (Cont)

Spark/Shark and Storm - Map reduce in memory (low latency) also Streams.

Oozie - workflow / job control

Ambari - admin/deployment tool (also Cloudera Manager)

Sqoop - ETL tool to extract/transorm/load from your RDBMS

Flume - Enterprise Service Bus like tool for transporting data in/out

Mahout - Machine Learning / Decision making

Page 24: What You Should Know About Big Data

www.mammothdata.com | @mammothdataco

RDBMS may not scale to your needsYour data may not map efficiently to tablesColumn Family/Big Table - fast, scalable, denormalized, map reduce, good for series, not

efficient for complex dataHadoop is an ecosystem of different software packages mainly centered around HDFS and

Map Reduce (But not exclusively)Both expands our capabilities and disrupts old technologiesNot usually an operational datastoreUse this where you need it, most places create a basic POC and then deploy a

competency center/platform then increase usesThere is a long list of alphabet soup and addons...

Conclusions

Page 25: What You Should Know About Big Data

www.mammothdata.com | @mammothdataco

Thank you for attending!

{CIO/CTO Breakfast Forum | Columbia}