54
Big Data the next frontier Leonid Zhukov Professor Higher School of Economics 1 RVC Seminar Moscow, 08/02/2013

Business of Big Data

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Business of Big Data

Big Datathe next frontier

Leonid Zhukov Professor Higher School of Economics

1

RVC SeminarMoscow, 08/02/2013

Page 2: Business of Big Data

Big data

+ Graph of terms popularity

2

www.visibletechologies.com

Page 3: Business of Big Data

McKinsey, May 2011 3

www.mckinsey.com

Page 4: Business of Big Data

Headlines 4

Data driven business

Data democratization

Data scientists

Page 5: Business of Big Data

The White House

+ $200M initiative

+ NSF: core techniques

+ NIH: 1000 genomes

+ DOE: advanced computing

+ DOD: data to decisions

+ USGS: Earth system

5

www.whitehouse.gov

Page 6: Business of Big Data

Gartner Hype Cycle6

www.gartner.com

Page 7: Business of Big Data

Market Forecast

+ Venture money invested (Reuters):+ 2009 - $1.1B+ 2010 - $1.53B+ 2011 - $2.47B

7

www.wikibon.com

+ Market forecasts:+ IDC: 2015 - $16.9B+ Gartner: 2016- $55B

Page 8: Business of Big Data

Big Data Revenue 2012 8

+ Big Business:

+ IBM+ HP+ Oracle+ Teradata+ EMC www.wikibon.com

Page 9: Business of Big Data

Big Data Vendors!

+ Hadoop:+ Cloudera+ MapR Techonologies+ HortonWorks

9

www.wikibon.com

Page 10: Business of Big Data

Forrester Wave 10

www.forrester.com

Page 11: Business of Big Data

What is big data 11

+ Big data:

+ “Data you can’t process by traditional tools”

+ “A phenomenon defined by the rapid acceleration in the expanding volume of high velocity, complex and diverse types of data.”

+ “Refers to a collection of tools, techniques and technologies for working with data productively, at any scale.”

Page 12: Business of Big Data

What is Big data 12

+ 3V

+ Volume: petabytes (1000TB) to exabytes (1000PB)

+ Variety: structured, semi-structured, unstructured

+ Velocity: Tb/s data streams

+ Requires distributed processing

+ Big data = storage + processing

+ Big data = Hadoop (not only)

Page 13: Business of Big Data

Big data Glossary

+ Hadoop, MapReduce, Hive, Pig, Cascading, HBase, Hypertable, Cassandra, Flume, Sqoop, Mongo, Voldemort, Storm, Kafka, Drill, Dremmel, Impala, Zookeeper, Ambari, Oozi, Yarn, Redis, Rajak, Pregel, Gremlin, Giraph, Solr, Lucene, R, Mahout, Weka,

13

Page 14: Business of Big Data

How big is Big?

+ Google + 24 PB data processed daily

+ Twitter+ 340 mln daily tweets+ 1.6 bln search queries+ 7 TB added daily

+ Facebook+ 750 mln users + 12 TB daily daily content+ 2.7 bln “likes” and comments daily

14

Page 15: Business of Big Data

Sources of Big Data15

www.ibm.com

Page 16: Business of Big Data

Supercomputing

+ National Labs, Universities, Military

+ Processing power, flops, MPI

+ Parallel computing:

+ Cray, IBM SP, SGI

+ Beowulf cluster (Linux commodity)

16

Page 17: Business of Big Data

New realities

+ Yahoo, AltaVista, Inktomi, Google

+ Consumer web companies:

+ web search (crawling, indexing)

+ advertising

+ email services

+ ecommerce

+ Commodity hardware

17

Page 18: Business of Big Data

Google 18

2003 2004

Page 19: Business of Big Data

GFS/HDFS

+ Distributed replicated data blocks (64Mb)

+ Master-slave architecture (Name Node, Data Nodes)

+ Not a general file system

+ Access via command line utils and API

+ Can’t modify after files written

19

Page 20: Business of Big Data

MapReduce 20

+ MapReduce programming model:+ functional programming+ like UNIX pipeline

+ Master-slave architecture+ Master: divide, schedule, monitor work+ Slave: actual processing

+ Scalable:+ no file IO+ no networking+ no synchronization

Page 21: Business of Big Data

 Data movement 21

www.cloudera.com

+ store and process data on the same nodes

+ bring code to data, data “locality”

Page 22: Business of Big Data

Hadoop

+ Doug Cutting

+ Search indexer - Lucene

+ Web crawler - Nutch

+ Hadoop

+ HDFS

+ MapReduce

22

Page 23: Business of Big Data

Yahoo!

+ 40,000 servers

+ 170PB storage

+ 1000+ active users

+ 5M+ monthly jobs

+ email spam filters

+ categorization, personalization

+ computational advertising

23

Page 24: Business of Big Data

Data Base NoSQL Revolution

+ Needed:

+ fast read/write time

+ high concurrency

+ easy horizontally scalable

+ Flat data structure

+ Sacrificed:

+ DB Schema

+ SQL

+ Transactions

24

Page 25: Business of Big Data

NoSQL World 25

+ Key-value: Dynamo, Voldemort, Redis, Riak

+ Column (tabular): HBase, Hypertable, Cassandra

+ Document store: CouchDB, MongoDB

+ Graph: Neo4J, FlockDB

+ 120+ products (2012)

Page 26: Business of Big Data

Hadoop stack 26

www.hortonworks.com

Page 27: Business of Big Data

Hadoop tools

+ Pig

+ high level scripting language (PigLatin)

+ converts to MapReduce jobs

+ Hive

+ SQL like queries on dat in HDFS

+ converts in MapReduce jobs

27

Page 28: Business of Big Data

Hadoop data movement28

www.cloudera.com

Page 29: Business of Big Data

Typical hadoop usage

+ Text mining+ Pattern recognition+ Recommendation systems (collaborative filtering)+ Prediction models+ Risk assessment+ Sentiment analysis+ Customer churn prediction+ Customer segmentation+ Point of Sale Transaction analysis+ Data “sandbox”

29

Page 30: Business of Big Data

Application fields

+ Science: sensors, genome, weather, satellite, imaging

+ Engineering: log analytics, status feeds, network messages, spam filters..

+ Product: financial, pharmaceutical, insurance, energy, retail, ecommerce, healthcare, telecom

+ Business: analytics, BI

30

Page 31: Business of Big Data

Business analytics

+ Analytic

+ Operational

31

www.datasciencecentral.com

Capture, analyze, learn from data

Page 32: Business of Big Data

Who uses Hadoop? 32

www.cloudera.com

Page 33: Business of Big Data

Why Hadoop? 33

www.thinkbiganalytics.com

Page 34: Business of Big Data

Cloudera

+ Enterprise support for Apache Hadoop

+ Founded 2008, funding $141 M

+ Employee 230

+ Products:

+ CDH 4 (cloudera distrobution hadoop)+ Impala+ Consulting and training

34

www.cloudera.com

Page 35: Business of Big Data

MapR

+ Founded 2009, funding $20M

+ MapR Technologies is engineering game-changing Map/Reduce related technologies

+ Products:+ M3,M5,M7+ NFS, no single node failure+ NOT open source !

35

www.mapr.com

Page 36: Business of Big Data

HortonWorks

+ Founded 2011

+ Yahoo spin-off

+ Products:

+ HDP distribution

+ tools

36

www.hortonworks.com

Page 37: Business of Big Data

Hadoop Ecosystem 37

www.datameer.com

Page 38: Business of Big Data

Big Data Landscape38

www.bigdatalandscape.com

Page 39: Business of Big Data

Splunk

+ Founded 2003, raised $230M, IPO 2011, Market cap $3.35B

+ Machine logs analysis, operational intelligence

+ Collecting, searching, monitoring

39

www.splunk.com

Page 40: Business of Big Data

Datameer

+ Founded 2009, Funding $17,8M

+ Big data:

+ Data integration

+ Data Analytics

+ Data Visualization

+

40

www.datameer.com

Page 41: Business of Big Data

Datasift

+ Founded 2010, funding $29.7M

+ Data platform for social web

+ Aggregate and filter data

41

www.datasift.com

Page 42: Business of Big Data

Infochimps

+ Founded 2009, funding $5.5M

+ Transitioned from data marketpalce to big data platform

+ End-to-end big data solution, real time

42

www.infochimps.com

Page 43: Business of Big Data

Tableau software

+ Founded 2003, funding $15M

+ Big data analytics

+ Big data visualization

43

www.tableau.com

Page 44: Business of Big Data

Big data Startups 2012

+ Platfora, in memory BI on Hadoop

+ Sumologic, log file analysis

+ Hadapt, Hadoop+RDBSM

+ Metamarkets, patterns in data flow

+ DataStax, consulting, training

+ Karmasphere, BI, analytics on Hadoop

44

Page 45: Business of Big Data

Big data startups 2013!

+ 10gen, MongoDB

+ ClearStory, big data aggregation + analytics

+ Continuuity, Hadoop API

+ Parstream, database analytics

+ Zoomdata, data visualization

+ Climate corporation, predictive analytics

45

Page 46: Business of Big Data

Big data by industry46

www.gartner.com

Page 47: Business of Big Data

Big data Processing47

Batch processing interactive stream

Query time

data volume

programming model

minutes to hours

Millisecond to seconds continues

TB to PT GB to PB continues

MapReduce Queries DAG

Users

Open Source

Developers Analysts Developers

Hadoop mapreduce Drill, Impala Storm, Kafka

Page 48: Business of Big Data

New technologies

+ Real time quering

+ Drill (based on Google Dremmel)

+ Impala (Cloudera)

+ Data stream processing

+ Storm (Twitter), real time analytics

+ Kafka (LinkedIn), messaging system

48

Page 49: Business of Big Data

Machine learning

+ Predictive analytics

+ Patterns discovery

+ Data mining

+ Tools:

+ Mahout

+ R

49

Page 50: Business of Big Data

Big data revolution

+ Google: GFS, MapReduce, BigTable,

+ Yahoo: Hadoop

+ Amazon: DynamoDB

+ Facebook: Cassandra, HBase

+ Twitter: FlockDB, Storm

+ LinkedIn: Vondelmort, Kafka

50

Page 51: Business of Big Data

Observations

+ Game changing technologies come from big companies

+ Open Source (!)

+ Start-up ecosystem

+ Less general, more specialized

+ Next step: big data analytics and visualization

51

Page 52: Business of Big Data

Data scientist

+ Machine Learning

+ Data Mining

+ Statistics

+ Software Engineering

+ Hadoop/MapReduce/HBase/Hive/Pig

+ Java, Python, C/C+, SQL

52

“By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.”

Page 53: Business of Big Data

Big Data Products MindMap

53

www.garycrawford.co.uk

Page 54: Business of Big Data

Contacts

+ Leonid Zhukov, Ph.D.

+ School of Applied Mathematics and Information Science Higher School of Economics, NRU-HSE

+ [email protected]

+ www.leonidzhukov.ru

54