Upload
leonid-zhukov
View
129
Download
4
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
Big Datathe next frontier
Leonid Zhukov Professor Higher School of Economics
1
RVC SeminarMoscow, 08/02/2013
Big data
+ Graph of terms popularity
2
www.visibletechologies.com
Headlines 4
Data driven business
Data democratization
Data scientists
The White House
+ $200M initiative
+ NSF: core techniques
+ NIH: 1000 genomes
+ DOE: advanced computing
+ DOD: data to decisions
+ USGS: Earth system
5
www.whitehouse.gov
Market Forecast
+ Venture money invested (Reuters):+ 2009 - $1.1B+ 2010 - $1.53B+ 2011 - $2.47B
7
www.wikibon.com
+ Market forecasts:+ IDC: 2015 - $16.9B+ Gartner: 2016- $55B
Big Data Revenue 2012 8
+ Big Business:
+ IBM+ HP+ Oracle+ Teradata+ EMC www.wikibon.com
Big Data Vendors!
+ Hadoop:+ Cloudera+ MapR Techonologies+ HortonWorks
9
www.wikibon.com
What is big data 11
+ Big data:
+ “Data you can’t process by traditional tools”
+ “A phenomenon defined by the rapid acceleration in the expanding volume of high velocity, complex and diverse types of data.”
+ “Refers to a collection of tools, techniques and technologies for working with data productively, at any scale.”
What is Big data 12
+ 3V
+ Volume: petabytes (1000TB) to exabytes (1000PB)
+ Variety: structured, semi-structured, unstructured
+ Velocity: Tb/s data streams
+ Requires distributed processing
+ Big data = storage + processing
+ Big data = Hadoop (not only)
Big data Glossary
+ Hadoop, MapReduce, Hive, Pig, Cascading, HBase, Hypertable, Cassandra, Flume, Sqoop, Mongo, Voldemort, Storm, Kafka, Drill, Dremmel, Impala, Zookeeper, Ambari, Oozi, Yarn, Redis, Rajak, Pregel, Gremlin, Giraph, Solr, Lucene, R, Mahout, Weka,
13
How big is Big?
+ Google + 24 PB data processed daily
+ Twitter+ 340 mln daily tweets+ 1.6 bln search queries+ 7 TB added daily
+ Facebook+ 750 mln users + 12 TB daily daily content+ 2.7 bln “likes” and comments daily
14
Supercomputing
+ National Labs, Universities, Military
+ Processing power, flops, MPI
+ Parallel computing:
+ Cray, IBM SP, SGI
+ Beowulf cluster (Linux commodity)
16
New realities
+ Yahoo, AltaVista, Inktomi, Google
+ Consumer web companies:
+ web search (crawling, indexing)
+ advertising
+ email services
+ ecommerce
+ Commodity hardware
17
Google 18
2003 2004
GFS/HDFS
+ Distributed replicated data blocks (64Mb)
+ Master-slave architecture (Name Node, Data Nodes)
+ Not a general file system
+ Access via command line utils and API
+ Can’t modify after files written
19
MapReduce 20
+ MapReduce programming model:+ functional programming+ like UNIX pipeline
+ Master-slave architecture+ Master: divide, schedule, monitor work+ Slave: actual processing
+ Scalable:+ no file IO+ no networking+ no synchronization
Data movement 21
www.cloudera.com
+ store and process data on the same nodes
+ bring code to data, data “locality”
Hadoop
+ Doug Cutting
+ Search indexer - Lucene
+ Web crawler - Nutch
+ Hadoop
+ HDFS
+ MapReduce
22
Yahoo!
+ 40,000 servers
+ 170PB storage
+ 1000+ active users
+ 5M+ monthly jobs
+ email spam filters
+ categorization, personalization
+ computational advertising
23
Data Base NoSQL Revolution
+ Needed:
+ fast read/write time
+ high concurrency
+ easy horizontally scalable
+ Flat data structure
+ Sacrificed:
+ DB Schema
+ SQL
+ Transactions
24
NoSQL World 25
+ Key-value: Dynamo, Voldemort, Redis, Riak
+ Column (tabular): HBase, Hypertable, Cassandra
+ Document store: CouchDB, MongoDB
+ Graph: Neo4J, FlockDB
+ 120+ products (2012)
Hadoop tools
+ Pig
+ high level scripting language (PigLatin)
+ converts to MapReduce jobs
+ Hive
+ SQL like queries on dat in HDFS
+ converts in MapReduce jobs
27
Typical hadoop usage
+ Text mining+ Pattern recognition+ Recommendation systems (collaborative filtering)+ Prediction models+ Risk assessment+ Sentiment analysis+ Customer churn prediction+ Customer segmentation+ Point of Sale Transaction analysis+ Data “sandbox”
29
Application fields
+ Science: sensors, genome, weather, satellite, imaging
+ Engineering: log analytics, status feeds, network messages, spam filters..
+ Product: financial, pharmaceutical, insurance, energy, retail, ecommerce, healthcare, telecom
+ Business: analytics, BI
30
Business analytics
+ Analytic
+ Operational
31
www.datasciencecentral.com
Capture, analyze, learn from data
Why Hadoop? 33
www.thinkbiganalytics.com
Cloudera
+ Enterprise support for Apache Hadoop
+ Founded 2008, funding $141 M
+ Employee 230
+ Products:
+ CDH 4 (cloudera distrobution hadoop)+ Impala+ Consulting and training
34
www.cloudera.com
MapR
+ Founded 2009, funding $20M
+ MapR Technologies is engineering game-changing Map/Reduce related technologies
+ Products:+ M3,M5,M7+ NFS, no single node failure+ NOT open source !
35
www.mapr.com
HortonWorks
+ Founded 2011
+ Yahoo spin-off
+ Products:
+ HDP distribution
+ tools
36
www.hortonworks.com
Big Data Landscape38
www.bigdatalandscape.com
Splunk
+ Founded 2003, raised $230M, IPO 2011, Market cap $3.35B
+ Machine logs analysis, operational intelligence
+ Collecting, searching, monitoring
39
www.splunk.com
Datameer
+ Founded 2009, Funding $17,8M
+ Big data:
+ Data integration
+ Data Analytics
+ Data Visualization
+
40
www.datameer.com
Datasift
+ Founded 2010, funding $29.7M
+ Data platform for social web
+ Aggregate and filter data
41
www.datasift.com
Infochimps
+ Founded 2009, funding $5.5M
+ Transitioned from data marketpalce to big data platform
+ End-to-end big data solution, real time
42
www.infochimps.com
Tableau software
+ Founded 2003, funding $15M
+ Big data analytics
+ Big data visualization
43
www.tableau.com
Big data Startups 2012
+ Platfora, in memory BI on Hadoop
+ Sumologic, log file analysis
+ Hadapt, Hadoop+RDBSM
+ Metamarkets, patterns in data flow
+ DataStax, consulting, training
+ Karmasphere, BI, analytics on Hadoop
44
Big data startups 2013!
+ 10gen, MongoDB
+ ClearStory, big data aggregation + analytics
+ Continuuity, Hadoop API
+ Parstream, database analytics
+ Zoomdata, data visualization
+ Climate corporation, predictive analytics
45
Big data Processing47
Batch processing interactive stream
Query time
data volume
programming model
minutes to hours
Millisecond to seconds continues
TB to PT GB to PB continues
MapReduce Queries DAG
Users
Open Source
Developers Analysts Developers
Hadoop mapreduce Drill, Impala Storm, Kafka
New technologies
+ Real time quering
+ Drill (based on Google Dremmel)
+ Impala (Cloudera)
+ Data stream processing
+ Storm (Twitter), real time analytics
+ Kafka (LinkedIn), messaging system
48
Machine learning
+ Predictive analytics
+ Patterns discovery
+ Data mining
+ Tools:
+ Mahout
+ R
49
Big data revolution
+ Google: GFS, MapReduce, BigTable,
+ Yahoo: Hadoop
+ Amazon: DynamoDB
+ Facebook: Cassandra, HBase
+ Twitter: FlockDB, Storm
+ LinkedIn: Vondelmort, Kafka
50
Observations
+ Game changing technologies come from big companies
+ Open Source (!)
+ Start-up ecosystem
+ Less general, more specialized
+ Next step: big data analytics and visualization
51
Data scientist
+ Machine Learning
+ Data Mining
+ Statistics
+ Software Engineering
+ Hadoop/MapReduce/HBase/Hive/Pig
+ Java, Python, C/C+, SQL
52
“By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.”
Big Data Products MindMap
53
www.garycrawford.co.uk
Contacts
+ Leonid Zhukov, Ph.D.
+ School of Applied Mathematics and Information Science Higher School of Economics, NRU-HSE
+ www.leonidzhukov.ru
54