58
Veldkant 33A, Kontich [email protected] www.infofarm.be Data Science Company Real Time Big Data InfoFarm Seminar 18/11/2015

Real Time Big Data

Embed Size (px)

Citation preview

Page 1: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Data Science Company

Real Time Big Data

InfoFarm Seminar18/11/2015

Page 2: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

About Me

Page 3: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

About InfoFarm

Page 4: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Agenda

•  Typical Big Data Landscape•  The need for Real Time Big Data•  Real Time Data Ingestion•  Tools for Real Time Big Data– Apache Spark– Apache Storm– Search

•  Q&A•  Lunch

Page 5: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

A Typical Big Data Landscape

Page 6: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

A Typical Big Data Landscape

Page 7: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

A Typical Big Data Landscape

•  Data Silo

•  Batch environment

•  Periodical Analytics/statistics

•  Data Source for new systems

Page 8: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

The need for Real Time Big Data•  Obtaining analytical results faster– Processing faster than once a day

•  Load evens out over day

•  Past/Present/Future– Alert for certain events– Updating Prediction models on-the-fly

•  Allow faster feedback to end users–  See results of your actions right away

Page 9: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Perfect fits for Real Time Processing•  Anomaly Detection

–  Abnormal readings of sensors–  Abnormal amounts of log files–  Fraud detection

•  Real Time updates to Recommender models–  Fast new recommendations in e-commerce–  Support for trending items–  Fast responses to events happening right now

•  Real Time updates of clustering models

•  Improving Classification based un current events

•  Can be run side-by-side with traditional historical models

Page 10: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Ingestion Processing Output

Page 11: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Ingestion

Page 12: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Ingestion

Page 13: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Page 14: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Apache Kafka

•  Fast

•  Scalable

•  Durable

•  Distributed

Page 15: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Apache Kafka - Overview

•  Producers write messages to Kafka topics

•  Consumers process messages from a topic

•  Kafka runs on a cluster of server where each server is called a broker

Page 16: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Apache Kafka - Topics•  Topics are split up in

different partitions•  Partitions are

replicated across the cluster

•  Order of messages is guaranteed

•  Messages are stored for a period of time

•  Producers decide which partition they write to

•  Consumers keep the offset of which messages they have read

Page 17: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

DEMO

Page 18: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Ingestion Processing Output

Page 19: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

The Hadoop Ecosystem

     

HDFS  Distributed  File  System   Amazon  S3   Local  FS  

YARN  Resource  Management  

MapReduce  

HBase  NoSQL  

Hive  Data  Mart  

Pig  ScripCng  

Sqoop  SQL  

Import  Export  

Mahout  Machine  Learning  

…  

Page 20: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

The Hadoop Ecosystem

     

HDFS  Distributed  File  System   Amazon  S3   Local  FS  

YARN  Resource  Management  

MapReduce  

HBase  NoSQL  

Hive  Data  Mart  

Pig  ScripCng  

Sqoop  SQL  

Import  Export  

Mahout  Machine  Learning  

…  

Spark   Storm   …  

Spark  SQL    

Spark  MLlib    

Page 21: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Page 22: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Apache Storm

Page 23: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Spouts

Page 24: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Spouts

•  Source of streams into the topology•  Can be reliable or unreliable

•  Support for:– Kafka– Kestrel– RabbitMQ–  JMS– Amazon Kinesis– Build your own (e.g. twitter)

Page 25: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Bolts

Page 26: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Bolts

•  Where all the processing happens

•  Filtering, functions, aggregations, joins, database updates, …

•  You subscribe to streams of a different component (other bolts/spouts)

•  Must ack every tuple they process

Page 27: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Parallelism

•  Spouts & Bolts actually run as multiple instances on different machines

•  Making sure that the correct messages goes to the correct instance is up to the developer

Page 28: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Stream Groupings

Page 29: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Stream Groupings

•  Defines how a stream should be partitioned among the bolt's tasks

•  Some examples:– Round Robin

– Based on key– All– Specific instance

– …

Page 30: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Storm Ups and Downs

•  Really real time•  Very Powerful•  Built for performance

•  Very low level (comparable to MapReduce)

•  Trivial tasks can become hard (sorting, joins, …)

Page 31: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Spark Streaming

Page 32: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Spark Architecture

Page 33: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Spark Streaming Concepts

Page 34: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Spark Streaming Input

•  Kafka•  Flume•  Kinesis•  Twitter•  ZeroMQ•  HDFS•  TCP Sockets

Page 35: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Windowing

•  You can group multiple batches together into a sliding window.

•  E.g. all the events from the last 60 seconds

Page 36: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Spark Streaming Strengths

•  Works just like regular Spark processing, just replace SparkContext with StreamingContext

•  Full integration with other Spark libraries (Spark SQL, Spark Mllib, …)

•  Ease of development

•  Scalable, fault-tolerant, …

Page 37: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Spark Streaming Example

Page 38: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Ingestion Processing Output

Page 39: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Getting to Your Data

Page 40: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Getting to Your Data

Page 41: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Data output bottlenecks

•  Pig & Hive are quite slow

•  No visual feedback from results

•  Specific calculations (cubing) of metrics – Reporting tools cannot handle the

dimensions of the data

Page 42: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Page 43: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Elasticsearch

•  Document store (ideal for denormalized data)

•  Distributed•  Highly Available

•  Open Source

•  Real Time (Inserts & Searches)

Page 44: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

ES-Hadoop

Page 45: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Hive Integration

•  Writing to Elasticsearch from Hive

CREATE  EXTERNAL  TABLE  artists  (          id            BIGINT,          name        STRING,          links      STRUCT<url:STRING,  picture:STRING>)  STORED  BY  'org.elasticsearch.hadoop.hive.EsStorageHandler'  TBLPROPERTIES('es.resource'  =  'radio/artists');    -­‐-­‐  insert  data  to  Elasticsearch  from  another  table  called  'source'  INSERT  OVERWRITE  TABLE  artists          SELECT  NULL,  s.name,  named_struct('url',  s.url,  'picture',  s.picture)          FROM  source  s;  

Page 46: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Hive Integration

•  Reading from Elasticsearch in Hive

CREATE  EXTERNAL  TABLE  artists  (          id            BIGINT,          name        STRING,          links      STRUCT<url:STRING,  picture:STRING>)  STORED  BY  'org.elasticsearch.hadoop.hive.EsStorageHandler'  TBLPROPERTIES('es.resource'  =  'radio/artists',                              'es.query'  =  '?q=me*');    -­‐-­‐  stream  data  from  Elasticsearch  SELECT  *  FROM  artists;  

Page 47: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Pig Integration

•  Writing to Elasticsearch from Pig

-­‐-­‐  load  data  from  HDFS  into  Pig  using  a  schema  A  =  LOAD  'src/test/resources/artists.dat'  USING  PigStorage()                                          AS  (id:long,  name,  url:chararray,  picture:  chararray);  -­‐-­‐  transform  data  B  =  FOREACH  A  GENERATE  name,  TOTUPLE(url,  picture)  AS  links;  -­‐-­‐  save  the  result  to  Elasticsearch  STORE  B  INTO  'radio/artists'  USING  org.elasticsearch.hadoop.pig.EsStorage();  

Page 48: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Pig Integration

•  Reading from Elasticsearch in Pig

-­‐-­‐  execute  Elasticsearch  query  and  load  data  into  Pig  A  =  LOAD  'radio/artists'          USING  org.elasticsearch.hadoop.pig.EsStorage('es.query=?me*');  DUMP  A;  

Page 49: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Spark Integration

•  Writing to Elasticsearch from Spark

import  org.apache.spark.SparkContext          import  org.apache.spark.SparkContext._    import  org.elasticsearch.spark._                    val  conf  =  ...  val  sc  =  new  SparkContext(conf)                      -­‐-­‐  Create  RDD  here    rdd.saveToEs("spark/docs")  

Page 50: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Spark Integration

•  Reading from Elasticsearch in Spark

...  import  org.elasticsearch.spark._    ...  val  conf  =  ...  val  sc  =  new  SparkContext(conf)    sc.esRDD("radio/artists",  "?q=me*")    

Page 51: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Storm Integration

•  Writing to Elasticsearch from Storm

import  org.elasticsearch.storm.EsBolt;      TopologyBuilder  builder  =  new  TopologyBuilder();  builder.setSpout("spout",  new  RandomSentenceSpout(),  10);  builder.setBolt("es-­‐bolt",  new  EsBolt("storm/docs"),  5)                                                                          .shuffleGrouping("spout");  

Page 52: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Storm Integration

•  Reading from Elasticsearch in Storm

import  org.elasticsearch.storm.EsSpout;      TopologyBuilder  builder  =  new  TopologyBuilder();  builder.setSpout("es-­‐spout",  new  EsSpout("storm/docs",  "?q=me*),  5);  builder.setBolt("bolt",  new  PrinterBolt()).shuffleGrouping("es-­‐spout");  

Page 53: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Visualizing data

Page 54: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Kibana

•  Visualization tool on top of Elasticsearch

•  Allows ad-hoc querying & graphing

•  Support for real time updates

•  Create your own dashboards

Page 55: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Demo

Page 56: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Wrap Up

Ingestion Processing Output

Page 57: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Page 58: Real Time Big Data

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Data Science Company

Real Time Big Data

InfoFarm Seminar18/11/2015