23
Simultaneous Analysis of Massive Data Streams in Real-Time and Batch Anjana Fernando Technical Lead WSO2

Simultaneous analysis of massive data streams in real time and batch

Embed Size (px)

DESCRIPTION

My WSO2Con Asia 2014 Talk

Citation preview

Page 1: Simultaneous analysis of massive data streams in real time and batch

Simultaneous Analysis of Massive Data Streams in Real-Time and Batch

Anjana Fernando

Technical Lead

WSO2

Page 2: Simultaneous analysis of massive data streams in real time and batch

Agenda

• How massive data streams created• How to receive• How to store• How to analyze, batch vs real-time• WSO2 Big Data solution• Demo

Page 3: Simultaneous analysis of massive data streams in real time and batch

Massive Data Streams -> Data Streams with Big Data

Page 4: Simultaneous analysis of massive data streams in real time and batch

What is Big Data?

❏ The 3 Vs❏ Velocity❏ Volume❏ Variety

Image Source: http://akrayasolutions.com/big-data/

Page 5: Simultaneous analysis of massive data streams in real time and batch

Where does it originate from?

• Machine logs• Social media• Archives• Traffic information• Weather data• Sensor data (IoT)

Page 6: Simultaneous analysis of massive data streams in real time and batch

What do I do with it?

Create intelligence..

• Should I take an umbrella to work today?• What is the best route to go back home?• What are the current market trends?• Are my servers running healthily?

Page 7: Simultaneous analysis of massive data streams in real time and batch

Protocols used to publish data..

• HTTP• MQTT • Zigbee• Thrift• Avro• ProtoBuf

Page 8: Simultaneous analysis of massive data streams in real time and batch

How to store the data?• Relational databases • Block data stores -> HDFS• Column oriented -> HBase -> Cassandra• Document based -> MongoDB -> CouchDB• In-Memory -> VoltDB

A

C P

Page 9: Simultaneous analysis of massive data streams in real time and batch

How to analyse data?

• Two options:

-> Batch processing: Schedule data processing jobs and receive the processed data later

-> Real-time processing: The queries are executed and the results are retrieved instantly

Page 10: Simultaneous analysis of massive data streams in real time and batch

Analysing data..

• Batch processing -> Apache Hadoop: Map/Reduce processing system and a distributed file system

Page 11: Simultaneous analysis of massive data streams in real time and batch

Analysing data..

• Batch processing - Data Warehouse -> Apache Hive - Hadoop based framework for working on large scale data stores with SQL-like queries

INSERT OVERWRITE TABLE UserTable SELECT userName, COUNT(DISTINCT orderID),SUM(quantity) FROM PhoneSalesTable WHERE version= "1.0.0" GROUP BY userName;

Page 12: Simultaneous analysis of massive data streams in real time and batch

Analysing data..

• Batch processing - In-Memory Computing -> Apache Spark - Functional programming model, in-memory computing, claims 10x - 100x faster than Hadoop

Page 13: Simultaneous analysis of massive data streams in real time and batch

Analysing data..

• Real-time processing - Stream Processing -> Apache Storm - Distributed and fault-tolerant

Spouts Bolts

Page 14: Simultaneous analysis of massive data streams in real time and batch

Analysing data..

• Real-time processing - Complex Event Processing -> WSO2 Siddhi:

Page 15: Simultaneous analysis of massive data streams in real time and batch
Page 16: Simultaneous analysis of massive data streams in real time and batch

Big Data Architecture with WSO2..

• Data Streams {

'name':'phone.retail.shop','version':'1.0.0','nickName': 'Phone_Retail_Shop','description': 'Phone Sales','metaData':[ {'name':'clientType','type':'STRING'}],'payloadData':[ {'name':'brand','type':'STRING'}, {'name':'quantity','type':'INT'}, {'name':'total','type':'INT'}, {'name':'user','type':'STRING'}]

}

The common stream format used in both CEP and BAM; The stream definition contains the stream name, version and other attributes that makes up the stream.

Page 17: Simultaneous analysis of massive data streams in real time and batch

Big Data Architecture with WSO2..

• WSO2 BAM-> Data Receiver - High performance binary format data publishing with Apache Thrift, shared with WSO2 CEP-> Data Storage - Cassandra for highly scalable data store-> Data Analyzer - Hive based batch processing

Page 18: Simultaneous analysis of massive data streams in real time and batch

Big Data Architecture with WSO2..

• WSO2 BAM..-> Activity Monitoring: Implemented using a custom indexing mechanism to instantly search for events of a specific activity in the system

Page 19: Simultaneous analysis of massive data streams in real time and batch

Big Data Architecture with WSO2..

• WSO2 BAM..-> Incremental Data Processing - Customized Hive to support incremental data processing:

@Incremental(name="salesAnalysis" , tables="PhoneSalesTable") SELECT brandname, Count(DISTINCT orderid), Sum(quantity) FROM phonesalestable WHERE version = "1.0.0" GROUP BY brandname;

Page 20: Simultaneous analysis of massive data streams in real time and batch

Big Data Architecture with WSO2..

• WSO2 CEP-> Same data receiver as BAM, where this is the point where the same event is sent to both servers, where BAM for batch processing and CEP for real-time processing of the same data streams-> Real-time in-memory processing, based on WSO2 Siddhi engine, with data adapters for receiving and sending event with different data types and transports, e.g. XML, JSON, Text, HTTP, JMS, SMTP

Page 21: Simultaneous analysis of massive data streams in real time and batch

Demo

Page 22: Simultaneous analysis of massive data streams in real time and batch

Questions?

Page 23: Simultaneous analysis of massive data streams in real time and batch

Thank you!