23
Simultaneous Analysis of Massive Data Streams in Real-Time and Batch Anjana Fernando Technical Lead WSO2

WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch

  • Upload
    wso2

  • View
    452

  • Download
    2

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch

Simultaneous Analysis of Massive Data Streams in Real-Time and Batch

Anjana Fernando

Technical Lead

WSO2

Page 2: WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch

Agenda

• How massive data streams created• How to receive• How to store• How to analyze, batch vs real-time• WSO2 Big Data solution• Demo

Page 3: WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch

Massive Data Streams -> Data Streams with Big Data

Page 4: WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch

What is Big Data?

❏ The 3 Vs❏ Velocity❏ Volume❏ Variety

Image Source: http://akrayasolutions.com/big-data/

Page 5: WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch

Where does it originate from?

• Machine logs• Social media• Archives• Traffic information• Weather data• Sensor data (IoT)

Page 6: WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch

What do I do with it?

Create intelligence..

• Should I take an umbrella to work today?• What is the best route to go back home?• What are the current market trends?• Are my servers running healthily?

Page 7: WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch

Protocols used to publish data..

• HTTP• MQTT • Zigbee• Thrift• Avro• ProtoBuf

Page 8: WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch

How to store the data?• Relational databases • Block data stores -> HDFS• Column oriented -> HBase -> Cassandra• Document based -> MongoDB -> CouchDB• In-Memory -> VoltDB

A

C P

Page 9: WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch

How to analyse data?

• Two options:

-> Batch processing: Schedule data processing jobs and receive the processed data later

-> Real-time processing: The queries are executed and the results are retrieved instantly

Page 10: WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch

Analysing data..

• Batch processing -> Apache Hadoop: Map/Reduce processing system and a distributed file system

Page 11: WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch

Analysing data..

• Batch processing - Data Warehouse -> Apache Hive - Hadoop based framework for working on large scale data stores with SQL-like queries

INSERT OVERWRITE TABLE UserTable SELECT userName, COUNT(DISTINCT orderID),SUM(quantity) FROM PhoneSalesTable WHERE version= "1.0.0" GROUP BY userName;

Page 12: WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch

Analysing data..

• Batch processing - In-Memory Computing -> Apache Spark - Functional programming model, in-memory computing, claims 10x - 100x faster than Hadoop

Page 13: WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch

Analysing data..

• Real-time processing - Stream Processing -> Apache Storm - Distributed and fault-tolerant

Spouts Bolts

Page 14: WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch

Analysing data..

• Real-time processing - Complex Event Processing -> WSO2 Siddhi:

Page 15: WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch
Page 16: WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch

Big Data Architecture with WSO2..

• Data Streams {

'name':'phone.retail.shop','version':'1.0.0','nickName': 'Phone_Retail_Shop','description': 'Phone Sales','metaData':[ {'name':'clientType','type':'STRING'}],'payloadData':[ {'name':'brand','type':'STRING'}, {'name':'quantity','type':'INT'}, {'name':'total','type':'INT'}, {'name':'user','type':'STRING'}]

}

The common stream format used in both CEP and BAM; The stream definition contains the stream name, version and other attributes that makes up the stream.

Page 17: WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch

Big Data Architecture with WSO2..

• WSO2 BAM-> Data Receiver - High performance binary format data publishing with Apache Thrift, shared with WSO2 CEP-> Data Storage - Cassandra for highly scalable data store-> Data Analyzer - Hive based batch processing

Page 18: WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch

Big Data Architecture with WSO2..

• WSO2 BAM..-> Activity Monitoring: Implemented using a custom indexing mechanism to instantly search for events of a specific activity in the system

Page 19: WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch

Big Data Architecture with WSO2..

• WSO2 BAM..-> Incremental Data Processing - Customized Hive to support incremental data processing:

@Incremental(name="salesAnalysis" , tables="PhoneSalesTable") SELECT brandname, Count(DISTINCT orderid), Sum(quantity) FROM phonesalestable WHERE version = "1.0.0" GROUP BY brandname;

Page 20: WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch

Big Data Architecture with WSO2..

• WSO2 CEP-> Same data receiver as BAM, where this is the point where the same event is sent to both servers, where BAM for batch processing and CEP for real-time processing of the same data streams-> Real-time in-memory processing, based on WSO2 Siddhi engine, with data adapters for receiving and sending event with different data types and transports, e.g. XML, JSON, Text, HTTP, JMS, SMTP

Page 21: WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch

Demo

Page 22: WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch

Questions?

Page 23: WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch

Thank you!