Upload
guido-schmutz
View
263
Download
7
Tags:
Embed Size (px)
Citation preview
2015 © Trivadis
BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN
2014 © Trivadis
Blueprints for the analysis of social media
Java Lounge Zurich, Mai 2015
Guido Schmutz
Trivadis AG
May 2015 Blueprints for the analysis of social media
1
2015 © Trivadis
Guido Schmutz
§ Working for Trivadis for more than 18 years § Oracle ACE Director for Fusion Middleware and SOA § Co-Author of different books § Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data § Member of Trivadis Architecture Board § Technology Manager @ Trivadis
§ More than 25 years of software development
experience
§ Contact: [email protected] § Blog: http://guidoschmutz.wordpress.com § Twitter: gschmutz
May 2015 Blueprints for the analysis of social media
2
2015 © Trivadis
A little story of a “real-life” customer situation
Traditional system interact with its clients and does its work
Implemented using legacy technologies (i.e. PL/SQL)
New requirement:
• Offer notification service to notify customer when goods are shipped
• Subscription and inform over different channels
• Existing technology doesn’t fit
May 2015 Blueprints for the analysis of social media
3
delivery
Logistic System
Oracle
Mobile Apps
Sensor ship
sort
3
Rich (Web) Client Apps
DB
schedule
Logic (PL/SQL)
delivery
2015 © Trivadis
A little story of a “real-life” customer situation
Rule Engine implemented in Java and invoked from OSB message flow
Notification system informed via queue
Higher Latency introduced (good enough in this case)
Events are “owned” by traditional application (as well as the channels they are transported over)
integrate in order to get the information!
Oracle Service Bus was already there
May 2015 Blueprints for the analysis of social media
4
delivery
Logistic System
Oracle Oracle
Service Bus
Mobile Apps
Sensor AQ ship
sort
4
Rich (Web) Client Apps
DB
schedule
Filter
Notification
Logic (PL/SQL)
JMS
Rule Engine (Java)
Logic (Java) delivery
ship delivery
delivery true SMS
…
2015 © Trivadis
A little story of a “real-life” customer situation
May 2015 Blueprints for the analysis of social media
5
delivery
Logistic System
Oracle Oracle
Service Bus
Mobile Apps
Sensor AQ ship
sort
5
Rich (Web) Client Apps
DB
schedule
Filter
Notification
Logic (PL/SQL)
JMS
Rule Engine (Java)
Logic (Java) delivery
ship delivery
delivery true SMS
…
Treat events as first-class citizens
Events belong to the “enterprise” and not an individual system => Catalog of Events similar to Catalog of Services/APIs !!
Event (stream) processing can be introduced and by that latency reduced!
2015 © Trivadis
Stream/Event Processing?
Infrastructure for continuous data processing
Computational model can be as general as MapReduce but with the ability to produce low-latency results
Data collected continuously is naturally processed continuously
aka. Event Processing / Complex Event Processing (CEP)
May 2015 Blueprints for the analysis of social media
6
2015 © Trivadis
Agenda
1. Designing Stream/Event Processing Solutions
2. Implementing the Enterprise Event Bus (Unified Log)
3. Implementing Stream Processing
4. Unified Log (Event) Processing Architecture in Action
May 2015 Blueprints for the analysis of social media
7
2015 © Trivadis
How to design a Streaming Processing System? It usually starts very simple … just one data pipeline
May 2015 Blueprints for the analysis of social media
8
Event Stream Consumer event Collector
2015 © Trivadis
New Event Stream sources are added …
May 2015 Blueprints for the analysis of social media
9
Event Stream Consumer
2nd Event Stream
3rd Event Stream
nth Event Stream
event
event
event
event
Collector
2nd Collector
3rd Collector
Nth Collector
2015 © Trivadis
New Processors are interested in the events …
May 2015 Blueprints for the analysis of social media
10
Event Stream Consumer
2nd Event Stream
3rd Event Stream
nth Event Stream
2nd Consumer event
event
event
event
Collector
2nd Collector
3rd Collector
Nth Collector
2015 © Trivadis
… and the solution becomes the problem
May 2015 Blueprints for the analysis of social media
11
Event Stream Consumer
2nd Event Stream
3rd Event Stream
nth Event Stream
2nd Consumer
3rd Consumer
Nth Consumer
event
event
event
event
Collector
2nd Collector
3rd Collector
Nth Collector
2015 © Trivadis
… and the solution becomes the problem
May 2015 Blueprints for the analysis of social media
12
Event Stream Consumer
2nd Event Stream
3rd Event Stream
nth Event Stream
2nd Consumer
3rd Consumer
Nth Consumer
event
event
event
event
Collector
2nd Collector
3rd Collector
Nth Collector
2015 © Trivadis
… and the solution becomes the problem
May 2015 Blueprints for the analysis of social media
13
New Customers
Operational Logs
Click Stream
Meter Readings
event
event
event
event
CDC Collector
Log Collector
Click Stream Collector
Senor Collector
Hadoop/Data Warehouse
Recommendation System
Log Search
Fraud Detection
2015 © Trivadis
Decouple event streams from consumers
May 2015 Blueprints for the analysis of social media
14
„Unified Log“
Remember Enterprise Service Bus (ESB) ?
Enterprise Event Bus
Event Stream Processor
Event Stream Source
New Customers
Operational Logs
Click Stream
Meter Readings
CDC Collector
Log Collector
Click Stream Collector
Senor Collector
Hadoop/Data Warehouse
Recommendation System
Log Search
Fraud Detection
What is the idea of a Unified Log?
2015 © Trivadis
Unified Log – What is it?
By Unified Log, we do not mean this …. 137.229.78.245 - - [02/Jul/2012:13:22:26 -0800] "GET /wp-admin/images/date-button.gif HTTP/1.1" 200 111 137.229.78.245 - - [02/Jul/2012:13:22:26 -0800] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver=349-20805 HTTP/1.1" 200 13593 137.229.78.245 - - [02/Jul/2012:13:22:26 -0800] "GET /wp-includes/js/tinymce/wp-tinymce.php?c=1&ver=349-20805 HTTP/1.1" 200 101114 137.229.78.245 - - [02/Jul/2012:13:22:28 -0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 30747 137.229.78.245 - - [02/Jul/2012:13:22:40 -0800] "POST /wp-admin/post.php HTTP/1.1" 302 - 137.229.78.245 - - [02/Jul/2012:13:22:40 -0800] "GET /wp-admin/post.php?post=387&action=edit&message=1 HTTP/1.1" 200 73160 137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "GET /wp-includes/css/editor.css?ver=3.4.1 HTTP/1.1" 304 - 137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver=349-20805 HTTP/1.1" 304 - 137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 30809
… but this
• a structured log (records are numbered beginning with 0 based on order they are written)
• aka. commit log or journal
May 2015 Blueprints for the analysis of social media
15
0 1 2 3 4 5 6 7 8 9 10 11
1st record Next record written
2015 © Trivadis
Central Unified Log for (real-time) subscription
Take all the organization’s data (events) and put it into a central log for subscription
Properties of the Unified Log: • Unified: “Enterprise”, single deployment • Append-Only: events are appended, no update in place => immutable • Ordered: each event has an offset, which is unique within a shard • Fast: should be able to handle thousands of messages / sec • Distributed: lives on a cluster of machines
May 2015
Blueprints for the analysis of social media 16
0 1 2 3 4 5 6 7 8 9 10 11
reads
writes
Collector
Consumer System A (time = 6)
Consumer System B (time = 10)
reads
2015 © Trivadis
Unified Log / Event Processing Architecture
Stream processing allows for computing feeds off other feeds
Derived feeds are no different than original feeds they are computed off
Single deployment of “Unified Log”
logically different feeds
May 2015 Blueprints for the analysis of social media
17
Meter Readings Collector
Enrich / Transform
Aggregate by Minute
Raw Meter Readings
Meter & Customer
Meter by Customer by Minute
Customer Aggregate by Minute
Meter by Minute
Persist
Meter by Minute
Persist
Raw Meter Readings
….
2015 © Trivadis
Agenda
1. Designing Stream Processing Solutions
2. Implementing the Enterprise Event Bus (Unified Log)
3. Implementing Stream Processing
4. Unified Log (Event) Processing Architecture in Action
May 2015 Blueprints for the analysis of social media
18
2015 © Trivadis
Apache Kafka - Overview
• A distributed publish-subscribe messaging system
• Designed for processing of real time activity stream data (logs, metrics collections, social media streams, …)
• Initially developed at LinkedIn, now part of Apache
• Does not follow JMS Standards and does not use JMS API
• Kafka maintains feeds of messages in topics
May 2015 Blueprints for the analysis of social media
19
Kafka Cluster
Consumer Consumer Consumer
Producer Producer Producer
0 1 2 3 4 5 6 7 8 9 1 0
1 1
1 2
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 1 0
1 1
1 2
Anatomy of a topic:
Partition 0
Partition 1
Partition 2
Writes
old new
http://kafka.apache.org/
2015 © Trivadis
Apache Kafka - Performance
Kafka at LinkedIn
Up to 2 million writes/sec on 3 cheap machines § Using 3 producers on 3 different machines
May 2015 Blueprints for the analysis of social media
20
10+ billion writes per day
172k messages per second
(average)
55+ billion messages per day
to real-time consumers
http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
2015 © Trivadis
Apache Kafka - Partition offsets
Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset
• Consumers track their pointers via (offset, partition, topic) tuples
May 2015 Blueprints for the analysis of social media
21
Consumer group C1
2015 © Trivadis
Unified Log Alternatives
• Amazon Kinesis (http://aws.amazon.com/kinesis/) • Confluent (http://confluent.io/) • Redis Pub/Sub (http://redis.io/topics/pubsub) • Kestrel (http://robey.github.io/kestrel/) • ZeroMQ (http://zeromq.org/) • RabbitMQ (http://www.rabbitmq.com/) • Oracle GoldenGate (http://bit.ly/g-gate) • JMS compliant Server
• Apache ActiveMQ (http://activemq.apache.org/) • Weblogic JMS (
http://www.oracle.com/technetwork/middleware/weblogic/overview/index.html) • IBM Websphere MQ (http://www-03.ibm.com/software/products/de/ibm-mq) • …
May 2015 Blueprints for the analysis of social media
22
2015 © Trivadis
Apache Storm
A platform for doing analysis on streams of data as they come in, so you can react to data as it happens. • highly distributed real-time computation system
• Provides general primitives to do real-time computation
• To simplify working with queues & workers
• scalable and fault-tolerant
Originated at Backtype, acquired by Twitter in 2011
Open Sourced late 2011
Part of Apache Incubator since September 2013
May 2015 Blueprints for the analysis of social media
23 https://storm.apache.org/
2015 © Trivadis
Apache Storm – Core concepts
Tuple • Immutable Set of Key/value pairs
Stream • an unbounded sequence of tuples that can be processed in parallel by Storm
Topology • Wires data and functions via a DAG (directed acyclic graph) • Executes on many machines similar to a MR job in Hadoop
Spout • Source of data streams (tuples) • can be run in “reliable” and “unreliable” mode
Bolt • Consumes 1+ streams and produces new streams • Complex operations often require multiple
steps and thus multiple bolts
May 2015 Blueprints for the analysis of social media
24
Spout
Spout
Bolt
Bolt
Bolt
Bolt
Source of Stream B
Subscribes: A Emits: C
Subscribes: A Emits: D
Subscribes: A & B Emits: -
Subscribes: C & D Emits: -
T T T T T T T T
2015 © Trivadis
Stream Processing Alternatives
• Apache Samza (http://samza.incubator.apache.org) • Apache S4 (http://incubator.apache.org/s4/) • Apache Spark Streaming (http://spark.apache.org/streaming/) • Google MillWheel (http://research.google.com/pubs/pub41378.html) • Akka Streams (http://akka.io) • Complex Event Processing
§ Esper (http://esper.codehaus.org/) § WSO2 Complex Event Processor (http://wso2.com/products/complex-event-processor/) § Oracle Event Processing (
http://www.oracle.com/technetwork/middleware/complex-event-processing/overview/index.html)
§ TIBCO BusinessEvents & TIBCO StreamBase (http://www.tibco.com/products/event-processing/complex-event-processing)
§ IBM InfoSphere (http://www-01.ibm.com/software/data/infosphere/) § Microsoft StreamInsight (http://msdn.microsoft.com/de-ch/sqlserver/ee476990.aspx) § …
May 2015 Blueprints for the analysis of social media
25
2015 © Trivadis
Agenda
1. Designing Stream Processing Solutions
2. Implementing the Enterprise Event Bus (Unified Log)
3. Implementing Stream Processing
4. Unified Log (Event) Processing Architecture in Action
May 2015 Blueprints for the analysis of social media
26
2015 © Trivadis
Unified Log Processing Architecture in Trivadis CRA
May 2015 Blueprints for the analysis of social media
27
Tweets Filter and Unify
Persist Tweet
Filtered Tweets
Split Text
Words
Count over Time
Count by Minute
Persist Graph
Social Graph
Remove Stopwords
Tweet
Tweets Consumer
Twitter Filter Stream
Sensor Layer Distribution Layer
Speed Layer
Kafka Storm
Cassandra Elasticsearch Titan
2015 © Trivadis
Unified Log Processing Architecture in Trivadis CRA
May 2015 Blueprints for the analysis of social media
28
Tweets Filter and Unify
Persist Tweet
Filtered Tweets
Split Text
Words
Count over Time
Count by Minute
Persist Graph
Social Graph
Remove Stopwords
Tweet
Tweets Consumer
Twitter Filter Stream
Sensor Layer Distribution Layer
Splitter
Kafka Spout
Word Remover
Splitter
Word Remover
Shuffle Fields
Kafka
Kafka
Word Remover
Storm Topology
Speed Layer
Kafka Storm
Cassandra Elasticsearch Titan
2015 © Trivadis
Storm Topology
May 2015 Blueprints for the analysis of social media
29
Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb
#barca
Sentence Splitter
Twitter Spout
Sentence Splitter
… #barca
Shuffle Grouping
Sentence Splitter
… #fcb
bayern
fcb
juve
real
barca
barca
2015 © Trivadis
Storm Topology
May 2015 Blueprints for the analysis of social media
30
Sentence Splitter
Twitter Spout
Word Counter
Sentence Splitter
Word Counter
Sentence Splitter
Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb
#barca
Shuffle Grouping
… #barca
… #fcb
Fields Grouping
real
juve
barca
barca
bayern
fcb
2015 © Trivadis
Storm Topology
May 2015 Blueprints for the analysis of social media
31
Sentence Splitter
Twitter Spout
Word Counter
Sentence Splitter
Word Counter
Sentence Splitter
Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb
#barca
Shuffle Grouping
real
juve
barca
barca
bayern
fcb … #barca
… #fcb
Fields Grouping
INCR barca
INCR real
INCR juve
real = 1
juve = 1
INCR barca
INCR bayern bayern = 1
barca = 1
barca = 2
INCR fcb fcb = 1
2015 © Trivadis
Storm Topology
May 2015 Blueprints for the analysis of social media
32
Sentence Splitter
Twitter Spout
Word Counter
Sentence Splitter
Word Counter
Persist
INCR real 1 INCR juve 1
INCR barca 2 INCR bayern 1
Sentence Splitter
Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb
#barca
Shuffle Grouping
real
juve
barca
barca
bayern
... … #barca
… #fcb
Fields Grouping
Global Grouping
real = 1 juve = 1
bayern = 1 barca = 2
30sec
fcb = 1
INCR fcb 1
2015 © Trivadis
Elasticsearch
Kibana Dashboards Open Source Search & Analytics engine • Structured & unstructured data • Real-time Analytics • Percolator Index • Analytics capabilities (facets) • REST based • Schema-free • Distributed
Lightweight Build on top of Apache Lucene
May 2015 Blueprints for the analysis of social media
34 https://www.elastic.co/
2015 © Trivadis
Elasticsearch
May 2015 Blueprints for the analysis of social media
35 https://www.elastic.co/
2015 © Trivadis
Cassandra
• Developed at Facebook
• Open source distributed database management system
• Professional grade support from company called DataStax
• Main Features § Real-Time § Highly Distributed § Support for Multiple Data Center § Highly Scalable § No Single Point of Failure § Fault Tolerant § Tunable Consistency § Cassandra Query Language (CQL)
May 2015 Blueprints for the analysis of social media
37 http://www.datastax.com/
2015 © Trivadis
Table TWEET_COUNT
22.05.2014 Big Data and Fast Data – gross und schnell, geht das? | Teil 2: Praktische Erfahrungen bei der Umsetzung
38
Sensor Bucket
AFG10 MINUTE-2014/03/05 key IBM IBM IBM … Oracle Oracle …
at 11:59 11:58 11:57 … 11:59 11:58 …
count 10 4 6 … 2 4 …
AFG10 HOUR-2014/03 key IBM IBM IBM … Oracle Oracle …
at 5T11 5T10 5T09 … 5T11 5T10 …
count 148 108 111 … 29 41 …
AFG10 DAY-2014 key IBM IBM IBM … Oracle Oracle …
at 5T 3T 2T … 5T 4T …
count 10100.2 9892.2 8987.4 … 879.8 912,3 …
GXK11 MINUTE-2014/03/5 key NoSQL NoSQL NoSQL … Hadoop Hadoop …
at 11:59 11:03 11:04 … 11:01 11:02 …
count 5 9 12 … 2 1 …
Growth
24h * 60m * n keys = n * 1’440 cols
2015 © Trivadis
Optimized to work against billions of nodes and edges
Works with several different distributed DBs
• Apache Cassandra
• Apache HBase
• Oracle BerkeleyDB
Supports concurrent users doing complex graph traversals
Integration with TinkerPop stack
Supports integration with search technologies such as Lucene and Elasticsearch
Titan Graph Database
May 2015 Blueprints for the analysis of social media
39 http://thinkaurelius.github.io/titan/
2015 © Trivadis
Property Graph
Node / Vertex • can have zero or more
edges connected to it
Edge • connects two nodes
• may be directed or undirected
May 2015 Blueprints for the analysis of social media
40
User [id, name]
Post [id, message,
time]
Term [name,type]
author
follow uses
retweet
mention mention
2015 © Trivadis
Titan Graph Database
Titan can integrate with distributed architectures in a few different ways
May 2015 Blueprints for the analysis of social media
41
Remote Server
• Connects remotely to cluster
• Can scale size as far as cluster can
• Native Titan API
• Possible processing bottleneck
Remote Server with Rexster
• Put Rexster in front to allow RESTful access
• Connects remotely to cluster
• Can scale size as far as cluster can
• Possible processing bottleneck
Embedded
• TitanDB and Rexster run on each node in cluster
• Can run on same JV
• Considerable performance/stability improvement
2015 © Trivadis
Tinkerpop Stack
Different components all built on each other Provides abstraction Blueprints underpins the stack making it all DB agnostic Blueprints implementations • Neo4j
• Oracle NoSQL
• Titan
• FluxGraph
• Foundation DB
• MongoDB
• …
Tinkerpop3 on its way ….
May 2015 Blueprints for the analysis of social media
42 http://tinkerpop.incubator.apache.org/
2015 © Trivadis
Tinkerpop - Gremlin
Graph traversal scripting language
May 2015 Blueprints for the analysis of social media
43 https://github.com/tinkerpop/gremlin/wiki
2015 © Trivadis
Tinkerpop - Rexster
Provides REST and binary protocols
Flexible extension model (e.g. ad-hoc Gremlin queries)
Server-side stored procedures (Gremlin)
Browser-based interface (Dog House)
Command-line tool for interacting with API
SPARQL plugin to work against sail graphs (OpenRDF)
May 2015 Blueprints for the analysis of social media
44 https://github.com/tinkerpop/rexster/wiki
2015 © Trivadis
Keylines - Visualizing Graphs Toolkit for visualizing graphs
Compatible with any modern browser
HTML 5 or Flash (fall-back) Compatible with any graph database
Powerful visualizations features
Built-in social network analysis
http://keylines.com
May 2015 Blueprints for the analysis of social media
45 http://keylines.com
2015 © Trivadis
Weitere Informationen...
May 2015 Blueprints for the analysis of social media
46
INFOBOX – Lesen und Löschen • Folie wenn auf weitere Informationen
verwiesen werden soll, also z.B. Bücher, Websiten, etc.
2015 © Trivadis
BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN
Fragen und Antworten...
2013 © Trivadis
May 2015 Blueprints for the analysis of social media
INFOBOX – Lesen und Löschen • Die Schlussfolie steht in zwei Varianten
zur Verfügung, einmal für die Kontaktdaten eines Referenten, einmal in der Variante für zwei oder mehr Referenten
• Name, Titel und Location jeweils untereinander in eine Zeile (Shift+Return)
• Die Idee ist das diese Folie als letzte Folie (auch für Fragen und Antworten) am Ende der Präsentation lange stehen bleibt, somit haben die Zuhörer die Möglichkeit die Kontaktdaten aufzuschreiben J