Streaming computing: architectures, and tchnologies

Preview:

DESCRIPTION

Some loosen thoughts about the latest buzzwords, streaming computing, realtime processing, and in memory computing.

Citation preview

Streaming ComputingSome thoughts and technology choices for event-driven processing

Natalino Busa - 29 Aug. 2013

Outline

● Concurrency● Streaming computing

● Technologies○ Gigaspaces○ Storm○ Akka

● Comparison matrix● Opportunities

Algorithms: a tribute

Numbers and Algorithms:

9th century Persian Muslim mathematician Abu Abdullah Muhammad ibn Musa Al-Khwarizmi,

whose work built upon that of the 7th century Indian mathematician Brahmagupta.

We own a lot to these guys !!!

Why do we need parallelism?

It gets bigger,

It doesn’t get much faster

BUT

We get more cores in a chip.

More cores = more parallelismWe are happy now, right?

Moore’s law

Every 18 months, the number of CPU

core’s double

Another interpretation:

Every 18 months, the number of idle

CPU core’s double

More parallelism

We trade:

Time vs ( CPU, Memory, I/O)

Modern applications

Scalability:Vertical: concurrency

(use all the cores, memory and I/O of a given machine)

Horizontal: distribution (use all the machines in the cluster)

High availability: Fault tolerance: all levels (local, distributed)

(the terminator effect: you can stop it but can’t kill it )

Streaming applications

Performance: Efficient use of resources:

CPU and memory, but also OS threads and sockets

Asynchronous:

event driven, reacts on new data

Distributed:

more machines = more performancethe algorithm is partitioned and/or replicated on the cluster

What to increase?

More CPU: It helps when there is

computation involved

More MEMORY: It helps when there is

more state to keep

More I/O: It helps when there are

more messages to transfer

Streaming or batch?

ProcessingData

Natalino Busa - 12 Feb. 2013

Data

source system target systemour system

What differentiate Streaming from Batch?

● Granularity of Data● Granularity of Processing

Granularity impacts:

Throughput, Latency, and the Cost of the system!

The choice is yours

1000 events/sec (1 KB/event)

running on 100 cores all day long

“Wait a day, then process”

860 M events = 86 GB of data

Latency: 24 hoursThroughput: 1 update/day

BATCH: Hadoop

Latency 1ms Throughput: 1000 updates/sec

STREAMING: Akka

“Do not wait”

Process the 1KB of data each msec.

“Both are valid options. It depends on the application domain and the requirements/specs of the target and source systems”

Mapping it to existing applications

Granularity of Data

256 GB 256 GB

Granularity of Processing

1 CPU 100 CPU’s

Traditional DB systems Big Data (Hadoop)

Granularity of Data

1 KB 1 KB

Granularity of Processing

1 CPU 100 CPU’s

Traditional mail server Web application server

Technologies: Gigaspaces

Technologies: StormTopology

SupervisingScaling

Technologies: Akka

Supervising:tree of actors

Topology (statics and dynamic actors)

Scaling and distributed processing

Technology matrix

Gran

ular

ity o

f Dat

aGranularity of Processing

Small Big

Small Akka AkkaGigaspaces

Big ? Storm

System end-to-end throughput

High ~ 10’000 events/sec Medium ~100 events/sec Low ~10 events/sec

Akka Storm/ Gigaspaces Scripting languages

Big Data in motion

Both are:Distributed, fault-tolerant, streaming

- Storm ++ multi-language -- not user/admin friendly -- slow supervising

processing elements are jvm’s ideal when data is coarse grained

- Akka ++ high throughput, fine grained actors ++ dynamic topologies -- low-level, but high performance

processing elements are small and lightweightideal for millions of transactions per second

- Gigaspaces ++ combines memory + application distribution -- framework api is not very flexible

processing elements are jvmsideal for all-in-one solution, with little customization

Opportunity: Lambda Architecture

Logic layerSoftware as a Servicee.g realt-time predictor

Natalino Busa - 12 Feb. 2013from http://www.manning.com/marz/

Opportunity: Batch + Streaming

BatchComputing

Front End Services

In-MemoryDistributed Database

In-memoryDistributed DB’s

BatchStreaming

HTML5 Client / Responsive Applow-latencyHTTP API services FETCH

(refresh)

StreamingComputing

Data Warehouses Messaging Busses

PUSH(SSE, notifications)

Recommended