40
Styles and Patterns of Distributed Architectures PROGRAMMING LANGUAGES AND COMPILERS (LOG3210) SOFTWARE QUALITY (LOG8371) SOFTWARE TESTING (LOG3430) Cloud Computing (LOG8415) Course Map Software Design (LOG2410) OO Design (SOLID) Design Patterns Bad Design Design Quality Frameworks Pluggable Architectures SOA Multi-tier Cloud Big Data Event-driven 1 LOG8430 – ©M. Fokaefs et F. Guibault Ecosystems Microservices

LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Styles and Patterns of Distributed Architectures

PROGRAMMING LANGUAGES AND COMPILERS (LOG3210)

SOFTWARE QUALITY (LOG8371)

SOFTWARE TESTING (LOG3430)

Cloud Computing(LOG8415)

Course MapSoftware Design (LOG2410)

OO Design (SOLID)

Design Patterns

Bad Design

Design Quality

Frameworks

PluggableArchitectures

SOA

Multi-tier

Cloud

Big DataEvent-driven

1LOG8430 – ©M. Fokaefs et F. Guibault

Ecosystems

Microservices

Page 2: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

LOG8430E: Big Data Architectures – Data Ingestion

and AnalysisFall 2020

©M. Fokaefs et F. Guibault

Page 3: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Styles and Patterns of Distributed Architectures

SOFTWARE QUALITY (LOG8371)

SOFTWARE TESTING (LOG3430)

Cloud Computing(LOG8415)

TodaySoftware Design (LOG2410)

OO Design (SOLID)

Design Patterns

Bad Design

Design Quality

Frameworks

PluggableArchitectures

Cloud

Big Data

3LOG8430 – ©M. Fokaefs et F. Guibault

Ecosystems Multi-tier

Page 4: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Application Service Analysis Database

GUILoad

Balancer

Service Worker1

Service Worker2

Service Worker3

Spark Master2

Spark Worker3

Service Worker4

Kafka Cluster

Spark Master1

Spark Worker1

Service Worker2

NoSQL DB Cluster

(master)

NoSQL DB

(worker)

NoSQL DB

(worker)

Batch Analytics

Local Analytics

LOG8430 – ©M. Fokaefs et F. Guibault 4

Page 5: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Analyses and data in the new era

Access to data

• Services

• Encapsulation

• Management of data ingestion

• Message-oriented middleware

Data analysis

• Algorithms

• Big data analyses architectures

Big data systems

• NoSQL databases

LOG8430 – ©M. Fokaefs et F. Guibault 5

Page 6: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

A small problem…

• Suppose we have a “smart” building that can control its environmental conditions (temperature, humidity, luminosity). The sensors and the equipment come from different manufacturers and with different specifications. We have only one system for control and analysis that will accept the data.

• What are the implementation challenges for this system?

LOG8430 – ©M. Fokaefs et F. Guibault 6

Page 7: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

A small problem…

• Multiple resources, a single destination.• Multiple data formats.• Interoperability becomes difficult.

• For efficient analyses, we need enough data.• We cannot process every single element of information individually.• We need to wait to have a sufficient quantity of data.

• The sensors do not have much intelligence.• Should we implement a service per sensor or per type of sensor to accept the data?• Who is responsible for managing the data ingestion?

• Web services are not a sufficient solution (not wrong, but insufficient).• They require a direct connection between the producer and the consumer.• They (often) require a synchronous interaction.

LOG8430 – ©M. Fokaefs et F. Guibault 7

Page 8: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Message-oriented middleware

• In a message-oriented architecture, the modules are really decoupled and interoperable.

• The principal entity of this architecture is the message.

• The messages correspond to data that needs to be communicated between the modules.

• The modules are divided in producers and consumers of messages.

• The broker is the middleware

LOG8430 – ©M. Fokaefs et F. Guibault 8

Page 9: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Advanced Message Queuing Protocol (AMQP)

• AMQP is an ISO standard for the specification of message-oriented middleware.

• AMQP defines the entities of a message system and their method of communication.

• The entities according to AMQP:• Broker• Producer• Consumer• Exchange• Queue• Message

LOG8430 – ©M. Fokaefs et F. Guibault 9

Page 10: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

AMQP: Broker

• The broker implements the interoperability and it relieves the modules from a lot of responsibilities.

• The broker also allows for asynchronicity: the producer submits a message and continues its work; the consumer can receive the message once it is ready.

• The broker has the responsibility to route the messages between producers and consumers.

• The broker can also transform the message. This helps the processing of data in different formats.

• The broker is responsible to guarantee the security, the performance, the reliability and the scalability of the system.

LOG8430 – ©M. Fokaefs et F. Guibault 10

Page 11: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

AMQP: Producers and Exchanges

• The producers submits messages to the broker and in particular to the exchanges.

• The exchanges route the messages in queues according to an algorithm:• Direct: The exchange routes the message to a queue specified by a key.• “Fanout”: The exchange routes the message to all queues linked to the exchange.• “Topic”: The message is routed according to a key and an additional parameter to

allow for multicast.• “Header”: An exchange routes the message to a queue specified by more

parameters than just a key.

• Besides the algorithm, the exchanges also have properties:• Name• Persistence: The exchange survives a restart of the broker.• Automatic removal: when there are no queues linked to the exchange.• Additional arguments: used by extensions or plugins.

LOG8430 – ©M. Fokaefs et F. Guibault 11

Page 12: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

AMQP: Consumers and Queues

• The consumers consume the messages by registering to queues.

• The consumers can remove messages from the queue or push message to the queue.

• The queue functions as a temporary storage for the messages.

• The queues can be persistent or not. A persistent queue is stored on disk and it can survive from a restart of a broker.

• A queue can be exclusive to a connection.

• The queues are linked to exchanges by links.• The links represent rules according to which the exchanges guide the message

to queues.

LOG8430 – ©M. Fokaefs et F. Guibault 12

Page 13: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

AMQP: Messages

• The messages are defined by their data and their attributes.• The content (the data) of the message is optional and it is ignored by the

broker.• The message can be declared as persistent, which implies that the message

is stored on disk and it can survive from a restart.• The attributes of the message include:

• Type and encoding of the content.• Routing key• Persistent or not• Priority• Producer ID• Timestamp and expiration time of the message.

LOG8430 – ©M. Fokaefs et F. Guibault 13

Page 14: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Implementations of AMQP

LOG8430 – ©M. Fokaefs et F. Guibault 14

Page 15: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

RabbitMQ

• RabbitMQ is a direct implementation of the AMQP• All the elements discussed before are present in RabbitMQ.

LOG8430 – ©M. Fokaefs et F. Guibault 15

Page 16: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

RabbitMQ: An example

LOG8430 – ©M. Fokaefs et F. Guibault 16

Page 17: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

RabbitMQ: Types of exchanges

LOG8430 – ©M. Fokaefs et F. Guibault 17

Page 18: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Apache Kafka

• Kafka is defined as a distributed streaming service.

• The streams are like the queues but with a few small differences.

• The streams allow the broadcast of the data and not just the multicast.

• Kafka uses transmission by topic.• The consumers register to topics

and wait for data at real-time.

• All data is persistent on disk.

LOG8430 – ©M. Fokaefs et F. Guibault 18

Page 19: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Kafka: Topics

• The order of messages is dictated by the producers, but the consumers can control the order of retrieval.

• Thanks to topics and the grouping of consumers, Kafka offers the advantages of parallelism and broadcast.

• The message can be read by multiple consumers.

LOG8430 – ©M. Fokaefs et F. Guibault 19

Page 20: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Kafka: Basic architecture

• “Input systems” = Producers

• “Output systems” = Consumers

• “Kafka Producers” = Exchanges

• “Kafka Consumers” = Queues

• “Kafka” = Broker

• For Kafka, the exchanges and the queues are the equivalent of the stubs for the services.• They are local classes that will be

called by external systems.

LOG8430 – ©M. Fokaefs et F. Guibault 20

Page 21: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Kafka: ZooKeeper

• Kafka can work in clusters of servers/brokers.

• This helps the performance and the reliability.• The messages can be replicated

between brokers to increase fault tolerance.

• ZooKeeper is a discovery service that can manage the cluster.• The clients communicate with the

ZooKeeper service to connect to a broker.

LOG8430 – ©M. Fokaefs et F. Guibault 21

Page 22: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

RabbitMQ vs Kafka

RabbitMQ

• Smart broker/Simple consumers

• Reliability and decoupling are priorities.

• Multiple types of exchange are supported.

• Direct transmission or multicast.

• Increased security.

• High performance is possible but with significant resources.

Kafka

• Simple broker/Smart consumers

• Streaming allows the use of a single connection with an increased throughput.

• Only exchange by topic is supported.

• Broadcast transmission.

• Security can be a problem.

• Increased performance.LOG8430 – ©M. Fokaefs et F. Guibault 22

Page 23: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Message-oriented middleware

Advantages

• Flexibility

• Decoupling/Interoperability

• Reliability

• Security

• Scalability

Disadvantages

• Performance

• Infrastructure and architecture complexity

• Decoupling is not always desired.

LOG8430 – ©M. Fokaefs et F. Guibault 23

Page 24: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Analysis: MapReduce

• MapReduce is a programming model for the distributed analysis of big data.

• The model contains an algorithm, an implementation and an infrastructure.• In an implementation of MapReduce, the developer needs to define two

tasks:• A map task that represents the analysis job (e.g., a sort, an addition, a search etc.)• A reduce task that represents a summarization job, that will aggregate the results of

the individual map tasks.

• The tasks can be executed by multiple processing nodes and they require multiple storage nodes to facilitate the processing of data and to store the final results if necessary.• Distributed data systems, which are traditionally used, include GFS (Google File

System) and HDFS (Hadoop Distributed File System).

LOG8430 – ©M. Fokaefs et F. Guibault 24

Page 25: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

MapReduce: Process

• The data is defined and organized in pairs of key-value.

• The data is partitioned automatically or explicitly by the developer.

• By size, by geographic location or temporally.

• The data is partitioned among the workers by key.

• Copies of the tasks (map and reduce) are transmitted to each worker.

• The workers execute the map tasks by using input data according to the keys and they store the intermediate results.

• Other workers aggregate the map results by executing the reduce tasks.

• The results are stored in one or more files.

• The developers can process the files or give them to another MapReduce task.

LOG8430 – ©M. Fokaefs et F. Guibault 25

Page 26: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

MapReduce: Architecture

• The model follows a “master-slave” architecture.

• The master is responsible for the distribution of tasks.

• The workers can accept map tasks, as well as reduce tasks.

• The master is responsible to balance the workload between the workers.

• The master periodically checks the correct function of the workers. This guarantees the reliability of the system and the high availability of the service.

• The master from its side is a weak point.• Newer versions solve this problem.

• The distributed file system is also a “master-slave” architecture.

LOG8430 – ©M. Fokaefs et F. Guibault 26

Page 27: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

MapReduce: Advantages and Disadvantages

Advantages

• Efficiency thanks to parallelism.

• Availability.

• Fault tolerance.

• Scalability.

• Minimal mobility of data.• The processing of data can

happen in the server that stores the data.

Disadvantages

• “Region hotspotting”• The use of keys to partition and

share the data among workers can overcharge a particular worker.

• We can solve this situation by using a hash on the keys.

• Not all types of processing can be implemented by MapReduce.

LOG8430 – ©M. Fokaefs et F. Guibault 27

Page 28: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

MapReduce Implementations

LOG8430 – ©M. Fokaefs et F. Guibault 28

Page 29: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Hadoop: Basic architecture

• Hadoop is a direct implementation of MapReduce.

• It supports the same principal modules.

• The “Cluster” represents the physical (or virtual) infrastructure for data processing.

• YARN is the management system for the cluster.

• HDFS is the distributed file system.

• S3 is the physical storage where the HDFS resides.

• MapReduce is the implementation of the model.

LOG8430 – ©M. Fokaefs et F. Guibault 29

Page 30: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

YARN: Workflow

• A client submits a job for an application.

• The resource manager (RM) searches for available resources.

• The RM contacts the node manager (NM).

• The NM starts a container with the available resources.

• The container runs the application master.

LOG8430 – ©M. Fokaefs et F. Guibault 30

Page 31: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Hadoop: MapReduce architecture

• The containers contain the master and the workers.

• The application master starts the job and the job is responsible for defining the tasks (map and reduce).

• The tasks are deployed on the containers to be executed.

• The necessary resources are determined and provided by YARN.

LOG8430 – ©M. Fokaefs et F. Guibault 31

Page 32: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Spark: The platform

• Spark is another MapReduce implementation (Spark Core).

• It also works in a cluster of resources.

• It can work with multiple resource managers, but it provides its own manager as well.

• Spark provides multiple capacities, besides MapReduce.• Spark SQL, for the operations with

relational databases.• Spark Streaming• MLlib, for machine learning.• GraphX, for graph analytics.

LOG8430 – ©M. Fokaefs et F. Guibault 32

Page 33: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Spark: Architecture

• An execution of Spark is divided in two types of processes, a driverand executors.

• The driver is the job/application and it defines a context.

• The context specifies the job configuration.

• The job defines the tasks and starts the executors, one for each task.

LOG8430 – ©M. Fokaefs et F. Guibault 33

Page 34: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Spark: Resilient Distributed Dataset

• The RDDs explain the power of Spark.

• Although Spark can work with a file system like HDFS, thanks to RDDs, Spark works in memory.

• This makes it faster and more efficient than Hadoop. (100x faster)

• But, if the data is extremely large, the system is forced to use the disk. In any case, Spark is still much faster. (10x faster)

• The RDDs can be recovered in case of server failure.

• This increases fault tolerance.

LOG8430 – ©M. Fokaefs et F. Guibault 34

Page 35: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Hadoop vs Spark

Hadoop Spark

Performance o +

Usability/Portability + ++

Cost o -/+

Compatibility + +

Data processing - +

Fault tolerance + ++

Scalability ++ +

Security + o

LOG8430 – ©M. Fokaefs et F. Guibault 35

Page 36: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Streaming Analytics

• Data can be analysed in batch or in streams.

• The version of MapReduce that we saw analyses data in batch.• A large quantity of data enters the system and it is all analysed together

during one execution of the system.

• In case of streams, the connection between the data producer and the analysis system remains always open. The data is analysed in pieces as they come.

• This allows the analysis of data at real-time, which is important for systems like alarms or financial systems.

LOG8430 – ©M. Fokaefs et F. Guibault 36

Page 37: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Spark Streaming

• Spark streaming accepts data streams from multiple sources.

• At frequent intervals, Spark extracts data and divides it in small batches, which are inserted in the principal system.

• The data streams are represented as DStreams (direct streams), which are children of RDDs.

LOG8430 – ©M. Fokaefs et F. Guibault 37

Page 38: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Kafka Streaming

• For Kafka, each topic corresponds to a stream.

• Kafka uses threads to parallelise the processing of streams.

• Kafka stores the local state of each partition (stateful operation).

LOG8430 – ©M. Fokaefs et F. Guibault 38

Page 39: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Spark Streaming vs Kafka Streaming

• The definition of streams and the ingestion of data is time-based in Spark and event-based in Kafka.

• Spark is a “fake” stream analysis, since in reality it analyses the data in small batches.

• Kafka is more efficient and better in reacting to events.

• Spark is more stable and more portable and it is compatible with all Hadoop technologies.

LOG8430 – ©M. Fokaefs et F. Guibault 39

Page 40: LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Styles and Patterns of Distributed Architectures

SOFTWARE QUALITY (LOG8371)

SOFTWARE TESTING (LOG3430)

Cloud Computing(LOG8415)

Next timeSoftware Design (LOG2410)

OO Design (SOLID)

Design Patterns

Bad Design

Design Quality

Frameworks

PluggableArchitectures

Cloud

Big Data

40LOG8430 – ©M. Fokaefs et F. Guibault

Ecosystems Multi-tier