LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics

Styles and Patterns of Distributed Architectures

PROGRAMMING LANGUAGES AND COMPILERS (LOG3210)

SOFTWARE QUALITY (LOG8371)

SOFTWARE TESTING (LOG3430)

Cloud Computing(LOG8415)

Course MapSoftware Design (LOG2410)

OO Design (SOLID)

Design Patterns

Bad Design

Design Quality

Frameworks

PluggableArchitectures

SOA

Multi-tier

Cloud

Big DataEvent-driven

1LOG8430 – ©M. Fokaefs et F. Guibault

Ecosystems

Microservices

LOG8430E: Big Data Architectures – Data Ingestion

and AnalysisFall 2020

©M. Fokaefs et F. Guibault





TodaySoftware Design (LOG2410)

OO Design (SOLID)

Design Patterns

Bad Design

Design Quality

Frameworks


Cloud

Big Data


Ecosystems Multi-tier

Application Service Analysis Database

GUILoad

Balancer

Service Worker1

Service Worker2

Service Worker3

Spark Master2

Spark Worker3

Service Worker4

Kafka Cluster

Spark Master1

Spark Worker1

Service Worker2

NoSQL DB Cluster

(master)

NoSQL DB

(worker)

NoSQL DB

(worker)

Batch Analytics

Local Analytics

LOG8430 – ©M. Fokaefs et F. Guibault 4

Analyses and data in the new era

Access to data

• Services

• Encapsulation

• Management of data ingestion

• Message-oriented middleware

Data analysis

• Algorithms

• Big data analyses architectures

Big data systems

• NoSQL databases


A small problem…

• Suppose we have a “smart” building that can control its environmental conditions (temperature, humidity, luminosity). The sensors and the equipment come from different manufacturers and with different specifications. We have only one system for control and analysis that will accept the data.

• What are the implementation challenges for this system?


A small problem…

• Multiple resources, a single destination.• Multiple data formats.• Interoperability becomes difficult.

• For efficient analyses, we need enough data.• We cannot process every single element of information individually.• We need to wait to have a sufficient quantity of data.

• The sensors do not have much intelligence.• Should we implement a service per sensor or per type of sensor to accept the data?• Who is responsible for managing the data ingestion?

• Web services are not a sufficient solution (not wrong, but insufficient).• They require a direct connection between the producer and the consumer.• They (often) require a synchronous interaction.


Message-oriented middleware

• In a message-oriented architecture, the modules are really decoupled and interoperable.

• The principal entity of this architecture is the message.

• The messages correspond to data that needs to be communicated between the modules.

• The modules are divided in producers and consumers of messages.

• The broker is the middleware


Advanced Message Queuing Protocol (AMQP)

• AMQP is an ISO standard for the specification of message-oriented middleware.

• AMQP defines the entities of a message system and their method of communication.

• The entities according to AMQP:• Broker• Producer• Consumer• Exchange• Queue• Message


AMQP: Broker

• The broker implements the interoperability and it relieves the modules from a lot of responsibilities.

• The broker also allows for asynchronicity: the producer submits a message and continues its work; the consumer can receive the message once it is ready.

• The broker has the responsibility to route the messages between producers and consumers.

• The broker can also transform the message. This helps the processing of data in different formats.

• The broker is responsible to guarantee the security, the performance, the reliability and the scalability of the system.


AMQP: Producers and Exchanges

• The producers submits messages to the broker and in particular to the exchanges.

• The exchanges route the messages in queues according to an algorithm:• Direct: The exchange routes the message to a queue specified by a key.• “Fanout”: The exchange routes the message to all queues linked to the exchange.• “Topic”: The message is routed according to a key and an additional parameter to

allow for multicast.• “Header”: An exchange routes the message to a queue specified by more

parameters than just a key.

• Besides the algorithm, the exchanges also have properties:• Name• Persistence: The exchange survives a restart of the broker.• Automatic removal: when there are no queues linked to the exchange.• Additional arguments: used by extensions or plugins.


AMQP: Consumers and Queues

• The consumers consume the messages by registering to queues.

• The consumers can remove messages from the queue or push message to the queue.

• The queue functions as a temporary storage for the messages.

• The queues can be persistent or not. A persistent queue is stored on disk and it can survive from a restart of a broker.

• A queue can be exclusive to a connection.

• The queues are linked to exchanges by links.• The links represent rules according to which the exchanges guide the message

to queues.


AMQP: Messages

• The messages are defined by their data and their attributes.• The content (the data) of the message is optional and it is ignored by the

broker.• The message can be declared as persistent, which implies that the message

is stored on disk and it can survive from a restart.• The attributes of the message include:

• Type and encoding of the content.• Routing key• Persistent or not• Priority• Producer ID• Timestamp and expiration time of the message.


Implementations of AMQP


RabbitMQ

• RabbitMQ is a direct implementation of the AMQP• All the elements discussed before are present in RabbitMQ.


RabbitMQ: An example


RabbitMQ: Types of exchanges


Apache Kafka

• Kafka is defined as a distributed streaming service.

• The streams are like the queues but with a few small differences.

• The streams allow the broadcast of the data and not just the multicast.

• Kafka uses transmission by topic.• The consumers register to topics

and wait for data at real-time.

• All data is persistent on disk.


Kafka: Topics

• The order of messages is dictated by the producers, but the consumers can control the order of retrieval.

• Thanks to topics and the grouping of consumers, Kafka offers the advantages of parallelism and broadcast.

• The message can be read by multiple consumers.


Kafka: Basic architecture

• “Input systems” = Producers

• “Output systems” = Consumers

• “Kafka Producers” = Exchanges

• “Kafka Consumers” = Queues

• “Kafka” = Broker

• For Kafka, the exchanges and the queues are the equivalent of the stubs for the services.• They are local classes that will be

called by external systems.


Kafka: ZooKeeper

• Kafka can work in clusters of servers/brokers.

• This helps the performance and the reliability.• The messages can be replicated

between brokers to increase fault tolerance.

• ZooKeeper is a discovery service that can manage the cluster.• The clients communicate with the

ZooKeeper service to connect to a broker.


RabbitMQ vs Kafka

RabbitMQ

• Smart broker/Simple consumers

• Reliability and decoupling are priorities.

• Multiple types of exchange are supported.

• Direct transmission or multicast.

• Increased security.

• High performance is possible but with significant resources.

Kafka

• Simple broker/Smart consumers

• Streaming allows the use of a single connection with an increased throughput.

• Only exchange by topic is supported.

• Broadcast transmission.

• Security can be a problem.

• Increased performance.LOG8430 – ©M. Fokaefs et F. Guibault 22

Message-oriented middleware

Advantages

• Flexibility

• Decoupling/Interoperability

• Reliability

• Security

• Scalability

Disadvantages

• Performance

• Infrastructure and architecture complexity

• Decoupling is not always desired.


Analysis: MapReduce

• MapReduce is a programming model for the distributed analysis of big data.

• The model contains an algorithm, an implementation and an infrastructure.• In an implementation of MapReduce, the developer needs to define two

tasks:• A map task that represents the analysis job (e.g., a sort, an addition, a search etc.)• A reduce task that represents a summarization job, that will aggregate the results of

the individual map tasks.

• The tasks can be executed by multiple processing nodes and they require multiple storage nodes to facilitate the processing of data and to store the final results if necessary.• Distributed data systems, which are traditionally used, include GFS (Google File

System) and HDFS (Hadoop Distributed File System).


MapReduce: Process

• The data is defined and organized in pairs of key-value.

• The data is partitioned automatically or explicitly by the developer.

• By size, by geographic location or temporally.

• The data is partitioned among the workers by key.

• Copies of the tasks (map and reduce) are transmitted to each worker.

• The workers execute the map tasks by using input data according to the keys and they store the intermediate results.

• Other workers aggregate the map results by executing the reduce tasks.

• The results are stored in one or more files.

• The developers can process the files or give them to another MapReduce task.


MapReduce: Architecture

• The model follows a “master-slave” architecture.

• The master is responsible for the distribution of tasks.

• The workers can accept map tasks, as well as reduce tasks.

• The master is responsible to balance the workload between the workers.

• The master periodically checks the correct function of the workers. This guarantees the reliability of the system and the high availability of the service.

• The master from its side is a weak point.• Newer versions solve this problem.

• The distributed file system is also a “master-slave” architecture.


MapReduce: Advantages and Disadvantages

Advantages

• Efficiency thanks to parallelism.

• Availability.

• Fault tolerance.

• Scalability.

• Minimal mobility of data.• The processing of data can

happen in the server that stores the data.

Disadvantages

• “Region hotspotting”• The use of keys to partition and

share the data among workers can overcharge a particular worker.

• We can solve this situation by using a hash on the keys.

• Not all types of processing can be implemented by MapReduce.


MapReduce Implementations


Hadoop: Basic architecture

• Hadoop is a direct implementation of MapReduce.

• It supports the same principal modules.

• The “Cluster” represents the physical (or virtual) infrastructure for data processing.

• YARN is the management system for the cluster.

• HDFS is the distributed file system.

• S3 is the physical storage where the HDFS resides.

• MapReduce is the implementation of the model.


YARN: Workflow

• A client submits a job for an application.

• The resource manager (RM) searches for available resources.

• The RM contacts the node manager (NM).

• The NM starts a container with the available resources.

• The container runs the application master.


Hadoop: MapReduce architecture

• The containers contain the master and the workers.

• The application master starts the job and the job is responsible for defining the tasks (map and reduce).

• The tasks are deployed on the containers to be executed.

• The necessary resources are determined and provided by YARN.


Spark: The platform

• Spark is another MapReduce implementation (Spark Core).

• It also works in a cluster of resources.

• It can work with multiple resource managers, but it provides its own manager as well.

• Spark provides multiple capacities, besides MapReduce.• Spark SQL, for the operations with

relational databases.• Spark Streaming• MLlib, for machine learning.• GraphX, for graph analytics.


Spark: Architecture

• An execution of Spark is divided in two types of processes, a driverand executors.

• The driver is the job/application and it defines a context.

• The context specifies the job configuration.

• The job defines the tasks and starts the executors, one for each task.


Spark: Resilient Distributed Dataset

• The RDDs explain the power of Spark.

• Although Spark can work with a file system like HDFS, thanks to RDDs, Spark works in memory.

• This makes it faster and more efficient than Hadoop. (100x faster)

• But, if the data is extremely large, the system is forced to use the disk. In any case, Spark is still much faster. (10x faster)

• The RDDs can be recovered in case of server failure.

• This increases fault tolerance.


Hadoop vs Spark

Hadoop Spark

Performance o +

Usability/Portability + ++

Cost o -/+

Compatibility + +

Data processing - +

Fault tolerance + ++

Scalability ++ +

Security + o


Streaming Analytics

• Data can be analysed in batch or in streams.

• The version of MapReduce that we saw analyses data in batch.• A large quantity of data enters the system and it is all analysed together

during one execution of the system.

• In case of streams, the connection between the data producer and the analysis system remains always open. The data is analysed in pieces as they come.

• This allows the analysis of data at real-time, which is important for systems like alarms or financial systems.


Spark Streaming

• Spark streaming accepts data streams from multiple sources.

• At frequent intervals, Spark extracts data and divides it in small batches, which are inserted in the principal system.

• The data streams are represented as DStreams (direct streams), which are children of RDDs.


Kafka Streaming

• For Kafka, each topic corresponds to a stream.

• Kafka uses threads to parallelise the processing of streams.

• Kafka stores the local state of each partition (stateful operation).


Spark Streaming vs Kafka Streaming

• The definition of streams and the ingestion of data is time-based in Spark and event-based in Kafka.

• Spark is a “fake” stream analysis, since in reality it analyses the data in small batches.

• Kafka is more efficient and better in reacting to events.

• Spark is more stable and more portable and it is compatible with all Hadoop technologies.






Next timeSoftware Design (LOG2410)

OO Design (SOLID)

Design Patterns

Bad Design

Design Quality

Frameworks


Cloud

Big Data


Ecosystems Multi-tier

Documents

LOG8430 : Architectures des Megadonnées...Kafka: Topics •The order of messages is dictated by the producers, but the consumers can control the order of retrieval. •Thanks to topics