Distributed Systems - Uni Siegen

Roland WismullerBetriebssysteme / verteilte Systeme Distributed Systems (1/13) i

Roland Wismuller

Universitat Siegen

[email protected]

Tel.: 0271/740-4050, Buro: H-B 8404

Stand: March 29, 2021

Distributed Systems

Summer Term 2021

Roland WismullerBetriebssysteme / verteilte Systeme Distributed Systems (1/13) 2

Distributed SystemsSummer Term 2021

0 Organisation

About Myself


➥ Studies in Computer Science, Techn. Univ. Munich

➥ Ph.D. in 1994, state doctorate in 2001

➥ Since 2004 Prof. for Operating Systems and Distributed Systems

➥ Research: Monitoring, Analysis und Control of parallel and

distributed Systems

➥ Mentor for Bachelor Studies in Computer Science with secondary

field Mathematics

➥ E-mail: [email protected]

➥ Tel.: 0271/740-4050

➥ Room: H-B 8404

About the Chair ”‘Operating Systems / Distrib. Sys.”’


Andreas Hoffmannandreas.hoffmann@uni-...

0271/740-4047

H-B 8405

➥ E-assessment and e-labs

➥ IT security

➥ Web technologies

➥ Mobile applications

Damian Ludwigdamian.ludwig@uni-...

0271/740-2533

H-B 8402

➥ Capability systems

➥ Compilers

➥ Programming languages

Hawzhin Hozhabr Pourhawzhin.hozhabrpour@uni-...

0271/740-4038

H-B 8411

➥ Machine Learning

➥ Pattern recognition in car sensordata

➥ Anomaly detection

Teaching


Lectures/Labs

➥ Rechnernetze I, 5 LP (every summer term)

➥ Rechnernetze Praktikum, 5 LP (every winter term)

➥ Rechnernetze II, 5 LP (every summer term)

➥ Betriebssysteme I, 5 LP (every winter term)

➥ Parallel Processing, 5 LP (every winter term)

➥ Distributed Systems, 5 LP (every summer term)

Teaching ...


Project Groups

➥ e.g., recording and analyzing car sensor data

➥ e.g., outlier detection in car sensor data

Theses (Bachelor, Master)

➥ Topic areas: secure virtual machine, parallel computing, patternrecognition in sensor data, e-assessment, ...

Seminars

➥ Topic areas: IT security, programming languages, patternrecognition in sensor data, ...

➥ Procedure: block seminar

➥ 30 min. talk, 5000 word seminar paper

About the Lecture


➥ Lecture:

➥ digital: screen casts at moodle

➥ Q&A: Mon., 12:00 - 12:30 (or longer, if needed) via zoom

➥ Exercises:

➥ 2 hours (digital)

➥ Tue., 10:15-11:45, via zoom, starting 20.04.

➥ this zoom meeting will be recorded!

➥ includes programming exercises using Java

➥ Links to zoom meetings: see moodle

https://moodle.uni-siegen.de/course/view.php?id=512

https://moodle.uni-siegen.de/course/view.php?id=512

About the Lecture ...


Information, Slides and Announcements

➥ http://www.bs.informatik.uni-siegen.de/lehre/vs

➥ For printing: use print service of the Student Council!

➥ If necessary, updates/supplements shortly before the lecture

➥ look at the date!

➥ Exercise sheets will be put online as PDF

➥ please print and process them yourself!

http://www.bs.informatik.uni-siegen.de/lehre/vs

Examination


➥ Oral examination

➥ duration about 30 minutes

➥ Registration:

➥ first register at the campus management system (unisono)

➥ at least 1 week before the exam date

➥ then fix a date with my secretary

➥ at least 1 week before the exam date

➥ Mrs. Syska, regina.syska@uni-...

➥ cancellation is possible up to 7 days before the exam

➥ via unisono➥ please inform me, too!

Contents of the Lecture


➥ Introduction

➥ Middleware

➥ Distributed programming with Java RMI

➥ Name services

➥ Process management

➥ Time and global state

➥ Coordination

➥ Replication and consistency

➥ Distributed file systems

➥ Fault tolerance

Learning targets


➥ Understand the properties of distributed systems

➥ absence of a global state

➥ problems with synchronization and with consistency of

replicated data

➥ Understand the approaches to solve the problems

and be able to apply them to given challenges

➥ Distinguish architecture models for distributed systems as well as

different types and tasks of middleware

and be able to assess their usability for given problems

➥ Be able to develop simple distributed programs with Java RMI

Literature


➥ Andrew S. Tanenbaum, Marten van Steen. Verteilte Systeme,Grundlagen und Paradigmen. Pearson Studium, 2003.(English: Distributed Systems: Principles and Paradigms, 2ndEdition. Pearson Education, 2016. Available online.)

➥ Ulrike Hammerschall. Verteilte Systeme und Anwendungen. Pear-son Studium, 2005.

➥ George Coulouris, Jean Dollimore, Tim Kindberg. Verteilte Sys-teme, Konzepte und Design, 3. Auflage. Pearson Studium, 2002.(English: Distributed Systems: Concepts and Design, 5th Edition.Pearson Education, 2012.)

➥ Andrew S. Tanenbaum. Moderne Betriebssysteme, 2. Auflage.Pearson Studium, 2003.

➥ William Stallings. Betriebssysteme – Prinzipien und Umsetzung,4. Auflage. Pearson Studium, 2003.

http://barbie.uta.edu/~jli/Resources/MapReduce%26Hadoop/Distributed%20Systems%20Principles%20and%20Paradigms.pdf

Literature ...


➥ Jim Farley, William Crawford, David Flanagan. Java Enterprise in

a Nutshell. O’Reilly 2002.

➥ Cay S. Horstmann, Gary Cornell. Core Java 2, Band 2 –

Expertenwissen. Sun Microsystems Press / Addison Wesley,

2008.

➥ Robert Orfali, Dan Harkey. Client/Server-Programming with Java

and Corba. John Wiley & Sons, 1998.

➥ Torsten Langner. Verteilte Anwendungen mit Java. Markt +

Technik, 2002.



1 Introduction

1 Introduction ...


Contents

➥ What makes a distributed system?

➥ Software architecture

➥ Architecture models

➥ Cluster

Literature

➥ Hammerschall: 1

➥ Tanenbaum, van Steen: 1

➥ Colouris, Dollimore, Kindberg: 1, 2

➥ Stallings: 13.4

1 Introduction ...


1.1 What makes a distributed system?

work together to coordinate their actions by exchanging messages.In a distributed system, components located on different computers

G. Coulouris

A distributed system is one on which I can’t do any work because somemachine I’ve never heard of has crashed. L. Lamport

A distributed system is a set of independent computers that appear tothe user as a single, coherent system.

A. Tanenbaum

A distributed system is a collection of processors that neither sharemain memory nor a clock. A. Silberschatz

1.1 What makes a distributed system? ...


➥ A distributed system is a system

➥ in which hardware and software components are based on

networked computers, and

➥ communicate and coordinate their actions only via the

exchange of messages.

➥ The boundaries of the distributed system are defined by a com-

mon application

➥ Best known example: Internet

➥ communication via the standardized Internet protocols

➥ IP and TCP / UDP (☞ lecture Computer Networks)

➥ users can use services / applications, regardless of the present

location



What is a distributed application?

➥ Application that uses a distributed system to create a

self-contained functionality

➥ Application logic distributed among several, largely independent

components

➥ Components often executed on different machines

➥ Examples:

➥ simple internet applications (e.g. WWW, FTP, email)

➥ distributed information systems (e.g. flight booking)

➥ SW intensive, data centered, interactive, highly concurrent

➥ distributed embedded systems (e.g. in the car)

➥ distributed mobile applications (e.g. for handhelds)



A typical distributed system

Desktop

Desktop

WWWserver DataPrint

server

LANInternet

LAN

serverMail

Appli−cationserver

serverbase



Why distribution?

➥ Central, non-distributed applications are

➥ generally safer and more reliable

➥ generally more performant

➥ Main reason for distribution: sharing of resources

➥ Hardware resources (printer, scanner, ...)

➥ cost saving

➥ Data and information (file server, database, ...)

➥ information exchange, data consistency

➥ Functionality (centralization)

➥ error avoidance, reuse

1.2 Characteristics of distributed systems


➥ Resources (e.g. computers, data, users, ...) are distributed

➥ sometimes worldwide

➥ Cooperation via message exchange

➥ Concurrency

➥ but: parallel processing of a single request is not the primary

goal

➥ No global clock (more precisely: no global time)

➥ Distributed status information

➥ no uniquely determined global state

➥ Partial errors are possible (independent failures)

1.2 Characteristics of distributed systems ...


Parallel vs. distributed systems

➥ Parallel system:

➥ motivation: higher performance through parallel execution

➥ multiple tasks (processes/threads) working on one job

➥ tasks are fine-grained: frequent communication

➥ tasks work simultaneously (parallel)

➥ homogeneous hardware / OSs, regular network structure

➥ Distributed system:

➥ motivation: distributed resources (computers, data, users)

➥ multiple tasks (processes/threads) working on one or manyjobs

➥ tasks are coarse grained: communication less frequent

➥ tasks work synchronized (usually one after the other)

➥ inhomogeneous (processors, networks, OSs, ...)

1.3 Challenges and Goals of Distributed Systems


➥ Heterogeneity: computer hardware, networks, OSs,programming languages, implementations by differentdevelopers, ...

➥ solution: middleware➥ software layer that hides heterogeneity by providing a

unified programming model

➥ e.g. CORBA: distributed objects, remote method invocation

➥ e.g. web services: remote procedure calls (services)

➥ Openness: easy extensibility (with new services)

➥ requirements:

➥ key interfaces are published / standardized

➥ uniform communication mechanisms / protocols

➥ components must conform to standards

[Coulouris, 1.4]

1.3 Challenges and Goals of Distributed Systems ...


➥ Security

➥ information: confidentiality, integrity, availability

➥ esp. with mobile code

➥ users: authentication, authorization

➥ Scalability: number of resources or users can grow without

negative impact on performance and cost

➥ Error handling (partial errors)

➥ error detection (e.g. checksums)

➥ error masking (e.g. retransmission of a message)

➥ error tolerance (e.g. browser: “server not available”)

➥ recovery (of data) after errors

➥ redundancy (of hardware and software)

1.3 Challenges and Goals of Distributed Systems ...


➥ Concurrency

➥ synchronization, consistency of replicated data

➥ Transparency

➥ access∼: local and remote accesses identical}

network∼➥ location∼: no need to know the location

➥ mobility∼: transparent relocation of resources

➥ replication∼: transparent replication of resources

➥ concurrency∼: shared use of resources without disruptions

➥ error∼: hiding errors due to component failure

➥ performance∼: performance is largely independent of theload

➥ scaling∼: system scales without negative impact on users

1.4 Software Architecture


Types of Operating Systems for Distributed Systems

➥ Network operating system:

➥ traditional OS, extended by support for network applications

(API for sockets, RPC, ...)

➥ each computer has its own OS, but can use services of other

computers (file system, email, ssh, ...)

➥ the existence of the other computers is visible

➥ Distributed operating system:

➥ uniform OS for a network of computers

➥ transparent for the user

➥ requires cooperation of the OS kernels

➥ so far mainly research projects

1.4 Software Architecture ...


Typical layers in a distributed system

Applications

Pla

tfo

rm(s

)

Services (generic or application specific)

Computer and network hardware

Middleware

(Network) Operating system

API

[Coulouris, 2.2.1]

1.4 Software Architecture ...


Middleware

➥ Tasks:

➥ hiding of distribution and heterogeneity

➥ providing a common programming model / API

➥ provision of general services

➥ Functions e.g:

➥ communication services: remote method calls, group

communication, event notifications

➥ replication of shared data

➥ security services

➥ Examples: CORBA, EJB, .NET, Axis2 (Web Services), ...

(☞ Lecture Client/Server Programming)

1.5 Architectural Models


➥ An architecture model characterizes:

➥ roles of an application component within the distributedapplication

➥ relationships between application components

➥ Role defined by the type of process the component is running in:

➥ client process

➥ short-lived (for the duration of use by the user)

➥ acts as initiator of interprocess communication (IPC)

➥ server process

➥ lives ’unlimited’➥ acts as a service provider for an IPC

➥ peer process

➥ short-lived (for the duration of use by the user)

➥ acts as initiator and service provider

1.5 Architectural Models ...


Peer-to-Peer Model

➥ Collaboration of peer processes for a distributed activity

➥ each process manages a local part of the resources

➥ distributed coordination and synchronization of actions at

application level

Coordination code

Application

Coordination code

Application

Coordination code

Application

➥ E.g.: file sharing applications, routers, video conferences, ...



[Coulouris, 2.2.2]Client/Server Model

➥ Asymmetric model: Servers provide services that can be used by

(multiple) clients.

➥ servers usually manage resources (centralized)

Request

Reply

Client

Client

Server

Process Computer

RequestServer

Reply

Server can itselfact as a client

➥ Most common model for distributed applications (ca. 80 %)



Client/Server Model ...

➥ Usually concurrent requests from several client processes to theserver process

Start

Server

Client End

ReplyRequest Time

➥ Examples: file server, web server, database server, DNS server,...



Client/Server Model ...

➥ Usually concurrent requests from several client processes to theserver process

Client

Start

Server

Client End

ReplyRequest Time

➥ Examples: file server, web server, database server, DNS server,...



Variants of the client/server model

➥

Server

Server

ServerClient

Client

Cooperating servers

➥ Network of servers transparently processes a request

➥ Example: Domain Name Server (DNS)

➥ if server cannot determine address:request is transparentlyforwarded to another server

➥ Replicated servers

➥ replicas of server processesare provided

➥ transparent replicas (often in clusters)➥ requests are automatically distributed to the servers

➥ public replicas (e.g. mirror servers)

➥ goals: better performance, reliability



Variants of the client/server model ...

➥

Proxy

Client

Client

Server

Server

Proxy-Server / Caches

➥ proxy is a delegate forthe server

➥ task often is cachingof data / results

➥ e.g. web proxy

➥ Mobile code

➥ executable server code migrates to client on request

➥ code is executed by the client

➥ best-known example: JavaScript / Java applets in the WWW

➥ Mobile agents

➥ agent contains code and data, moves through the network andperforms actions on local resources



n-Tier Architectures

➥ Refinements of Client/Server Architecture

➥ Models for distributing an application to the nodes of a distributed

system

➥ Mainly used in information systems

➥ Tier (german: Schicht / Stufe) denotes an independent process

space within a distributed application

➥ process space can, but does not have to, correspond to a

physical host

➥ several process spaces on one computer are possible



The Tier Model

➥ Typical tasks in an information system:

➥ presentation – interface to the user

➥ application logic – actual functionality

➥ data storage – storage of data in a database

➥ The tier model determines:

➥ assignment of tasks to application components

➥ distribution of application components on tiers

➥ Architectures:

➥ 2-tier architectures

➥ 3-tier architectures

➥ 4-or-more-tier architectures



2-Tier Architecture

➥ Client and server tier

➥ No own tier for the application logic

(distribution between clientand server tier varies)

Client tier

Server tier

Presentation

Data storage

Application logic

➥ Advantage: simple, high performance

➥ Disadvantage: difficult to maintain, poorly scalable



3-Tier Architecture

Presentation

Application logic

Data storage

Client tier

Middle tier

Server tier

➥ Standard distribution model for simple web applications:

➥ client tier: web browser for display

➥ middle tier: web server with servlets / JSP / ASP

➥ server tier: database server

➥ Advantages: Application logic centrally administrable, scalable



4-or-more-Tier Architectures

➥ Difference to 3-tier architecture:

➥ application logic distributed across multiple tiers

➥ Motivation:

➥ minimization of complexity (divide and conquer)

➥ better protection of individual application parts

➥ reusability of components

➥ Many distributed information systems have 4-or-more-tier

architectures



Example: Typical Internet Application

Intranet

Internet

Web

Webclient

client

cationserver

serverData

Tier 1 Tier 3 Tier 4Tier 2

Appli−baseserver

Web



Example: Typical Internet Application

Intranet

Internet

Web

Webclient

client

cationserver

serverData

Tier 1 Tier 3 Tier 4Tier 2

Appli−baseserver

Web

DMZ

Fire

wal

l

Fire

wal

l

serverWeb



Thin and fat clients

➥ Characterizes complexity of the application component on the

client tier

➥ Ultra-thin client

➥ client tier only for presentation: pure display of dialogs

➥ presentation component: web browser

➥ only possible with 3-or-more-tier architectures

➥ Thin client

➥ client tier for presentation only: display of dialogs, preparation

of data for display

➥ Fat client

➥ parts of the application logic on the client tier

➥ usually with 2-tier architectures



Distinction from Enterprise Application Integration (EAI)

➥ EAI: integration of different applications

➥ communication, exchange of data

➥ Goals similar to distributed applications / middleware

➥ middleware is often used for EAI as well

➥ Differences:

➥ distributed applications: application components, high degree

of coupling, usually little heterogeneity

➥ EAI: complete applications, low degree of coupling, mostly

great heterogeneity (different technologies, systems,

programming languages, ...)

1.6 Cluster


➥ Cluster: group of networked

computers that acts as a unified

computing resource

➥ i.e. multicomputer system

➥ nodes usually standard PCs

or blade server

➥ Application mainly as high

performance server

➥ Motivation:

➥ (step-by-step) scalability

➥ high availability

➥ good price/performance ratio

[Stallings, 13.4]

1.6 Cluster ...


Uses for Clusters

➥ High availability (HA) clusters

➥ improved reliability

➥ when a node is faulty: services are migrated to other nodes

(failover)

➥ Load balancing cluster

➥ incoming requests are distributed to different nodes of the

cluster

➥ usually by a (redundant) central instance

➥ frequently with WWW or email servers

➥ High performance computing cluster

➥ cluster as parallel computer

1.6 Cluster ...


Cluster configurations

➥ Passive standby (no actual cluster)

➥ processing of all requests by primary server

➥ secondary server takes over tasks (only) in case of failure

➥ Active standby

➥ all servers process requests

➥ enables load balancing and improved reliability

➥ problem: access to data of other / failed server

➥ alternatives:

➥ replication of data (a lot of communication)

➥ shared hard disk system (usually mirrored disks or RAID

system for fail-safe operation)

1.6 Cluster ...


Active Standby Configurations

➥ Separate servers with data replication

➥ separate disks, data is continuously copied to secondary

servers

➥ Server with shared hard disks

➥ shared nothing cluster

➥ separate partitions for each server

➥ in case of server failure: reconfiguration of the partitions

➥ shared disc cluster

➥ simultaneous use by all servers

➥ requires lock manager software to lock files or records

1.7 Summary


➥ Distributed system

➥ HW and SW components on networked computers

➥ no shared memory, no global time

➥ motivation: use of distributed resources

➥ Challenges

➥ heterogeneity, openness, security, scalability

➥ error handling, concurrency, transparency

➥ Software architecture: middleware

➥ Architectural models:

➥ peer-to-peer, client/server

➥ n-tier models

➥ Cluster: high availability, load balancing



2 Middleware

2 Middleware ...


Content

➥ Communication in distributed systems

➥ Communication-oriented middleware

➥ Application-oriented middleware

Literature

➥ Hammerschall: Ch. 2, 6

➥ Tanenbaum, van Steen: Ch. 2

➥ Colouris, Dollimore, Kindberg: Ch. 4.4

2 Middleware ...


Netw.

Distributed application (DA)

Distributed system (DS)

DADA

DS nodeDS node

component component

➥ DA uses DS for communication between its components

➥ DSs generally only offer simple communication services

➥ direct use: network programming

➥ Middleware offers more intelligent interfaces

➥ hides details of network programming

2 Middleware ...


Netw.

DA

Middleware

DA

Middlewarecomponent

DS node DS node



component

Netw.



DADA

DS nodeDS node

component component

➥ DA uses DS for communication between its components

➥ DSs generally only offer simple communication services

➥ direct use: network programming

➥ Middleware offers more intelligent interfaces

➥ hides details of network programming

2 Middleware ...


➥ Middleware is the interface between distributed application and

distributed system

➥ Goal: hide distribution aspects from application

➥ transparency (☞ 1.3)

➥ Middleware can also provide additional services for applications

➥ huge differences in existing middleware

➥ Distinction:

➥ communication-oriented middleware (☞ 2.2)

➥ (only) abstraction from network programming

➥ application-oriented middleware (☞ 2.3)

➥ besides communication, the focus is on support of

distributed applications

2 Middleware ...


2.1 Communication in Distributed Systems

➥ Basis: interprocess communication (IPC)

➥ exchange of messages between processes (☞ BS I: 3.2)

➥ on the same or on different nodes

➥ e.g. via ports, mailboxes, streams, ...

➥ For distribution: network protocols (☞ RN I)

➥ relevant topics etc: addressing, reliability, guaranteed ordering,

timeouts, acknowledgements, marshalling

➥ Interface for network programming: sockets (☞ RN II)

➥ datagrams (UDP) and streams (TCP)

2.1 Communication in Distributed Systems ...


Synchronous Communication

➥ ReceiverSender

bloc

ked

reply

request

activ

e

Time

Sender and receiver block when

calling a send or receive operation

➥ receiver is waiting for a request

➥ sender is waiting for the reply

➥ Tight coupling between sender and

receivers

➥ advantage: easy to understand model

➥ disadvantage: strong dependency, especially in case of error

➥ Prerequisites:

➥ reliable and fast network connection

➥ receiver process is available



Asynchronous Communication

➥ Receiver

activ

e

Sender

activ

e

request

Time

Sender is not blocked, can continue

immediately after sending the message

➥ Incoming messages are buffered at the

receiver

➥ Answers are optional

➥ receiver can reply asynchronously to

the sender

➥ More complex implementation and use as with synchronous

communication, but usually more efficient

➥ Only loose coupling between the processes

➥ receiver does not have to be ready for reception

➥ less dependent in case of errors



Client/Server Communication

operation

replymessage

ServerClient

request

messagedeterminerequest

send answer

select object, if needed

execute

(wait)

(continue)

execute method

➥ Mostly synchronous: client blocked until response arrives

➥ Variants: asynchronous (non blocking), one way (without answer)



Client/Server Communication

getRequest()

sendReply()

do

Op

era

tion

() operation

replymessage

ServerClient

request

messagedeterminerequest

send answer

select object, if needed

execute

(wait)

(continue)

execute method

➥ Mostly synchronous: client blocked until response arrives

➥ Variants: asynchronous (non blocking), one way (without answer)



Client/Server Communication: Request/Response Protocol

➥ Typical operations:

➥ doOperation() – send request and wait for result

➥ getRequest() – wait for request

➥ sendReply() – send result

➥ Typical message structure:

messageTyperequestIDobjectReferencemethodIDarguments

request / reply ?unique ID of request (usually int)reference to remote object (if needed)method to be called (int / String)arguments (usually as Byte array)

➥ request ID + sender ID result in unique message ID

➥ e.g. to map an answer to its query



Client/Server Communication: Error Handling

➥ Request and/or response messages may be lost

➥ Client sets a timeout when sending a request

➥ after expiration, request is usually sent again

➥ after a few repetitions: termination with exception

➥ Server discards duplicate requests if request has already been /

is still being processed

➥ For lost response messages:

➥ idempotent operations can be executed again

➥ otherwise: save results of operations in a history

➥ for repeated request: only resend the result

➥ delete history entries when next request arrives; if

necessary confirmations for results can also be used

2.2 Communication-oriented Middleware


➥ Focus: provision of a communication infrastructure for distributed

applications

➥ Tasks:

➥ communication

➥ dealing with heterogeneity

➥ error handling

Application

Communication oriented

Operating system / distributed system

middleware

2.2.1 Tasks of the Middleware


Communication

➥ Provision of a middleware protocol

➥ Localization and identification of communication partners

➥ Integration with process and thread management

Transport protocol (e.g. TCP)

Middleware protocol

Application protocol

Lower layers of the protocol stack

2.2.1 Tasks of the Middleware ...


Heterogeneity

➥ Problem with data transmission:

➥ heterogeneity in distributed systems

➥ Heterogeneous hardware and operating systems

➥ different byte order

➥ little endian vs. big endian

➥ different character encoding

➥ e.g.. ASCII / Unicode / UTF-8 / EBCDIC (IBM Mainframes)

➥ Heterogeneous programming languages

➥ different representation of simple and complex data types in

the main memory



Heterogeneity: Solutions (☞ RN I)

➥ Use of generic, standardized data formats

➥ known to all communication partners and middleware

➥ platform-specific formats for middleware (e.g. CDR for

CORBA) or external formats, e.g. XML

➥ Heterogeneity of hardware and operating system

➥ is handled transparently for the applications by the middleware

➥ Heterogeneity of programming languages

➥ applications need to convert data to higher-level format and

back (marshaling / unmarshaling)

➥ necessary code is usually generated automatically

➥ client stub / server skeleton



Error Handling

➥ Possible errors due to distribution

➥ incorrect transmission (incl. loss of messages)

➥ handled by the protocols of the distributed system:

➥ checksums, CRC➥ retransmission of packets (e.g. TCP)

➥ failure of components (network, hardware, software)

➥ handled by middleware or application:

➥ acceptance of the error➥ retransmission of messages

➥ replication of components (error avoidance)

➥ controlled termination of the application

2.2 Communication-oriented Middleware ...


2.2.2 Programming Models

➥ Programming model defines two concepts:

➥ communication model

➥ synchronous vs. asynchronous

➥ programming paradigm

➥ object-oriented vs. procedural

➥ Three common programming models for middleware:

➥ message-oriented model (asynchronous / arbitrary)

➥ remote procedure call (synchronous / procedural)

➥ remote method invocation (synchronous / object-oriented)

2.2.2 Programming Models ...


Message-Oriented Model

➥ Sender puts message in receiver’s queue

Sender

Message

Message queue

Message

Receiver

➥ Receiver accepts message as soon as he is ready

➥ Extensive decoupling of transmitter and receiver

➥ No method or procedure calls

➥ data is packed and sent by the application

➥ no automatic reply message



Remote Procedure Call (RPC)

➥ Allows a client to call a procedure in a remote server process

...P(a) {

return b;}

y = P(x);Input parameters

processClient

processServer

Results

➥ Communication according to request / response principle

Remote Method Invocation (RMI)

➥ Allows an object to call methods of a remote object

➥ In principle very similar to RPC



Common Basic Concepts of Remote Calls

➥ Client and server are decoupled by interface definition

➥ defines names of calls, parameters and return values

➥ Introduction of client stubs and server skeletons as an access

interface

➥ are automatically generated from interface definition

➥ IDL compiler (IDL = interface definition language)

➥ are responsible for marshaling / unmarshaling

as well as for the actual communication

➥ realize access and location transparency



How Client Stub and Server Skeleton Work (RPC)

Client stub Server skeleton

P(a) {y=P(x)

...P(a) {

return b;}

; ;

Client process

return b;}

receive(m1);

client=sender(m1);

unpack argument xfrom message

y = P(x)

}

pack argument ainto message

send(Server, m1);

receive(Server, m2)

unpack result bfrom message

while (true) {

send(Client, m2);

pack result y

Server process

into message



Basis of RMI: The Proxy Pattern

➥ Client works with a deputy object (proxy) of the actual server

object

➥ proxy and server object implement the same interface

➥ client only knows / uses this interface

Client Proxy Object

Interface<<interface>>



Flow of a Remote Method Call

Proxy

Skeleton callsthe samemethod onthe object

Client−OS

Client

Network

Server−BS

Server

Skeleton

Server nodeClient node

Object

Status

Method

Same interfaceas real object

Interface

Client callsa method

Packed request is sent over the network(object ID, method name, parameters)



Creation of a Client/Server Program

Server

Client

Compiler

Compiler

Client stubs

IDL compiler

Server skel.

RuntimeRPC/RMI

Serverprocedures

Client

library

Interfacedescription

program

➥ Applies in principle to all realizations of remote calls



2.2.3 Middleware Technologies

➥ Realize (at least) one of the programming models

➥ rely on open standards / standardized interfaces

➥ Remote procedure call

➥ SUN RPC, DCE RPC, Web Services (☞ CSP: 7), ...

➥ Remote method invocation

➥ Java RMI (☞ 3), CORBA (☞ CSP: 3), ...

➥ Message-oriented middleware technologies

➥ MOM: message oriented middleware, messaging systems

➥ mainly for EAI

➥ Java Message Service, WebSphereMQ (MQSeries), ...



2.2.4 Message Oriented Middleware (MOM)

➥ Middleware technology for the message-oriented model

➥ In addition to message exchange also other services, especially

queue management

interfaceAccess

interfaceAccess

Sender ReceiverMessage queues

Message queuemanager

Protocol stack

Middleware protocol (proprietary)

2.2.4 Message Oriented Middleware (MOM) ...


Message Queue Infrastructure

➥ Access to queues is only possible locally

➥ local: same computer or same subnet

➥ Transport of messages across subnet boundaries by queue

administrators (routers)

Manager Manager

Manager

Sender Receiver

ReceiverSender



Variants of message exchange

➥ Point-to-point communication

➥ communication between two defined processes

➥ simplest model: asynchronous communication

➥ enhancement: request/reply model

➥ enables synchronous communication via asynchronousmiddleware

➥ Broadcast communication

➥ Message is sent to all reachable receivers

➥ one implementation: publish/subscribe model

➥ publishers publish messages/news on a topic

➥ subscribers subscriber to certain topics

➥ mediation via a broker



Example: Java Message Service

➥ Part of the Java Enterprise Edition (Java EE)

➥ Unified Java interface for MOM services

➥ Distinguishes two roles:

➥ JMS provider: the respective MOM server

➥ JMS client: sender or receiver of messages

➥ JMS supports:

➥ asynchronous point-to-point communication

➥ request/reply model

➥ publish/subscribe model

➥ JMS defines corresponding access objects and methods



2.2.5 Summary

➥ Tasks: Communication, dealing with heterogeneity, error handling

➥ Programming models:

➥ message-oriented model (asynchronous)

➥ basis: message queues

➥ refinements:➥ request/reply model (synchronous)

➥ publish/subscribe model (broadcast)

➥ remote procedure or method calls

➥ synchronous: request and response

➥ generated stubs for (un-)marshaling

2.3 Application-oriented Middleware


➥ Based on communication-oriented middleware

➥ Extends it by:

➥ runtime environment

➥ services

➥ component model

Runtime environment ServicesServices

Component model

Communication infrastructure

Operating system / distributed system

componentApplication

component componentApplication Application

2.3.1 Runtime environment


➥ Based on node operating systems of the distributed system

➥ Operating system (OS) manages processes, memory, I/O, ...

➥ provides basic functionality

➥ starting / stopping processes, scheduling, ...

➥ interprocess communication, synchronization, ...

➥ Runtime environment extends functionality of the OS:

➥ improved resource management

➥ e.g. concurrency, connection management

➥ improved availability

➥ improved security mechanisms

2.3.1 Runtime environment ...


Resource management

➥ Middleware goes beyond simple OS functionality

➥ e.g. independently managed main memory areas with

individual security criteria

➥ pooling of processes, threads, connections

➥ are created for stock and made available as required

➥ possible, since middleware is specific to certain classes of

applications

➥ Goal: improved performance, scalability and availability



Concurrency

➥ Concurrency in this context:

➥ isolated parallel processing of requests

➥ Concurrency can be implemented via processes or threads

➥ threads (lightweight processes): concurrent activities within

processes

➥ threads in the same process share all resources

➥ advantages and disadvantages:

➥ processes: high resource requirements, not well scalable,

good protection, with low concurrency

➥ threads: well scalable, no mutual protection, with high

concurrency



Concurrency ...

➥ Middleware takes over automatic generation / administration ofthreads in the case of concurrent orders, e.g.

➥ single-threaded

➥ only one thread, sequential processing

➥ thread-per-request

➥ a new thread is created for each request

➥ thread-per-session

➥ a new thread is created for each session (client)

➥ thread pool

➥ fixed number of threads, incoming requests are distributedautomatically

➥ saves thread generation costs➥ limits resource consumption



Connection management

➥ Connection here means: endpoints of communication channels

➥ occur at tier boundaries (between process spaces)

➥ e.g. client/server interface, database access

➥ are assigned to a process/thread, if in the active state

➥ require resources (memory, processor time)

➥ opening and closing connections is costly

➥ To save resources: pooling of connections

➥ connections are initialized to stock and placed in pool

➥ each thread/process receives a connection on demand

➥ after use: return connection to pool



Availability

➥ Requirement to the application,

but mainly implemented by the runtime environment

➥ Downtimes are caused by

➥ failure of a hardware or software component

➥ overload of a hardware or software component

➥ maintenance of a hardware or software component

➥ Frequent technology for ensuring availability: cluster

➥ replication of hardware and software

➥ cluster appears externally as one unit

➥ two types: fail-over cluster / load-balancing cluster



Security

➥ Distributed applications are vulnerable due to their distribution

➥ Middleware supports different security models

➥ Security requirements:

➥ authentication:

➥ proves the identity of the user / a component

➥ e.g. by password query (for users) or cryptographic

techniques and certificates (for components)

➥ authorization:

➥ definition of access rights for users to specific services

➥ or more fine grained: methods and attributes

➥ requires secure authentication



Security ...

➥ Security requirements ...:

➥ confidentiality

➥ information cannot be intercepted during transmission in

the network➥ technique: encryption

➥ integrity

➥ transmitted data cannot be changed without being noticed

➥ techniques: cryptographic checksum (message digest,

fingerprint), digital signature

➥ digital signature also ensures authenticity of the sender



Security ...

➥ Security mechanisms:

➥ encryption

➥ symmetric (e.g. IDEA, AES)➥ same key for encryption and decryption

➥ asymmetric (public key algorithms, e.g. RSA)➥ public key for encryption➥ private key for decrypting

➥ digital signature

➥ ensures integrity of a message and authenticity of thesender as well as nonrepudiation

➥ certificate➥ certifies that public key and person (or component) belong

together

2.3.2 Services


Name service (directory service) (☞ 4)

➥ Publication of available services

➥ in the intranet or Internet

➥ Assignment of names to references (addresses)

➥ name serves as a unique / unchangeable identifier

➥ the client can request the address of a service via its name

➥ address can change e.g. at restart

➥ goal: decoupling of client and server

➥ Examples: JNDI, RMI registry, CORBA interoperable naming

service, UDDI registry, LDAP server, ...

2.3.2 Services ...


Session management

➥ In interactive systems: each instance of a client is assigned itsown session

➥ deleted when logging out or closing the client

➥ Session stores all relevant data (in main memory)

➥ e.g. identification of the user, browser type, ”‘shopping cart”’, ...

➥ data stored in the server or in the client

➥ transient data: deleted at the end of the session

➥ persistent data: is written to a data carrier (database) at theend of the session.

➥ Middleware implements/supports the assignment of requests tosessions (often transparent)

➥ e.g. cookies, HTTP-sessions, session beans, ...

2.3.2 Services ...


Transaction management (☞ 7.4)

➥ Service for interactive, data-centric applications

➥ consistency / integrity of data is important

➥ this means that the entire (maybe distributed) dataset must

represent a valid state in itself

➥ Typical sequence in applications:

1. client requests data

2. client changes the data

3. client requests that the data be rewritten

➥ problem: steps 1-3 could be performed by two clients at the

same time

➥ Transaction management allows execution of a sequence ofactions as an atomic unit

2.3.2 Services ...


Persistence service

➥ Persistence: all measures for the permanent storage of main

memory data

➥ Persistence service: intelligent interface to the database

➥ integrated in middleware or as an independent component

➥ most important service for data-centered applications besides

transaction management

➥ Most common type: object-relational mapper (OR-Mapper)

➥ maps objects in the main memory to tables in a relational

database

➥ mapping rules are defined by application developers

2.3.2 Services ...


Persistence service ...

Var5Var4Var3Var2Var1Table B

Var1 Var2 Var3 Var4Table A

Var1Var2Var3

Object A Var1Var2

Object B

Var1Var2

Object C

OR mapper

Mai

n m

emor

yD

ata

base

2.3.3 Component model


➥ Components: “large” objects for structuring applications

➥ A component model defines:

➥ the term “component”

➥ structure and properties of the components

➥ mandatory and optional interfaces

➥ interface contracts

➥ how do components interact with each other and with the

runtime environment?

➥ component runtime environment

➥ management of the life cycle of components

➥ implicit provision of services: component only specifies its

requirements (e.g. persistence)

2.3.4 Middleware Technologies


➥ Object request broker (ORB)

➥ distributed objects, remote method calls

➥ variety of services, only basic runtime environment

➥ example: CORBA

➥ Application server

➥ focus: support of application logic (middle tier)

➥ services, runtime environment, and component model

➥ today only as part of a middleware platform

➥ Middleware platforms

➥ extension of application servers: support of all tiers

➥ distributed applications as well as EAI

➥ examples: Java EE/EJB, .NET/COM, CORBA 3.0/CCM

2.3.5 Summary


Application-oriented middleware

➥ Runtime environment

➥ resource management, availability, security

➥ Services

➥ name service, session management, transaction

management, persistence service

➥ Component model

➥ defintion of components, interface contracts, runtime

environment



3 Distributed Programming with Java RMI

3 Distributed Programming with Java RMI ...


Content

➥ Introduction

➥ Hello World with RMI

➥ RMI in detail

➥ classes and interfaces, stubs, name service, parameter

passing, factories, callbacks, ...

➥ Deployment: loading remote classes

➥ Java remote class loader and security manager



Literature

➥ WWW documentation and tutorials from Oracle

➥ http://docs.oracle.com/javase/6/docs/api/index.html?

java/rmi/package-summary.html

➥ http://www.oracle.com/technetwork/java/javase/tech/

index-jsp-136424.html

➥ Hammerschall: Ch.. 5.2

➥ Farley, Crawford, Flanagan: Ch. 3

➥ Horstmann, Cornell: Ch. 5

➥ Orfali, Harkey: Ch. 13

➥ Peter Ziesche: Nebenlaufige & verteilte Programmierung,

W3L-Verlag, 2005. Ch. 8

http://docs.oracle.com/javase/6/docs/api/index.html?java/rmi/package-summary.html

http://docs.oracle.com/javase/6/docs/api/index.html?java/rmi/package-summary.html

http://www.oracle.com/technetwork/java/javase/tech/index-jsp-136424.html

http://www.oracle.com/technetwork/java/javase/tech/index-jsp-136424.html

3.1 Introduction


➥ Java RMI is an integral part of Java

➥ allows use of remote objects

➥ Elements of Java RMI:

➥ remote object implementations

➥ client interfaces (stubs) to remote objects

➥ server skeletons for remote object implementations

➥ name service to locate objects in the network

➥ service for automatically creating (activating) objects

➥ communication protocol

➥ Java interfaces for the first five elements

➥ in the package java.rmi and its subpackages

3.1 Introduction ...


➥ Java RMI requires that all objects (i.e., client and server) are

programmed in Java.

➥ in contrast to, e.g., CORBA

➥ Advantage: seamless integration into the language

➥ use of remote objects is (almost!) identical to local objects

➥ including distributed garbage collection

➥ Integration of objects in other programming languages:

➥ “wrapping” in Java code via Java Native Interface (JNI)

➥ use of RMI/IIOP: interoperability with CORBA

➥ direct communication between RMI and CORBA objects



Distributed Objects

local reference

JVM 1 JVM 2 JVM 3

Node 1 Node 2

➥ Remote references can be used just like local references

➥ Objects can occur in client and server roles



Distributed Objects

remote referencelocal reference

JVM 1 JVM 2 JVM 3

Node 1 Node 2

➥ Remote references can be used just like local references

➥ Objects can occur in client and server roles



3.1.1 RMI ArchitectureR

MI s

yste

m

Client Server

Remote Remote

SkeletonStub

Remote reference

Stub / skeletonlayer

layer

RMI transport layer

referencemanager

referencemanager

3.1.1 RMI Architecture ...


Stub/Skeleton Layer

➥ Stub: local proxy object for the remote object

➥ Skeleton: receives calls and forwards them to the correct object

➥ Stub and skeleton classes are automatically generated from an

interface definition (Java interface)

➥ As of JDK 1.2: skeleton class is generic

➥ skeleton uses reflection mechanism of Java to call methods of

server object

➥ reflection allows you to query the method definitions of a class

and to generically call methods at runtime

➥ As of JDK 1.5: stub classes are created at runtime

➥ with the Java class Proxy



Remote Reference Layer

➥ Defines call semantics of RMI

➥ in JDK 1.1: unicast only, point-to-point

➥ call is routed to exactly one existing object

➥ as of JDK 1.2 also activatable objects

➥ object will be (re-)activated first, if necessary

➥ new object, state is restored from hard disk

➥ also possible: multicast semantics

➥ proxy sends request to a set of objects and returns the first

response

➥ Also: connection management, distributed garbage collection



Transport Layer

➥ Connections between JVMs

➥ basis: TCP/IP streams

➥ Proprietary protocol: Java Remote Method Protocol (JRMP)

➥ allows tunneling the connection via HTTP (due to firewalls)

➥ allows you to define your own socket factory, e.g. to use

Transport Layer Security (TLS or SSL)

➥ As of JKD 1.3 also RMI-IIOP

➥ uses IIOP (Internet Inter-ORB Protocol) from CORBA

➥ thus: direct interoperability with CORBA objects



3.1.2 RMI Services

➥ Name service: RMI Registry

➥ registers remote references to RMI objects under freely

selectable unique names

➥ a client can then get the corresponding reference for a name

➥ technical: registry sends serialized proxy object (client

stub) to the client.

➥ the location of the required class files may also be

transferred (see 3.4.1)

➥ RMI can also be used with other naming services, e.g. via

JNDI (Java Naming and Directory Interface)

3.1.2 RMI Services ...


➥ Object Activation Service

➥ usually: remote reference to RMI object is only valid as long as

the object exists

➥ if the server or the server JVM crashes: object references

become invalid➥ references change on restart!

➥ RMI Activation Service introduced with JDK 1.2

➥ starts server objects on request of a client

➥ server object must register an activation method with the

RMI Activation Daemon

3.1.2 RMI Services ...


➥ Distributed Garbage Collection

➥ automatic garbage collection of Java also works for remote

objects

➥ server-side JVM manages a list of remote references to

objects

➥ references are “leased” for a certain time

➥ reference counter of the object is decremented, if

➥ client deletes the reference (e.g., end of the lifetime of the

reference variable), or

➥ client does not renew the lease in time➥ reason: remote reference layer cannot explicitly “log off”

an object, if the client crashes➥ default setting: 10 min.

3.2 Hello World with Java RMI


Structure:

Client JVM Server JVM

class HelloClient {

s = h.sayHello();

Client class

...

Hello h;...

...

Server classclass HelloServer

String sayHello() {return "Hello World";

}...

Interface

interface Hello {String sayHello();

}

implements Hello {

3.2 Hello World with Java RMI ...


Development Process:

1. Design the interface for the server object

2. Implement the server class

3. Develop the server application to include the server object

4. Develop the client application with calls to the server object

5. Compile and start the system



Designing the Interface for the Server Object

➥ Specified as normal Java interface

➥ Must extend java.rmi.Remote

➥ no inheritance of operations, only marking as remote interface

➥ Each method must declare to raise the exception

java.rmi.RemoteException (or a base class of it)

➥ base class for all errors that may occur

➥ in the client, during transmission, in the server

➥ No restrictions compared to local interfaces

➥ but: semantic differences (parameter passing!)



Hello-World Interface

RemoteExceptionindicates error in theremote object orduring communication

Marker interface,contains no methods,marks interface asRMI interface

public interface Hello extends Remote {

String sayHello() throws RemoteException;

import java.rmi.RemoteException;

import java.rmi.Remote;

}



Implementing the Server Class

➥ A class that is to be usable remotely must:

➥ implement a given remote interface

➥ usually extend java.rmi.server.UnicastRemoteObject

➥ defines call semantics: point-to-point

➥ have a constructor that declares to throw a RemoteException

➥ creation of object must be done in a try-catch block

➥ Methods usually do not need to specify throws RemoteException

➥ because they don’t throw the exception themselves



Hello-World Server (1)

Remote method

import java.rmi.*;

import java.rmi.server.UnicastRemoteObject;

public class HelloServer extends UnicastRemoteObject

implements Hello {

super();

}

return "Hello World!";

}

public HelloServer() throws RemoteException {

public String sayHello() {



Development of the Server Application to Include the ServerObject

➥ Tasks:

➥ creating a server object

➥ registering the object with the name service

➥ under a specified public name

➥ Typically not a new class, but main method of the server class



Hello-World Server (2)

server object

Create the Register the server object

under the name "Hello−Server"

with the name server (RMI registry,local host, port 1099)

public static void main(String args[]) {

try {

HelloServer obj = new HelloServer();

catch (Exception e) {

System.out.println("Error: " + e.getMessage());

e.printStackTrace();

}

}

}

Naming.rebind("rmi://localhost/Hello−Server", obj);

}



Development of the Client Application with Calls to the ServerObject

➥ Client must first use the name service to get a reference to the

server object from the name service

➥ type cast to the correct type required

➥ Then: any method can be called

➥ no syntactical differences to local calls

➥ Note: client can get remote references in other ways as well

➥ e.g. as return value of a remote method



Hello-World Client

Get object referencefrom name server

Call the methodon the remote object

public static void main(String args[]) {public class HelloClient {

try {

import java.rmi.*;

(Hello)Naming.lookup("rmi://bspc02/Hello−Server");

String message = obj.sayHello();System.out.println(message);}catch (Exception e) {...

}}}

Hello obj =



Compiling and Starting the System

➥ Compiling the Java sources

➥ source files: Hello.java, HelloServer.java,

HelloClient.java

➥ invocation: javac *.java

➥ creates Hello.class, HelloServer.class,HelloClient.class

➥ Creating the client stub (proxy object)

➥ for clients up to JDK 1.4:

➥ invocation: rmic -v1.2 HelloServer

➥ creates HelloServer Stub.class

➥ as of JDK 1.5: client creates proxy class at runtime

➥ using java.lang.reflect.Proxy



Compiling and Starting the System ...

Client side Server sideup

to J

DK

1.4

HelloServer.java

HelloClient.class Hello.class Hello.class HelloServer.class

javac javac

HelloClient.java Hello.java

HelloServer_Stub.class

rmic



Compiling and Starting the System ...

➥ Starting the naming service

➥ invocation: rmiregistry [port]

➥ for security reasons, objects can only be registered on the

local host

➥ i.e. RMI registry must run on server computer

➥ standard port: 1099

➥ Starting the server

➥ invocation: java HelloServer

➥ Starting the client

➥ invocation: java HelloClient



Execution of the Example

HelloClient

rmiregistryTestFoo

Remoteobject

HelloServer

Clientcomputer

Servercomputer




Name and

(serialized)

Naming.rebind(...)

objectstubHelloClient

rmiregistryTestFoo

Remoteobject

HelloServer

Clientcomputer

Servercomputer




Hello server

Name and

(serialized)

Naming.rebind(...)

objectstubHelloClient

rmiregistryTestFoo

Remoteobject

HelloServer

Clientcomputer

Servercomputer




Naming.lookup(...)

NameHello server

HelloClient

rmiregistryTestFoo

Remoteobject

HelloServer

Clientcomputer

Servercomputer




(serialized)objectstub

Naming.lookup(...)

NameHello server

HelloClient

rmiregistryTestFoo

Remoteobject

HelloServer

Clientcomputer

Servercomputer




objectStub

(serialized)objectstub

Naming.lookup(...)

NameHello server

HelloClient

rmiregistryTestFoo

Remoteobject

HelloServer

Clientcomputer

Servercomputer




obj.sayHello()

objectStub

Hello server

HelloClient

rmiregistryTestFoo

Remoteobject

HelloServer

Clientcomputer

Servercomputer

3.3 RMI in Detail


3.3.1 Classes and Interfaces

uses

java.lang java.io

java.rmi.serverjava.rmi

HelloServer

RemoteServer

RemoteStub

HelloClient

RemoteObject

Object IOException

Remote<<interface>>

UnicastRemoteObject

RemoteException

Naming

3.3.1 Classes and Interfaces ...


Interface Remote

➥ Every remote object must implement this interface

➥ Does not provide methods, serves only as a marker

Class RemoteException

➥ Superclass for all exceptions that can be triggered by the RMI

system, for example, with

➥ communication errors (server not reachable, ...)

➥ (un-)marshalling errors

➥ protocol errors

➥ Each remote method must specify RemoteException (or a base

class of it) in the throws clause



Class RemoteObject

➥ Base class for all remote objects

➥ Redefines the methods equals, hashCode, and toString

➥ toStub() returns a reference to the stub object

➥ getRef() returns remote reference (= Java class)

➥ used by the stub to call methods

Class RemoteServer

➥ Base class for all server implementations

➥ UnicastRemoteObject, Activatable

➥ Method getClientHost(): host address of the client of the

current RMI call

➥ setLog() and getLog(): logging of RMI calls



Class UnicastRemoteObject

➥ Implements remote object with the following properties:

➥ references to the object are only valid as long as server

process (JVM) is still running

➥ client call is routed to exactly one object (via TCP connection),

no replication

➥ Constructor allows definition of port and socket factories

➥ so that e.g. connections via TLS/SSL can be realized

➥ Static method exportObject() makes object available via RMI

➥ Static method unexportObject() cancels availability

Class RemoteStub

➥ Base class for all client stubs



Class Naming

➥ Allows easy access to RMI registry

➥ Important methods:

➥ bind() / rebind(): registers object under given name

➥ lookup(): get object reference to a name

➥ Names are given in URL format

➥ also define the host and port of the RMI registry.

➥ structure of the URL:

Port of the RMI registryName of the registered Object

Host of RMI registryProtocol (always rmi)

rmi:// bspc02:1234/Hello

3.3 RMI in Detail ...


3.3.2 Special Features of Remote Classes

➥ Comparison of remote objects

➥ Comparison with == refers only to the stub objects

➥ Result is false, even if both stubs refer to the same remote

object

➥ comparison with equals() returns true if both stubs refer to

the same remote object

JVM1

stub1

stub2 stub1.equals(stub2)

stub1 != stub2

JVM2

objectRemote

3.3.2 Special Features of Remote Classes ...


➥ Method hashCode()

➥ used by container classes HashMap, HashSet and others

➥ Hash code is calculated only from the object identifier of the

remote object

➥ same remote object ⇒ same hash code

➥ but the content of the object is ignored

➥ consistent with behavior of equals()

➥ Cloning objects

➥ cloning of the remote object is not possible by calling clone()

on the stub

➥ cloning of stubs neither necessary nor meaningful



3.3.3 Parameter Passing

➥ Parameters passed to remote methods

➥ either via call-by-value

➥ or via call-by-reference

➥ The mechanism used depends on the type of the parameter

➥ Final decision may only be made at runtime!

➥ The return of the result follows the same rules as for parameter

passing

3.3.3 Parameter Passing ...


Parameter Passing for Local Methods

➥ Java supports two kinds of types:

➥ value types: simple data types

➥ boolean, byte, char, short, int, long, float, double

➥ are passed to local methods by value

➥ that is, the method receives a copy of the value

➥ reference types: classes (incl. String and arrays)

➥ are passed to local methods by reference

➥ that is, the method works on the original object

and can also change object if required



Parameter Passing for Remote Methods

➥ Value types: are always passed by value

➥ Reference types: dependent on the concrete object

➥ object can be serialized: call-by-value

➥ object belongs to a class that implements the Remote interface:

call-by-reference

➥ neither: error (java.rmi.MarshalException)

➥ both: ??! (this case is to be avoided!)

➥ decision is made only at runtime



Call-by-Value (Serializable Objects)

➥ Class must implement interface java.io.Serializable

➥ Serializable objects can be transferred over a network

➥ only the data is transferred, the code (class file) must be

available at the receiver!

➥ Default serialization of Java:

➥ all attributes of the object are serialized and transferred

➥ recursive procedure!

➥ prerequisite: all attributes and all base classes can be

serialized

➥ Application specific serialization is possible:

➥ implement the methods writeObject and readObject



Passing a Serializable Object

param

Original

Clientobject param Stub

objectServerobject

Independentcopy

<<create>>

op(param)

Skele−ton

Networkconnection

op(param)

m()

paramserialize

paramdeserialize



Call-by-Reference (Remote Objects)

➥ Class of the parameter object must implement an interface that

extends Remote

➥ parameter type must be this interface

➥ class is typically derived from UnicastRemoteObject

➥ A serialized stub object is transferred

➥ from JDK 1.5: stub class is created dynamically

➥ (up to JDK 1.4 stub class must be generated by rmic and be

available at the server)

➥ If the server calls methods on the parameter object:

➥ calls are routed to the original object using RMI



Passing a Remote Object

paramstub

Stubobject

paramClientobject

paramstub

Serverobject

<<create>>

op(paramStub)

Skele−ton

Networkconnection

op(param)

toStub(param)

paramStub

m()

serializeparamStub

paramStubdeserialize



Examples

➥ See WWW:

➥ Hello-World with call-by-value parameter

➥ Hello-World with call-by-reference parameter

http://www.bs.informatik.uni-siegen.de/web/wismueller/vl/gen/csp/code/RMI/BYVALUE.zip

http://www.bs.informatik.uni-siegen.de/web/wismueller/vl/gen/csp/code/RMI/BYREFERENCE.zip



Arrays and Container Objects

➥ Arrays and container objects (from the Java Collection

Framework, java.util) can be serialized

➥ i.e., they will be reinstantiated at the receiver

➥ To the elements of the array / container the same rules apply as

to simple parameters

➥ for mixed content: elements are passed by value or by

reference depending on their actual class



3.3.4 Remote Object References as Results

➥ Frequently: via RMI registry, the client receives a reference to a

remote object, which provides references to other objects

➥ the remote object may also create these objects on demand

(this is called factory object or factory class)

➥ Example: server for bank accounts

➥ registration of all account objects with RMI registry not useful

➥ instead: registration of a manager object that returns the

reference to the account object for a given account number

➥ if necessary, it can create a new object (from a database)

➥ Note: RMI does not allow remote object creation

➥ client cannot create objects on a remote host



3.3.5 Client Callbacks

➥ Frequently: server wants to make calls in the client

➥ e.g. progress bar, queries, ...

➥ For this: client object must be an RMI object

➥ pass this reference to the server method

➥ In some cases, you cannot inherit from UnicastRemoteObject,

e.g. for applets

➥ then: export the object using

UnicastRemoteObject.exportObject(obj,0);

➥ attention: when calling exportObject(obj), no dynamic stub

is created, even with JDK 1.5 and later

➥ Example code: see WWW (Hello-World with callback)

http://www.bs.informatik.uni-siegen.de/web/wismueller/vl/gen/csp/code/RMI/HELLO-CALLBACK.zip



3.3.6 RMI and Threads

➥ RMI does not specify how many threads are provided on the

server side for method calls

➥ only one thread, one thread per call, ..

➥ This means that several server methods can be active at the

same time

➥ requires correct synchronization (synchronized)!

➥ Client-side locking of a remote object using a synchronized block

is not possible

➥ only local stub is locked

➥ a lock must be implemented using methods of the remote

object if necessary

3.4 Deployment


➥ Deployment: distribution, transfer and installation of the

components of a distributed application

➥ specifically for RMI: which class file has to go where?

➥ Server, RMI registry and client need the class files for:

➥ the remote interface of the server

➥ all classes or interfaces that are used in the server interface

(recursively)

➥ (up to JDK 1.4 also the stub classes for all used remote

interfaces)

3.4 Deployment ...


➥ Client and server additionally need the class files for:

➥ their own implementation

➥ all classes of serializable objects that they receive

➥ as a parameter or a result of method calls

➥ (up to JDK 1.4 also the stub classes for all remote objects they

receive)

➥ Problems with static installation of class files for serialized

objects (and stubs):

➥ dependency between client and server

➥ method parameters, result objects

➥ change of classes requires new installation

➥ nullifies an advantage of distributed applications

3.4.1 Remote Class Loading in Java RMI


Class loader

➥ Class loaders are used for loading classes (and interfaces) at

runtime

➥ more exactly: for loading class files

➥ Each class is loaded only once

➥ Class loaders are Java objects themselves

➥ base class: java.lang.ClassLoader

➥ RMI uses its own class loader

➥ java.rmi.RMIClassLoader

3.4.1 Remote Class Loading in Java RMI ...


Remote Loading of Classes

➥ RMIClassLoader allows to load classes also from remote

computers

➥ via HTTP (web server) or FTP

➥ URL is defined via codebase property when the JVM is started

➥ Allows central storage of the necessary files

➥ “automatic” deployment

➥ Restrictions:

➥ all classes named in the client code must be available locally

➥ client must define its own security manager



Example

main()

<<interface>>BankServer

getDate(): ...getAmount(): ...getText(): ...

UnicastRemoteObject

BankServerImpl

<<interface>>

Serializable

Entry<<interface>>

BankClient

EntryImpl AccountImpl

getAccount(...): Account

{Remote}

getStatement(...): Entry

{Remote}Account

➥ The class files for BankServer, Account and Entry must beavailable locally at the client (BankClient)

➥ EntryImpl can be remotely loaded by client



Local and Remote Loadable Classes

➥ Local loading (via CLASSPATH) must be possible (for client and

server) for:

➥ all classes (and interfaces) named in the client/server code,

all classes mentioned by name in those classes, ..

➥ i.e. everything that is needed to compile the code

➥ Remotely loadable:

➥ stub classes of remote objects

➥ subclasses accessed only via polymorphism

➥ i.e. the code only uses a superclass or interface

➥ The RMI registry can load all required classes remotely



Example: Hello-World with Callback and Result Object

➥ Interfaces (see WWW):

public interface Hello extends Remote{

HelloObj getHello(AskUser ask) throws RemoteException;}

public interface AskUser extends Remote{

boolean ask(String question) throws RemoteException;}

public interface HelloObj{

void sayIt();}

http://www.bs.informatik.uni-siegen.de/web/wismueller/vl/gen/csp/code/RMI/HELLO-CALLBACK-REMOTE.zip



Example: How are the Classes Loaded?

➥ Interfaces Hello.class, AskUser.class, HelloObj.class

➥ must be available locally at the client

➥ can be loaded remotely by the RMI registry

➥ Implementation HelloObjImpl.class of HelloObj

➥ can be loaded remotely by the client

➥ is not required by RMI registry

➥ Stub classes for the two remote interfaces

➥ are usually generated dynamically (not loaded) as of JDK 1.5

➥ but can also be loaded remotely



Example: Necessary Changes in the Client

➥ Using the RMI security manager:

public static void main(String args[]) {

System.setSecurityManager(new RMISecurityManager());

➥ Definition of security policy: policy file:

grant {permission java.net.SocketPermission "myserver:1024-",

"connect,accept";

permission java.net.SocketPermission "www.bsvs.de:80",

"connect";};

➥ grants local classes (client!) the following permissions:

➥ connection to/from myserver on non-privileged ports:➥ RMI registry (1099), server and callback object (dyn.)

➥ connection to the web server (port 80)



Example: Deployment

➥ All classes to be loaded remotely are packed into one archive

➥ The archive is made available via a web server

➥ Start the server with a codebase, e.g:

➥ java -Djava.rmi.server.codebase="http://www.bsvs.de/

jars/HelloServer.jar" HelloServer

➥ the codebase property specifies the URL to the JVM underwhich the classes are to be loaded

➥ server passes codebase to RMI registry when registering theserver object

➥ RMI registry passes codebase to client

➥ Start of the client with specification of the policy file, e.g:

➥ java -Djava.security.policy=policy HelloClient



Procedure for Transferring Objects

locally]loaded[Class

true

true

false

false

false

[Class in

true

false

codebase

Determineclass loader

object

ClassNotFoundException

object

Useloadedclass

true Load classfrom

codebase*

Send object Receive serialized object

Serialize

Send

[Codebaseavailable]

[Classavailablelocally]

RAM]

Loadclasslocally

Deserialize

* must be specifiedexplicitly as of

JDK 1.7

3.4.2 Java Security Manager


➥ JVM can be equipped with a security manager if required

➥ for Java applications: System.setSecurityManager()

➥ for Java applets: by default

➥ Security manager checks, among other things, whether the

application is allowed to

➥ access a local file,

➥ establish a network connection,

➥ stop the JVM,

➥ create a class loader,

➥ read AWT events, ...

➥ Permissions are specified in a security policy

➥ if the specifications are violated: exception

3.4.2 Java Security Manager ...


Security Policy

➥ Assigns permissions to codes from specific sources

➥ Code source can be described by two properties:

➥ code location: URL where the code was loaded from

➥ certificates (for signed code)

➥ Permissions allow access to certain resources

➥ permissions are modeled by objects, but are usually specified

in the policy file

➥ e.g. FilePermission p =

new FilePermission("/tmp/*", "read,write");

➥ or permission java.io.FilePermission "/tmp/*",

"read,write";



Hierarchy of Permission Classes (JDK 1.2)

Permission

BasicPermission FilePermission SocketPermissionAllPermission

Property

Permission

Net

Permission

Reflected

Permission

Runtime

Permission

Logging

Permission

AWT

Permission

Security

Permission

Permission

Serializable

Permission

Auth

Audio

Permission

SQL

Permission

Use onlyfor testing!!!



Policy File

grant {permission java.net.SocketPermission "www.bsvs.de:80","connect";

};

grant codebase "file:" {permission java.io.FilePermission "/home/tom/-","read, write";

permission java.io.FilePermission "/bin/*", "execute";};

grant codebase "http://www.bsvs.de/jars/HelloServer.jar" {permission java.net.SocketPermission "localhost:1024-","listen, accept, connect";

};



Policy File ...

➥ All classes are allowed to:

➥ establish connections to www.bsvs.de, port 80

➥ Locally loaded classes may:

➥ read and write files in /home/tom or (recursively) a

subdirectory of it

➥ execute files in the /bin directory

➥ Classes loaded from http://www.bsvs.de/jars/

HelloServer.jar are allowed to:

➥ accept / establish network connections on / to the local

computer via non-privileged ports (1024 or higher)



Further Documentation

➥ General information on policy files::

http://docs.oracle.com/javase/8/docs/technotes/guides/

security/PolicyFiles.html

➥ Overview of the permission classes::

http://docs.oracle.com/javase/8/docs/technotes/guides/

security/permissions.html

➥ Java API documentation::

http://docs.oracle.com/javase/8/docs/api/

http://docs.oracle.com/javase/8/docs/technotes/guides/security/PolicyFiles.html

http://docs.oracle.com/javase/8/docs/technotes/guides/security/PolicyFiles.html

http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html

http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html

http://docs.oracle.com/javase/8/docs/api/



3.5 Summary

➥ RMI allows access to remote objects

➥ transparent, via proxy objects

➥ proxy classes are generated automatically

➥ usually at runtime

➥ Parameter passing semantics

➥ by value, if parameter object can be serialized

➥ by reference, if parameter object is an RMI object

➥ Classes can also be loaded remotely (security manager!)

➥ Name service: RMI registry

➥ Security: RMI over SSL is possible, but not ideal



4 Name Services

4 Name Services ...


Content

➥ Basics

➥ Example: JNDI

Literature

➥ Tanenbaum, van Steen: Ch. 4.1

➥ Farley, Crawford, Flanagan: Ch. 7

➥ http://docs.oracle.com/javase/tutorial/jndi/overview

http://docs.oracle.com/javase/tutorial/jndi/overview

4.1 Basics


Names, Addresses and IDs

➥ Name: character or bit sequence that refers to a unit

➥ unit: e.g. computer, printer, file, user, website, ...

➥ Address: name of the entry point of a unit

➥ entry point allows access to the unit

➥ several entry points per unit are possible

➥ entry point may change over time

➥ A position-independent name identifies a unit independently

from its entry point

➥ ID: name with the following properties:

➥ ID refers to at most one unit, unit has at most one ID

➥ ID always refers to the same unit (not reused)

4.1 Basics ...


Namespaces

➥ represented by directed, labelled graph

"/home/steen/mbox"

"/keys""/home/steen/keys"elke

maxsteen

home keys

mbox.twmrc

n0

n1

n2 n3 n4

n5

keys

n6 n7

➥ leaf node: named unit,

with information / status

if required

➥ inner node: directory node

➥ edges are labeled with names

➥ Units are named by paths in the graph:

Start node: < Label-1, Label-2, ... >

➥ absolute path: starting from root (of namespace)

➥ relative path: starting from any node

➥ Example: names in the UNIX file system

4.1 Basics ...


Aliasing and Linking

➥ Alias: alternative name for the same unit

➥ Possibilities for the realization of aliases:

➥ allow several absolute pathnames for one unit

➥ e.g. hard link in Unix

➥ a (special) leaf node stores pathname of the unit

➥ e.g. symbolic link in Unix

➥ Transparent linking of different namespaces:

➥ a (special) directory node stores the ID of a directory node in

another namespace

➥ e.g. mounted file system in Unix

4.1 Basics ...


Name Resolution

➥ Finding the node (or information) that corresponds to a name

➥ start at the start node

➥ look up first label in directory table

⇒ ID of the next node

➥ etc., until the path is completely processed

➥ Conclusion mechanism: determination of the start node

➥ usually implicit

➥ Global names: resolution independent of specific context

➥ Local names: resolution is context-dependent

➥ e.g. pathname relative to working directory in Unix

4.1 Basics ...


Implementation of Naming Services

➥ Typical operations:

➥ bind(name, address, attributes)

➥ lookup(name, attributes) ⇒ address, attributes

➥ unbind(name, address)

➥ In distributed systems:

➥ namespace is stored distributed (usually hierarchically)

➥ for high availability: additionally replicated storage

➥ Name resolution can be iterative or recursive

➥ iterative: Server responds with address of next server

➥ recursive: server requests even at next server

➥ Example: Domain Name System (☞ RN I, 11.1)

4.2 Example: JNDI


➥ JNDI: Java Naming and Directory Interface

➥ API for access to different name and directory services

➥ directory service also stores attributes of objects

RMI CORBA LDAP DNS

JNDI naming manager

Serviceprovider

JNDI SPI

JNDI API

RMI

Java application

registryCORBAnamingservice

LDAPserver server

DNS

4.2 Example: JNDI ...


➥ JDNI supports compound namespaces

➥ managed by various name or directory services

systemFile

Context(directory)

Initialcontext

LDAP

User objects

DNS

➥ Directories are called “contexts”

➥ objects are bound to names within a context



The Interface javax.naming.Context for Naming Contexts

➥ Important methods:

➥ bind(), rebind() : bind objects to names

➥ bind() throws exception if name already exists

➥ unbind() : remove names

➥ rename() : rename

➥ lookup() : resolve name to object

➥ listBindings() : list of all bindings

➥ createSubcontext() : create sub-context

➥ destroySubcontext() : delete sub-context



The Interface javax.naming.Context for Naming Contexts ...

➥ Implementation class InitialContext

➥ for initial context (depending on the concrete name service)

➥ Context iC = new InitialContext(properties);

➥ configuration via Properties object (Hashtable), among

others:

➥ "java.naming.factory.initial"

➥ factory for InitialContext

➥ "java.naming.provider.url"

➥ contact information for service provider

➥ "java.naming.security.principal" and

"java.naming.security.credentials"

➥ user name and password for authentication



Example: Accessing the RMI Registry

import javax.naming.*;...

Properties props = new Properties();props.put("java.naming.factory.initial","com.sun.jndi.rmi.registry.RegistryContextFactory");

props.put("java.naming.provider.url","rmi://localhost:1099");

Context ctx = new InitialContext(props);

obj = (Hello)ctx.lookup("Hello-Server");

message = obj.sayHello();



Example: Accessing a Local File System

import javax.naming.*;...

Properties props = new Properties();props.put("java.naming.factory.initial",

"com.sun.jndi.fscontext.RefFSContextFactory");Context ctx = new InitialContext(props);

for (int i=0; i<args.length-1; i++)ctx = (Context)ctx.lookup(args[i]);

NamingEnumeration<Binding> list= ctx.listBindings(args[args.length-1]);

while (list.hasMore()) {Binding b = list.next();System.out.println(b.getName()+": "+b.getClassName());

}



5 Process Management

5 Process Management ...


Contents

➥ Distributed process scheduling

➥ Code migration

Literature


➥ Stallings: Ch. 14.1



5.1 Distributed Process Scheduling

➥ Typical: middleware component that

➥ decides on which node a process is executed

➥ and probably migrates processes between nodes

➥ Gloals:

➥ balance the load between nodes

➥ maximize the system performance (average response time)

➥ also: minimize the communication between nodes

➥ meet special hardware / resource requirements

➥ Load: typically the length of the process queue (ready queue)

➥ sometimes resource consumption and communication volume

are considered, too

5.1 Distributed Process Scheduling ...


Approaches to distributed scheduling

➥ Static scheduling

➥ mapping of processes to nodes is defined before execution

➥ NP-complete, therefore heuristic methods

➥ Dynamic load balancing, two variants:

➥ execution location of a process is defined during creation and

is not changed later

➥ execution location of a process can be changed at runtime

(several times, if necessary)

➥ preemptive dynamic load balancing, process migration



5.1.1 Static Scheduling

➥ Procedure dependent on the structure / the modelling of a job

➥ jobs always consist of several processes

➥ differences in communication structure

➥ Examples:

➥ communicating processes: graph partitioning

➥ non-communicating tasks with dependencies: list scheduling

5.1.1 Static Scheduling ...


Scheduling through graph partitioning

➥ Given: process system with

➥ CPU / memory requirements

➥ specification of communication load

between each pair of processes

usually represented as a graph

64

81

43

2

3 23

551

2

4 2G

E

A B C

F

IH

D

➥ Wanted: partitioning of the graph in such a way that

➥ CPU and memory requirements are met for each node

➥ partitions are about the same size (load balancing)

➥ weighted sum of cut edges is minimal

➥ i.e. as little communication as possible between nodes

➥ NP-complete, therefore many heuristic procedures









= 30

1 2 3

64

81

43

2

3 23

551

2

4 2G

E

A B C

F

IH

D















= 28

1 2 3

64

81

43

2

3 23

551

2

4 2G

E

A B C

F

IH

D









List scheduling

➥ Tasks with dependencies, but without communication during

execution

➥ tasks work on results of other tasks

➥

D 6 E 6 F 4

G 4

A 6 B 5 C 4

1

211

333

41

Modelling

➥ program represented as a DAG

➥ nodes: tasks with execution times

➥ edges: communication with transfer

time



Method

➥ Create prioritized list of all tasks

➥ many different heuristics to determine the priorities, e.g.

according to:

➥ length of the longest path (without communication) from the

node to the end of the DAG (High Level First with EstimatedTime, HLFET).

➥ earliest possible start time (Earliest Task First, ETF)

➥ Process the list as follows:

➥ assign the first task to the node that allows the earliest start

time

➥ remove the task from the list

➥ Creation and processing of the list can also be interleaved



Example: List Scheduling with HLFET

A 6 C 4

D 5 F 4

G 4

E 6

B 4

2

1 3

1 1

41

5 3

➥ Assumption: local communication does not cost any time




6+5+4 = 15Static level (without comm.):

A 6 C 4

D 5 F 4

G 4

E 6

B 4

2

1 3

1 1

41

5 3





6+5+4 = 15Static level (without comm.):

6+6+4 = 16

A 6 C 4

D 5 F 4

G 4

E 6

B 4

2

1 3

1 1

41

5 3

16





A 6 C 4

D 5 F 4

G 4

E 6

B 4

2

1 3

1 1

41

5 3

16

4

9 10 8

1413





A 6 C 4

D 5 F 4

G 4

E 6

B 4

2

1 3

1 1

41

5 3

16

4

9 10 8

1413 List: GFDEBCA





Schedule with 3 nodes:

0 2 4 6 8 10 12 14 16

1

2

3

A 6 C 4

D 5 F 4

G 4

E 6

B 4

2

1 3

1 1

41

5 3

16

4

9 10 8

1413 List: GFDEBCA





6A


0 2 4 6 8 10 12 14 16

1

2

3

A 6 C 4

D 5 F 4

G 4

E 6

B 4

2

1 3

1 1

41

5 3

16

4

9 10 8

1413 List: GFDEBC





6A

C 4


0 2 4 6 8 10 12 14 16

1

2

3

A 6 C 4

D 5 F 4

G 4

E 6

B 4

2

1 3

1 1

41

5 3

16

4

9 10 8

1413 List: GFDEB





6A

C 4

B 4


0 2 4 6 8 10 12 14 16

1

2

3

A 6 C 4

D 5 F 4

G 4

E 6

B 4

2

1 3

1 1

41

5 3

16

4

9 10 8

1413 List: GFDE





6A

C 4

B 4

6E


0 2 4 6 8 10 12 14 16

1

2

3

A 6 C 4

D 5 F 4

G 4

E 6

B 4

2

1 3

1 1

41

5 3

16

4

9 10 8

1413 List: GFDE





6A

C 4

B 4

6E


0 2 4 6 8 10 12 14 16

1

2

3

A 6 C 4

D 5 F 4

G 4

E 6

B 4

2

1 3

1 1

41

5 3

16

4

9 10 8

1413 List: GFDE





6A

C 4

B 4 6E


0 2 4 6 8 10 12 14 16

1

2

3

A 6 C 4

D 5 F 4

G 4

E 6

B 4

2

1 3

1 1

41

5 3

16

4

9 10 8

1413 List: GFDE





6A

C 4

B 4

6E


0 2 4 6 8 10 12 14 16

1

2

3

A 6 C 4

D 5 F 4

G 4

E 6

B 4

2

1 3

1 1

41

5 3

16

4

9 10 8

1413 List: GFD





6A

C 4

B 4

6E D 5


0 2 4 6 8 10 12 14 16

1

2

3

A 6 C 4

D 5 F 4

G 4

E 6

B 4

2

1 3

1 1

41

5 3

16

4

9 10 8

1413 List: GFD





6A

C 4

B 4

6E

D 5


0 2 4 6 8 10 12 14 16

1

2

3

A 6 C 4

D 5 F 4

G 4

E 6

B 4

2

1 3

1 1

41

5 3

16

4

9 10 8

1413 List: GFD





6A

C 4

B 4

6E

D 5


0 2 4 6 8 10 12 14 16

1

2

3

A 6 C 4

D 5 F 4

G 4

E 6

B 4

2

1 3

1 1

41

5 3

16

4

9 10 8

1413 List: GFD





6A

C 4

B 4

6E

D 5


0 2 4 6 8 10 12 14 16

1

2

3

A 6 C 4

D 5 F 4

G 4

E 6

B 4

2

1 3

1 1

41

5 3

16

4

9 10 8

1413 List: GF





6A

C 4

B 4

6E

D 5

4F


0 2 4 6 8 10 12 14 16

1

2

3

A 6 C 4

D 5 F 4

G 4

E 6

B 4

2

1 3

1 1

41

5 3

16

4

9 10 8

1413 List: G





6A

C 4

B 4

6E

D 5

4F

G 4


0 2 4 6 8 10 12 14 16

1

2

3

A 6 C 4

D 5 F 4

G 4

E 6

B 4

2

1 3

1 1

41

5 3

16

4

9 10 8

1413 List:




5.1.2 Dynamic Load Balancing

➥ Components of a load balancing system

➥ Information policy – when is load balancing triggered?

➥ on demand, periodically, in case of state changes, ...

➥ Transfer policy – under which condition is load shifted?

➥ often: decision with the help of threshold values

➥ Location policy – how is the receiver (or sender) found?

➥ polling of some nodes, broadcast, ...

➥ Selection policy – which tasks are moved?

➥ new tasks, long tasks, location-independent tasks, ...

5.1.2 Dynamic Load Balancing ...


Typical approaches to dynamic load balancing

➥ Sender initiated load balancing

➥ new process usually start on the local node

➥ if node is overloaded: determine load of other nodes and startprocess on low-loaded node

➥ e.g. ask randomly selected nodes for their load, sendprocess if load ≤ threshold, otherwise: next node

➥ disadvantage: additional work for already overloaded node!

➥ Receiver initiated load balancing

➥ when scheduling a process: check whether the node has stillenough work (processes)

➥ if not: ask other nodes for work

➥ Similar also for preemptive dynamic load balancing



[Tanenbaum/Steen, 3.4]5.2 Code Migration

➥ In distributed systems, in addition to data also programs aretransfered between nodes

➥ partly also during their execution

➥ Motivation: performance and flexibility

➥ preemptive dynamic load balancing

➥ optimization of communication (move code to data or highly

interactive code to client)

➥ increased availability (migration before system maintenance)

➥ use of special HW or SW resources

➥ use / evacuation of unused workstation computers

➥ avoid code installation on client machines (dynamic loading of

code from server)

5.2 Code Migration ...


Models for Code Migration

➥ Conceptual model: a process consists of three “segments”:

➥ code segment

➥ the executable program code of the process

➥ execution segment

➥ complete execution status of the process➥ virtual address space (data, heap, stack)➥ processor register (incl. instruction counter)

➥ process / thread control block

➥ resource segment

➥ contains references to external resources required by the

process➥ e.g. files, devices, other processes, mailboxes, ...



Models for Code Migration ...

➥ Weak mobility

➥ only the code segment is transferred

➥ including initialization data if necessary

➥ program is always started from initial state

➥ examples: Java applets, loading remote classes in Java

➥ Strong mobility

➥ code and execution segment are transferred

➥ migration of a process in execution

➥ examples: process migration, agents

➥ Sender- or receiver-initiated migration



Code Migration Issues and Solutions

➥ Security: target computer executes unknown code (e.g. applet)

➥ restricted environment (sandbox)

➥ signed code

➥ Heterogeneity: code and execution segment depend on CPU andoperating system

➥ use of virtual machines (e.g. JVM, XEN)

➥ migration points at which state can be stored and read in aportable way (possibly supported by compiler)

➥ Access to (local) resources

➥ remote access with a global reference

➥ move or copy the resource

➥ new binding to resource of the same type



[Stallings, 14.1]Process migration

➥ Migration of a process that is already running

➥ triggered by OS or the process itself

➥ mostly for dynamic load balancing

➥ Sometimes combined with checkpoint /restart function

➥ instead of transferring the status of the process, it can also be

stored persistently

➥ Design goals of migration procedures:

➥ low communication effort

➥ only short blocking of the migrated process

➥ no dependency on source computer after migration



Process Flow of a Process Migration

➥ Creating a new process on the target system

➥ Transfer the code and execution segment (process address

space, process control block), initialization of the target process

➥ required: identical CPU and OS or virtual machine

➥ Update all connections to other processes

➥ communication links, signals, ...

➥ during migration: buffering at source

➥ then: forwarding to target computer

➥ Delete the original process

➥ if necessary, retain a “shadow process” for redirected system

calls, e.g. file accesses



Transferring the process address space

➥ Eager (all): transfer the entire address space

➥ no traces of the process remain on source nodes

➥ very expensive for large address space (especially if not all

pages are used)

➥ often together with checkpoint/restart function

➥ Precopy : process continues to run on source node during

transfer

➥ to minimize time in which the process is blocked

➥ pages modified while the migration is in progress must be sent

again



Transferring the process address space ...

➥ Eager (dirty): transfer only modified pages that are in mainmemory

➥ all other pages are only transferred when accessed

➥ integration with virtual memory management

➥ motivation: quickly “flush” main memory of the source node

➥ source node may remain involved until the end of the process

➥ Copy-on-reference: transfer each page only when accessed

➥ variation of eager (dirty)

➥ lowest initial costs

➥ Flushing: move all pages to disk before migration

➥ after that: copy-on-reference

➥ advantage: main memory of the source node is relieved



6 Time and Global State

6 Time and Global State ...


➥ Synchronization of physical clocks

➥ Lamport’s happended before relation

➥ Logical clocks

➥ Global state

Literature

➥ Tanenbaum, van Steen: Kap. 5.1-5.3

➥ Colouris, Dollimore, Kindberg: Kap. 10

➥ Stallings: Kap 14.2



What is the difference between a distributed system and asingle/multiprocessor system?

➥ Single or multiprocessor system:

➥ concurrent processes: pseudo-parallel by time sharing or

truely parallel

➥ global time: all events in the processes can be ordered

unambiguously in terms of time

➥ global state: at any time a unique state of the system can be

determined

➥ Distributed system

➥ true parallelism

➥ no global time

➥ no unique global state



Concurrency vs. (true) parallelism

A B C Dsequential

A B C D A A AB BD D DCconcurrent

AB

CD

One time line, processes can be interrupted by others

parallel

Each node / process has its owntime line! Events in differentprocesses can truely happensimultaneously.

One time line, processes are not interrupted.

at any time: interleaved execution.

Example: 4 processes



Global Time

➥ In a single/multiprocessor system

➥ each event can (at least theoretically) be assigned a unique

time stamp of the same local clock

➥ for multiprocessor systems: synchronization at the shared

memory

➥ In distributed systems:

➥ many local clocks (one per node)

➥ exact synchronization of clocks is (on principle!) not possible

➥ ⇒ the sequence of events on different nodes can not (always)

be determined uniquely

➥ (cf. special theory of relativity)



An effect of distribution

➥ Preliminary remark: events in distributed systems

Process 1

Process 2 Time

receive the message

send a message local event

local events

➥ Scenario: two processes observe two other processes

Observer A

Observer B

Process 1

Process 2z

yx

z x y

x y z



An effect of distribution ...

➥ The observers may see the events in different order!

➥ Problem e.g., if the observers are replicated databases and the

events are database updates

➥ replicas are no longer consistent!

➥ Even from time stamps of (local) clocks it is not possible to

determine the order of events in a meaningful way

➥ Hence, in such cases:

➥ events with timestamps of logical clocks (☞ 6.3)

➥ logical clocks allow conclusions to be made about causal order



[Coulouris, 10.3]6.1 Synchronizing Physical Clocks

➥ Physical clock shows ’real’ time

➥ based on UTC (Universal Time Coordinated)

➥ Each computer has its own (physical) clock

➥ quartz oscillator with counter in HW and if necessary in SW

➥ Clocks usually differ from each other (offset)

➥ Offset changes over time: clock drift

➥ typ. 10−6 for quartz crystals, 10−13 for atomic clocks

➥ Goal of clock synchronization:

➥ keep the offset of the clocks under a given limit

➥ clock skew: maximum allowed deviation

6.1 Synchronizing Physical Clocks ...


Cristian’s Method

➥ Assumption: A and B want to synchronize their clocks with eachother

➥ B can also be a time server (e.g. with GPS clock)

➥ Protocol:

1. A sendsrequest to B

A

B

t0

➥ A must take the

runtime of the

reply message

into account

➥ estimate: runtime

= half the round

trip time

= (t1− t0)/2



Cristian’s Method



➥ Protocol:

t1

(t)

t

2. B reads time t andsends it to A


A

B

t0

➥ A must take the

runtime of the

reply message

into account


= half the round

trip time

= (t1− t0)/2



Cristian’s Method



➥ Protocol:

3. A sets its clockto t + (t1−t0)/2

t1

(t)

t



A

B

t0

➥ A must take the

runtime of the

reply message

into account


= half the round

trip time

= (t1− t0)/2



Cristian’s Method



➥ Protocol:

(interrupt−)latency

t1

(t)

t




A

B

t0

➥ A must take the

runtime of the

reply message

into account


= half the round

trip time

= (t1− t0)/2



Cristian’s Method



➥ Protocol:

transit timedifferent

t1

(t)

t




A

B

t0

➥ A must take the

runtime of the

reply message

into account


= half the round

trip time

= (t1− t0)/2



Cristian’s Method: Discussion

➥ Problem: runtimes of both messages may be different

➥ systematic differences (different paths / latencies)

➥ statistical fluctuations of the transit time

➥ Accuracy estimate, if minimum transit time (min) is known:

➥ B can have determined t at the earliest at time t0 + min, at

the latest at time t1 − min (measured with A’s clock)

➥ thus accuracy ± ((t1 − t0)/2 − min)

➥ To improve accuracy:

➥ execute the message exchange multiple times

➥ use the one with minimum round trip time



Changing the clock

➥ Turning back is problematic

➥ order / uniqueness of time stamps

➥ Non-monotonous “jumping” of the time also problematic

➥ Therefore: clock is generally adjusted slowly

➥ runs faster / slower, until clock skew has been compensated

Further protocols

➥ Berkeley algorithm: server calculates mean value of all clocks

➥ NTP (Network Time Protocol): hierarchy of time servers in the

Internet with periodic synchronization

➥ IEEE 1588: clock synchronization for automation systems



[Coulouris, 10.4]6.2 Lamport’s Happened-Before Relation

➥ In two cases, the order of events can also be determined without

a global clock:

➥ if the events are in the same process, local clock is sufficient

➥ the sending of a message is always before its reception

➥ Definition of the happened-before causality relation → (causality

relation)

➥ if events a, b are in the same process i and ti(a) < ti(b)(ti: time stamp with i’s clock), then a → b

➥ if a is the sending of a message and b its receipt, then a → b

➥ if a → b and b → c, then also a → c (transitivity)

➥ a → b means, that b may causally depend on a

6.2 Lamport’s Happened-Before Relation ...


Examples

Process 1

Process 2

Process 3

Process 4

hf

c

d

k

a

e

b

g

j

i

l

➥ Among others, we have here:

➥ b → i and a → h (events in the same process)

➥ c → d and e → f (sending / receiving a message)

➥ c → k and a → i (transitivity)

➥ g 6→ l and l 6→ g: l and g are concurrent (nebenlaufig)



[Coulouris, 10.4]6.3 Logical Clocks

➥ Physical clocks cannot be synchronized exactly

➥ therefore: unsuitable for determining the order in which eventsoccurred

➥ Logical clocks

➥ refer to the causal order of events (happened-before relation)

➥ no fixed relationship to real time

➥ In the following:

➥ Lamport timestamps

➥ are consistent with the happened-before relation

➥ vector timestamps

➥ allow sorting of events according to causality (i.e.happened-before relation)

6.3 Logical Clocks ...


Lamport Timestamps

➥ Lamport timestamps are natural numbers

➥ Each process i has a local counter Li, that is updated as follows:

➥ at (more precisely: before) each local event: Li = Li + 1

➥ in each message, the time stamp Li of the send event is also

sent

➥ at receipt of a message with time stamp t:Li = max(Li, t + 1)

➥ Lamport time stamps are consistent with the causality:

➥ a → b ⇒ L(a) < L(b), where L is the Lamport timestamp

in the respective process

➥ but the reversal does not apply!



Lamport Timestamps: Example

3

1

1

2

1

1 21

Process 1

Process 2

Process 3

Process 4l

c

d f h

k

a

e

b

g

j

i

1

2 3 4

64

5


➥ c → k and L(c) < L(k)

➥ g 6→ j and L(g) 6< L(j)

➥ g 6→ l, but still L(g) < L(l)




3

1

1

2

1

1 21

Process 1

Process 2

Process 3

Process 4l

c

d f h

k

a

e

b

g

j

i

1

2 3 4

64

5L = max(2, 1+1)3


➥ c → k and L(c) < L(k)

➥ g 6→ j and L(g) 6< L(j)





3

1

1

2

1

1 21

Process 1

Process 2

Process 3

Process 4l

c

d f h

k

a

e

b

g

j

i

1

2 3 4

64

5L = max(3, 1+1)3


➥ c → k and L(c) < L(k)

➥ g 6→ j and L(g) 6< L(j)





3

1

1

2

1

1 21

Process 1

Process 2

Process 3

Process 4l

c

d f h

k

a

e

b

g

j

i

1

2 3 4

64

5

L = max(2, 4+1)1


➥ c → k and L(c) < L(k)

➥ g 6→ j and L(g) 6< L(j)





3

1

1

2

1

1 21

Process 1

Process 2

Process 3

Process 4l

c

d f h

k

a

e

b

g

j

i

1

2 3 4

64

5


➥ c → k and L(c) < L(k)

➥ g 6→ j and L(g) 6< L(j)




Vector Timestamps

➥ Objective: timestamps that characterize causality

➥ a → b ⇔ V (a) < V (b), where V is the vector timestamp

in the respective process

➥ A vector clock in a system with N processes is a vector of Nintegers

➥ each process has its own vector Vi

➥ Vi[i]: number of events that have occurred so far in process i

➥ Vi[j], j 6= i: number of events in process j, of which i knows

➥ i.e. by which it could have been causally influenced



Vector Timestamps ...

➥ Update of Vi in process i:

➥ before any local event: Vi[i] = Vi[i] + 1

➥ Vi is included in every message sent

➥ when receiving a message with timestamp t:Vi[j] = max(Vi[j], t[j]) for all j = 1, 2, . . . , N

➥ Comparison of vector timestamps:

➥ V = V ′ ⇔ V [j] = V ′[j] for all j = 1, 2, . . . , N

➥ V ≤ V ′ ⇔ V [j] ≤ V ′[j] for all j = 1, 2, . . . , N

➥ V < V ′ ⇔ V ≤ V ′ ∧ V 6= V ′

➥ the relation < defines a partial order



Vector Timestamps: Example

Process 1

Process 2

Process 3

Process 4

f h

k

a

e

b

g

j

i

l

c

d

(1,0,0,0)

(0,0,1,0)

(0,0,0,1) (0,0,0,2) (0,0,0,3)

(0,1,2,0)

(0,1,3,1) (0,1,4,1)

(0,1,0,0) (0,2,0,0)

(2,1,4,1) (3,1,4,1)


➥ c → k and V (c) < V (k)

➥ g 6→ l and V (g) 6< V (l), as well as l 6→ g and V (l) 6< V (g)

➥ V (l) and V (g) not comparable ⇔ l and g concurrent




Process 1

Process 2

Process 3

Process 4

f h

k

a

e

b

g

j

i

l

c

d

(1,0,0,0)

(0,0,1,0)

(0,0,0,1) (0,0,0,2) (0,0,0,3)

(0,1,2,0)

(0,1,3,1) (0,1,4,1)

(0,1,0,0) (0,2,0,0)

(2,1,4,1) (3,1,4,1)2V = max((0,0,2,0), (0,1,0,0))


➥ c → k and V (c) < V (k)






Process 1

Process 2

Process 3

Process 4

f h

k

a

e

b

g

j

i

l

c

d

(1,0,0,0)

(0,0,1,0)

(0,0,0,1) (0,0,0,2) (0,0,0,3)

(0,1,2,0)

(0,1,3,1) (0,1,4,1)

(0,1,0,0) (0,2,0,0)

(2,1,4,1) (3,1,4,1)2V = max((0,1,3,0), (0,0,0,1))


➥ c → k and V (c) < V (k)






Process 1

Process 2

Process 3

Process 4

f h

k

a

e

b

g

j

i

l

c

d

(1,0,0,0)

(0,0,1,0)

(0,0,0,1) (0,0,0,2) (0,0,0,3)

(0,1,2,0)

(0,1,3,1) (0,1,4,1)

(0,1,0,0) (0,2,0,0)

(2,1,4,1) (3,1,4,1)

0V = max((2,0,0,0), (0,1,4,1))


➥ c → k and V (c) < V (k)






Process 1

Process 2

Process 3

Process 4

f h

k

a

e

b

g

j

i

l

c

d

(1,0,0,0)

(0,0,1,0)

(0,0,0,1) (0,0,0,2) (0,0,0,3)

(0,1,2,0)

(0,1,3,1) (0,1,4,1)

(0,1,0,0) (0,2,0,0)

(2,1,4,1) (3,1,4,1)


➥ c → k and V (c) < V (k)



6.4 Global State


A Motivating Example

➥ Scenario: peer-to-peer application, processes send requests to

each other

➥ Question: when can the application terminate?

➥ Answer: when no process is processing a request

6.4 Global State


A Motivating Example

➥ Scenario: peer-to-peer application, processes send requests to

each other

➥ Question: when can the application terminate?

➥ Wrong answer: when no process is processing a request

➥ reason: requests can still be on the way in messages!

idle idle

RequestProcess 1 Process 2

➥ Other applications: distributed garbage collection, distributed

deadlock detection, ...

6.4 Global State ...


➥ How can we determine the overall state of a distributed process

system?

➥ naıvely: union of the states of all processes (wrong!)

➥ Two aspects have to be considered:

➥ messages that are still in transit

➥ must be included in the state

➥ lack of global time

➥ a global state at time t cannot be defined!

➥ process states always refer to local (and thus different)

times➥ question: condition on local times? ⇒ consistent cuts



Consistent Cuts

➥ Objective: build a meaningful global state from local states (whichare not determined simultaneously)

➥ Processes are modeled by sequences of events:

Process 1

Process 2

Process 3

➥ Cut: consider a prefix of the event sequence in each process

➥ Consistent cut:

➥ if the cut contains the reception of a message, it also containsthe sending of this message



Consistent Cuts



Process 1

Process 2

Process 3

Cut Cut Cut Cut


➥ Consistent cut:




Consistent Cuts



Process 1

Process 2

Process 3

Inconsistent cutConsistent cuts


➥ Consistent cut:




The Snapshot Algorithm of Chandy and Lamport

➥ Determines online a “snapshot” of the global state

➥ i.e.: a consistent cut

➥ The global state consists of:

➥ the local states of all processes

➥ the status of all communication connections

➥ i.e. the messages in transmission

➥ Assumptions / properties:

➥ reliable message channels with sequence retention

➥ process graph is strongly connected

➥ each process can trigger a snapshot at any time

➥ the processes are not blocked during the algorithm



The Snapshot Algorithm of Chandy and Lamport ...

➥ When a process wants to initiate a snapshot:

➥ process first saves its local state

➥ then it sends a marker message over each outgoing channel

➥ When a process receives a marker message:

➥ if it has not yet saved its local state:

➥ it saves its local state

➥ and sends a marker over each outgoing channel

➥ else:

➥ for the channel where the marker was received, it saves all

messages that have been received since the local state

was saved➥ i.e., it records the status of the channel



The Snapshot Algorithm of Chandy and Lamport ...

➥ The algorithm terminates when each process has received a

marker message on each channel

➥ the determined consistent section is then (initially) stored in a

distributed way



Example for the algorithmof Chandy/Lamport

e

b

dc

a

P2

P1

P3




e

b

dc

a

P2

P1

P3

M

1. P1 initiates a snapshot, saves its state, and sends markers

M




e

b

dc

a

P2

P1

P3

2. P3 receives a marker from P1, saves its state, and sends markers

M

MM





e

b

dc

P2

P1

P3

3. P2 receives and processes a2. P3 receives a marker from P1, saves its state, and sends markers

M

MM





e

b

dc

P2

P1

P3

P2 receives the marker from P1, saves its state, and sends markers3. P2 receives and processes a

M

M


M

M





P2

P1

P3

4. P1, P2, P3 save the incoming messages, until all markers are received

e

c

b

d

P2 receives the marker from P1, saves its state, and sends markers3. P2 receives and processes a

M

M


M

M





P2

P1

P3

4. P1, P2, P3 save the incoming messages, until all markers are received

e

c

b

d

P2 receives the marker from P1, saves its state, and sends markers3. P2 receives and processes a2. P3 receives a marker from P1, saves its state, and sends markers1. P1 initiates a snapshot, saves its state, and sends markers



Sequence in the Example and Selected Cut

P1

P2

P3

db

e

a

c

displayed initial state

➥ The cut consists of the local states of P1, P2, P3 and the

messages b, c, d, e



Sequence in the Example and Selected Cut

P1

P2

P3

db

e

a

c

consistent cut determined by the algorithm

➥ The cut consists of the local states of P1, P2, P3 and the

messages b, c, d, e



7 Coordination

7 Coordination ...


Contents

➥ Election algorithms

➥ Mutual exclusion

➥ Group communication (multicast)

➥ Transactions

Literature

➥ Tanenbaum, van Steen: Kap. 5.4-5.6

➥ Colouris, Dollimore, Kindberg: Kap. 11, 12

➥ Stallings: Kap 14.3

7 Coordination ...


[Tanenbaum/Steen, 5.4]7.1 Election Algorithms

➥ In many distributed algorithms one arbitrary process must play anexceptional role

➥ e.g. central coordinator, initiator, ...

➥ Question: how to choose this process unambiguously?

➥ processes must be distinguishable, e.g. via a unique ID.

➥ then select e.g. the process with the highest ID

➥ Prerequisites / requirements:

➥ election can be initiated by multiple processes simultaneously

➥ e.g. after failure or recovery of a process

➥ after the election all processes must have the same result

➥ each process knows the IDs of all other processes, but doesnot know whether they are running or not

7.1 Election Algorithms ...


The Bully Algorithm

➥ A process P holds an election as follows:

➥ P sends an ELECTION message to all processes with a larger

ID

➥ if none of the processes reacts, P wins the election

➥ if a process responds: P loses the election

➥ When a process receives an ELECTION message:

➥ (message comes from a process with a lower ID)

➥ return an OK message

➥ hold an election of your own

➥ At some point, there is only one process left

➥ this wins the election and sends the result to all others



Bully Algorithm: Example

1

64

2 5

30

7

Previous Coordinatorhas crashed




Process 4 holds an election1

64

2 5

30

7


ELECTIONELECTION

ELECTION




Processes 5 and 6 reply,Process 4 terminates its election


64

2 5

30

7


OK

OK




Processes 5 and 6 simultaneouslyhold an election



64

2 5

30

7


ELECTION

ELE

CTI

ON

ELECTIO

N




Process 6 replies to 5Process 5 terminates its election




64

2 5

30

7


OK




Noone replied to the election ofprocess 6, thus, this process winsthe election and communicates theresult to all others

Process 6 replies to 5Process 5 terminates its election




64

2 5

30

7


COORDINATOR



A Ring Algorithm

➥ Assumption: processes form a logical ring, i.e. each process

knows its successors in the ring

➥ Messages are sent along the ring as follows:

➥ a process tries to send the message to its direct successor

➥ if this process is not active, the message will be sent to the

next process in the ring, etc.

➥ ELECTION messages contain a list of process IDs



A Ring Algorithm ...

➥ A process that initiates the election sends an ELECTION

message with its own ID along the ring

➥ When an ELECTION message is received by a process:

➥ if its own ID is not in the list of IDs:

➥ append the own ID to the list

➥ continue sending message along the ring

➥ else (message came back to the initiator):

➥ determine highest ID in the list

➥ send this ID in a COORDINATOR message along the ring



Ring Algorithm: Example

3

51

2 4

60

7

Previous coordinatorhas crashed




3

51

2 4

60

7


[2]

[5]

Processes 2 and 5 concurrently initiate an election




3

51

2 4

60

7


no answer

[2,3][2]

[5]





3

51

2 4

60

7


[5,6]

[2,3,4]

no answer

[2,3][2]

[5]





3

51

2 4

60

7


[5,6,0] [2,3,4,5]Eventually both processes gettheir ELECTION messagesback and send a COORDINATOR message(with identical contents!)

[5,6]

[2,3,4]

no answer

[2,3][2]

[5]


7 Coordination ...


7.2 Mutual Exclusion

➥ Here mainly: use / allocation of exclusive resources

➥ Requirements:

➥ safety: only one process can use the resource at any one time

➥ liveness: any process that requests the resource will

eventually get it

➥ fairness: access to resources in ’FIFO’ order

➥ Solution approaches:

➥ centralized server

➥ distributed algorithm with Lamport clock

➥ token ring algorithm

7.2 Mutual Exclusion ...


Centralized server

➥ An special coordinator process manages the resource and a

queue for waiting processes

➥ determined e.g. via an election algorithm

➥ Resource is requested by sending a message to the coordinator

➥ if resource is free: coordinator answers with OK

➥ otherwise: coordinator does not answer

➥ requesting process is blocked (waiting for reply)

➥ Resource is released by sending a message to the coordinator

➥ if processes wait: coordinator sends an OK to one of them

➥ Problem: processes cannot detect failure of the coordinator

➥ this could be done using negative replies and polling



A Distributed Algorithm (Ricart / Agrawala)

➥ Idea: a process that wants to have a resource asks all other

processes for their OK

➥ a process replies with OK, if

➥ it does not want the resource, or

➥ it wants the resource, but the other process has requested

it “earlier”

➥ Requires total order of request events

➥ order must be consistent with causality

➥ realizable e.g. via a time stamp (Lamport time, process ID)

with lexicographic order

➥ in the example of slide 206 this results in the event order:

b, c, a, e, g, d, j, f, l, h, i, k



A Distributed Algorithm (Ricart / Agrawala) ...

➥ To request a resource, a process sends the following message to

all other processes:

➥ resource ID

➥ time stamp T of the request

➥ pair: (current Lamport time, own process ID)

(the message must be delivered reliably)

➥ The process then waits until it receives an OK message from all

other processes

➥ After that it can use the resource (exclusively)




➥ Each process responds to request messages as follows:

➥ resource is not used and not requested by the process:

➥ return OK message

➥ resource is used by the process:

➥ do not send a reply

➥ put the request in a queue

➥ Resource id not used, but requested by the process:

➥ if T (incoming message) < T (own request):

➥ return OK message

➥ or else:➥ do not send a reply

➥ put the request in a queue




➥ When a process releases the resource:

➥ send an OK message to all processes in the queue

➥ delete the queue



Example for the Algorithm of Ricart / Agrawala

32

1

the resourceBoth P1 and P3 want




1. P1 sends request to all others

0Nok =Treq = 8,1

8,1 8,1

32

1







0Treq = 12,3Nok =

0Nok =Treq = 8,1

12,3

12,3

8,1 8,1

32

1





since it doesn,t want the resource3. P2 sends OK to P1 and P3,



1

1

Treq = 12,3Nok =

Nok =Treq = 8,1

OK

OK 12,3

8,1

32

1





4. P1 doesn’t send an OK to P3,since (12,3) > (8,1).P1 adds P3 to its queue




Queue: P3

1

1

Treq = 12,3Nok =

Nok =Treq = 8,1

8,1

32

1





(8,1) < (12,3)5. P3 sends OK to P1, since





Queue: P32

1Treq = 12,3Nok =

Nok =Treq = 8,1

OK

32

1





and uses the resource6. P1 received all OKs






Queue: P32

1Treq = 12,3Nok =

Nok =Treq = 8,1

P1 ownsthe resource

32

1





and sends an OK to P37. P1 releases the resource

and uses the resource6. P1 received all OKs






2Treq = 12,3Nok =

OK

the resourceP1 releases

32

1




A Token Ring Algorithm

➥ The processes form a logical ring

➥ A token circles in the ring

➥ authorization for (exclusive) use of the resource

➥ token is initially generated by one of the processes

➥ On arrival of the token: process checks whether it wants theresource

➥ if so:

➥ use the resource➥ after releasing the resource:

➥ pass token to successor in the ring

➥ else:

➥ pass token immediately to successor in the ring



Comparison of algorithms

➥ Centralized server:

➥ server is single point of failure and may be a performance

bottleneck

➥ clients cannot distinguish (without additional measures)

between server failure and occupied resource

➥ only little communication necessary

➥ Distributed algorithm:

➥ failure of any node is problematic

➥ any node can become a performance bottleneck

➥ high communication effort

➥ just a proof that a distributed, symmetrical algorithm is possible



Comparison of algorithms ...

➥ Token ring algorithm:

➥ problem: loss of the token (detection, re-creation)

➥ failure of nodes is problematic

➥ communication, even if resource is not used

Algorithm Messages per Delay before Problems

allocation allocation

centralized 3 2 server failure

distributed 2(n − 1) 2(n − 1) failure of any process

token ring 1 ...∞ 0 ... n − 1 lost token,

failure of any process

7 Coordination ...


[Coulouris, 4.5, 11.4]7.3 Group Communication (Multicast)

➥ In distributed systems, communication with a group of processes

(multicast) is often also important, e.g. for:

➥ fault tolerance based on replicated services

➥ service realized by group of servers

➥ all servers receive and process the requests

➥ finding of services (especially discovery / name services)

➥ multicast is a possible approach for this

➥ better performance through replicated data

➥ changes must be sent to all copies

➥ sending event notifications

➥ all subscribers receive the event

7.3 Group Communication (Multicast) ...


Questions / Problems

➥ Addressing the recipients

➥ explicit list of all recipients

➥ addressing a process group

➥ static / dynamic groups

➥ Reliability

➥ reasonable guarantees that messages will reach their

recipients

➥ Order

➥ adequate guarantees as to the order in which multicast

messages arrive at the various recipients



Reliability

➥ Unreliable multicast:

➥ some processes may not receive the message (e.g. due to

packet loss)

➥ Reliable multicast:

➥ apart from network and process failures, the message is

delivered to all processes in the group

➥ Atomic multicast:

➥ the message is (under all circumstances) received either by allprocesses of the group or by none of them

➥ required if all processes in the group must be kept consistent

(e.g., operations on replicated data)



Order

➥ Unordered

➥ receiving order is undefined

and can be different in different

processes

➥ FIFO order

➥ messages from the same sender

are received by all processes in

FIFO order

➥ i.e. introduction of sequence

numbers local to the sender

1 2 3 4

1 2 3 4



Order

➥ Unordered

➥ receiving order is undefined

and can be different in different

processes

➥ FIFO order

➥ messages from the same sender

are received by all processes in

FIFO order

➥ i.e. introduction of sequence

numbers local to the sender

1 2 3 4

1 2 3 4



Order ...

➥ Causal order

➥ if message m′ can causally

depend on m (m → m′), then all

processes receive m before m′

➥ i.e. introduction of vector time

stamps

➥ Total order

➥ all messages are received by all

processes in the same order

➥ i.e. introduction of globalsequence numbers

1 2 3 4

1 2 3 4



Order ...

➥ Causal order





stamps

➥ Total order




1 2 3 4

1 2 3 4



Order ...

➥ Causal order





stamps

➥ Total order




1 2 3 4

1 2 3 4



Order ...

➥ Causal order





stamps

➥ Total order




1 2 3 4

1 2 3 4

7 Coordination ...


7.4 Transactions

➥ Combining a sequence of atomic actions into a single unit

➥ atomic actions: read, change, write data

➥ Example: seat reservation

reserved

Transaction

Atomic actions

free seatsQuery list of

Choose a seatMark seat as

➥ Used not only in database systems

7.4 Transactions ...


Properties of Transactions: ACID

➥ Atomicity

➥ all-or-nothing principle: either all atomic actions are executed(correctly) or none at all

➥ Consistency

➥ a transaction always transfers a consistent state back toconsistent state

➥ Isolation

➥ concurrent transactions do not affect each other; the result isthe same as with sequential execution

➥ Durability

➥ at the (successful) end of the transaction all changes arestored permanently



Atomicity

Rollback

BeginTransaction

Crash

BeginTransaction

Transaction

Commit

Transaction

All changes arestored permanently

are undoneAll changes



Isolation

or

Two concurrent Permitted serializationstransactions

ZYX

CBACBA ZYX

ZYX CBA

➥ The result of the concurrent transactions corresponds to one of

the two serializations



Isolation Levels

➥ Complete isolation of (database) transactions often is too

restrictive / too little performant

➥ Therefore: SQL99 standard defines four isolation levels

➥ Goal: avoidance of unwanted phenomena

➥ dirty reads: a transaction can read data of another

transaction before they have been committed

➥ unrepeatable reads: when reading repeatedly, a transaction

can see commited changes of other transactions

➥ phantom reads: when reading repeatedly, a transaction can

see that other transactions have added or deleted records



Isolation Level According to ANSI/ISO-SQL99

Read Uncommitted

Read Committed

Repeatable Read

Serializable

PhantomReads

UnrepeatableReadsReads

DirtyPhenomenon

Isolationlevel

possible possible possible

possible

possible

possiblenot possible

not possible

not possible

not possible

not possible not possible

➥ Serializable corresponds to complete isolation



Nested Transactions

➥ Within a transaction, several subtransactions take place

➥ Higher-level transaction can run successfully to completion, even

if subtransaction was terminated with an error

➥ Abort of the higher-level transaction results in aborting all

subtransactions

➥ Example: booking of flight and hotel

➥ booking of the flight should be maintained, even if hotel

booking (in the first attempt) fails

➥ Nested transaction are supported by only a few transaction

services



Crash

Crash

Book flight

Flat transaction

Nested transactions

Begin Abort

Begin

Call subtransaction

Commit

Book hotel

Call subtransactionCall subtransaction

Book hotel

commit

Book flight

Abort

Book hotel

Begin BeginBegin Tentativecommit

Tentative



Distributed Transactions

➥ So far: data is stored at exactly one location

➥ Distributed transactions: data is stored distributed

➥ Realization of transactions on the individual data resources

(databases) is no longer sufficient

➥ distributed transaction management becomes necessary

➥ There is a generally accepted Open Group model for the

management of distributed transactions

➥ is implemented by most transaction services

➥ most important feature: 2-Phase-Commit



Model for Managing Distributed Transactions

Application

Resource manager(RM)

Transaktion manager(TM)

Management of distributed transactionsacross resource boundaries

Transaction management within theindividual data resources




1. Application requests start of a new transaction.TM internally initializes a new transaction.

Application



1begin




2. Transaction is active.Application can use the resources.

Application



2

1begin




3. Each RM used by the applilcationregisters with the TM for the transactio.n

Application



3 join

2

1begin




4. Aplication requires to commit or abortthe transaction.

Application



commit /rollback

43 join

2

1begin




5. TM requires RM to commit the changes:2−phase commit

Application



prepare5

commit /rollback

43 join

2

1begin




5. TM requires RM to commit the changes:2−phase commit

Application



commitprepare5

commit /rollback

43 join

2

1begin



2-Phase Commit

➥ Phase 1 (voting phase)

➥ TM asks all involved RM, if the commit would be successful

(“prepare”)

➥ each RM that answers “yes” prepares for the commit

➥ Phase 2 (finalization)

➥ if all RMs answered with “yes”:

➥ TM sends commit command to all RMs

➥ RM ultimately commits the data and sends an

acknowlegement to TM

➥ else:

➥ TM sends an abort command to all RMs



8 Replication and Consistency

8 Replication and Consistency ...


Contents

➥ Introduction, motivation

➥ Data-centered consistency models

➥ Client-centered consistency models

➥ Distribution protocols

➥ Consistency protocols

Literature

➥ Tanenbaum, van Steen: Kap. 6



8.1 Introduction and Motivation

➥ Replication: several (identical) copies of data objects are stored

in the distributed system

➥ processes can access an arbitrary copy

➥ Reasons for the replication:

➥ increase in availability and reliability

➥ if a replica is not available, use another one

➥ reading multiple replicas with majority vote

➥ increase in read performance

➥ for large systems: concurrent read access can be serviced

by different replicas

➥ with systems spread over a large area: access request is

sent to a replica in the vicinity

8.1 Introduction and Motivation ...


Central Problem of Replication: Consistency

➥ When data is changed, all replicas must be kept consistent

➥ Simplest option: all updates are done via totally ordered atomic

multicast

➥ high overhead when frequent updates occur

➥ in some replicas these may actually never be read

➥ totally ordered atomic multicast is very expensive with many /

widely dispersed replicas

➥ Strict consistency maintenance of replicas always deteriorates

performance and scalability

➥ Solution: weakened consistency requirements

➥ often only very weak demands, e.g. News, Web, ...

8.1 Introduction and Motivation ...


Consistency Models

➥ A consistency model determines the order in which the write

operations (updates) of the processes are “seen” by the other

processes

➥ Intuitive expectation: a read operation always returns the result of

the last write operation (strict consistency)

➥ problem: there is no global time

➥ pointless to speak of the “last” write operation

➥ therefore: other consistency models necessary

➥ Data-centric consistency models: view of the data storage

➥ Client-centric consistency models: view of one process

➥ assumption: (essentially) no update by multiple processes



8.2 Data Centric Consistency Models

➥ Model of a distributed data store:

Process(Client)

Process(Client)

Process(Client)

Distributed data storage

Local copy

Write andread accessees

➥ logical, shared data memory

➥ physically distributed and replicated across multiple nodes

8.2 Data Centric Consistency Models ...


Sequential Consistency

➥ A data store is sequentially consistent if the result of eachprogram execution is as if:

➥ the (read/write) operations of all processes are executed in a(random) sequential order,

➥ in which the operations of each individual process appear inthe order specified by the program.

➥ P1 P2 Pn

Switch can beshifted arbitrarilyafter eachoperation

Operations inProgram order

Data store

I.e. the execution of theoperations of the individualprocesses can beinterleaved arbitrarily

➥ Independent of time orclocks

➥ All processes see the accesses in the same order



Sequential Consistency: Examples

W(x)a

W(x)b

R(x)b

R(x)b

R(x)a

R(x)a

P1:

P4:

P3:

P2:

W(x)a

W(x)b

R(x)b R(x)a

P1:

P4:

P3:

P2:

R(x)bR(x)a

Allowed sequence: Forbidden Sequence:

➥ Notation:

➥ W(x)a : the value ’a’ is written into the variable ’x’

➥ R(x)a : variable ’x’ will be read, result is ’a’

➥ A possible sequential order of the left sequence:

➥ W2(x)b, R3(x)b, R4(x)b, W1(x)a, R3(x)a, R4(x)a



Sequential Consistency: Examples

W(x)b

R(x)b

R(x)b

W(x)a

R(x)a

R(x)a

P1:

P4:

P3:

P2:

W(x)a

W(x)b

R(x)b R(x)a

P1:

P4:

P3:

P2:

R(x)bR(x)a

Allowed sequence: Forbidden Sequence:

➥ Notation:

➥ W(x)a : the value ’a’ is written into the variable ’x’

➥ R(x)a : variable ’x’ will be read, result is ’a’

➥ A possible sequential order of the left sequence:

➥ W2(x)b, R3(x)b, R4(x)b, W1(x)a, R3(x)a, R4(x)a



Linearizability

➥ Stronger than sequential consistency

➥ Assumption: the nodes (processes) have synchronized clocks

➥ i.e. an approximation of a global time

➥ Operations have time stamps based on these clocks

➥ In comparison with sequential consistency additionally required:

➥ the sequential order of operations is consistent with theirtimestamps

➥ Complex implementation

➥ Used for formal verification of concurrent algorithms



Causal Consistency

➥ Weakening of sequential consistency

➥ (Only) write operations that are potentially causally dependent

must be visible to all processes in the same order

P1:

P4:

P3:

P2:

Not causally consistent:

W(x)a

R(x)a W(x)b

R(x)b R(x)a

R(x)a R(x)b

P1:

P4:

P3:

P2:

Causally, but not seq. consistent:

W(x)a

R(x)a

R(x)a

R(x)a

W(x)b

W(x)c

R(x)c R(x)b

R(x)b R(x)c



Weak Consistency

➥ In practice: access to shared resources is coordinated viasynchronization variables (SV)

➥ Then: weaker consistency requirements are sufficient:

➥ accesses to SVs are sequentially consistent

➥ an operation on a SV is not allowed until all previous writeaccesses to data have been completed everywhere

➥ no operation on data is allowed before all previous operationson SVs have been completed

P1: W(x)a S

Allowed event sequence: Invalid event sequence:

P1:

P2:

W(x)a W(x)b S

R(x)aS

W(x)b

R(x)b R(x)a SR(x)a R(x)b SP3:

P4:

P2: S R(x)b



Release Consistency (Freigabe-Konsistenz)

➥ Idea as with weak consistency, but distinction between acquire

and release operations (mutual exclusion!)

➥ before an operation on the data is performed all acquire-

operations of the process must be completed

➥ before the end of a release operation all operations of the

process on the data must be completed

➥ acquire / release operations of a process are seen everywhere

in the same order

P1:

Allowed event sequence:

P2:P3:

W(x)bW(x)aacq(L) rel(L)acq(L) R(x)b rel(L)

R(x)a



Comparison of models

Strict Absolute time sequence of all shared accesses

(physically not useful!)

Linearization All processes see all accesses in the same order.

Accesses are sorted by a (non-unique) global

timestamp.

Sequential All processes see all accesses in the same order.

Accesses sre not sorted by time.

Causal All processes see causally linked accesses in the

same order.

Weak Data is only reliably consistent after a synchro-

nization has been performed.

Release Data is made consistent when leaving the critical

region.



8.3 Client Centric Consistency Models

➥ In practice:

➥ clients are usually independent from each other

➥ changes to the data are mostly rare

➥ because of partitioning often no write/write conflicts

➥ e.g., DNS, WWW (Caches), ...

➥ Eventual consistency: all replicas will eventually becomeconsistent if no updates take place for a long time

➥ Problem if a client changes the replica it is accessing

➥ updates may not have arrived there yet

➥ client detects inconsistent behavior

➥ Solution: client-centric consistency models

➥ guarantee consistency for an individual client

➥ but not for concurrent accesses by multiple clients

8.3 Client Centric Consistency Models ...


Illustration of the problem

data base

replicated

Distributed and

to another replicaand (transparently) creates a connectionThe client moves to another location

Wide area network

Read and writeoperations

Mobile computer

Replicas must retainclient centric consistency

8.3 Client Centric Consistency Models ...


Monotonic Read

➥ Example for a client centric consistency model

➥ more: see Tanenbaum / van Steen, Ch. 6.3

➥ Rule: When a process reads the value of a variable x, everysubsequent read operation for x returns the same or a morerecent value

➥ Example: access to a mailbox at different locations

WS(x )1

WS(x ;x )1 2

R(x )1

R(x )2

WS(x )1

WS(x )2

R(x )1

R(x )2 WS(x ;x )1 2

With monotonic read

L1:

L2:

L1:

L2:

Without monotonic read:

L1/L2: local copies

WS(...) set of write operations

Write operations to x in L1

are now executed on x in L2



8.4 Distribution Protocols

➥ Question: where, when and by whom are replicas placed?

➥ permanent replicas

➥ server initiated replicas

➥ client initiated replicas

➥ Question: how are updates distributed (regardless of consistency

protocol, ☞ 8.5)?

➥ sending invalidations, status or operations

➥ pull or push protocols

➥ unicast or multicast

8.4 Distribution Protocols ...


Placing the Replicas

Client initiatedreplicas

Server initiatedreplicas

Permanentreplicas

Server initiated replicas

Client initiated replicas

Clients

➥ All three types can occur simultaneously



Permanent Replicas

➥ Initial set of replicas, static, mostly small

➥ Examples:

➥ replicated web site (transparent to client)

➥ mirroring (Mirroring, client deliberately chooses a replica)

Server Initiated Replicas

➥ Server creates additional replicas on demand (Push-Cache)

➥ e.g., for web hosting services

➥ Difficult: deciding when and where replicas will be created

➥ usually access counter for each file, additional information

about the origin of the requests (→ nearest server)



Client initiated Replicas

➥ Other term: Client Cache

➥ Client cache locally stores (frequently) used data

➥ Goal: improving access time

➥ Management of the cache is completely left to the client

➥ server doesn’t care about consistency

➥ Data is usually kept in the cache for a limited time only

➥ prevents use of extremely obsolete data

➥ Cache usually placed on client machines, or shared cache for

multiple clients in their proximity

➥ e.g., Web proxy caches



Forwarding Updates: What’s Being Sent?

➥ The new value of the data object

➥ good with high read/update ratio

➥ The update operation (active replication)

➥ saves bandwidth (operation with parameters is usually small)

➥ but more computing power required

➥ Just a notification (invalidation protocols)

➥ notification makes the copy of the data object invalid

➥ on next access a new copy will be requested

➥ requires very little network bandwidth

➥ good at low read/update ratio



Pull and Push Protocols

➥ Push: updates are distributed on the initiative of the server thatmade the change

➥ replicas don’t have to request updates

➥ common in permanent and server-initiated replicas

➥ when a relatively high degree of consistency is required

➥ at high read/update ratio

➥ problem: server must know all replicas

➥ Pull: replicas actively request data updates

➥ common with client caches

➥ at low read/update ratio

➥ disadvantage: higher response time for cache access

➥ Leases: mixed form: first push for some time, then pull later



Unicast vs. Multicast

➥ Unicast: send update individually to each replica server

➥ Multicast: send one message and leave the distribution to the

network (e.g. IP multicast)

➥ often much more efficient

➥ especially in LANs: hardware broadcast possible

➥ Multicast is useful for push protocols

➥ Unicast is better with pull protocols

➥ only a single client/server requests an update



8.5 Consistency Protocols

➥ Describe how replica servers coordinate with each other to

implement a specific consistency model

➥ Here specifically considered:

➥ consistency models that serialize operations globally

➥ e.g., sequential, weak and release consistency

➥ Two basic approaches:

➥ primary-based (primarbasierte) protocols

➥ write operations are always performed on a special copy

(primary copy)

➥ replicated-write protocols

➥ write operations go to multiple copies

8.5 Consistency Protocols ...


Primary-Based Protocols

➥ Read operations are possible on arbitrary (local) copies

➥ Write operations must be handled by the primary copy

➥ e.g., to realize a sequential consistency:

➥ the primary copy updates all other copies and waits foracknowledgements, only then it replies to the client

➥ problem: performance

➥ Remote-write protocols

➥ the writer forwards the operation to a fixed primary copy

➥ Local-write protocols

➥ writer must become primary copy before it can do the update

➥ i.e., the primary copy is migrated between servers

➥ good model also for mobile users



Remote Write Protocol: Workflow (Sequential Consistency)

Data storageread(x) val(x)

serverBackup

x

serverBackup

x

serverBackup

xx

Primaryserver for x

Client Client




(1) Write request

write(x)Data storage

read(x) val(x)

serverBackup

x

serverBackup

x

serverBackup

xx

Primaryserver for x

Client Client




write(x)

is forwarded to primary server(1) Write request


read(x) val(x)

serverBackup

x

serverBackup

x

serverBackup

xx

Primaryserver for x

Client Client




update(x)update(x)

update(x)

(2) Primary server updates all backups

write(x)



read(x) val(x)

serverBackup

x

serverBackup

x

serverBackup

xx

Primaryserver for x

Client Client




update ACK

update ACKupdate ACK

and waits for acknowledgements

update(x)update(x)

update(x)


write(x)



read(x) val(x)

serverBackup

x

serverBackup

x

serverBackup

xx

Primaryserver for x

Client Client




write ACK

(3) Acknowledge the end of the write operation

write ACKupdate ACK



update(x)update(x)

update(x)


write(x)



read(x) val(x)

serverBackup

x

serverBackup

x

serverBackup

xx

Primaryserver for x

Client Client



Local Write Protocol: Workflow (Release Consistency)


serverBackup

x

serverBackup

x

ClientClient

x

Primaryserver for x server

Backup

x




(1) Acquire lock;

acq


serverBackup

x

serverBackup

x

ClientClient

x


Backup

x




requestprimary

Move primary copy to new server(1) Acquire lock;

acq


serverBackup

x

serverBackup

x

ClientClient

x


Backup

x




serverBackup

x x

Primaryserver for x

ACK

requestprimary


acq


serverBackup

x

serverBackup

x

ClientClient





acq ACK

serverBackup

x x

Primaryserver for x

ACK

requestprimary


acq


serverBackup

x

serverBackup

x

ClientClient




write ACK

(3) Write operations are executed (only) on the local server

write(x)


acq ACK

serverBackup

x x

Primaryserver for x

ACK

requestprimary


acq


serverBackup

x

serverBackup

x

ClientClient




relwrite ACK


write(x)


acq ACK

update(x)update(x)

update(x)

(4) New primary server updates backups

serverBackup

x x

Primaryserver for x

ACK

requestprimary


acq


serverBackup

x

serverBackup

x

ClientClient




rel ACKrelwrite ACK


write(x)


acq ACK


update ACK


update(x)update(x)

update(x)

(4) New primary server updates backups

serverBackup

x x

Primaryserver for x

ACK

requestprimary


acq


serverBackup

x

serverBackup

x

ClientClient



Replicated Write Protocols

➥ Allow execution of write operations on (multiple) arbitrary replicas

➥ In the following, two approaches:

➥ active replication

➥ update operations are passed on to all copies

➥ requirement: globally unique sequence of operations

➥ using totally ordered multicast

➥ or via central sequencer process

➥ quorum-based protocols

➥ only a portion of the replicas needs to be modified

➥ however, also multiple copies need to be read



Problem With Replicated Object Calls

➥ What happens when a replicated object calls another?

Replicated object

the same call three timesthe method call

All replicas seethe same call

Client replicates Object C receives

A B2

B1

B3

C

➥ Solution: middelware that is aware of replication

➥ coordinator of B makes sure that only one call is sent to C and

its result is distributed to all replicas of B



Quorum-based Protocols

➥ Clients need the permission of multiple servers for writing and for

reading

➥ When writing: send the request to (at least) NW copies

➥ their servers must agree to the change

➥ data gets a new version number when changed

➥ condition: NW > N/2 (N = total number of copies)

➥ prevents write/write conflicts

➥ When reading: send the request to (at least) NR copies

➥ client selects the latest version (highest version number)

➥ condition: NR + NW > N

➥ ensures that in any case the latest version is read



Quorum-based Protocols: Examples

(N < N/2)W

N = 6 N = 7 N = 7 N = 6 N = 1 N = 12R W R RW W

correct correctWrite/write conflictsare possible

Read quorum

Write quorum

E F G H

A B C D

I J K L

E F G H

A B C D

I J K L

E F G H

A B C D

I J K L



8.6 Summary

➥ Replication due to availability and performance

➥ Problem: consistency of copies

➥ strictest model: sequential consistency

➥ waekenings: causal consistency, weak ∼, release ∼

➥ client-centric consistency models

➥ Implementation of replication and consistency:

➥ replication scheme: static, server initiated, client initiated

➥ distribution protocols

➥ type of update, push / pull, unicast / multicast

➥ consistency protocols

➥ primary based / replicated write protocols



9 Distributed File Systems

9 Distributed File Systems ...


Contents

➥ General

➥ Case study: NFS

Literature


➥ Colouris, Dollimore, Kindberg: Ch. 8



[Coulouris, 8.1-8.3]9.1 General

➥ Objective: support the sharing of information (files) in an intranet

➥ in the Internet: WWW

➥ Allows applications to access remote files in the same way as

local files

➥ similar (or even better) performance and reliability

➥ Allows operation of diskless nodes

➥ Examples:

➥ NFS (standard in the UNIX area)

➥ AFS (goal: scalability), CIFS (Windows), CODA, xFS, ...

9.1 General ...


Requirements

➥ Transparency: access, location, mobility, performance and scaling

transparency

➥ Concurrent file updates (e.g., locks)

➥ File replication (often: local caching)

➥ Heterogeneity of hardware and operating system

➥ Fault tolerance (especially in case of server failure)

➥ often: at-least-once semantics + idempotent operations

➥ advantageous: stateless server (easy reboot)

➥ Consistency (☞ 8)

➥ Security (access control, authentication, encryption)

➥ Efficiency

9.1 General ...


Model Architecture of a Distributed File System

Server computer

Client module

Net−work

RPC

Directory service

Flat file service

interface

Client computer

Application

program

Application

program

➥ Tasks of the client module:

➥ emulation of the file interface of the local OS

➥ if necessary caching of files or file sections

9.1 General ...


Model Architecture of a Distributed File System ...

➥ Flat file service:

➥ provides idempotent access operations to files

➥ e.g., read, write, create, remove, getAttributes, setAttributes

➥ no open / close, no implicit file pointer

➥ files are identified by UFIDs (Unique File IDs)

➥ (long) integer IDs, can serve as capabilities

➥ Directory service:

➥ maps file or path names to UFIDs

➥ if necessary first authenticates the client and verifies its

access rights

➥ services for creating, deleting and modifying directories



9.2 Case Study: NFS

➥ Introduced in 1984 by Sun

➥ Open, OS independent protocol

➥ Architecture:

Net−work

Client computer

Client

UNIX

NFSUNIXkernel NFS

server

Server computer

systemcalls

Virtual file system Virtual file system

systemsystemfile

UNIXfile

UNIX

Applicationprogram

Applicationprogram

Oth

erfil

e sy

s. NFSprotocol

9.2 Case Study: NFS ...


Access Control and Authentication

➥ NFS server is stateless (up to and including NFS3)

➥ UFID (file handle): essentially just the file system ID and i-node

➥ not a capability

➥ Thus, access rights are checked with each request

➥ by the RPC protocol

➥ Authentication usually only via user and group ID

➥ extremely insecure!

➥ More possibilities in NFS3:

➥ Diffie-Hellman key exchange (insecure)

➥ Kerberos

➥ NFS4: secure RPC (RPCSEC GSS)



Mount Service

➥ An NFS file system can be mounted in the local directory tree

Server 2

/ (root)

users

nfs

ann joejim

ClientServer 1

/ (root) / (root)

export

people students x staff

vmunix usr

jonbob

remote

mount

remote

mount

➥ Collaboration of mount command in the client with the mount

service of the NFS server

➥ on request, the mount service provides file handles of the ex-

ported directories (for name resolution)



Translation of Pathnames

➥ Iteratively (NFS3): for each directory one request to NFS server

➥ necessary because path can cross mount points

➥ inefficiency is mitigated by client caching

Automounter

➥ Goal: set up an NFS mount only when it is accessed

➥ better fault tolerance, load balancing is possible

➥ Automounter is local NFS server

➥ thereby it sees the lookup()-requests of the client

➥ On request: set up the NFS mount and create a symbolic link tothe mount point

➥ After prolonged inactivity: release the mount



Server Caching

➥ Traditional file caching in UNIX:

➥ buffer in main memory for most recently used disk blocks

➥ read ahead : sequential blocks are loaded into cachebeforehand

➥ delayed write: modified blocks only written back when space isneeded; additionally every 30s by sync

➥ Server caching in NFS: two modes

➥ write through: write requests are executed in the server cacheand immediately also on disk

➥ advantage: no data loss in case of server crash

➥ delayed write: modified data will remain in the cache until acommit operation is executed (i.e. file is closed)

➥ advantage: better performance if many write operations



Client Caching

➥ NFS client buffers the results of (among other things) read / writeand lookup operations in a local cache

➥ leads to consistency issues, since now multiple copies

➥ Client is responsible for maintaining consistency

➥ Timeliness of the cache entry is checked with each access

➥ for that: compare whether the modification timestamp in thecache matches the modification timestamp on the server

➥ in case of negative validation: cache entry is deleted

➥ if validation is successful: cache entry is considered current fora certain time (3 - 30 s) without further checks

➥ i.e. changes only become visible after a few seconds

➥ compromise between consistency and efficiency



Client Caching ...

➥ Treatment of write operations:

➥ file block is marked as dirty in the cache

➥ marked blocks are sent asynchronously to the server:

➥ when closing the file

➥ at a sync operation on client machine

➥ possibly more often by block-input/output (bio)-demons

➥ Bio-demons realize asynchronous operations for read ahead and

delayed write

➥ for performance optimization

➥ NFS does not guarantee real consistency of client caches



10 Distributed Shared Memory

10 Distributed Shared Memory ...


Contents

➥ Introduction

➥ Design alternatives

Literature

➥ Colouris, Dollimore, Kindberg: Kap. 16.1-16.3



➥ Goal: shared memory in distributed systems

➥ Basic technique considered here:

➥ page-based memory management on the nodes

➥ on demand: loading pages over the network

➥ if necessary replication of pages to increase performance

➥ Differentiation:

Applica−

Runtime

system

Hardware

tion

system

Operating

Applica−

Runtime

system

Hardware

tion

system

Operating

Applica−

Runtime

system

Hardware

tion

system

Operating

Applica−

Runtime

system

Hardware

tion

system

Operating

Applica−

Runtime

system

Hardware

tion

system

Operating

Applica−

Runtime

system

Hardware

tion

system

Operating

Hardware DSM: NUMA Shared Virtual Memory Middleware

Computer 1 Computer 2 Computer 1 Computer 1Computer 2 Computer 2



Design alternatives

➥ Structure of the shared memory:

➥ byte-oriented (distributed shared memory pages)

➥ object-oriented (distributed shared objects)

➥ e.g., Orca

➥ immutable data (distributed shared container)

➥ operations: read, add, remove

➥ e.g., Linda Tuple Space, JavaSpaces

➥ Granularity (for page-based methods):

➥ when changing a byte: transmission of entire page

➥ with large pages: more efficient communication, less

administrative effort, more false sharing



Design alternatives ...

➥ Consistency model: mostly sequential or release consistency

➥ Consistency protocol: usually local write protocol

➥ i.e., memory page migrated to accessing process

➥ with or without replication for read accesses

➥ client initiated replication, i.e., reader requests copy

➥ usually only one writer per page

➥ mostly invalidation protocols (with push model)

➥ update protocols only if write accesses can be buffered (e.g.

with release consistency)




➥ Management of copies

➥ mostly: at any time either multiple readers or one writer

➥ each page has an owner

➥ writer or one of the readers (last writer)

➥ manages a list of processes with copies of the page

➥ before write access: process requests current copy

➥ Finding the owner of a page:

➥ central manager

➥ manages owners, forwards requests

➥ fixed distribution

➥ fixed mapping: page → manager




➥ Finding the owner of a page ...:

➥ multicast instead of manager

➥ problem: concurrent requests

➥ solution: totally ordered multicast, vector time stamps

➥ dynamically distributed manager

➥ every process knows a likely owner

➥ this node forwards the request if necessary

➥ the likely owner is updated,

➥ when a process transfers the ownership property

➥ upon receipt of an invalidation message

➥ upon receipt of a requested page

➥ when a request is forwarded (to the requestor)




➥ Problems: e.g., thrashing, especially due to false sharing

➥ simple remedy:

➥ a page can be migrated again only after a certain period of

time

➥ TreadMarks: multiple writer protocol

➥ release consistency; when released, only the changed

parts of the page are transferred

➥ changes are then “merged”

➥ in case of conflicts: result is non-deterministic



11 Fault Tolerance

11 Fault Tolerance ...


Contents

➥ Introduction

➥ Process elasticity

➥ Reliable communication

➥ Recovery

Literature


11.1 Introduction


Concepts

➥ Failure: external incorrect behavior (system no longer keeps itspromises)

➥ Error: (unobserved) incorrect internal state

➥ Fault: physical defect (in HW or SW) causing the error

➥ fault can be transient, periodic or permanent

➥ Fault tolerance: system does not fail despite a fault

➥ Requirement for reliable systems:

➥ availability: p(system is working at time t)

➥ reliability: p(system is working in time interval ∆t)

➥ safety: no major damage if system fails

➥ maintainability: effort for “repair” after a failure



Failure models

Crash failure Server halts

Omission failure Server is not responding to requests

Receive omission Server doesn’t receive incoming requests

Send omission Server doesn’t send messages

Timing failure Response time is outside the specification

Response failure Server’s response is incorrect

Value failure Only the value of the answer is wrong

State transition f. Incorrect control flow in server

Byzantine failure Random answers at arbitrary time

➥ Further distinction: can the client detect the failure or not?



Failure masking through redundancy

➥ Fault tolerant system must hide faults from other processes

➥ Most important technique: redundancy

➥ information redundancy: additional “check bits” (e.g., CRC)

➥ time redundancy: repetition of faulty actions

➥ physical redundancy: important components are provided

multiple times

➥ Example: TMR, triple modular redundancy

➥ components are replicated three times

➥ majority decision for the results

➥ protects against failure of a replicated component



Example for TMR

A1

A2

A3

C1

C2

C3B3

B2

B1

V2 V5 V8

A CB

Without redundancy

With TMR

Voter



Example for TMR

V1

V3

V4

V9

V7

V6

A1

A2

A3

C1

C2

C3B3

B2

B1

V2 V5 V8

A CB

Without redundancy

With TMR

Voter

11.2 Process Elasticity


Objective: Protection Against Process Failure

➥ By replicating processes in groups

➥ message to the group is received by all members

➥ usually with totally ordered multicast

➥ Questions:

➥ organization of the groups?

➥ flat (symmetrical) vs. hierarchical (central coordinator)

➥ group administration, synchronous join / exit

➥ necessary number of replicas?

➥ k fault tolerant: failure of k processes can be tolerated

➥ for silent failures: ≥ k + 1 Processes

➥ for Byzantine failures: ≥ 2k + 1 processes

➥ agreement in faulty systems?

11.2 Process Elasticity ...


Agreement in faulty systems

➥ Agreement is impossible with unreliable communication

➥ two army problem

➥ Agreement of faulty processes with reliable communication

➥ Byzantine agreement problem (byzantinische Generale)

➥ agreement only possible if > 2

3of the processes work correctly

1 2

4

1: (1, 2, x, 4)

2: (1, 2, y, 4)

3: (1, 2, 3, 4)

4: (1, 2, z, 4)

(i, j, k, l)

(1, 2, y ,4)

(1, 2, x ,4)

4 gets:

(1, 2, x ,4)

(e, f, g, h)

(1, 2, z ,4)

2 gets:

(1, 2, z ,4)

(a, b, c, d)

(1, 2, y ,4)

1 gets:

from 1:

from 2:

from 3:

from 4:

2

4

41

4x

y

z3

information

1

2

2

1. Send information 2. Received

1

3. Send received informationto all other processes

11.3 Reliable Communication


Objective: Protection Against Communication Failures

➥ Point-to-point communication (☞ RN I)

➥ TCP masks omission failures, but not crash failures

➥ Client/server communication (☞ 2.1)

➥ possible failures:

➥ server not found➥ lost request

➥ server crash while processing the request

➥ lost reply

➥ client crash after sending the request

➥ Group communication (☞ 7.3)

➥ Distributed commit (☞ 7.4)

11.4 Recovery


Objective: System Recovery After an Error

➥ Forward error recovery: go to a correct new state

➥ Backward error recovery: go to a correct earlier state

➥ i.e. reset to a consistent cut

➥ regular backup to stable storage (checkpointing)

➥ Independent checkpointing

➥ processes save their state independently of each other

➥ problem: domino effect

Initial stateP 1

P 2

1

2

3

4

5

6

Checkpoint

Error

11.4 Recovery ...


➥ Coordinated checkpoints

➥ Chandy/Lamport algorithm (☞ 6.4

➥ alternatively: blocking 2 phase protocol

➥ problem: requires to reset all processes

➥ Local checkpoints with message logging

➥ goal: restore the crashed process to a state consistent with

the current state of the other processes

➥ reset to last checkpoint and restore the received messages

P

Q

R

Check−point

Crashm1

m2 m3

Restore

m2

m1

m3Duplicate



12 Summary, Important Topics

12 Summary, Important Topics ...


1. Introduction

➥ Definition of a distributed system

➥ Features / challenges of distributed systems

➥ Architecture models: client/server, n-tier

2. Middleware

➥ Tasks of the middleware

➥ Communication-oriented and application-oriented middleware

➥ Implementation of remote calls (proxy pattern)

3. Distributed Programming with Java RMI

➥ Approach to create an RMI application

➥ Programming of server and client



4. Name Services

5. Process Management

➥ Graph partitioning, list scheduling, code migration

6. Time and Global State

➥ Synchronization of physical clocks

➥ Lamport’s happended-before relation (causality relation)

➥ Lamport and vector clocks

➥ Consistent cuts, Chandy/Lamport algorithm



7. Coordination

➥ Election algorithms

➥ Mutual exclusion (centralized, Ricart/Agrawala, ring)

➥ Multicast (reliability, order)

➥ Transactions

8. Replication and Consistency

➥ Concept of consistency

➥ Sequential consistency, release consistency

➥ Consistency protocols (primary-based, quorum-based)



9. Distributed File Systems

10. Distributed Shared Memory

11. Fault Tolerance

➥ Failure models

➥ Physical redundancy, agreement

➥ Recovery

Documents

Distributed Systems - Uni Siegen