Kabbalah Health - School of Computingspiegel/advos.doc · Web viewA manual for learning to practice authentic kabbalah. ... ב״ה With the knowledge of heaven בס״ד Everything

Advanced OS Distributed

Version – 5/7/2023

These are the class notes for CS7460, Advanced OS Distributed.

1: Index (57 pages) 1

Table of Contents1 Introduction.....................................................................................2

1.1 PRINCIPLES OF DISTRIBUTED OPERATING SYSTEMS.....................21.2 TRENDS..........................................................................................21.3 JOHN CARTER’S VISION................................................................31.4 TRANSPARENCY.............................................................................3

2 Focused Review of Distributed Systems.....................................33 Networking....................................................................................5

3.1 UDP...............................................................................................53.2 TCP...............................................................................................53.3 REMOTE PROCEDURE CALLS.........................................................53.4 REMOTE OBJECT INVOCATION......................................................8

4 Indirect Communication..............................................................94.1 BULLETIN......................................................................................9

5 Threads........................................................................................106 Agents...........................................................................................10

6.1 FIPA............................................................................................116.2 JADE...........................................................................................11

7 Synchronization..........................................................................137.1 LOGICAL CLOCKS........................................................................147.2 TOTALLY-ORDERED MULTICAST PROTOCOL..............................147.3 DISTRIBUTED MUTUAL EXCLUSION............................................157.4 TRANSACTION..............................................................................157.5 ELECTION ALGORITHM................................................................18

8 Consistency..................................................................................188.1 CONSISTENCY MODEL.................................................................198.2 CONSISTENCY ORDER..................................................................208.3 STRICT CONSISTENCY..................................................................218.4 SEQUENTIAL CONSISTENCY.........................................................218.5 CAUSAL CONSISTENCY................................................................218.6 PROCESSOR (PRAM) FIFO CONSISTENCY..................................228.7 WEAK CONSISTENCY...................................................................228.8 RELEASE CONSISTENCY...............................................................248.9 ENTRY CONSISTENCY..................................................................258.10 OPEN-TO-CLOSE CONSISTENCY....................................................25

9 Client Centric Consistency Models...........................................269.1 EVENTUAL CONSISTENCY............................................................269.2 MONOTONIC READS....................................................................269.3 MONOTONIC WRITES...................................................................269.4 READ YOUR WRITES...................................................................26

10 Implementation issues of Consistency.......................................2710.1 PUSH (SERVER-INITIATED) VS. PULL (CLIENT-INITIATED)........2710.2 UPDATE VS. INVALIDATE.....................................................2710.3 MASTER/SLAVE VS. CACHING.............................................2810.4 PEER-TO-PEER..........................................................................2810.5 MIRRORING VS. CACHING....................................................2810.6 QUORUM PROTOCOLS – FOR PROPAGATION UPDATE MESSAGES

2910.7 EPIDEMIC PROTOCOLS..............................................................2910.8 GOSSIP PROTOCOLS...................................................................2910.9 REPLICA LOCATION.....................................................................2910.10 UPDATE PROPAGATION...............................................................30

11 Papers...........................................................................................3011.1 MOBILE AGENTS.........................................................................3011.2 SETI@HOME................................................................................31


11.3 WORMS........................................................................................3211.4 END-TO-END ARGUMENTS IN SYSTEM DESIGN..........................3211.5 ENERGY-EFFICIENT ZEBRANET...................................................3311.6 CAUSALLY AND TOTALLY ORDERED COMMUNICATION.............3311.7 DISCONNECTED OPERATION IN THE CODA FILE SYSTEM..........3411.8 BAYOU FLEXIBLE PROPAGATION................................................3511.9 SUMMARY CACHE: A SCALABLE CACHE SHARING PROTOCOL. 3511.10 PRACTICAL BYZANTINE FAULT TOLERANCE..............................3511.11 HA-NFS......................................................................................3611.12 RECOVERY...................................................................................3711.13 LARD..........................................................................................3711.14 ENERGY.......................................................................................3811.15 PORCUPINE..................................................................................38

12 Distributed File Systems/FINAL...............................................3812.1 NFS.............................................................................................3912.2 AFS.............................................................................................4012.3 SPRITE..........................................................................................4012.4 XFS..............................................................................................4012.5 NASD ACTIVE DISKS.............................................................4212.6 FILE SYSTEM LAYER...................................................................43

13 Fault Tolerance...........................................................................4313.1 PARTIAL FAILURES......................................................................4313.2 REDUNDANCY FOR F.T................................................................4413.3 BUILT-IN REDUNDANCY..............................................................45

14 Software Distributed Shared Memory (DSM) Systems..........4814.1 IVY..............................................................................................4814.2 MUNIN.........................................................................................4914.3 TREADMARKS..............................................................................4914.4 MIDWAY...................................................................................4914.5 SHASTA.....................................................................................50

15 Computer Systems Security ▼..................................................5015.1 PREVENTION, DETECTION, REACTION.........................................5015.2 AUTHENTICATION, AUTHORIZATION, AUDITING.........................5015.3 TRUSTED COMPUTING BASE........................................................5015.4 DESIGN PRINCIPLES FOR SECURITY.............................................5215.5 DISTRIBUTED. SYSTEM ISSUES....................................................5215.6 CRYPTOGRAPHY..........................................................................5315.7 Authentication.............................................................................54

Table of TablesTABLE 10-1: COMPARISON OF SYNCHRONIZATION VARIABLE CONSISTENCIES............................................22Table 9-1: Bayou vs. CODA..........................................................................................................................35

Table of EquationsTable of Figures

Table of TextTEXT 2-1: DESIGN TENANTS FOR GOOD DISTRIBUTED SYSTEMS.....................................................................4TEXT 7-1; EXAMPLE OF CLIENT SYNCHRONIZING TIME WITH CRISTIANS ALGORITHM................................13TEXT 7-2: TRANSACTION 1............................................................................................................................17Text 7-3: Transaction 2..................................................................................................................................18


1 Introduction

1.1 Principles of Distributed Operating SystemsThe principles of Distributed Operating Systems include:

Group communication: Send info to set of distributed processes or receive information for a set of processes.

Location and naming: Name has a specific determination for locating resources. Naming ties into a lot of different things.

Replicate resources, files, DNS servers for fault tolerance. Gets rid of bottlenecks. Transparency is important. Distributed consistency management. Tighter consistency, lower the performance.

Consistency plays an important role because of replication. Scalability. Fault tolerance and security. Disc goes down and still get good performance. Protection from

hackers. How to build things that are robust if you can’t trust the other computer. File systems. Distributed object database systems.

A number of HWs over a collection of papers. Midterm and final. Project and build something on distributed systems. [email protected] email list. http://www.cs.utah.edu/classes/cs7460. Review:

Lamport Clocks Stateless Servers Consistency Models Distributed Transactions Replication Distributed Locks Ordered Multicast

1.2 TrendsNew OS features in the last twenty years: running mobile code like Java and protection, security issues, ease of system administration, explosion of peripheral devices – how to throw processing on them – what should be done on them, virtual machines running on the same processor, types of applications have changed, real-time devices or system or applications that the OS must support, internationalization (languages), clustering cheap PCs together for fault tolerance – hot swap ability, cheap processing by racking processors together, networking speed increases – GBit Ethernet – point to point multi GBit communication, raw network capacity on the internet has exploded – disjunctive technology changes, split services – load balancing – replications – how do things behave when something goes down, S/W infrastructure necessary to take advantage of the huge increase in processing and network capacity.

1.3 John Carter’s VisionJohn Carter: Make OS more tenable (capable, defendable) by having components, self-configuration, push towards things becoming more and more mobile not just applications (Java) but also wireless. Now instead of post-wiring buildings/houses for networks one can add a wireless network. Everyone has a cheap PC on their desktop, clustering is available. Peripherals in different boxes communicating with IP packets – network appliances. (not graphics card and monitor because bandwidth is too high) Virtualizing S/W for accessibility in different places with simplifying inputs like a notepad computer instead of laptop. Vision of Xerox Park, pervasive computing – throw processors everywhere. Environment follows you throughout the house. Tracking where you go.


http://www.cs.utah.edu/classes/cs7460

mailto:[email protected]

Centralized OS – has control over all its own parts, even if there are multiple processors, all peripherals under control of a single OS, relatively easy to do – one piece of code written by a single group.

Network OS – bunch of machines tied together, each machine is independent, but it can call on its neighbors, existence of 2nd machine is really obvious. Remote resources.

Distributed OS – take collection of machines, resources, and to the extent possible make them a single entity, shared file system, but physical location of file system is not obvious. More and more things integrated, needed more resources and these are adding automatically. SETI cycle stealer. Processor sharing. Figures out who is under utilized and distributes resources accordingly. IBM says computing is a utility, don’t have to pay for service, but pay on demand costs. IBM is pushing ‘Java’ something that permits transparency.

1.4 TransparencyTransparency is a set of goals: (p. 5)

Location transparency, physical location of resources, file systems, names have indirection Migration of transparency, relocation transparency, beneficial if resource can move on need,

overhead of moving can be costly. Running parallel applications spread remotely. Overcoming physical disconnects. In general hasn’t been to effective.

Replication transparency, tolerates failures, get rid of bottlenecks, replicate services that the user isn’t aware of. Consistency is a huge problem.

Concurrency transparency – multiple copies of resources, multiple users accessing the same resources at the same time. ex. accessing a single file server w/o things falling apart.

Failure transparency, fault tolerance, H/W technique for automatic failover Size transparency, scalability, this can often add overhead and may slow performance if there isn’t

a bottleneck.

2 Focused Review of Distributed Systems

Types include: Client-Server: Sample servers include File, Database, Web Servers, Mail, Print Server, “X”,

Authentication, Name DNS, Microsoft Active Directory. Problem is bottleneck. Good for system administration, security. Bad for scalability.

Peer-to-Peer: Peers can connect to all the peers they are interested in. There is really no server or there are as many servers as there are clients. Mount or share to get disks on other peers. Still need to know the machine name to get to a print server so not too “transparent.”

o – Administrationo ± Availability i.e. replication of resourceso ± Scalability depends on distribution of resourceso – Securityo – Privacy o – Progressive Complexityo – Naming/Resource Descriptiono ± Adding/Removing Resources (going down causes a lot of search costs)o + Cost

Advanced Peer-to-Peer: KaZaA, Mango. Files that are used a lot should distribute on multiple file servers and others are less distributed. Even though all the peers have their own disks, we say that logically there is a single file system. File A can be replicated in two locations. Everyone is cooperating to manage a file structure.

o + Transparencyo – Consistencyo ± Performance


o ± Scalabilityo + Reliabilityo – Storage: more metadata needed to denote where files are located: metadata structureo + Mobility: Location flexibleo ± Availability: Japanese culture is to turn machines off when one leaves the office.

Clusters: Rack of blade servers. Multiple PCs, fans in a rack. At bottom is a network switch called a SAN. Inexpensive servers in a rack with a network device for supporting a lot of clients. Clusters work well as web servers. Some blades are for computer servers, some for data storage, some as gatekeepers. This is where most of research is going.

o + Consistencyo + Sessiono + Robustness – need to handle failovero + Complexityo + Securityo + Homogeneous Administration – one disk image for everyone o ± Scalabilityo + Cost for computationo Load Balancing is keyo Performance of parallelism – striping data across multiple driveso Web farm application

Router Load Balancer FE1 FE2

FE3Scalability and Good Design

Centralization is good but often crushes you. Three areas affected: services, data, algorithms. o Services: this creates a single point of failure. Must think if this is a good service to

centralize. o Data: centralized lock server on a section of the database can throttle everyone else. o Algorithms means one has to be carefully aware of load balance. Small imbalances lead

to large imbalances, static and dynamic load balancing. Decentralization increases complexity

Text 2-1: Design tenants for good distributed systems

No machine knows all. Make decision based on local information. Avoids single points of failure. Do not assume that there is a global clock. Minimize IPC

o Caching and replication are very helpful hereo Piggyback information on the back of packets

Minimize Synchronizing IPC – send something out and waiting for a response Offload work to clients Distribute the workload.

3 NetworkingRaw IPC

UDP preferred - sensor data – non-critical data Video data – dropping P frames acceptable in a video stream – selective ACK Key Frames in the

video some flow control can be added. Do it at the higher level. Audio data – clicking when something is dropped Wireless with a TCP sync control


Real time data Gaming data

TCP preferred – FTP, crucial – shut down nuclear reactor, EMAIL, http uses TCP as underlying protocol, telnet, ssh, DNS could go on either side,

Real-time control Gaming control

3.1 UDPUDP sockets, fire & forget datagrams “ports”, low overhead, no acknowledgement, no flow control (at end point) or congestion control (at routers in between), dropped packets are a problem, packet reordering. (Best effort, unreliable)

3.2 TCPTCP – connection oriented delivery guarantees, reliable delivery. Has a window containing the number of in flight unacknowledged packets. The window is used for both flow control (receiver advertising available window space) and congestion control (slow start). Reorders packets on delivery. After slow start drops to half and then stars linear ramp up. A second drop in steady state is half again or down to one packet. Overhead for connection setup (3-way) and tear down (4-way tear down and sender retains state for a few minutes and cleanup). Video streaming could cause unnecessary retransmissions.

3.3 Remote Procedure CallsStack on clientparameters variables

SP --> 2 <--1 <--

| <-- Frame Pinter |V

Server Clientint foo (a,b) { main ()

intx ; foo (1,2)int I

Passing the parameters recreates the stack on the serverSTUBSCLIENT

main () {foo ()

}

client stub --> foo

Marshalls arguments and sends them to the server. send [FOO, 1, 2]Blocks

SERVER

foo (a,b) {real foo

}


stub foo () {int a,b;unmarshall data into a,bfoo ()marshall data for return;

}

Disperse_work (){

sets up connectionfor (,,)

receive ()Switch (msg type)case FOO: stub_foo (data)

}

IDL – interface definition language compiler creates stubs. IDL File

int foo (int,int)void bar (char, string, int,int)

Client stub will marshall two arguments send message to the server, wait for a response, unmarshall the return and do something with the results.

IDL Compiler generates client stub (files), server stub (files), header. Header has type signatures and IDs where foo is 1 and bar is 2 etc. Header will define the clients. Random numbers identify for the servers. How do I find the server?

Find the server is called Binding: Binding to the Server Find the address of the server and then the port number. There is a daemon that is running I am the mail daemon and I am port 25. I am random ID server and I am listening on Port 124. There is a Port Mapper that will respond.

Server first registers itself with the Port Mapper. Request the service get the port in return.Need to find the machine. In http with a URL gives us the machine name. On what server is the blah,blah,blah server. ON Corba, it is the request server. Connect to the Binding Server to find where are the other servers. DCE is similar. Now I have an IP address and a port name and I could open up a socket.

To do this I need to know the location of the name server and port server. I also need to know the port of the port mapper. There is some init code that connects to the name server, says tell me which address is the server. The name server returns to the address. Open a connection to that machine’s well-know port mapper address. Then we get back an available port. Finally we open up a socket.

With http we can bypass this since the address is well know with a default http port or the port :port is in the URL address.

Get an mxrecord back from a DNS server to get who handles email for this network. This is specific service instead of a generic binding service.

Some name services are smart and can get a structured query and it will compare it against the database of registered entries and make geographic proximity decisions. People are working on improving intelligence here. Internet caching systems are forcing DNS entries to look different depending on where you are coming from. There are proxies or copies sitting in Japan or other countries. Different DNS servers in different parts of the world having different entries. DNS cannot cope with frequent changes or fine grain differences.


3.3.1 Difficulties with RPC

3.3.1.1 Marshalling Complex Data Structures

Client ------ Client Stub ---------------- Server Stub ------- Server

Passing tricky parameter types like pointers – reference to “simple data”void foo (*int)

Call-by-value-return is what is done.We marshall up the contents of what is pointed to by the integer. The server stub knows it has to allocate some memory and then it passes up a pointer to the data to the server. Client stub has to look at the return values and store it into the memory that is pointed to by the pointer.

Reference to a truly complex dataPointer is a pointer to other structures ad-infinitum.If IDL (interface definition language) is rich enough that you can specify what the data structure looks like. Data structure can consist of everything that it isn’t pointed to NULL. The RPC can “pickle” or flatten the data structure into an array.

-- -

- - - - Null Null Null Null - - - - - demartialler turns it back into a data structure.

Typically we don’t do this if it is an arbitrarily large data structure like a database where we only want to use a subset of the database. Sun prohibits this in their RPC, which is the standard. Sun allows strings however. Serializing or pickle the data onto the network in xferable form is the problem.

Distributed shared memory is a solution here. Make it that the two processes share the same address space on different machines. S/W at side of memory can trap for unavailable memory, send a message to the other machine which will return the chunk of memory pointed to.

Client and server stubs might be PC and IBM server and there will be different data formats between the machines. Sun RPC supports Sun data structures. One approach is to define a network format. Up to client and server stub to coerce the data into the appropriate form. Each machine only has to know one network format, but the machines will have to do excess work. Having a negotiation approach where data might be sent in different formats is complex. One will have to bootstrap from the handshake. Network format is the typical approach. Two PCs might go with an interchange format amongst themselves, but this is rarely done.

3.3.1.2 Error Handling

3.3.1.2.1 NFSOn a local procedure call that dies, the calling process dies. On a client machine, what do we want the semantics to be? NFS didn’t require any changes, because all NFS calls are detected on the fly. Client still uses read/write. At a particular call in the system call stack, something similar is done in the RPC call.

There are a ton of applications that do not know how to deal with NFS problems so they just wait until the NFS is back up. If it hangs we keep trying indefinitely.

3.3.1.2.2 RPC


At least once – keep pounding on the server until the server sends back an acknowledgement that it has received the message. The problem here is one of idempotence with all the repeat requests.Idempotence - Idempotent : multiplying by 1 or unnecessary. A client sending an application to a server to add $50 – this is not idempotent. Need to cope with no response at all. Either asynchronous or has a timeout.

At most once – I am not willing to tolerate multiple requests, if it happens twice. Server sending a message to the ATM machine to dispense $50, we don’t want a person pulling the plug before each reply so that the machine keeps on giving away money. Need some way to hang like an NFS server. What would be nice if one copy of a web server went away, a DNS server would bind in a duplicate server.

3.3.1.3 SecurityUnrobust systems i.e. NFS servers. Subvert the machine name. Cause the server to crash and bring up your own machine with the same IP address. Any client can connect to an NFS server. The request would pass the User ID of who is accessing the application. The server would check the file ID and say it is OK. Well, one can come in with a PC and be anyone.

What is needed is out of band authentication. One way to do this is public and private key encryption. One can have an authentication server that they can each convince the authentication server that we are who we claim to be.

client --------- authentication server ----------- server| pass a certificate || |stub -------------------------------------------------stub

3.3.2 BlockingAsynchronous: the client does not block and continues to process as normal. When the server replies, it will cause a client interrupt to process the incoming message.Synchronous: blocking

3.4 Remote Object InvocationObjects consist of Data, Methods, and Interfaces. Client Proxy Object connecting over to a Server Skeleton.Client Proxy Object has the same interface but doesn’t have the data or code, which is over on the server.Each of the CPC methods looks like a client stub. Any method can bundle up the requests and send it over to the Server skeleton.

3.4.1 The difficulty is references.Can’t punt on reference in a remote object system. Two types of references. Local proxy reference has one form. A global proxy reference (smart pointer) has a different form – I either point to a local object or something somewhere else.

3.4.2 Location and binding (finding out what objects are available)How do we find an instance? Object broker – this is service that tells where an object is located like a Name Server CORBA.

3.4.3 Proxy instantiationDynamic vs. static instantiation.


3.4.4 PersistenceOnce we have references, there is a unique large number instead of a pointer at its core. With objects it is easy to convert it to a dynamic local instance.

We can store an instance persistently.

-- - code stored persistently

-- -- data stored on disc.--

When I go off and follow the global pointers is to get the code for an object from the disc. The data for the specific value of the instance in some data structure.

4 Indirect Communication

4.1 BulletinBulletin board space, shared data space. Anyone else out there can be waiting for data coming into this space. A system called Linda, available for 15 years. Linda features typed tuples: (1, “foo”, 6.0) (“NAME”, “John”, “Carter”)

put (1, “foo”, 6.0) put tuple into data spaceread (1, “foo”, 6.0) returns data to you and leaves it in the data spaceread (x:int, “foo”, value:float) returns data that has type match and value matchtake (…) same thing but removes tuple from data space

Linda allows producers and consumers. Producers and consumers have no idea who they are.

Advantages Anyone can be a producer, consumer, or both. Persistent and asynchronous – no requirement that the producer be running at the same time as the

consumer. Logging is very natural in this environment. Location transparency. Offloads several complex issues to infrastructure, one of which is location. Replication Transparency – numerous consumers for a single producer where the producer is free Anonymity – privacy of who published the information Synchronization handled by tuple space, can’t consume until producer finished, this is an

advantage of the data structure.

Disadvantages Less efficient for one client and one server – indirection inefficiencies Anonymity/Security – data can be corrupted by some end points Infrastructure complex Scalability is not obvious. Inefficiency with only one centralized database. Programming model not a great fit for everything. Might know who we want to talk to, why do I

have to use an indirection space. Good for producers and consumers but not for everything else. Resource management – space growing and shrinking Complex data

Implementation – The first choice is a centralized tuple server. To much time going across the network.


A solution is that on each node there can be a proxy maintaining some of the tuple state so that some work can be done locally. There is a difficulty here with updating.

Tuples can be distributed via a tree. This gives us log(n) lookup times.

SETI at home – distributing multiple tasks to available resources

5 Threads3 requests come in and single process response will block on I/O.User-level threads.

Threads avoid concurrency transparency if this would result in performance degradation (p. 137). A thread context is nothing more than the CPU context. Thread switching in user space avoids the cost of flushing the TLB and changing the MMU memory map for process switching. There is no need to switch to the kernel context.

Advantage of User Level Threads Cheap to create and destroy threads Switching thread context takes only a few instructions: saving/restoring CPU registers

Disadvantage of User Level Threads Blocking on I/O blocks the entire process, i.e. all threads Expensive to create and destroy at the kernel level Blocks I/O process at the user level.

6 Agents

One that acts or has the power or authority to act. One empowered to act for or represent another: an author’s agent. Agents have to react to the situation.

Software Agents in the AI world – robot interacting with their environment. Agents are an open communication system, heterogeneous systems. Communicates with other agents. Possesses resources of its own. Is driven by a set of its own objectives. A printer agent could order toner when the printer is out of toner or send this request to another agent. Behavioral goals for agents:

Attain objectives uses available resources and skills provided by others

End User Perspective Agents are tools that we can give them tasks.

System perspective; Reactive senses changes in its envoironment and acts according Autonomous has control Goal driven: is proactive Commuicative: able to communicat with other agents Mobile: can travel from one host another Learning Believable

Key Ideas: Autonomy – agents encapsulate some state and make dedisions

o Execute its plants but not blindlyo Maintains local stateo responds appropriately if unforeseencircumstandceso Reactiveness – can not spend hours deliberating must respond


Reactivity agents situation in environment, perceive invironment Pro-Activeness – plan how to achiev this goal. Generate subsidiear goals Social Ability – agent communication language

o auto pilot controls activities

Multi-Agent Systems (MAS) Extract primary stories from scenarios Decompose into “autonomous agents” & services Allocate responsiblilities Device ontologies & languarges: “meeting”, “room”, “day”, “time”, “participant”, “organizer”

(setup (meeting “planning” (when “wed 3 pm”) (organizer “martin”) (participants… Develop agent messages and protocols: Request time: inform answer Implement the agent and its behaviors: Using JADE/FIPA – FIPA is a standards organization Create agent interfaces to external services and infrastructure – need to talk to web services or

databases

MAS example – meeting planner Personal assistant – checks for food prefererencs Calendar Manager – assists personal assistant Meeting Arranger Canceling meeting, checking meetings

6.1 FIPAhttp://www.fipa.org - they defined an Agent Platform (AP) which handles infrastructure for agent environment and key agents for management.

ACL – Communicative acts , KQML, Content Language – the contents of the messages – ontologies, protocols – message patterns – transport level

Agent Platform: Agent Management System – looks up agents, responsible for life cycle Directory Facilitator – advertises agents Other agents

These operate on a Message Transport System. There is also a provision to go to other platforms.

6.2 JADEhttp://sharon.cselt.it/projects/jade

Is a software framework – simplifies the implementation of multi-agent systems. Attempts to be very efficient. Jade is the middle-ware for building MAS (multi-agent systems)

Has scalability. A JADE agent platform can be multiple computers (hosts). Uses RMI or IIOP or http to communicate with other agents on the platform. Container on a host that includes multiple agents. Containers expand multiple hosts. The container is an environment providing the execution of the agents. To start an agent, tell the container that it is responsible for starting the agent. One AMS and DF per host but can have multiple containers per host.

Remote agent console permits you to look at what’s running. The Life cycle agent (AMS) tells when to start and stop agents. Goal of performative is tell you the kind of message being sent.

In JAVA, code saves state and moves code to other host and restores state and runs code. Not mobile in this sense.

Separate threads in JAVA are not blocked on I/O. The agent language separates to one thread per agent.


http://sharon.cselt.it/projects/jade

http://www.fipa.org/

6.2.1 ACL exampleINFORM

sender: martinreceiver: bob reed craigprotocol: arrange_meetingconversation_id: c23reply_with: 275reply_by: wed 3pmlanguage: lispontology: eventscontent: (setup “planning” (length “1 hour”)…

6.2.2 OntologyStandard class structure stuff:

Event type Person email roleMultiple one ofRoom features

6.2.3 Protocol (Agent)PERFORMATIVE (things in parens are contents)

Organizer PA – REQUEST (arrangeMeeting Assistant - QUERY (availParticipate PAParticipant CalendarRoom Broker

6.2.4 Remote Agent Management GUIThis is an agent tool with a GUI.

6.2.5 Jade conclusionCan create agents without all of the FIPA stuff. No need to implement the Agent PlatformNo need to implement agent-management ontology and functionalities. No need to implement message transport and parsing. Interaction protocols must only be extended via handle methods.

CS6934 seminar has a lot of papers.Disadvantages:

Replication transparency is difficult when resources are limited. Unlike http server replications, agent replications are not automatic.

Failure handling must be built in

7 Synchronization

How do we make decisions that everyone agrees with in a distributed system? We cannot know what everyone else knows. How do we make common decisions, how do we synchronize our decisions? As a group we can agree that a whole bunch of decisions will be made atomically. Distributed synchronization and distributed state.


Quartz has a resonant frequency permitting oscillators with different frequencies driving a counter, which can generate interrupts to the OS to notify. Machines can be out of sync by some real delta. E.g. Make uses timestamps that can be maintained by different machines. More abstractly, e.g. every hour on the hour we do something on distributed systems that can mess up an algorithm by variations. Bursty networks are tough to keep sync, but two machines on an Ethernet are easier to sync.

m1 ------- m2 ---------- m3 d: variance we can tolerate, e.g. 1 sec└----------------------------┘ p: clock skew (± ticks/unit time) physical manufacturing skew

of clock. how man secs a day our clock can shift

d =1 P = ± 10 seconds day : Need to exchange info. 20x a day to insure tolerance is maintained.

Cristians algorithm – some machine is a designated timeserver e.g. m2 above. m1 and m3 send messages to m2 who replies with the time on his clock. Client puts a time stamp on the message and gets the RTT on the return, divides the result by 2, and adds the time to the server.

Text 7-2; Example of Client Synchronizing Time with Cristians Algorithm

Send time is 100, Receive time is140Client increments the “T” from the time server by 20.

The problem is if there is high variance between sending and receiving. E.g. say we find that our clock is 10 ticks off, but we don’t want to do a sudden jump backwards so time could be in the future for some files. The best way is to slow time down. Instead of updating clock every 10 ms update on 11 ms until time is adjusted.

“If one finds oneself consistently fast or consistently slow to self-correct in the future” -- John CarterOf course this could be just a hot room or cold room. (NTP creates a drift file for this purpose)

Atomic clock – oscillation of a particular Cesium atom, particular elements are know to have very constant oscillation rates. wwv out of Ft. Collins delivering short wave clock signals off of the atomic clock. The timer sever can have wwv receiver.

Berkeley algorithm – sends out time to everyone else and gets a reply one what everyone elses time is. Berkeley averages everyone else. On second round the “T” server sends out messages to every other client to adjust their time and then it adjusts its own time.

NTP (Network Time Protocol) - has strata to weight the importance of different nodes. (Network Time daemon – sort of a “loud protocol”) Distributed algorithm that weights according to quality. Berkeley can have a problem with a very large peer group.

As you do more work to try to sync up more, the more work can add more skew, “Heisenberg problem of distributed clocks”. Order can be important like adding $10 to a bank account and then 1%. Need to agree on particular order, post facto – retroactive. Defer messages before processing to check for concurrency.

7.1 Logical Clocks“happens before” – captures causality and concurrency ( partial order) Lamport sense

causality – one event happens before another eventconcurrency – means that the events happened at the same time in terms of the messages that

were exchanged1. If A+B occur on the same machine, A precedes B, then A B2. If A is a message being sent by one process and B is the event of the same message received and

A B (A happens before B) is transitive----------------------


Vector Timestamps – each node maintains a vector of the logical times it has seen from other nodes.(0, 0, 0) when I send a message increment the tuple for my machine. If a tuple comes in I do a Max and use that value to update my logical clock (timestamp)(0,0,1)

First tuple - # of events on A of which I know2nd tuple - # of events on B on which I know3rd tuple - # of events on C on which I know

-----------------------

Review: of Lamport “happens before” relationships:

Given – E1: tagged w/ “17” E2 tagged w/”19” Is E1 E2? NO they could be independent eventsThe timestamps are unrelated. There is no causal relation.

Given E1 E2 is T(e1) < T(E2) THIS IS TRUE

-----------------------

7.2 Totally-Ordered Multicast protocol

(Distributed Systems p. 255)

Sender multicasts its message to a group of processes with a timestamp of the sender’s logical time. The receivers process a message by placing it in their local queue ordered by the timestamp. The receiver multicasts an acknowledgement to all other processes. Lamport clocks ensure no two messages have the same timestamp.

Sender multicasts messages to a group of processes with its logical timestamp. Receivers enter message into their queue in order of their timestamp. Receivers multicast acknowledgements to all processes in the group. Multicasting guarantees that all processes have the same queue. After receiving all acknowledgements the receiver can process its queue entry.

A and B. A gets to queue A first. B gets to B queue first. B gets to queue C. But A should be delivered to queue C before B.

Both multicast to everybody. By seeing the timestamp on the response, we know the order. (We assume on the wire messages don’t get flipped, A to C messages arrive in order)

A set to 7, B set to 8. A queue has A7. B queue has B8 C queue has B8. Everyone ACKs back. Then B queue has A7 and C queue has A7 and A queue has B8.

Double multicast space. First delivery sits around in a queue until all ACKs are received back, than everyone can process the earliest member in the queue after a second “double multicast”.

Question on how important is causality. Get PID is going to effect timestamps of a better database transaction. Alternatively the application knows a lot more about the proper order and can do things more efficient here than the operating system layer.


7.3 Distributed Mutual Exclusion

“Locking Protocols”

I want to guarantee that I am the only one operating on this piece of data. Simplest way is to have a centralized Lock server. There is a queue that contains all requests and who has it locked. If we can’t grab position, than a requestor doesn’t get a reply yet.

Advantage/Disadvantage of a Lock Server + Simple - Single point of failure - Performance bottleneck - No cheap local reuse: if the data is on my server I still have to go to the Lock Server. Always

sending lock requests. Would prefer performance to drop when there are conflicts on the lock.

Decentralized

Whenever I don’t have the lock, I multicast to everyone that I want a lock. If multiple requests, I check the PID and if I was lower I get the lock otherwise I know the other person has it. Pick IP address with PID to make sure.

- Ton of acknowledgements: keep getting pestered with something that you don’t care about - End point goes down.

Distributed Token base

Send a request to the node one thinks has the token. If he doesn’t have it he forwards the request to who he thinks has it…

+ simple + cheap local reuse + no performance bottle neck - single point of failure : if machine with token goes down, someone has to say I think the token is

lost and let’s have an election protocol to reelect a token. + scales well

Tarjan Path Comprehension paper says average # of hops is TWO. – Good

7.4 Transaction

7.4.1 Definition: ACID TestFive operations

BEGIN-TRANS END-TRANS ABORT READ WRITE

BEGIN TRANSREAD ACCT1, BAKREAD ACCT2, BAKACCT1.bak +=50ACCC2.bak -=50


Write ACCT1.bakWrite ACCT2.bak

END_TRANS

Atomic – can’t see one of the operations without seeing the other. The transaction happens indivisibly, all or nothing.

Consistent – constraints maintained consistently. The transaction does not violate system invariants, i.e. the amount of money at a bank is constant even after an internal transfer.

Isolated – only I can see the changes made from within a transaction, (serializable – executing instructions within a transaction together) results become visible as if transactions ran serially. Concurrent transactions do not interfere with each other. Locks will prevent conflicting transactions from executing concurrently.

Durable – even if one has a failure—a committed transaction is committed. Once the transaction commits, i.e. writes a committed statement to disc, the changes are permanent.

7.4.2 ImplementationOne atomic true operation that logically commits all changes. One method is shadow copies of data

Shadow – create my own private copies of inode blocks. When all changes done make the old file Inode point to the new file Inode which will have a lot of new file blocks. Operating system cleans up fragmentation.

Writeahead Log –

1) Assume private log and data

Transaction identifier, variable modified, old value, new value(BEGIN TRANS 5)(5, ACCT.BAK, 5000, 5050)(5, ACCT.BAK, 5000, 4950)(COMMIT 5) - ATOMIC operation that commits the transaction(ABORT) - If we ABORT the statement goes back and erases operations

If I am going to read a value, I need to check my private data in the log to see if I have made any tentative changes.

At the time I do a commit, this log has to be written to disc in case the computer goes down, after a few operations and needs to check the log when it comes up to finish the COMMIT. Until (COMMIT 5) is written on disc the transaction is not written. As soon as it hits stable storage, than we have gone atomically to a COMMITed transaction. Did the COMMIT record hit the disc yet? A single disc write takes our transaction to a committed state.

2) Writing operations in permanent data locations could require unwinding permanent changes

7.4.2.1 Two-Phase CommitHow do we get the commitment to work over a distributed network?COORDINATOR + COHORTS

COORDINATOR 1. WRITE “PREPARE” in Log2. Send “READY TO COMMIT?” to all COHORTS. (multicast)


3. Collects replies from everybody from 2) below. If I don’t get all the replies, I write an abort. If all replies come in …

4. WRITEs LOCAL COMMIT Record. At this point the transaction is committed despite failures.5. Multicast “Committed” to all COHORTS

COHORT1. After receiving 2) above WRITE “READY” in Log2. REPLY “OK TO COMMIT”

If a cohort crashes after 2, then it queries others after it comes up to see state. If Coordinator crashes, when it comes back up and doesn’t see a COMMIT record it aborts transaction, even if “READY TO COMMIT” has been sent to everyone.

This is similar to Lamport multicasting to obtain total ordering, but the purpose here is to guarantee the atomicity of a transaction and the rest of ACID. Lamport ordered multicasting has synchronization as its objective.

7.4.2.2 Three-Phase CommitNo one is really using this.

7.4.3 ConcurrencyPessimistic Concurrency Control – uses locks, if I can’t lock at this point I abort the transaction to avoid deadlock. Guaranteed to avoid deadlock. Write the transaction using TWO-PHASE LOCKING (Grab all the locks phase, than process). Can canonically define the order for locking.

Optimistic Concurrency Control – Give them the old last committed account balance even though the data is locked.

Text 7-3: Transaction 1

B_TRD A1.BALRD A2.BALA1.BAL +=50A2.BAL -=50WT A1.BALWT A2.BAL

E_T

Text 7-4: Transaction 2

B_TRD A1.BALRD A3.BALA1.BAL += 50A3.BAL -= 50WT A1.BALWT A2.BAL

E_T


The second transaction is going to use the data before transaction commits. There are going to be time stamps on the database. When transaction 2 goes to commit and sees a changed timestamp than it has to abort the transaction. Have to check a list of version timestamps on all data items that I have read to make sure that nothing has changed before final commitment otherwise abort. If transaction 1 modifies a transaction 2 read value, after transaction 2 verifies, then transaction 1 will abort.

Concurrency is orderingConsistency is how data to be consistent between machines

PapersEnd to end design issues – papersZebra net

7.5 Election AlgorithmBully algorithm

o – 1 o – 2o – 6 o – 3

o – 5 o – 4

If leader disappears someone can say I think the leader has died. Say 6) goes down, heartbeat mechanism or timeout mechanism tells someone that I think there is a problem. Anyone can call an election. Say 3) thinks something is wrong with the leader so it calls election. Send messages to everyone. 5) can also be sending messages to everyone. If I 4) get a message from someone lower than me 3), I speak up because I am higher so I broadcast that I am the leader.

Issue Partition Network Break: What if 6) didn’t crash but is delayed by a network crash. It comes back up and doesn’t know that it isn’t leader anymore. 1) and 6) could think that 6) is the leader in a partition. The rest follow a new leader. How to fix this? Solution: Incarnation # could exist in a field of every message. One ends up with sub-quorums that may operate with autonomous leaders while the network is broken. Anything that is in a crucial file system database may be problematic to distribute with partitioned network. One runs into the consistency problem.

8 Consistency

Caching and replication are really important for improving performance of a distributed system. Keep data cached close to where it is used. If clients are using data faster than servers are changing the data, we can benefit from the proxy server. Replicated database servers increase reliability if one is not available due to weather circumstances in one location.

8.1 Consistency ModelContract between application and server (hardware shared memory and software) on if you do this I will do this. In ‘release consistency’ application or s/w if you lock data, as long as you all the synchronization underlying system; only guarantees this if you put locks around your global accesses. In ‘sequential consistency’ whatever you do, things looks like they are happening on a single processor; fairly stripped – putting strong requirements on implantation.


Contract between PROGRAMMER & Storage (memory/data system). Programmer: how much synchronization is required? If you provide locks etc., SYSTEM told when data must appear to be consistent.

The web has a very weak consistency model. User does ‘reload’ or erase caches when he knows of thinks it is out of date. A file system must have information from different crashed copies or last person to write sets the directory contents while someone else has their changes lost.

Strict consistency is not possible because we cannot make things happen in a distributed system according to a global clock order.

Sequential Consistency (Lamport): A multiprocessor system is sequentially consistent if the result of any execution is the same as if the operations were executed in SOME sequential order, and the operations of each individual processor appear in program order. Regardless of how may units r/w the same data, it guarantees that the data seen would be same as if everything would have been run in some sequential order.

For example lets say I have four processors: Initial condition is x = 0P1: x =1P2: x=2P3: Reads x and gets back 2 then Reads x and gets back 1(P4: Reads x and gets back 1 then Reads x and gets back 2) – this is not sequentially possible on a single processor, since P3 and P4 conflict)P4 Rx 0 Rx 2 is consistent

This is acceptable for P1 – P3.

x 2, Rx 2, x1, Rx 1 is also sequentially consistent.

Dekkers algorithm works in a Sequential consistent system:

P1: FLAG1 = 1; if (!FLAG2) then <enter cs>P2: FLAG2 = 1; if (!FLAG1) then <enter cs>

How to implement a system that does this? In hardware we have a number of caches sitting on a bus with memory.

Models are contracts; protocols are implementations of contracts.

x=1 P1 P2 P3 x=1

$ $ $────────────────────── Bus

| Memory

Write-invalidate Protocols: Only allow one write at a time, first send out a bus operation to invalidate all copies in cache before the write and then to write over the data. All going through a central server, I shoot down all copies before making any changes. If I need to maintain copies for fault tolerance systems, an invalidation followed by a fault, blows up my requirement.

Directory Protocols: H/W notion, a lot of original ideas driven by the H/W camp. For performance reasons, we don’t want everyone looking at every piece of data on the Ethernet. Instead we keep track of who has data. In a H/W system we have four nodes:

0-1 MB Home node 0 1 Home directory 1-2 MB: what is the state of x any given block and a copy set.


If another node has x then state is shared {0,3}Later x becomes exclusive on {3} after a request

3 4

x

We maintain shared and exclusive directory access appropriately.

8.2 Consistency OrderHere we list in weakening order, consistency protocols that lead to better performance:

Sequential

Weak

ReleaseWrite-Invalidate Protocol: Finish receiving invalidate ACKs before releasingWrite-Update Protocol: Finish updating everyone else before releasing

Unless I do an acquire I am not required to see an old update from another processor.Eventually these global updates should propagate through anyway. <RELEASE> does forward out updates, but in a weak scenario.

DASH – Studied the feasibility of allowing caching of shared-writeable data using the distributed directory protocol to support cache coherence. The directory protocol uses cache invalidate. Thedirectory keeps a summary on each memory line as to which clusters are caching it. The memoryline is either in uncached, shared, or dirty state.

MUNIN – John Carter studied DSMo Software release consistencyo multiple consistency protocolso write-shared protocolso update with timeout mechanism

One node doing a lot of writes works better with an invalidate protocol, because once the other nodes are invalidated they don’t have to be invalidated again. Invalidation protocols have a much better worse case scenario. When update protocols go bad they can go really bad. Updates can cause cataclysmic traffic even if everyone else is not using the data.

Lazy ReleaseAt <RELEASE> the owner is not required to make everything consistent at this point. Only we need to do this out the <ACQUIRE> point. This is implemented with vector timestamps called THREADMARKS. When a lock arrives, everything that is causally related to the <ACQUIRE>. Compare this with my own vector timestamps. Free to lazily push out changes at <RELEASE> or thereafter. Here is my timestamp, send me any changes you have that have a later vector timestamp. If there are multiple nodes with <RELEASE>, I have to send a message to all other<RELEASE> nodes. Vector timestamps are associated with every page in the system. + Advantageous for producer and consumer applications. - Disadvantageous because a lot of messages and timestamps to maintain

<ACQ>X Y

<REL>


At this <REL> mode it doesn’t have to write it out of its cache.

<ACQ>X Self-invalidate at the lock, no need for an ACK

Can be lazily sending out invalidates or updates. Still have to send out messages to check state at the <ACQUIRE>. Defer things to subsequent <ACQUIRE>

Entry

Causal - Sequential is stronger than causal, which is stronger than processor. Causal is not necessarily weaker than Weak, Release, Lazy Release or Entry.

Processor

X 1 XY 1 Y 3rd processor can see Y and old value of X

t X old value of X can still be seen

Does not preserve transitive rule or causality. Just pipeline writes.

8.3 Strict Consistency

Any read on a data item x returns a value corresponding to the result of the most recent according to absolute time write on x.

8.4 Sequential Consistency

Any valid order of execution if all code runs on a single processor

8.5 Causal Consistency

All processors see writes that are causally related in the same order:

Following invalid:P1: W(x)aP2: R(x)a W(x)bP3: R(x)b R(x)a This is invalidP4: R(x)a R(x)bP5: R(x)c R(x)dP6: R(x)d R(x)cP7: W(x)cP8: W(x)d

Non-causally related data may appear in different order on different processors unlike sequential consistency.

8.6 Processor (PRAM) FIFO Consistency


We see writes from the same processor in order, but writes from different processors may have any order.Doesn’t guarantee causuality.

The above case is valid here:P1: W(x)aP2: R(x)a W(x)bP3: R(x)b R(x)aP4: R(x)a R(x)b

8.7 Weak Consistency

P1: W(x)a W(x)b SP2: R(x)a R(x)b SP3: R(x)b R(x)a S

If we executed a “S” before instructions in P3 we would guarantee to see R(x)b R(x)bThe execution of Synchronization Variables is sequentially consistent.

Table 8-1: Comparison of Synchronization Variable Consistencies

Weak Consistency Release Consistency Entry Consistency Entry Finish Ext. Writes

Fin. Int. Writes (for other processors)

Synchronize before entry into critical section.

Fin. External Writes

Acquire before entry into critical section

Fin. External Guarded Data Writes.

Acquire Lock on variable or group before entry.

Finish exclusive access on guarded shared data. Other processors wanting to use the guarded data must request it from the locking processor via a message to get the latest updates.

Exit Fin. Ext. WritesFin. Int. Writes

Fin. Internal Writes

Release Lock at exit of critical section.

Fin. Internal Guarded Data Writes.

Release Lock on variable or group at exit.

This is a Distributed Shared Object

Directory-Based Consistency Protocol

Essentially there is one home for each datum.

Hardware based

Each byte of global physical hardware associates with a particular address


Cannot just look at the top bits of the physical address. On that node there is memory that has a directory slot or store for every global memory space:

Directory Entry: |Sharing State| Copy Set|Sharing State: Uncached – no one has a copy

Shared – multiple users have a copyExclusive – only one user has a copy

Sequentially Consistent, Write Invalidate

Write invalidate protocol – request goes out to a node, which has exclusive control, tells node to go to shared and gets the copied data, but the previous processor can no longer change the data. Provides some global order.

What happens when two nodes want to modify the same location. Both generate to the home node saying I want to upgrade from shared mode to exclusive mode. Unlike the bus protocol they do not see each other’s message from multicast. Whoever arrives first gets the exclusive control. In response the home node has to send a shoot down message to every other node that has a copy. The other node, which wanted exclusive access has to throw away its data and send an ACK saying it has done so and exclusive can be granted or sent out to the first owner. Now the home node says I am done processing one request, let’s send a request for exclusive control to the other node that has exclusive access. When done it will invalidate its copy, and ACK that it is giving up exclusive access.

This guarantees seq. consistent, write invalidate.

If I don’t need the cache the dirty data can sit there indefinitely until a request comes in to modify the data. If I run out of space I will send data back to home node, which effectively ends exclusive node. (Here’s the data, address, and I am evicting the data) If something more recent comes in I will update my cache.

If I am doing a write-back, flushing my write-buffer, you must send me an ACK so I can throw it out of my write buffer.

Software

In software essentially the home node is not fixed, so it can move around with the data. The transitive nature takes over, so there is a path to find out where the owner has gone.

Conflict

If two datum are on the same cache line, and two processors want either data, there is a sense of false sharing due to the granularity at which we maintain consistency:

P1 wants x, P2 wants y | | x | | | y | | cache line

Everytime someone wants to write they get exclusive mode to the entire cache line. We end up repeating requests for data if one is writing and the other is reading.

If both want to write, we get a ping-pong effect which each processor is making the other throw away its copy. There is a 4-way handshake. I want the data, throw the data away, ok, ok.

Must guarantee there is a global order on everything so there is not a lot of room for optimization. Great for System program, which knows that everything is consistent. Global ordering implies serialization (aggregation of instructions within a transaction) in addition to sequential ordering (of transactions).

Synchronization alternatives


data accesses …

Barrier Enter Critical Section [FENCE]x x … y y P[4] P[4] Exit C.S.

I don’t expect anyone to be looking at the intermediate results as I generate them. Only at “Exit C.S.” do all the operations become available atomically.

One can divide shared memory into two types: (besides reads/writes, or data order)Data and synchronization variables are the two typesThere are a lot of races on the synchronization variables. Race conditions on the data are far less common.The reason there is a synchronization variable Barrier is to prevent races on data.

Weak Consistency – both exploit this observation. Synchronization tells when order matters. If we assume that Barrier or Enter C.S. and Exit C.S. are the only important points. At Exit, I make sure everything gets pushed through.

Access to sync data is seq. consistent No access to a sync variable is allowed to be “performed” until all previous normal accesses are

“performed” No data access (R or W) is allowed to be performed until all preceding syncs are “performed”

In weak consistency, P3 can send a request for exclusive access and start modifying before it gets ACK from P2 when it enters critical section?

The underlining memory can be changing from the critical section in other processor memory. The other person needs to do a SYNC to wait for EXIT for say all the critical data to be modified. If the critical section only had writes, it wouldn’t buy much in write-invalidate protocol. In a write-update protocol things work much better and eliminate the ping-pong problem.

+ Overall we can eliminate latency- If we don’t sync at the right time, things won’t work, often times we don’t thinking about

locking the datawhen we are reading, but here we need to do this, otherwise we will see partially modified results. (writers locking but readers not)In weak cons. we can queue changes made by everyone else, but not respond to changes.

8.8 Release ConsistencyRelease Consistency

What really matters is when I exit so when I enter it is not a problem. Acquire on enter. Release is not blocking. Turns down overhead by half. Now we don’t have to wait for all the ACKs to come back before ENTERing a critical section. Still no one else is reading the data while I am writing. Release is stronger than Weak Consistency. While I am doing the flush I can go further ahead and I don’t have to wait to enter the critical section until everything before has finished. So release consistency weakens Weak consistency further.

Access to sync data is seq. consistent No RELEASE can be “performed” until all previous normal accesses by the processes are

“performed”


No data access (R or W) is allowed to be performed until all preceding ACQUIRES are “performed”.

Before entering a critical section I do an ACQUIRE Lock on the Synchronous Variable. This causes all writes, but not reads, in other processors to complete. Upon exiting a critical section, I do a RELEASE Lock, which causes all writes in my process to other processors to complete, but I do not have to wait for reads in other processes to complete.

Overall I can enter a critical section faster because I only need to synchronize on external writes and I can exit faster because I only have to flush my writes. Release consistency is weaker than Weak consistency because in Weak consistency I cannot enter the critical section until all my processor writes finish as well and I cannot exit until I have synchronized on other process writes as well.

8.9 Entry Consistency

P1: Acq(Lx) W(x)a Acq(Ly) W(y)b Rel(Lx) Rel(Ly)P2: Acq(Lx) R(x)a R(y)NILP3: Acq(Ly) R(y)b

Without acquiring locks on shared variables there is not guarantee that a processer will read anything but NIL.

Works well with simple objects like C++ objects. Before I can enter a method on an object, I need to make sure the data for the object is up to date. Locks and mutexs are associated with data. Java object accesses, I am only making sure that I have up to date information on my object. Request to world for up to date values on my object.

- if we don’t lock a piece data, I won’t see an up to date value, I can’t spin on global values- acquiring a lock is costly and just acquiring a lock for values that are not changing is costly- all accesses have to be in synchronization to work- real implementations put this directly in methods

P1 P2 P3 P4. . . .

If all the processors are using the same array, we can’t decompose the locks since it is object based.

Sub-arrays try to work around this. Programming effort greater than just using RELEASE on four parts of the array.

8.10 Open-to-close consistency

OPEN acquire up to date snapshot of data. Ignore rest of world↓↓ Ignore rest of world and rest of the world doesn’t see my changes until a CLOSE↓

CLOSE guarantee that subsequent OPENs see my changes (if any)

If I write a file and then a read on the same processor, you will see my change.Over a network this doesn’t work to well because there is no caching.


AFS – I am going to ship the entire file to you when you OPEN the file. If someone else OPEN the file for write, I do not remember who I sent the file to do. AFS doesn’t maintain state. There is no sequential consistency unlike the Unix or Sprite file system. Here the last one to CLOSE the file wins.

CODA – does whole file caching, now I have copies of the shared file on my local disk. Instead some s/w on my laptop is using VFS to intercept the calls to work on local copies. When I reconnect to the server, CODA keeps track of fileserver changes – imposes “state” requirement on a file server. If someone else wrote the file, it updates the file to the laptop user. If there is a write-write conflict, it flags everyone and this has to be fixed manually. This is a CVS like consistency.

NFS – stateless, tries sequential consistency, but delays when everyone else sees changes for 30 seconds. This is time-based consistency. This is done for performance. Clients poll to see if what they have is up to date. A lot of web pages work the same way.

9 Client Centric Consistency Models

These are single process consistency models.

9.1 Eventual Consistency

“Data stores that are eventually consistent have the property that in the absence of updates, all replicas converge toward identical copies of each other.”1 This flexible scenario causes problems where updates are not necessarily seen in subsequent reads as in web page changes perhaps.

9.2 Monotonic Reads“If a process reads the value of a data item x, any successive read operation on x by that process will always return that same value or a more recent value.” The author brings an example that if one reads email at one time in one location and then at a future time in a different location, the old mail will still be in the mail box even if there isn’t newer mail. Also if there is newer mail, the old mail should still be there.

9.3 Monotonic Writes“A write operation by a process on a data item x is completed before any successive write operation on x by the same process.” This is useful for maintaining consistency of a library where we want the most up to date version to have all prior changes.

9.4 Read Your Writes

10 Implementation issues of Consistency

(P. 326 Chapter 6.4)

Consistency Models – semantic guarantees

1 Distributed Systems Principles and Paradigms, Andrew Tanenbaum, page 318


SEQUENTIAL – strong like on a single processorWEAKRELASELAZY RELEASEENTRY

CAUSAL – like sequential but distinguish between normal accesses and causalPROCESSOR (PRAM)

Implementation Strategy

10.1 PUSH (server-initiated) vs. PULL (client-initiated)

PUSHing data to different localities is good for distributed work and storage requirements to multiple servers. Once someone has created enough work to justify me updating or invalidating him, then I let him make a copy. Competitive algorithm.

PULL – if I am uncertain I have to poll the server if there are any changes.

LEASE – want to be able to grab exclusive access of data. If a client with an exclusive lock fails how does the server recover—a lease has a lock for a timeout, eg. 24 hours. I am not allowed to reflect changes unless I relock the data. Push either server owns it or Pull client owns it, Lease is in between. How to recover if the server goes down?

10.2 UPDATE vs. INVALIDATE(Also one can mix these for different memory regions on the same servers) – update model pushes changes to all replicas.

Invalidate shoots down all cached copies. Invalidates better on casual use of data. One node performing lots of writes and no one else is doing anything, invalidates better. We only need to invalidate once and then we wait for the client to request updated data.

On the other hand if 10 nodes are all doing R/Ws there is performance benefits to update since the client doesn’t need to send a message to get the invalidated data.

In all shared memory and parallel hardware architectures, systems prefer the invalidate protocol.

UPDATE+ Update is essential in mirroring architecture where we must update to preserve fault tolerance. This guarantees duplication of data in case a node should go down.

10.3 MASTER/SLAVE vs. CACHINGMaster/Slave – distribution between servers and clients, primary copies on servers, secondary copies on client. Peer-to-Peer – primary location of copy can migrate around. STATEFUL vs. STATELESS

Example: File server and a bunch of clients. One node is designated master. All operations logically go through the master copy – this is simplest version. All writes sent to master, ordered by the master, responses sent out to each source. Simplest version turn off caching so reads go through the master. Sequential consistency comes free in this model where everything is happening on a single node. - the master can be a bottleneck- single point of failure


- latency for roundtrip transfer of data, even if master is unloaded, instead locality avoids network latency

The first improvement is to allow other nodes to cache data – client cache. In M/S updates only occur with writes to the MASTER.

If file is to big for cache, intelligent caching protocol – LRU (least recently used) blocks Caching file on a local disk just to get it off of the server Write through caches – don’t return from write until master responds and reflects the changes to

everyone else either by update or invalidate. This requires server state on who has same data. Extra roundtrips and acknowledges to maintain sequential consistency.

Writes not sent through right away. Exclusive-mode for write state machine. If we don’t want to reflect writes immediately for performance, we can go with a weaker

consistency model. Example: OPEN – CLOSE policy with last writer winning.

+ Caching gives a limited amount of fault tolerance if files are only READ from cache.+ Read performance has gone up tremendously. Reads tend to dominate writes in most applications.

3 to 1 up to 10 to 1+ Good for disconnected operation. Clients disconnect from network. Master-Slave has a good way to

come back into operation when we reconnect to the Master. Global file system gets up to dateon reconnection. Push through all changes to the Master. Very amenable to disconnected Master

operations.- Would not work for a transaction system.± If Master goes down it will have to rebuild STATE from clients when it comes back up.

10.4 PEER-to-PEERLike freenet or qazar. Freenet – data is sent around to whoever wants it. One of the nodes with the data is designated master. Only the owner is allowed to modify the data. Everyone has directory and state information. If anyone wants the data they will have to request the data. Send a write request to the master and wait for ACKs to come back that INVALIDATEs have been sent to everyone else. The former owner invalidates itself. If the INV doesn’t tell the new owner, it will trace the data back through previous owners. In a tree, we will only know about a subset of the network. TARJAN PATH COMPRESSION. + A node can feel after a request comes in, a timeout permits reassigning master+ In P-to-P more ready for any replica to supply the data while in M/S only the Master supplies+ Invalidates propagate through tree.

10.5 MIRRORING vs. CACHINGHave replicas in case of mirroring in case one fails I can get other copies for fault tolerance, also to spread the load a bit. Caching is nearer where it is being used for performance improvement.

Caching - Trying to improve performance with a Proxy cache for local nodes. Cached copies can be dropped or thrown away at anytime. We can drop or throw replicas away silently.

Mirroring – all copies are just there for performance reasonsWant to maintain ‘N’ copies of the data for robustness or scalabilityCan’t drop copies, need an update protocol for dealing with conflicts between mirrors.Designate one copy as primary copy. Alternative any change gets sent to everyone. This leads to secondary copies.Some mirrors can be designated a Master for some different files, turning this into a M/S. Usually the # of mirrors is fixed by policy, typically # mirrors not done dynamically. + Mirror FTP sites for performance and fault tolerance+ Initial motivation was robustness+ Later locality copies improved performance and cost of access.+ Eventual consistency can push updates around to mirrors.

We can use both Caching and Mirroring at the same time.


-----------

10.6 QUORUM protocols – for propagation update messages

Want to guarantee that if I do a write and someone does a subsequent read they will see the changes. (alternative to invalidate and update). Useful in mirroring with a static number of copies. Pick a number of nodes and define it to be read quorum and another number of nodes is my write quorum. Quorum is how many nodes I must contact to read or write the data. My write quorum could be 1, my read quorum N or vice-versa. As long as the quorum is N/2 +1 there will be at least one node overlapping quorums. R – can be 4 W – can be 2R – can be 2 W – can be 4

10.7 EPIDEMIC protocols

10.8 GOSSIP protocols

10.9 Replica Location

Hierarchy from smallest number to largest:Permanent replicas

Server-initiated replicasClient-initiated replicas

Clients

10.9.1 Permanent ReplicasTypically a small number of Permanent Replicas. Mirroring copies a web site to a limited number of servers, called mirror sites.

10.9.2 Server-initated replicasThe Server owning the Data Store may initiate additional replicas to improve performance. Servers create push caches to handle occasionally bursty traffic requirements. A server could dynamic create replicas of say web pages close to a server supporting a group of clients with that need. - deciding where and when replicas should be created or deleted

An algorithm to support server-initiated replicas

1. Server tracks access counts per file and location of clients accessing file2. Server knows which other servers are closest to the clients3. If number of requests for a file drops below a threshold server deletes file4. We ensure that at least one copy remains5. A server can reevaluate placement of files and do migration or replication

This is a good approach for web page distribution.


10.9.3 Client Initiated Replicas

Client caches. Multiple clients can share caches.

10.10 Update Propagation

Client caches and servers can initiate updates of its copies to ensure consistency. Invalidation protocols use less network bandwidth than update protocols. These work better when there are many update operations compared to read operations (p.331).

Pushed-based protocols- has to keep track of all the clients- don’t work on stateless servers+ multicasting can help here

Pull-based protocol- client has to poll the server each time it thinks it needs an up to date copy

Lease protocol is the server saying it will push updates to a client for a certain time, afterwards the client polls for an update, which renews the lease.+ If the server feels overloaded it can shorten the lease times to offload itself faster for pushing updates

11 PapersTitle page with name and author of article, date, class name, maybe font

Discussion Thoughts

white on BLue looks good

11.1 Mobile AgentsAgents are autonomous programs.

Mobile move dynamically They make their own decisions when to move They support heterogeneity They can interact with each other

ARPL routing protocol – any soldier wherever you are can put a query into the network.

Strengths Conservation of bandwidth – trying to retrieve multiple documents and multiple queries. Agents

can travel to the end location and do the queries there saving retrieval time Reduction in total computing time – RPC does a good job in fast networks. For slower networks

agent benefits increase by only passing relevant information back. Reduction in Latency – intermediate results not sent across wire, RPC is faster when result sets are

small, mobile agents can move closer to producer of data depending on network conditions. Disconnected Operation – decide which links are slow and dynamically move to other side Most advantageous in fully wireless networks with much physical movement of devices Mobile devices: limited storage. New applications can be installed dynamically. Persistent queries - always there on client to be accessed


Load balancing – static vs. dynamic (workload re-allocated during execution), can move between nodes, heterogeneous environments, carry all application specific code with them. Due to overhead not suitable if goal is computation speed.

Dynamic Deployment, - agents can install themselves on other machines, security is of concern, Java would be a better choice because of “sandbox” environment. (can’t harm anything)

Reasons why not to use Mobile Agent No applications specifically require MA Security Complexity of deployment (decision making, etc) Rich server query languages already available that can cause processing on client. Powerful mobile devices

ScalabilityReliabilitySecurity

11.2 Seti@HomeServer side

Maintains database of all current units listed as “waiting”, “in progress”, or “done”

Client side Connects to the server to transfer a single work unit Processes the work unit nicely using low load periods to steal cycles When completed reconnects to the server to return current unit and request another

Seti@Home is a client server Client software is custom written for each platform (over 50 supported) Client software must be manually downloaded and installed Client software must be manually upgraded to new version releases, which obsolete older

versions.

Why does it work? Data computed is independent Ratio of computation time to setup and network overhead is high Very little user maintenance

Where is it going? Distributed Java

o One time install of client s/wo Sandboxed security model of JVM – (windows running under windows so we can’t

damage anything)o Automagical class file distributiono Class distribution allows “any” problem to be implemented inthis framework.

Problems Security - old adage of “security through obscurity” Total Bandwidth - only computation is distributed, bottleneck on server to analyze results

Security - If one is running someone else’s code that code can do anything and access anything on the machine.


mailto:Seti@Home

11.3 WormsWorm used Internet daemons that would replicate extra information until the disk became full. Worm tried to dump data into the spool directory, which was full so the homogeneity prevented it from adding more info so the worm would crash. Denial of service (DOS) attack. Security and Privacy

11.4 End-To-End Arguments in System Design

Purpose: Guide placement of functionality among parts of a distributed systemProblem: Where does a function belong?Assertion: Some things can only be done completely and correctly at the application level

Ex.1 File TransferFailures caused by

error caused by disc s/w may mistake in copying and buffering processor or memory may have transient fault network can drop bits either host may crash at anytime

Solution: end-to-end checksum/CRC

Does TCP help? Still need to end-to-end check.

Ex.2 Gateway (“flake-way”) in MIT networkOne of their computer was flipping bytes every million bytes. An end-to-end check would have caught the problem

Security/Data Encryption – at end-to-end handle my own keys, I am encrypting and decrypting, I have authenticity: the person I am talking to is intended.

Duplicate Suppression – end-to-end is necessary for correctnessFIFO delivery ordering – didn’t get toTransaction management – acknowledgement for all packets, low level ACKs would double overhead, four messages for every request and reply instead of two.

Managing the Tradeoffs – everything at high level means everyone has to build-in these features. Cost vs. benefit at each level. All high level leads to lots of reimplementing. All low level checks add costs for everyone, whether they want it or not.

Identifying the Ends – Argument has different results with different end points

Voice recording on disk should get perfect data because it will effect all replays. Humans can guarantee retry.

11.5 Energy-Efficient ZebraNet

Sensor – device can sense environments, compute data, and communicate4-100MHz, memory 512 KB, communication radius limited


Node capability – discover current location and measure trhe distane among nodes in an energy e- efficient way

Area of research, need practically deployable devices, need platforms for building sensor like TinyOS

Zebra Net goals dynamic sensor networks GPS position recording positions every 3 min operation for 1 year

Eng. Design issues weight inv. prop. energy transmiision range env. prop. energy storage – efficiency scaling -> energy/connectivity tradeoff.

Data Gathering Protocols send data either directly r indirectly to base station whenits active

o flooding protocolo history-based protocol: based on probability and obs zebras

Flood > History

ConclusionHistory based protocol is better than flooding, considering energy – efficientcy due to reduncay in floodingBut flooding will be be better in case of distance

11.6 Causally and Totally Ordered CommunicationCausally ordered communication:

about how multicast is done in a process group

Totally ordered multicast: stricter protocol, all messages delivered in same order to all processes (p. 255)

CATOCS – atomic message delivery, stable – message reached all processors. Has only incidental ordering.

Prescriptive ordering – explicitly ordering constraints (by state level)

“Can’t say for sure” – hidden channel examples – changes to shared database, real time data“Can’t say together” – can’t say that operations are happening together

this is a serious drawback“Can’t say the whole story” – causal based multi-cast does not tell the whole story. Partial ordering is not strong enough. Causal operations on shared data not preserved.

“Can’t say efficiently” – no efficiency gain over state level techniques

Netnews – expensive to preserve order.

Trading Application claim causal multicast can deliver content prices to trading

Global Predicate Evaluation: deadlock detection, RPC deadlock, orphan detection

Transaction – CATOCS can’t enforce serializability for group operations (keeping things together within a single transaction to maintain proper sequential execution)


Replicated Data consistency – CATOCS improves performance by asynchrony by trading concurrency of update of replicated data. Transaction file system using read any write all available protocol.

Distributed R/T – hidden comm.. chan. problem causal communication via monitored parameters escapes the message passing system. Msg delay by false causality is a problem – can affect “response time”

Scalability issue – buffering of messages does not scale well. buffering can be reduced introducing transmission delays.

ConclusionCausal multicast is not adequate by itself to meet the application level semantic consistencyBetter alternatives- state level clocks.

Need heartbeats to maintain group membership.

11.7 Disconnected Operation in the CODA File SystemWork while file servers in accessible through use of cachingTransparent operation through optimistic conflict resolutionClient-server modelNaming transparency and location transparency – ever user sees file in same place.High availability – server replication (VSG – volume storage group vs AVSG)Whole file caching

Optimistic consistency – open to close file sessionCache coherence using call back. Server will send a message to Client and sent a message to client whenever anyone updates the server. Client registers call back function.

Network partitions handled using version vectors.

Read-cache miss on open

Write – update - Muliticast RPC. All three servers will update their copy.

VenusHoardingEmulation – disconnect from server – all updates logged in a per volume rplay logReintegration – log parsed, files lockedk ; validation : confilict detiction, disk-spack check integrity; fetching: updating files from client ; commit : locks released

Conflict Handling – unresolved conflict represet as dangling symbolic link ; applicati specific resolvers (ASR) executed at clients.

Is write-all and read-all-statgus good? - when different replicas ar frar from such latency from client differs log. Emulation: replay log can fill memory.

11.8 Bayou Flexible Propagation

High availability with fine-grain control. Lots of replicas with free independent update. User can specify idea of conflict. Server will try to keep eventual consistency.

Anti-entropy is based on theory of epidemic algorithms – randomly effect another person and select his server. Thru peer-to-peer reconciliation protocols, ‘writes’ will fully propagate with high probability.


Ex) k=3 is probability, then only less than 2% of servers remain susceptible (don’t know the rumor (~=writes)). Difference between CODA and BAYOU is BAYOU has a database.

Bayou anti-entropy algorithm:Request information of R (R.V R.CSN)Check write-log trunk Transfer a full db if log entries were deleted incorrectlySend anything that R does not yet know about: two parts: committed and tentative writes.

Extension of Anti-entropy Through Transportable Media Per session guarantees eventual consistency on some database: monotonic reads, read your writes,

monotonic writes, writes follow reads Light weight server creation and retirement

Performance An anti-entropy session only propagates writes unknown the receiver. Bulk of anti-entropy exec time is spent on network and others.

Table 11-2: Bayou vs. CODA

Bayou CODAstyle peer-to-peer master – slave (server – client)method anti-entropy multicastprotocol primary commit replica write

Eventual consistency exists in routers. Group calendar management – eventuality based on connectivity if laptops are not connected to the network. Ad-hoc global networks are looking at this though Bayou is still just a research project.

Transferring entire database is not scalable. Bayou makes you keep the entire database at the client. This is not practical for a PDA. Can’t be used for a transaction database but instead a directory database.

11.9 Summary Cache: A Scalable Cache Sharing Protocol

Cache sharing increases benefit of proxy caches. Bunch of proxy works together to reduce cache misses. ICP cache miss – client sends to first proxy who multicasts misses to everyone else in the group if there is a miss.

Bloom filters to store the summaries. Got a vector of bits apply URL to hash function to locate URL in a hash table. Four bits per location. Inter-proxy messages reduced by a factor of 24-60. Network BW reduced by 50%. CPU overhead 35%+. Bloom filters controls the number of false positives (false hits) not false misses. No definitely means no in bloom filters. Yes is questionable.

11.10 Practical Byzantine Fault Tolerance

New replication algorithm Practical: works in asynchronous environments like internet – supplicates delayed, destroyed Only 3% worse than standard At most floor (n-1)/3 nodes are faulty

Assumptions Nodes connected by unreliable network Byzantine failure model: fault nodes behave arbitrary


Independent node failures Adversary cannot delay correct nodes indefinitely Adversary cannot produce valid signature for a correct node or find two messages with same

digest

Byzantine Replica Pool Size Derivation There may be f faulty replicas that don’t respond. There may be f good replicas that don’t respond because of delay. Hence we may receive f faulty responses, have f non-responses, and thus must have f+1 good

responses at the worst case. Hence n = f + f + f+1 = 3f+1 R is the minimum set of replicas: assume R = 3f+1 where f max # of faulty replicas

Algorithm1. Client sends request to invoke a service operation to the primary2. The primary multicasts the request to the backups3. Replicas execute the request and send a reply to the client4. The client waits for f+1 replies from different replicas with the same result; this is the

result of the operation.

If primary node fails, we change the view and the next node becomes the primary.

11.11 HA-NFS

Separates NFS problem into three sub-problems: server, disk, and network, reliability.Server failure toleration – dual-ported disks, accessible to two servers, record information about volatile state on disk, periodic liveness-chekcing, don’t maintain other server’s informationDisk faiure toleration – mirror files on different disks, all coppeks of same file controlled by same serverNetwork failure toleration – optional replication of network component – 10MEthernet and 4 M token ring network, network load is distributed over the networks.

architecture – normal operation, take over , retintegration, normal op – monitor liveness of eac other, if no rpply ping by network, else send request again by scsi bus, conservative detection to avoid race condition: NFS then network: then SCSI

Take-over – impersonate failed server, change sec. network interface to primary address of failed server. shortcoming of using ARP protocol is each client must implement this protocol correctly

Reinteration – uses secondary interface to send reintegration request, reclaim its SCSI buses and disks, run log and reconstruct reply cache.

Network Failure – daemon on clien to observe statues of each network, 21 servers have their primary interfaces on different networks. balance the load.

Performance gain is misleading because of logging not used HA-NFS but is used in NFS.

11.12 Recovery

Errors are inevitable: hw, sw and people errors are intrinsic to computing landscape, not solvable problems. Perfect systems and operators do not exist, especially in large networked environments. operator error is consistently the leading cause of down time.


Move away from performance and concentrate on availability. Focus on total cost of ownership, hw/sw costs dwindling due to Moor’s law, commodity PCs. OpSysAdm cost is 3.5 to 18x hw/sw cost.

Emergency systems need to be tested online. Restart trees capture components common restart requreiments in ahierarych. Restart as a little of the tree as possible to recover quicly. Reversible systems with Undo. Rewind goes back to pre-fault state, repair allows correction, and restart begins again.

Email not rewindable problem. Choose to notify the user of inconsistent state.Defense in depth – independent modules provide backup defense that improves reliability. Bricks. Indep. mech. to monitor nodes, isolate failed nodes, and insert faults. Incorpo. fai.lure injection into HW architecture.

Challenges for ROC – construction dependable netweked services, design for operators ainstead of users, finer granularity of failure: instead of reboots finer grain of what is going wrong, perfor.. montoring of misbehaving modules.

11.13 LARDWRR doesn’t account for localityForward individual requests to same type of backend node.Checks if any backend has the same request otherwise load into least used node.Else check load of that node and send it another backend server.

Tlow = load below which a back-end is likely to have idle 40% of untilization on machineThigh – load above which a node is likel to cause substantial delay in serving requests

Replication – set of servers for each target. WIth excess load add another request.

WRR achieves good load blancing but has worst throughput and highest cache miss ratio since it ignores locality.Locality Based – cache miss ratio goes down since the same node has served previous requests. This doesn’t distribute well and bad load balancing, though better than WRR in these cases.

LARD revicesd has best throughput in both traces. When the trace has less locality (requiring larget cache size) LARD performs 2-4 times better than WRR.

LARD /R only slightly exceeds LARD in orig. trac. to the lack of high frequency targets in trackes

LB/GC does not provide a significant improvement over LB. Hashing scheme of LB provides an even partitioning of the server’s working set.

LARD throughput improves with CPU speed wheras WRR does not if it is already disk-bound

WRR benefits from multiple disks.

TCP connection handoff – if not dealing with content can just forward, but LARD must check content.

11.14 Energy

In context of Hosting centers, energy management is a key. Resource EconomyReclaims resources with negative returnAllocate more resources to profitable services – greedy part of algorithmReclaim overcommitted resources


Adjust for highest and best use

Algorithm conserves resource during periods of low demand.

System prototypeMonitoring module – loadble keneel modules on FreeBSD, Executive

Would like to move a decentralized executive to eliminate single point of failure.

11.15 Porcupine

Scalable, High available electronic mail server.Data Structures

mailbox fragment mailbox fragment list (User’s list of frags.) user profile database (names, passwords)

Advantage and Tradeoffs always possible to deliver or retrieve mail dynamic load balance users mail replicated distribution of mail complicates delivery and retrieval for new to a user nodes, ,users mailbox fragment list has to be updated on retrieval load increases with # of nodes

Three round membership protocol (TMR) coordinator defines new membership

User Map maps user names to cache fragments. New map is broadcasted in final TMR.

Soft State Reconstruction.

Replication properties update anywhere eventual consistency update totally writes over previous copies.

12 Distributed File Systems/FINAL(Chapter 10)

CODABayou – mobile and disconnected system, consistent w/o master copySummer caching – proxy servers – hot to massive update

12.1 NFSNFS – one of the first client-server systems. The importance of transparency with VFS.

CLIENT

Application – Library - Syscall layer - Virtual FS Layer - NFS Client Stub - RPC - Local FS module – disc


SERVER

VFS – NFS Server Stub – RPC- Local FS – disc

Do a read through the local file system, which does a trap. The VFS parcels out the request to the appropriate file system.

12.1.1 Naming

Have a file, how do I get to the actual file? If the entire name isn’t mapped to a complete inode. It will proceed through caches according to each segment of the path. Part of naming is handled locally and part is handled at the other end of the system through RPC. Location independence – can I move the physical location of the server without telling anyone about this. This is difficult in NFS.

Server:/foo/bar - lacks location transparency/~/porter/user/foo ± in between location transparency as porter suggests remote location/home/retrac + good location transparency

The client mounts a portion of a server’s directory tree. NFS permits nicely grafting any server sub-tree into any location of its own directory tree. The server must export and the client explicitly mounts that sub-tree. A mount point is a special file like a Unix socket, etc.

The client traverses down as part of its open file request until it gets to its mount point at which point it moves over to the server for continued processing. The fileID that is returned will have explicit information about the remote file server so on a subsequent open will not have traverse as much to open the same file.

Instead of the sys-admin explicitly mounting everything that someone could possibly need. Auto-mounting replaced this approach with designated points that if accessed could lookup the point and the rest of the path name to decide on the mount.

After an automounts once, VFS (DFS), can go directly to an explicit mount since the directory is existing in the home directory now: /home/retrac /home/student

Performance Caching – Sharing Semantics“Unix semantics” – sequential consistency semanticsSPRITE enables caching at the server and I will keep a copy and know if someone else has directory or files shared. SPRITE is very stateful, which is costly and complicates recovery.

In NFS 3 the server doesn’t maintain any state. Semantics are SESSION SEMANTICS, the server gives block 10, but doesn’t know who it just gave it to. A SESSION means just OPEN/CLOSE semantics. Beginning session is OPEN file. End of SESSION is CLOSE of file. NFS caches files on a block basis. Original NFS systems had a separate lock manager for handling this. Now lock manager recover becomes complicated.

Every request coming through from client has to be completely self-contained. Every request has a fileID, a read, the starting byte to read, read fileID, start address, # of bytes – can’t remember my last location. The client NFS stub could remember this. The file handle has to be unique over time. There is no open file table on the NFS 3 server. The file handles can’t be reused quickly since if someone else deletes that file and creates a new file, we don’t want the new file to have the same handle. This circumvents security checks, besides causing applications to use the wrong file. FileID mapping has to be resilient to not reuse numbers.


If there is a failure, most Sysadmins configure machine to lockup until SERVER returns. A lot of legacy applications require this though NFS could be configured to return failure codes.

12.2 AFSAFS - whole file caching

12.3 SpriteIf on a distributed network want sequentially consistent semantics. Sprite was a stateful server.NFS was stateless.

Serverless – distributed file system- XFS- NASD Active disks- Mango Medly /Kaazana Kidfs

12.4 xFS

Raw storage, ability to store blocks, metadata management. Need to understand the concept of a directory. Can do reads and writes over the network, client operations (caching). Works on top of machines who have storage distributed throughout the network.

P – DP – DP – D

Inode InodeFile, offset Directory Manger map Imap Stripe group map Disk -------> Stripe group map --- Disk ------> Data block

12.4.1 Log-Structured File SystemThe log structured file system, LFS, is the foundation of xFS. Conventional file system, super-blocks, inodes, directory system. Inodes contain all file information including the location of the physical blocks that make up the files. Defragers fix this but inode must still trace this.

To logically move a file from one location to another, write-move, write-move, metadata file structure is really slow. On top of the hierarchical file system on disk we have two logs

Log in kernel memory Log on Disk

Each disk operation I perform -Starting to do a rename (like prepare) -.... - Write a giant disk block..... -

Background cleaner process is updating the Log on Disk so that one can grab information directly off the disk – this cleans up the log as the data is transferred to a permanent location on disk.

We can do metadata operations without having to write the Inode, and two other separate directory locations. Instead I can write three contiguous memory sections in a log.


The Inode points to first block, 2nd block, 3rd block… As we are doing a file lookup we need to check if there is newer information in the log than there is in disk. In that case we need to gather memory according to the memory log or disk log.

The logging is really useful for metadata operations, but the cleaner sends the real data to disk when time is available. (The metadata keeps the logs smaller since we might not have to write all the data into the log?)

12.4.2 Striped groups

Storage distributed across a group of servers (RAID File System). Want to make sure my striped group is N+1. For every four blocks in the file we have a parity block. We do a modulo 4 and distribute the blocks on each disk with a RAID like parity disk. XOR all of them together and the XOR goes into the parity disk. The whole idea of RAID is to spread files across multiple disk systems. Striping gives both performance and fault tolerance.

+ Need to only add 25% disk space for fault tolerance as opposed to 100% with mirroring.

There is a notion of these in XFS. There is a manager map that lets you recover from a fileID manager process, i.e. the home node for that file. This manager is responsible for what a directory manager does. Where in the hell is the Inode for this block, i.e. fileID Node

Imap – maps inode references to their location in the log (p. 631)

Log addresses which are triples, has (stripe group ID, segment ID, and segment offset for a particular block)

Need to find a particular group, need to know what stripe group, but need to know where in the log is it and then the offset. A request goes out to all servers requesting to feed me your block. Effectively we are spreading the file data across a number of nodes.

For this stripe group my log is located here, for another stripe group my log is elsewhere. More stripe groups permit us to add new nodes on the fly and delete nodes. Also can separate groups according to the performance of nodes that are similar in performance. Striping |u|v|w|x|☼parity| each distributed over four servers and there is one parity server.

RAW STORAGE – (RAID STRIPING) – some disks could be designated for pure raw storage without having caching and requiring metadata management.

METADATA MANAGEMENT – who is in which stripe group, in which stripe group do I want to put data.

CLIENT (CACHING) – where on the state groups is the file data. Take advantage of smart caching. If there is locality between same files being accessed on this pair of machines or another pair, I can check a nearby node if it has the file in cache already. Manager can tell node to go get data from a nearby disk since the manager knows to whom it has just handed out the data. Cooperative caching.

- writes still have to be handled properly. - makes the assumption of going to single node and burst data out- sand environment where disk is really fast no need for going to cache over network- if one cache is really big, all the neighbors could be going to a single nodes cache.

12.5 NASD Active Disks

Network attached storage devices


How do netapp devices work. People spending alot of money for NFS file servers, which make some money by bundling a PC with a NFS file server and give a nice user interface, hit with a web interface, a little bit of setup and now we have a file server. Now we are not going to buy a big sun server just one of thest netapps and everything is ready to go.

We have a network with clients and servers hanging to. We also have attached storage devices to the network.

Server -- -- DHigh Capacity Network

Client-- --D--D

Server – understands file system layoutDisk – inode blocks

Client contacts server when it wants to open a file. Want to get the server out of the loop for as many disk operations as I can. The raw R/W request (RPC) goes to the disk itself where the data block is. Server says to client that this file is stored on this disk:

- this is a big security risk – I can go wreck that block – certificates fix this

NASD developers said I understand that this disk is storing that file. Here is the fileID for that file. The server verifies that the client has access rights to that file. Server also gives a cryptographic shared key that the client passes along to the attached storage device. Mom says I can and here is the note (valid certificate). I am going to return the disk blocks to you. Assume this is a high capacity network.

+ Big benefit is scalability+ Server is out of the loop for data reads/writes but is in the loop for directory lookup

Moving to switched I/O backplanes instead of PCI. {Ability to do raw reads/writes out of buffer space}

Disk is not so dumb, but it has to know where it stored file information. PC on server handles certificates. Doesn’t know about any hierarchical file structure. Server involved with Name lookup. The disk does know for this file how to find the blocks.

Server knows renames and deletes that go to disks. Needs to be someway to do getaddrs and shortcuts, maybe another protocol from disk to server.

12.5.1 Active Disks

Let the client download information.

SRPC – A little engine that lets you run Perl scripts to access disks. Can do a write to a file, for a portion smaller than the block. Need to do a Read-Modify-Write. Need to read the entire block across network, modify, and write block. But here we can do this on the disk-pc where we read the block locally, modify the block locally and write it back.

Got something in the disk cache that is not up to date. If no one else has modified it don’t send it through. Some ability to handle bandwidth.

Dave Patterson is pushing the notion of bricks consisting of a ton of disk enclosures, autonomous evaluation of itself and its partners. Having smarts on the disks, can predict when the disk is going to fail. This is a good time to signal the main system to come and replace me before I die.


12.6 File System Layer

Distributed Data Store. The application is managing distributed data storage. Examples include Oceanstore – fileID, objectID and Khazana. The store is responsible for locating the data.

File System Layer knows how to: Store (persistently) Locate data Maintaining Consistency – all copies up to date all the time – sequential consistency OR storage

layers just for convenience, distributed game, another burst of data will overwrite the data anytime – want a much weaker consistency

Redundancy/Fault Tolerance Synchronization (optional based on consistency choice)

Oceanstore servers or Khazana servers lying around with their own disks. Application comes along and does “open/read G10 to one server --- S – D

||----------S – D|----------S – D

Perform a hash function on where I should start looking. Or we use an epidemic protocol for searching – peer-to-peer. Up to the File System Layer above to put some sense on the search. NFS – storage layer, metadata layer, application layer. Targeted for the Wide Area or very large storage with a compatibility layer on top. Decomposed file system or distributed layer. Step beyond is Freenet or KaZaA.

Two different independent servers can request the same data with locking and talk to the data system. So an application can have a second choice of server to contact if one fails.

Can we build a layer that is useful for more than just file systems? Does it make sense to support a wide variety of applications with a single distributed data store layer? Otherwise every application has to build its own solution to redundance, fault tolerance, etc. If it is close to what the application needs it can create better future software reuse.

13 Fault Tolerance

13.1 Partial failures NFS server dies, client keeps working without telling application of the fault because designed to expect temporary partial failures. If DNS goes down how does it effect the rest of the system. part of the stripe server going down. build in restraints to tolerate these failures.

Four qualities for Fault Tolerance in a distributed system to be good: Availability – what % of time is service available, how many days/year is the server down. This is

the aggregate of service up time. Reliability – how long does it run in time before failures or mean time between failure: MTBF,

this tells how spread out are the failures. Need to build in features to handle this. Maintainability – how easy is it to fix the system or replace disk driver … Storage Cubes – we are

going to place a large number of disks in one box but can’t get to center disks, so we are going to do striping or replication to disconnect these disks. % of dead nodes on x and % of data available on y, there is a critical point where routing is effected and % of dead nodes will cause available to drop close to 0. Goal is hot plugable. Ease of use has a 2nd requirement not to cause unrelated faults even by maintenance technicians

Safety – faults do not corrupt system


Other Fault Tolerant qualities: During Failures want continued (perhaps degraded) performance. Security and performance not compromised – failure-free performance not significantly degraded Failure containment – failures shouldn’t cascade, related to Safety above, design should not be

brittle. Don’t want it to be super resource hoggy. Cost in $ or Time. Minimize complexity Minimize Loss of persistent (hard to regenerate) state.

What are some Failures that we need to concern ourselves with: Disk failure – wasn’t head crash, but disk controller sends a corrupted a bit every 10 sectors, this

is a Byzantine failure, very hard to debug. Disk controller starts to randomly flip bits. Network Failure – instead of router failing, packets get through with a few corrupted bits. Loss of Power any component might fail Machine failures Bus failures Bug crashes OS – instead of OS going down it starts deleting things S/W bug corrupts component Operator errors

Fail-stop – machine goes deadByzantine – keeps operating and does the wrong thing, generates bogus outputs, a virus or break-in is a Byzantine failure. A Byzantine failure is where you can’t trust anyone even your generals. How to be sure that you have enough non-corrupt generals to eventually wage war?

Achieve Consensus – N nodes “he’s just ping for the fjords” Joes dead. We have three people,

Building F.T. Systems: Techniques Redundancy Transactions (applications cope with it) Checkpointing – rollback and try again Message Logging – replay inputs

13.2 Redundancy for F.T.

13.2.1 Active Replication Active replication (hot standby) – multiple instances running at the same time and all doing the same operations. In a distrib. file system, we have two mirror servers receiving he same client input and there is a combiner unit comparing the results and sending them back to the client. Using some form of group multicast in S/W. In H/W two nodes listening, HA – NFS.

All process & data replicated (N) + “in synch”Any node can service request (reply)Instant recovery

Examples: RAID mirroring, WEB Sites do this for performance – most of the data they are FARMing is read only.

Keeping N resources all in lock-step. TANDEM makes computers like this.

- High resource use ($)- Heterogeneity – keeping all servers with same speed so one everyone isn’t reduced to the lowest common performance


- Correlated failures due to homogeneity+ Fast recovery time+ Exploit redundancy for performance – pick one set for writing the other for reading± Simple recovery model, takes a lot to add a new box to the system- Some complications to “sync up” new box- Poor scalability

Multicast protocols to support this is expensive, can make a central point that does this distribution in H/W, but this could be a bottleneck in terms of fault tolerance and performance. Central point of failure.

TANDEM TMR – going to have N copies. If you care about Byzantine failures need at least 2 out of 3 to vote. But two nodes are good enough for FAIL-STOP failures.

Duplicated nodes covers failure up to 1/3 of the nodes sending out incorrect errors.

Cold backup just stores enough state to switch over.

13.2.2 Passive replication (backup units)

Track enough state at the Master/Primary then a secondary backup can take over within enough time its operation.

Periodically the Master can write to the backup. Volatile memory needs to be irrelevant or stored.

In this case the Backup must be sending a heartbeat, pinging, to verify the Primary is still alive. A smart router can be told which node to send data to. Major difficulty getting the Backup to impersonate the Primary.

When building a system, how many failures do you want to be able to handle? Build for one failure or two?

+ During failure free operation, much lower load on system than active replication. Don’t have to over provide resources.

13.3 Built-in RedundancyBuild into a system to track its own state. Maybe useful to implement passive replication. If a node fails can recover up a neighbor node.

13.3.1 CheckpointingThe larger the systems, the more likely that at least one node will fail. New machines are fairly unstable. Overtime they get to the point where they are not failing as often about the time of hardware obsolence. Checkpoint is built into the applications that need this reliability. What do we need to checkpoint:

Address space Registers – anything part of context space, program counter, stack pointer Kernel structures – file info: what files are open, what is their offset, parent/child info, kernel

media locks – sync structs, socket info. Files but generally files are stable storage (sync out any buffered data to files)

For a single application time P0 ----------------------------------------------------->CP Cp x

| rollback------|


Consistent checkpint: all messages received have been sent. Ckpt: all messages received in one process ckpt are seen as sent in the sender’s chkpt.

If processes are waiting on each other, not much is lost by other processes rolling back their checkpoints. However if the processes are fairly independent, than there processing power is lost.

Uncoordinated Checkpoints

P1 ----C--^---C |P2 ----C-|

How many checkpoints do I need to keep around? Dependent on when messages are sent can cause problems. Checkpointing is arbitrary not preventing these problems.

Consistent checkpointing or coordinated checkpointing

If we sync the checkpointing, garbage collection is facilitated. Whenever you see a message come in that has a ckpt interval # highter than my own, I take that # and append it to my own and checkpoint and I know that they have already checkpointed. This will create a causally consistent set of checkpoints.

- There is this time, where everyone is not doing anything- If everyone is trying to dump to the same file server for checkpointing, the bottleneck will cause problems. Coordination process will hammer stable storage server.

One solution is to only snapshot changes between checkpoints. Mark pages as read/only, after a write mark it as dirty. Code space isn’t going to change. Checkpoint structure will become more complicated as we now look at a table of dirty pages.

We are unsending messages in rollbacking. Message logging is an alternative to below.- Dealing with the “outside world” is particularly difficult to rollback.

launching a weapon, sending a package, distributing money can’t be rolled back.if we generate the same output on the screen, the user may be able to cope with itLots of I/O crushes checkpointing

Copy unwrite can be costly so we look at message logging instead.

Let’s say it takes 45 minutes to checkpoint across a 1000 processes. If I checkpoint every hour, I get only ¼ usage from the machine. This is 300% overhead. Instead say I am willing to pay only 1% overhead. If my MTBF is less than 75 hours I make no progress if I checkpoint once every 75 hours.

Trade-off often close my checkpoint interval or checkpoint frequently open my interval.

If willing to embed checkpointing code into application, we will know exactly what information to store, which reduces amount of state to store, no which state can be recreated on the fly, can reopen files manually, can choose judiciously when to do the checkpoint to minimizing storage.

Big problem with checkpointing- large checkpoint windows – can lose a lot of work- lots of I/O interaction to outside world, have to checkpoint often. + on top of a transaction system works well

13.3.2 Message LoggingThe problem in checkpointing is when messages received but not sent because the send process rollbacks. When I send a message it gets stuck in a log. If every message is being logged before I send it. With a


failure and a rollback I get a message that is received but never sent. Here the OS can track the message and ACK the message and record the ACK in the log. After the roll back the OS ignores the repeat send of the same message.

All these systems are depended upon the assumption of Deterministic execution, including checkpointing.This is a problem with gettime(), unless with checkpointing we roll back the logical clock. Assume applications are absolute time independent.

Message logging prevents redoing something that has already been done by providing ACKs for example. Prevents having to rollback any other process. Now rollbacking is only dependent on failures, only need to rollback the node with the failure.

Synchronous Message Logging or Pessimistic Message Logging – always have enough info in the log to rollback or roll forward to where we are. There is a lot of I/O to stable storage. Problem is that we are not optimizing for the common case.

I am not delivering a message to a process until I put the process on stable storage. This adds a big latency to every message. If between every send/receive I have to write down something, I have slowed down the failure free message case. While there are really good failure recovery semantics, logging adds latency to all messages.

Optimistic M.L.

I logged to disk asynchronously. I send it off to stable storage device and to other process at the same time. On sending side I buffer messages that I send out, and then flush it out to stable storage device . Problem if I fail at just the wrong time, key messages that are not logged that have received by someone else I can still get the Domino effect.

Sender based M.L. – this is above

Receiver based M.L – receiver can log all things it received, recieiver can log incoming and when it sees repeat incoming messages it discards the 2nd message.

Checkpointing+ low overhead+ simple- domino effect rollback- coordination overhead- major weakness with outside world

Mess. Logging- logging overhead- failure free performance cost+ solves rollback problems+ solves outside world and coordination handling.

14 Software Distributed Shared Memory (DSM) Systems

14.1 IVYIVY - Kai LI

MESI – modified , exclusive shared , invalid mode

Contact preson who has exclusive.


Write – in local cache as exclusive

Modified/exclusive is a local operation.

Protocol to invalidate everyone elseProtocol to xfer data

C-P C-P C-P C-P| |------------------------------------

To detect when accessing something I don’t have access to, check PAGE TABLE port byte&valid bits.Accessing illegal something causes a trap. Trap will occur and s/w in kernel takes over. Channging trap handler to handle protection violations. Install signal handler that will register a callback function when something happens – passes address that was illegal to the handler. Ivy uses Signal handlers. That signal handler does in s/w what I would have done in hardware.

Ivy has a SwDirectory which has who I think is the owner of this page. Coherency unit page not a cache line. what node is the owner of this page. if not local send a request to that node that “I need a readonly copy of the data.” Eventually we find owner node. If that node had the data in exclusive or dirty mode. T

Might have a process continuous receiving protocol msgs and handling them. Handler Routine.This routine changes directory state to reflect that this page is now in share mode. Change protection of that page to read/only. Ships a copy over to source node. Source marks page is writable. OS will attmpt to roll back to place where will fault occurred. Second time of execution with read access will result in success. Assumption that only one thread running at a time.

Ivy assumed that only the current of the owner of the data knows aho also has the data. MESI assumes only one writer.

Mattrix of nodes --- : memory page F±F - False shared page

-------------------------- |--- F±F

|+++++++±-------- ...| | |-----------------

What happens on these pages that have multiple accesses. If each processor only touches its only part, i.e. no sharing then since pages cross boundary then a neighbor node will suffer from false sharing since data lies on both sides of page boundaries. False sharing is a problem because granularity is pages instead of cache line. - false sharing – ping pong- lots of data copying between processors- invoking expensive traps and signal handlers- lots of communication+ works well if there is not a lot of sharing, not a lot of R/W sharing

Cache update – Every time a neighbor modifies a false-shared page, the changes have to be sent to the neighbor node. Works better with producer/consumer

Invalid works better when there is unpredictability. Lots of locality suggests invalidate is better.


14.2 MuninMUNIN: Write invalidate, migratory, update. Every write is immediately reflected everywhere else, the rule of sequential consistency. Only send “diffs” between processors with false sharing. Mark page as writeable.

for (…) {wait at barrior; // push through any data acquired by other nodes

// looks like a Release (make local changes global) // followed by an Acquire (make global changes local) (do work)}

When we receive a msg, know contents and that sender has sent it, so there is knowledge of synchronization. I know that you are at the barrier point.

Initial Munin didn’t do Eager release consistency, but there is some advantage here if there is bandwidth.DSM doesn’t really go anywhere for the sci appl. people throw in larger caches or better message passing.

14.3 Treadmarks

Lazy release consistency Treadmarks

Still had to maintain diffs on shared pages. Whenever there is a change, I don’t forward data, but I keep the diffs around. This is soft invalidate since I notify the sharing nodes.

He wanted to have one universal protocol. So notify, heh I want updates, send diffs. Sometimes more efficient to send data with invalidate notice.

- At what time did each invalidate notice come in and who has the latest data. Now we have to have vector time stamps to order everything.- Another problem is garbage collection. Want to flush old data out – is this data safe to throw out?

John Carter thought this is an overkill. They worked on implementation issues intensely.

14.4 MIDWAY

Entry Consistency MIDWAY

+ Objects are well contained where operating on individual records at a time.- Fell apart when objects are not compartments like large arrays.

If the Array is my object I have to update all elements of array – doesn’t scaleIf it is an element, I have to update

Started looking like a message passing system

14.5 SHASTA

(DEC World) Person known for RC. Update protocol is the origin of Release Consistency.

(Write-Invalidate and Migratory are not RC)


Compile application to work in a shared memory environment. Took unmodified SMP binaries to run it on top of a shared memory system. Since we can’t modify and add a special signal handler to detect shared memory.

They looked to the binary and had a binary modifer to change : Ld r1 into x

to Push address(x)JSR rdshell

Every single read and write had to modified.

Played depth use games, check for it once and not later.

They were able to port Oracle to a cluster of servers this way..

15Computer Systems Security ▼15.1 Prevention, Detection, Reaction

Prevention – prevent unauthorized access to secure data, confidentiality, (unauthorized reads), integrity (unauthorized writes, keep user level data secure), availability (denial of service attacks)

Detection – audits various features in the OS to track what is happening Reaction

15.2 Authentication, Authorization, AuditingGold Standard for Security (Au)

Authentication – you are who you say you are, i.e. password. I am the file service in CS, I thick you are a workstation in this network.

Authorization – Filtering out those who don’t have access to this resource. Access Control List, ACLs, give more fine grain control to who has access to a particular resource. In NT there are more sophisticated control to resources, Unix is entirely file-based control.

Auditing – Trace what I have been doing to make sure that everything is happing correctly. Log files for tracing what is going on. More auditing, more ability to debug what happened. More auditing, more compute time, but there is a privacy issue.

15.3 Trusted Computing BaseTrusted Computing Base (TCB) – The set of all security mechanisms in a (distributed) computer system that enforce a security policy. The mechanisms include anything that that if compromised will violate security. Want to make it so security system minimizes the size of the TCB.

Physical integrity of the machine; installing passive sniffers is bad OS kernel, drivers, and libraries Filesystem Password file / directory server / NIS (yellow pages) Key system daemons (setuid root programs) C Compiler

Need to make sure all of the above are not compromised to preserve the integrity of the system. Access libraries like DLLs could compromise security. If there are N components here there could be N! combinations to deal with.


Correctness of individual components (bug free) – apply s/w engineering principles to individual components. Need to protect components – protection of TCB - getting root password system – file enables all of these – setuid permits activity as write.

Compromising C compiler could add a backdoor to the system – Trojan C compiler could generate something every time it is being used. Every time I generate an OS kernel, I could generate a backdoor. Generates evil binaries.

Getting correct inputs.

Common ways people break into systems: Buffer overflows exploiting bugs in TCB. Being able to pass commands into a setuid root

program / script. Finger daemon has a 512 byte buffer on the stack. Gets() reads a single line and stops at end of line which could write 536 bytes and changed the return value. Exec (bin/sh) which gave same access to finger daemon and give root shell.

Direct attacks on passwordso Trojan horses attack passwords by for example a mock screen for entering a password.

Emails come in pretending to be benign. Users invoke operation on their system. Some of these are social engineering with messages like “I love you.”

o Brute force decryption Dictionary attacks / easy to guess Default passwords on shipped systems

o Social Engineering – asking someone to change password and monitor them. Typing in password at ATM with a gun.

o Setuid root / privileged process trap doorso + Prevention systems forcing users to have better passwordso + Smart cards – maintains a shared secret, a password or a function. Type in random set

of digits and type back what smart card is telling you. Also there is a private password one types in the card. Double security. Keys could be suspended or deleted after excessive brute force – 3 or 4 screw ups lock up account.

o + Biometric smart cards – thumb print, eye vessel scan, voice, physical dimensions of face or hand

Internet Worm – all it did was self-propagate and hide. Targeted to Vaxes and Sun 3’s under 4.2 BSD. These were dominant machines acting as major servers. Program that can run independently and self-propagate. Would look in /etc/password and apply crypt to get password. Also can use rhost and forward to find other machines to log into. Use rsh with compromised password and then search for other user passwords. Another attack is to connect to SMTP – send mail daemon, port 25.

Bootstrap program pull in a main program. Forward attacks – Robert Morris, Jr. Professor of MIT If it saw someone listening on port 11357, it knew the worm was running there and kill itself off. Inoculation involved putting a small program on 11357, which would make the machine immune. Break in, upload itself, and search for other machines.

Fork itself and kill the parent every so often, hard to detect. Blow itself out and reinfect every 12 hours. Took advantage of a bunch of vulnerabilities: fingerd, settings in sendmail (rhosts to everyone), took advantage of running homogenous environment – Vax and Sun 3s.

Diversity has its advantages like different Unix’s, Linux’s. UofU firewall had a lowend Vax with older BSD that was immune.

Virus - Cannot exist by itself. Attaches itself to a benign program, modifies behavior of benign program and then self-propagates.


15.4 Design Principles for SecuritySeparation of policy (administrative level – who can do what) and mechanism (what implements the policy).

Avoid building in policy assumptions into mechanism. Ex. for ACL system – try not presume the ACLs will be used in a particular way. Don’t build security policy into a mechanism.

Principle of Least Privileges – Need to know, after authentication, give minimal amount of privilege necessary to do their task. Ex. Setuid programs don’t need to run as root: finger daemon doesn’t need to run at root, it can run as any user.

Design Principles Separation of Policy & Mechanism - who can access ACLs impacts policy is an ex. mech.

impacting policy. Principle of least privilege – ability to read a file not same as write a file , finer grain of control for

access. ACLs better than granularity of Unix files. Supports least privilege by finer grain of feature distribution.

Minimize & validate TCB – trust as little as possible, reduce the amount of the operating system that one has to trust. Compiler can prove itself.

o “PROOF CARRYING CODE” – the program can check that there is no way to check that there are malicious calls – A loader may have a checker to invalidate the binary if malicious.

o “Typed Assembly Language (TAL)” - validate certain features for buggy programs that have the ability to be attacked as opposed to malicious features.

o Certifying Compilers – Even I allow your agent to run on my machine I want it to run in a sandbox so you can’t touch my file system, but I give you some scratch safe.

o A JAVA applet can still crash a system with pop-ups or memory consumption. If we have a JAVA OS instead of JVM, could kill applets to prevent this.

Separation of Privilege User acceptability Built-in self-audit

15.5 Distributed. System Issues

Have to worry about trusting other computers and compromised links, not just users. On really secure systems we never trusted anything physical so we look for user authorization all the time. In less secure systems we allow printers to be used by machines by physical location.

Network attacks: Snooping – I can see any information flowing. Could see private keys going back and forth. This

is why Sysadmins don’t want to use telnet where passwords go right across the network. Replay attacks – with key exchanges, someone who is snooping tracking entire conversation and

then replay the entire conversation. Man in the middle/reflection attacks – Two nodes talking to each other, a third party wants to

interfere and intercepts into the middle and takes over. X pretends to be A and B for communicating with each other. X is spoofing for A or B. A&B could have SSL links to X where everything is decrypted. Spoofing is pretending to be someone else.

15.6 CryptographyEncryption:

Private Key (secret/symmetric) K (msg)Symmetric uses same key K(K(msg)) msgDES is one algorithm.

Public Key


Have two keys. Nobody else knows the private key. Everybody knows my public key. If Iencode with my private key everyone can read it. If someone sends to me with a public key, only I can decrypt with my private key. RSA is one algorithm.- Key distribution has to be secure- More expensive than Private key

Message Digests (MD5)Computationally secure one-way hash. We have a hash function H with message m, which is what wewant to hash. h = H(m). Given h, “impossible” to determine mWeak collision property (requirement). Given m&h, infeasible to find arbitrary m’ such that H(m’) = hStrong Collision Property. Given only H infeasible to find m & m’ s.t. H(m) = H(m’)

E(m:h) E’ D (E’) m’, h’

Technically we only need to encrypt the hash not that the message.

Digital Signature – I want to be able to prove that this person is the one who signed it. The person doing the signing wants to put a little something at the end, similar to a hash indicating that it is me who sent it. If I send out Private (msg) no way to tell whether the message has been modified when it is received. End up doing Pri(msg, messageDigest) or Pri (msg, Hash) MessageDigest means I knew the message which guarantees that the private key is reliable.Computationally infeasible that I could make this mess message as a spoof without the private key.Digital signatures combine the whole shebang and are step behind Message Digests.

{msg, Pri(H(msg))} where Pri (H(msg)) is the digital signature.

Certificates – try to reduce vulnerability of public keys.big problem with M.D. and Hash is getting the public keys out there. Certificate Authority – depends on Verisign’s private Key(s)

o Consists of my public key, private key of certificate authority, encoded version of the CA’s public key and the authority)

o Certificate {PubA, A, CA, KpriCA(Pub A, A)}o Certificates have bits telling what they can be used for or time values.o CA’s maintain replication listso Also bits indicating how it can be used and level of trust.

Secret (private) key encryption – every pair of communicators must share a unique key. If there are N users there must be N^2 keys. Key scalability problem. Private key is quicker on encryption and decryption then public key. For privacy.

Public key encryption – Need only 2N keys for N users. Easier in don’t have exchange same private key each time. If a public key is compromised, the whole system can go down while in private key only one pair is compromised. Public key is much more computationally difficult. Public key is used only on establishment. Public key is used to create a session key. From then on we use the private key. For privacy.

Message digests – For integrity, hash function, given a message and a hash difficult to modify message to have the same hash. |cleartext|MD| X message’|MD’ Corrupted by someone in the middle.Encryption prevents man in the middle attack: E(|CT|MD5|) X impossible to make correct MD.

Digital signatures – put a mark on a packet that only you can have generated. combination of MD with a private key on top of it. When you decrypt with my public key and then compare MD gives you check that you sent it and the message hasn’t changed, i.e. integrity. No one has modified the message along the way and I know that you sent it. No privacy.


Certificates – Trusted certificate authority. Combination of my identity and a signed version of my certificate authority. Makes public key distribution much more manageable. Need to get one copy of the CA’s public key out. Delegation permits one to set up transit relationships with different CAs trusting each other. Can send various levels of rights by using different certificates.

15.7 Authentication

Smart Cards – private password onto smart card. Biometrics doesn’t support remote sites. How do we contact the remote machine so it can quiz me to find out who I am. There are replay attacks and man in the middle attacks. Simple approach is challenge and response (what’s your mother’s maiden name). Two nice systems who share between them a private key: A + B Kab

A I am A and I Challenge you B Kab (Ca), Cb Kab (b)

If we allow unlimited inquires, may be able to break key encryption:

Issues:1. Attack can acquire large corpus of cleartext – encrypted text2. If challenge space small, can create table of all response for any challege.3. Attacker can pretend to be A while talking to B

So I can do an SSH across the wire without a password.

Solutions to reflection and MIM attacks1. Separate challenge and responses helps reflection attack but not MIM2. NONCES – random values that keep one from replaying things. This is a random value that is tracked.

A BAD B A, CA (Track A: has N nonce)K(Ca+N), Cb

B, XKab(X+N’)

needs Kab(Cb+N) but N <> N’

Nine papers:1. Proof carrying code (Necula)2/3 Legion/Globus (wide area structure)5. Porcupine (scalable mail service) - Tues6. LARD (locality aware resource distribution) - Thur7. Energy aware clusters - Thur8. Resilient Overlay Networks - Jeff - Tues9. PAST – resilient overlay with 100s of thousands so I don’t get hot spots and distrib networks.


Documents

Kabbalah Health - School of Computingspiegel/advos.doc · Web viewA manual for learning to practice authentic kabbalah. ... ב״ה With the knowledge of heaven בס״ד Everything