Upload
rinnocente
View
179
Download
0
Embed Size (px)
Citation preview
Nov 242000 rinnocente 1
Automatic load balancing and transparent process migration
Roberto InnocenteRoberto Innocenterinnocentehotmailcomrinnocentehotmailcom
November 242000November 242000Download postscript from Download postscript from mosixpsmosixps
or gzipped postscript from or gzipped postscript from mosixpsgzmosixpsgzor pdf from or pdf from mosixpdfmosixpdf
Nov 242000 rinnocente 2
Introduction
Purpose of this presentation is an introductionPurpose of this presentation is an introductionto the most important features of MOSIXto the most important features of MOSIX
-- automatic load balancingautomatic load balancing-- transparent process migrationtransparent process migrationgiving an overview of related projects and of thegiving an overview of related projects and of thedifferent possible implementations of thesedifferent possible implementations of these
mechanismsmechanisms
Nov 242000 rinnocente 3
Scheme of the presentation
11 Overview of some Distributed Operating Overview of some Distributed Operating SystemsSystems
22 Load balancing mechanismsLoad balancing mechanisms33 Process migration mechanismsProcess migration mechanisms44 focus on Mosixfocus on Mosix
Nov 242000 rinnocente 4
Overview of somedistributed OS
Nov 242000 rinnocente 5
Distributed OS
Sprite Sprite Charlotte Charlotte AccentAccent AmoebaAmoeba V System V System MOSIXMOSIX Condor (this is just a batch system but offering Condor (this is just a batch system but offering
job migration)job migration)
Nov 242000 rinnocente 6
Sprite (DouglisOusterhout 1987)
Experimental Network OS developed at UCBExperimental Network OS developed at UCBWas running on Suns and DecstationsWas running on Suns and DecstationsThe process migration in this system introduced theThe process migration in this system introduced thesimplifying concept of Home Node of a process (the simplifying concept of Home Node of a process (the originating node where it should still appear to run)originating node where it should still appear to run)After then this concept has been utilized by many otherAfter then this concept has been utilized by many otherimplementationsimplementations
of TclTk fameof TclTk fame
Nov 242000 rinnocente 7
Charlotte (Artsy Finkel 1989)
Itrsquos a Distributed Operating System developedItrsquos a Distributed Operating System developedat the UoWM based on the message passingat the UoWM based on the message passingparadigmparadigmProvides a wellProvides a well--developed process migrationdeveloped process migrationmechanism that leaves no mechanism that leaves no residual dependenciesresidual dependencies(the migrated process can survive to the death of the(the migrated process can survive to the death of theoriginating node) and therefore is suitable also fororiginating node) and therefore is suitable also forfault tolerant systemsfault tolerant systems
Nov 242000 rinnocente 8
Accent (Zayas 1987)
developed at CMU as precursor of Machdeveloped at CMU as precursor of Mach a kernel not Unix compatible with IPCs a kernel not Unix compatible with IPCs
based on a ldquoportrdquo abstractionbased on a ldquoportrdquo abstraction process migration has been added by process migration has been added by
Zayas(1987)Zayas(1987) no load balancing no load balancing
Nov 242000 rinnocente 9
Amoeba (Tanenbaum et al 1990)
Tanenbaumvan Renessevan Staveren Sharp Mullender Jansen vaTanenbaumvan Renessevan Staveren Sharp Mullender Jansen van n RossumRossum(Experiences with the Amoeba Distributed Operating System (Experiences with the Amoeba Distributed Operating System 1990)1990)
ZhuSteketeeZhuSteketee(An experimental study of Load Balancing on Amoeba (An experimental study of Load Balancing on Amoeba 1994)1994)
doesnrsquot support virtual memorydoesnrsquot support virtual memory was based on the was based on the processorprocessor--pool modelpool model that assumes the of that assumes the of
processors is more than the of processesprocessors is more than the of processes
of python fame of python fame
Nov 242000 rinnocente 10
V system (Theimer 1987)
developed at Stanford microkernel based developed at Stanford microkernel based runs on a set of diskless wksruns on a set of diskless wks
V has support for both remote execution V has support for both remote execution and process migration the facilities were and process migration the facilities were added having in mind to utilize wks during added having in mind to utilize wks during idle periodsidle periods
Nov 242000 rinnocente 11
MOSIX (Barak 19901995)
its roots are back in the mid rsquo80s its roots are back in the mid rsquo80s MOS multicomputer OS (Barak HUJI 1985)MOS multicomputer OS (Barak HUJI 1985) set of enhancements to BSDOS MOSIX rsquo90set of enhancements to BSDOS MOSIX rsquo90 port to LINUX (Mosix 7port to LINUX (Mosix 7thth edition) around rsquo95 edition) around rsquo95
the system call based mgmt has been the system call based mgmt has been transformed to a proc filesystem interfacetransformed to a proc filesystem interface
enhancements rsquo98 (memory ushering enhancements rsquo98 (memory ushering algorithm)algorithm)
Nov 242000 rinnocente 12
Condor (LitzkowLivny 1988)
Condor is not a Distributed OSCondor is not a Distributed OSIt is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilizationof the idle time of WKS and PCsof the idle time of WKS and PCsIt can migrate jobs from machine to machineIt can migrate jobs from machine to machinebut imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can bemigrated and the programs have to be linkedmigrated and the programs have to be linkedwith a special checkpoint librarywith a special checkpoint library(More details on (More details on httphpcsissaitbatchhttphpcsissaitbatch))
Nov 242000 rinnocente 13
Load balancingmechanisms
Nov 242000 rinnocente 14
Load balancing methods taxonomy
job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this
NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration
Nov 242000 rinnocente 15
Load balancing algorithms
Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new
job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)
PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)
Nov 242000 rinnocente 16
Non pre-emptive load balancing (aka job placement)
centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects
distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs
random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs
Nov 242000 rinnocente 17
Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random
another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain
load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer
neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken
broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes
if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node
Nov 242000 rinnocente 18
Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)
std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time
but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations
Nov 242000 rinnocente 19
Process migrationmechanisms
Nov 242000 rinnocente 20
Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)
no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration
User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch
Nov 242000 rinnocente 21
Process migration factors
transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration
residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration
performanceperformancehow much the migration affects how much the migration affects performanceperformance
complexitycomplexity how much complex the migration is how much complex the migration is
Nov 242000 rinnocente 22
Transparent
Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent
Nov 242000 rinnocente 23
System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is
just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely
transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes
A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent
Nov 242000 rinnocente 24
Process migration parts
execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 2
Introduction
Purpose of this presentation is an introductionPurpose of this presentation is an introductionto the most important features of MOSIXto the most important features of MOSIX
-- automatic load balancingautomatic load balancing-- transparent process migrationtransparent process migrationgiving an overview of related projects and of thegiving an overview of related projects and of thedifferent possible implementations of thesedifferent possible implementations of these
mechanismsmechanisms
Nov 242000 rinnocente 3
Scheme of the presentation
11 Overview of some Distributed Operating Overview of some Distributed Operating SystemsSystems
22 Load balancing mechanismsLoad balancing mechanisms33 Process migration mechanismsProcess migration mechanisms44 focus on Mosixfocus on Mosix
Nov 242000 rinnocente 4
Overview of somedistributed OS
Nov 242000 rinnocente 5
Distributed OS
Sprite Sprite Charlotte Charlotte AccentAccent AmoebaAmoeba V System V System MOSIXMOSIX Condor (this is just a batch system but offering Condor (this is just a batch system but offering
job migration)job migration)
Nov 242000 rinnocente 6
Sprite (DouglisOusterhout 1987)
Experimental Network OS developed at UCBExperimental Network OS developed at UCBWas running on Suns and DecstationsWas running on Suns and DecstationsThe process migration in this system introduced theThe process migration in this system introduced thesimplifying concept of Home Node of a process (the simplifying concept of Home Node of a process (the originating node where it should still appear to run)originating node where it should still appear to run)After then this concept has been utilized by many otherAfter then this concept has been utilized by many otherimplementationsimplementations
of TclTk fameof TclTk fame
Nov 242000 rinnocente 7
Charlotte (Artsy Finkel 1989)
Itrsquos a Distributed Operating System developedItrsquos a Distributed Operating System developedat the UoWM based on the message passingat the UoWM based on the message passingparadigmparadigmProvides a wellProvides a well--developed process migrationdeveloped process migrationmechanism that leaves no mechanism that leaves no residual dependenciesresidual dependencies(the migrated process can survive to the death of the(the migrated process can survive to the death of theoriginating node) and therefore is suitable also fororiginating node) and therefore is suitable also forfault tolerant systemsfault tolerant systems
Nov 242000 rinnocente 8
Accent (Zayas 1987)
developed at CMU as precursor of Machdeveloped at CMU as precursor of Mach a kernel not Unix compatible with IPCs a kernel not Unix compatible with IPCs
based on a ldquoportrdquo abstractionbased on a ldquoportrdquo abstraction process migration has been added by process migration has been added by
Zayas(1987)Zayas(1987) no load balancing no load balancing
Nov 242000 rinnocente 9
Amoeba (Tanenbaum et al 1990)
Tanenbaumvan Renessevan Staveren Sharp Mullender Jansen vaTanenbaumvan Renessevan Staveren Sharp Mullender Jansen van n RossumRossum(Experiences with the Amoeba Distributed Operating System (Experiences with the Amoeba Distributed Operating System 1990)1990)
ZhuSteketeeZhuSteketee(An experimental study of Load Balancing on Amoeba (An experimental study of Load Balancing on Amoeba 1994)1994)
doesnrsquot support virtual memorydoesnrsquot support virtual memory was based on the was based on the processorprocessor--pool modelpool model that assumes the of that assumes the of
processors is more than the of processesprocessors is more than the of processes
of python fame of python fame
Nov 242000 rinnocente 10
V system (Theimer 1987)
developed at Stanford microkernel based developed at Stanford microkernel based runs on a set of diskless wksruns on a set of diskless wks
V has support for both remote execution V has support for both remote execution and process migration the facilities were and process migration the facilities were added having in mind to utilize wks during added having in mind to utilize wks during idle periodsidle periods
Nov 242000 rinnocente 11
MOSIX (Barak 19901995)
its roots are back in the mid rsquo80s its roots are back in the mid rsquo80s MOS multicomputer OS (Barak HUJI 1985)MOS multicomputer OS (Barak HUJI 1985) set of enhancements to BSDOS MOSIX rsquo90set of enhancements to BSDOS MOSIX rsquo90 port to LINUX (Mosix 7port to LINUX (Mosix 7thth edition) around rsquo95 edition) around rsquo95
the system call based mgmt has been the system call based mgmt has been transformed to a proc filesystem interfacetransformed to a proc filesystem interface
enhancements rsquo98 (memory ushering enhancements rsquo98 (memory ushering algorithm)algorithm)
Nov 242000 rinnocente 12
Condor (LitzkowLivny 1988)
Condor is not a Distributed OSCondor is not a Distributed OSIt is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilizationof the idle time of WKS and PCsof the idle time of WKS and PCsIt can migrate jobs from machine to machineIt can migrate jobs from machine to machinebut imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can bemigrated and the programs have to be linkedmigrated and the programs have to be linkedwith a special checkpoint librarywith a special checkpoint library(More details on (More details on httphpcsissaitbatchhttphpcsissaitbatch))
Nov 242000 rinnocente 13
Load balancingmechanisms
Nov 242000 rinnocente 14
Load balancing methods taxonomy
job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this
NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration
Nov 242000 rinnocente 15
Load balancing algorithms
Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new
job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)
PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)
Nov 242000 rinnocente 16
Non pre-emptive load balancing (aka job placement)
centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects
distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs
random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs
Nov 242000 rinnocente 17
Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random
another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain
load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer
neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken
broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes
if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node
Nov 242000 rinnocente 18
Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)
std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time
but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations
Nov 242000 rinnocente 19
Process migrationmechanisms
Nov 242000 rinnocente 20
Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)
no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration
User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch
Nov 242000 rinnocente 21
Process migration factors
transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration
residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration
performanceperformancehow much the migration affects how much the migration affects performanceperformance
complexitycomplexity how much complex the migration is how much complex the migration is
Nov 242000 rinnocente 22
Transparent
Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent
Nov 242000 rinnocente 23
System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is
just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely
transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes
A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent
Nov 242000 rinnocente 24
Process migration parts
execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 3
Scheme of the presentation
11 Overview of some Distributed Operating Overview of some Distributed Operating SystemsSystems
22 Load balancing mechanismsLoad balancing mechanisms33 Process migration mechanismsProcess migration mechanisms44 focus on Mosixfocus on Mosix
Nov 242000 rinnocente 4
Overview of somedistributed OS
Nov 242000 rinnocente 5
Distributed OS
Sprite Sprite Charlotte Charlotte AccentAccent AmoebaAmoeba V System V System MOSIXMOSIX Condor (this is just a batch system but offering Condor (this is just a batch system but offering
job migration)job migration)
Nov 242000 rinnocente 6
Sprite (DouglisOusterhout 1987)
Experimental Network OS developed at UCBExperimental Network OS developed at UCBWas running on Suns and DecstationsWas running on Suns and DecstationsThe process migration in this system introduced theThe process migration in this system introduced thesimplifying concept of Home Node of a process (the simplifying concept of Home Node of a process (the originating node where it should still appear to run)originating node where it should still appear to run)After then this concept has been utilized by many otherAfter then this concept has been utilized by many otherimplementationsimplementations
of TclTk fameof TclTk fame
Nov 242000 rinnocente 7
Charlotte (Artsy Finkel 1989)
Itrsquos a Distributed Operating System developedItrsquos a Distributed Operating System developedat the UoWM based on the message passingat the UoWM based on the message passingparadigmparadigmProvides a wellProvides a well--developed process migrationdeveloped process migrationmechanism that leaves no mechanism that leaves no residual dependenciesresidual dependencies(the migrated process can survive to the death of the(the migrated process can survive to the death of theoriginating node) and therefore is suitable also fororiginating node) and therefore is suitable also forfault tolerant systemsfault tolerant systems
Nov 242000 rinnocente 8
Accent (Zayas 1987)
developed at CMU as precursor of Machdeveloped at CMU as precursor of Mach a kernel not Unix compatible with IPCs a kernel not Unix compatible with IPCs
based on a ldquoportrdquo abstractionbased on a ldquoportrdquo abstraction process migration has been added by process migration has been added by
Zayas(1987)Zayas(1987) no load balancing no load balancing
Nov 242000 rinnocente 9
Amoeba (Tanenbaum et al 1990)
Tanenbaumvan Renessevan Staveren Sharp Mullender Jansen vaTanenbaumvan Renessevan Staveren Sharp Mullender Jansen van n RossumRossum(Experiences with the Amoeba Distributed Operating System (Experiences with the Amoeba Distributed Operating System 1990)1990)
ZhuSteketeeZhuSteketee(An experimental study of Load Balancing on Amoeba (An experimental study of Load Balancing on Amoeba 1994)1994)
doesnrsquot support virtual memorydoesnrsquot support virtual memory was based on the was based on the processorprocessor--pool modelpool model that assumes the of that assumes the of
processors is more than the of processesprocessors is more than the of processes
of python fame of python fame
Nov 242000 rinnocente 10
V system (Theimer 1987)
developed at Stanford microkernel based developed at Stanford microkernel based runs on a set of diskless wksruns on a set of diskless wks
V has support for both remote execution V has support for both remote execution and process migration the facilities were and process migration the facilities were added having in mind to utilize wks during added having in mind to utilize wks during idle periodsidle periods
Nov 242000 rinnocente 11
MOSIX (Barak 19901995)
its roots are back in the mid rsquo80s its roots are back in the mid rsquo80s MOS multicomputer OS (Barak HUJI 1985)MOS multicomputer OS (Barak HUJI 1985) set of enhancements to BSDOS MOSIX rsquo90set of enhancements to BSDOS MOSIX rsquo90 port to LINUX (Mosix 7port to LINUX (Mosix 7thth edition) around rsquo95 edition) around rsquo95
the system call based mgmt has been the system call based mgmt has been transformed to a proc filesystem interfacetransformed to a proc filesystem interface
enhancements rsquo98 (memory ushering enhancements rsquo98 (memory ushering algorithm)algorithm)
Nov 242000 rinnocente 12
Condor (LitzkowLivny 1988)
Condor is not a Distributed OSCondor is not a Distributed OSIt is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilizationof the idle time of WKS and PCsof the idle time of WKS and PCsIt can migrate jobs from machine to machineIt can migrate jobs from machine to machinebut imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can bemigrated and the programs have to be linkedmigrated and the programs have to be linkedwith a special checkpoint librarywith a special checkpoint library(More details on (More details on httphpcsissaitbatchhttphpcsissaitbatch))
Nov 242000 rinnocente 13
Load balancingmechanisms
Nov 242000 rinnocente 14
Load balancing methods taxonomy
job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this
NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration
Nov 242000 rinnocente 15
Load balancing algorithms
Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new
job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)
PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)
Nov 242000 rinnocente 16
Non pre-emptive load balancing (aka job placement)
centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects
distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs
random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs
Nov 242000 rinnocente 17
Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random
another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain
load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer
neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken
broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes
if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node
Nov 242000 rinnocente 18
Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)
std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time
but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations
Nov 242000 rinnocente 19
Process migrationmechanisms
Nov 242000 rinnocente 20
Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)
no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration
User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch
Nov 242000 rinnocente 21
Process migration factors
transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration
residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration
performanceperformancehow much the migration affects how much the migration affects performanceperformance
complexitycomplexity how much complex the migration is how much complex the migration is
Nov 242000 rinnocente 22
Transparent
Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent
Nov 242000 rinnocente 23
System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is
just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely
transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes
A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent
Nov 242000 rinnocente 24
Process migration parts
execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 4
Overview of somedistributed OS
Nov 242000 rinnocente 5
Distributed OS
Sprite Sprite Charlotte Charlotte AccentAccent AmoebaAmoeba V System V System MOSIXMOSIX Condor (this is just a batch system but offering Condor (this is just a batch system but offering
job migration)job migration)
Nov 242000 rinnocente 6
Sprite (DouglisOusterhout 1987)
Experimental Network OS developed at UCBExperimental Network OS developed at UCBWas running on Suns and DecstationsWas running on Suns and DecstationsThe process migration in this system introduced theThe process migration in this system introduced thesimplifying concept of Home Node of a process (the simplifying concept of Home Node of a process (the originating node where it should still appear to run)originating node where it should still appear to run)After then this concept has been utilized by many otherAfter then this concept has been utilized by many otherimplementationsimplementations
of TclTk fameof TclTk fame
Nov 242000 rinnocente 7
Charlotte (Artsy Finkel 1989)
Itrsquos a Distributed Operating System developedItrsquos a Distributed Operating System developedat the UoWM based on the message passingat the UoWM based on the message passingparadigmparadigmProvides a wellProvides a well--developed process migrationdeveloped process migrationmechanism that leaves no mechanism that leaves no residual dependenciesresidual dependencies(the migrated process can survive to the death of the(the migrated process can survive to the death of theoriginating node) and therefore is suitable also fororiginating node) and therefore is suitable also forfault tolerant systemsfault tolerant systems
Nov 242000 rinnocente 8
Accent (Zayas 1987)
developed at CMU as precursor of Machdeveloped at CMU as precursor of Mach a kernel not Unix compatible with IPCs a kernel not Unix compatible with IPCs
based on a ldquoportrdquo abstractionbased on a ldquoportrdquo abstraction process migration has been added by process migration has been added by
Zayas(1987)Zayas(1987) no load balancing no load balancing
Nov 242000 rinnocente 9
Amoeba (Tanenbaum et al 1990)
Tanenbaumvan Renessevan Staveren Sharp Mullender Jansen vaTanenbaumvan Renessevan Staveren Sharp Mullender Jansen van n RossumRossum(Experiences with the Amoeba Distributed Operating System (Experiences with the Amoeba Distributed Operating System 1990)1990)
ZhuSteketeeZhuSteketee(An experimental study of Load Balancing on Amoeba (An experimental study of Load Balancing on Amoeba 1994)1994)
doesnrsquot support virtual memorydoesnrsquot support virtual memory was based on the was based on the processorprocessor--pool modelpool model that assumes the of that assumes the of
processors is more than the of processesprocessors is more than the of processes
of python fame of python fame
Nov 242000 rinnocente 10
V system (Theimer 1987)
developed at Stanford microkernel based developed at Stanford microkernel based runs on a set of diskless wksruns on a set of diskless wks
V has support for both remote execution V has support for both remote execution and process migration the facilities were and process migration the facilities were added having in mind to utilize wks during added having in mind to utilize wks during idle periodsidle periods
Nov 242000 rinnocente 11
MOSIX (Barak 19901995)
its roots are back in the mid rsquo80s its roots are back in the mid rsquo80s MOS multicomputer OS (Barak HUJI 1985)MOS multicomputer OS (Barak HUJI 1985) set of enhancements to BSDOS MOSIX rsquo90set of enhancements to BSDOS MOSIX rsquo90 port to LINUX (Mosix 7port to LINUX (Mosix 7thth edition) around rsquo95 edition) around rsquo95
the system call based mgmt has been the system call based mgmt has been transformed to a proc filesystem interfacetransformed to a proc filesystem interface
enhancements rsquo98 (memory ushering enhancements rsquo98 (memory ushering algorithm)algorithm)
Nov 242000 rinnocente 12
Condor (LitzkowLivny 1988)
Condor is not a Distributed OSCondor is not a Distributed OSIt is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilizationof the idle time of WKS and PCsof the idle time of WKS and PCsIt can migrate jobs from machine to machineIt can migrate jobs from machine to machinebut imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can bemigrated and the programs have to be linkedmigrated and the programs have to be linkedwith a special checkpoint librarywith a special checkpoint library(More details on (More details on httphpcsissaitbatchhttphpcsissaitbatch))
Nov 242000 rinnocente 13
Load balancingmechanisms
Nov 242000 rinnocente 14
Load balancing methods taxonomy
job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this
NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration
Nov 242000 rinnocente 15
Load balancing algorithms
Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new
job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)
PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)
Nov 242000 rinnocente 16
Non pre-emptive load balancing (aka job placement)
centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects
distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs
random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs
Nov 242000 rinnocente 17
Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random
another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain
load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer
neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken
broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes
if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node
Nov 242000 rinnocente 18
Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)
std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time
but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations
Nov 242000 rinnocente 19
Process migrationmechanisms
Nov 242000 rinnocente 20
Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)
no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration
User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch
Nov 242000 rinnocente 21
Process migration factors
transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration
residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration
performanceperformancehow much the migration affects how much the migration affects performanceperformance
complexitycomplexity how much complex the migration is how much complex the migration is
Nov 242000 rinnocente 22
Transparent
Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent
Nov 242000 rinnocente 23
System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is
just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely
transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes
A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent
Nov 242000 rinnocente 24
Process migration parts
execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 5
Distributed OS
Sprite Sprite Charlotte Charlotte AccentAccent AmoebaAmoeba V System V System MOSIXMOSIX Condor (this is just a batch system but offering Condor (this is just a batch system but offering
job migration)job migration)
Nov 242000 rinnocente 6
Sprite (DouglisOusterhout 1987)
Experimental Network OS developed at UCBExperimental Network OS developed at UCBWas running on Suns and DecstationsWas running on Suns and DecstationsThe process migration in this system introduced theThe process migration in this system introduced thesimplifying concept of Home Node of a process (the simplifying concept of Home Node of a process (the originating node where it should still appear to run)originating node where it should still appear to run)After then this concept has been utilized by many otherAfter then this concept has been utilized by many otherimplementationsimplementations
of TclTk fameof TclTk fame
Nov 242000 rinnocente 7
Charlotte (Artsy Finkel 1989)
Itrsquos a Distributed Operating System developedItrsquos a Distributed Operating System developedat the UoWM based on the message passingat the UoWM based on the message passingparadigmparadigmProvides a wellProvides a well--developed process migrationdeveloped process migrationmechanism that leaves no mechanism that leaves no residual dependenciesresidual dependencies(the migrated process can survive to the death of the(the migrated process can survive to the death of theoriginating node) and therefore is suitable also fororiginating node) and therefore is suitable also forfault tolerant systemsfault tolerant systems
Nov 242000 rinnocente 8
Accent (Zayas 1987)
developed at CMU as precursor of Machdeveloped at CMU as precursor of Mach a kernel not Unix compatible with IPCs a kernel not Unix compatible with IPCs
based on a ldquoportrdquo abstractionbased on a ldquoportrdquo abstraction process migration has been added by process migration has been added by
Zayas(1987)Zayas(1987) no load balancing no load balancing
Nov 242000 rinnocente 9
Amoeba (Tanenbaum et al 1990)
Tanenbaumvan Renessevan Staveren Sharp Mullender Jansen vaTanenbaumvan Renessevan Staveren Sharp Mullender Jansen van n RossumRossum(Experiences with the Amoeba Distributed Operating System (Experiences with the Amoeba Distributed Operating System 1990)1990)
ZhuSteketeeZhuSteketee(An experimental study of Load Balancing on Amoeba (An experimental study of Load Balancing on Amoeba 1994)1994)
doesnrsquot support virtual memorydoesnrsquot support virtual memory was based on the was based on the processorprocessor--pool modelpool model that assumes the of that assumes the of
processors is more than the of processesprocessors is more than the of processes
of python fame of python fame
Nov 242000 rinnocente 10
V system (Theimer 1987)
developed at Stanford microkernel based developed at Stanford microkernel based runs on a set of diskless wksruns on a set of diskless wks
V has support for both remote execution V has support for both remote execution and process migration the facilities were and process migration the facilities were added having in mind to utilize wks during added having in mind to utilize wks during idle periodsidle periods
Nov 242000 rinnocente 11
MOSIX (Barak 19901995)
its roots are back in the mid rsquo80s its roots are back in the mid rsquo80s MOS multicomputer OS (Barak HUJI 1985)MOS multicomputer OS (Barak HUJI 1985) set of enhancements to BSDOS MOSIX rsquo90set of enhancements to BSDOS MOSIX rsquo90 port to LINUX (Mosix 7port to LINUX (Mosix 7thth edition) around rsquo95 edition) around rsquo95
the system call based mgmt has been the system call based mgmt has been transformed to a proc filesystem interfacetransformed to a proc filesystem interface
enhancements rsquo98 (memory ushering enhancements rsquo98 (memory ushering algorithm)algorithm)
Nov 242000 rinnocente 12
Condor (LitzkowLivny 1988)
Condor is not a Distributed OSCondor is not a Distributed OSIt is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilizationof the idle time of WKS and PCsof the idle time of WKS and PCsIt can migrate jobs from machine to machineIt can migrate jobs from machine to machinebut imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can bemigrated and the programs have to be linkedmigrated and the programs have to be linkedwith a special checkpoint librarywith a special checkpoint library(More details on (More details on httphpcsissaitbatchhttphpcsissaitbatch))
Nov 242000 rinnocente 13
Load balancingmechanisms
Nov 242000 rinnocente 14
Load balancing methods taxonomy
job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this
NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration
Nov 242000 rinnocente 15
Load balancing algorithms
Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new
job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)
PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)
Nov 242000 rinnocente 16
Non pre-emptive load balancing (aka job placement)
centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects
distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs
random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs
Nov 242000 rinnocente 17
Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random
another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain
load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer
neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken
broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes
if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node
Nov 242000 rinnocente 18
Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)
std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time
but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations
Nov 242000 rinnocente 19
Process migrationmechanisms
Nov 242000 rinnocente 20
Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)
no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration
User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch
Nov 242000 rinnocente 21
Process migration factors
transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration
residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration
performanceperformancehow much the migration affects how much the migration affects performanceperformance
complexitycomplexity how much complex the migration is how much complex the migration is
Nov 242000 rinnocente 22
Transparent
Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent
Nov 242000 rinnocente 23
System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is
just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely
transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes
A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent
Nov 242000 rinnocente 24
Process migration parts
execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 6
Sprite (DouglisOusterhout 1987)
Experimental Network OS developed at UCBExperimental Network OS developed at UCBWas running on Suns and DecstationsWas running on Suns and DecstationsThe process migration in this system introduced theThe process migration in this system introduced thesimplifying concept of Home Node of a process (the simplifying concept of Home Node of a process (the originating node where it should still appear to run)originating node where it should still appear to run)After then this concept has been utilized by many otherAfter then this concept has been utilized by many otherimplementationsimplementations
of TclTk fameof TclTk fame
Nov 242000 rinnocente 7
Charlotte (Artsy Finkel 1989)
Itrsquos a Distributed Operating System developedItrsquos a Distributed Operating System developedat the UoWM based on the message passingat the UoWM based on the message passingparadigmparadigmProvides a wellProvides a well--developed process migrationdeveloped process migrationmechanism that leaves no mechanism that leaves no residual dependenciesresidual dependencies(the migrated process can survive to the death of the(the migrated process can survive to the death of theoriginating node) and therefore is suitable also fororiginating node) and therefore is suitable also forfault tolerant systemsfault tolerant systems
Nov 242000 rinnocente 8
Accent (Zayas 1987)
developed at CMU as precursor of Machdeveloped at CMU as precursor of Mach a kernel not Unix compatible with IPCs a kernel not Unix compatible with IPCs
based on a ldquoportrdquo abstractionbased on a ldquoportrdquo abstraction process migration has been added by process migration has been added by
Zayas(1987)Zayas(1987) no load balancing no load balancing
Nov 242000 rinnocente 9
Amoeba (Tanenbaum et al 1990)
Tanenbaumvan Renessevan Staveren Sharp Mullender Jansen vaTanenbaumvan Renessevan Staveren Sharp Mullender Jansen van n RossumRossum(Experiences with the Amoeba Distributed Operating System (Experiences with the Amoeba Distributed Operating System 1990)1990)
ZhuSteketeeZhuSteketee(An experimental study of Load Balancing on Amoeba (An experimental study of Load Balancing on Amoeba 1994)1994)
doesnrsquot support virtual memorydoesnrsquot support virtual memory was based on the was based on the processorprocessor--pool modelpool model that assumes the of that assumes the of
processors is more than the of processesprocessors is more than the of processes
of python fame of python fame
Nov 242000 rinnocente 10
V system (Theimer 1987)
developed at Stanford microkernel based developed at Stanford microkernel based runs on a set of diskless wksruns on a set of diskless wks
V has support for both remote execution V has support for both remote execution and process migration the facilities were and process migration the facilities were added having in mind to utilize wks during added having in mind to utilize wks during idle periodsidle periods
Nov 242000 rinnocente 11
MOSIX (Barak 19901995)
its roots are back in the mid rsquo80s its roots are back in the mid rsquo80s MOS multicomputer OS (Barak HUJI 1985)MOS multicomputer OS (Barak HUJI 1985) set of enhancements to BSDOS MOSIX rsquo90set of enhancements to BSDOS MOSIX rsquo90 port to LINUX (Mosix 7port to LINUX (Mosix 7thth edition) around rsquo95 edition) around rsquo95
the system call based mgmt has been the system call based mgmt has been transformed to a proc filesystem interfacetransformed to a proc filesystem interface
enhancements rsquo98 (memory ushering enhancements rsquo98 (memory ushering algorithm)algorithm)
Nov 242000 rinnocente 12
Condor (LitzkowLivny 1988)
Condor is not a Distributed OSCondor is not a Distributed OSIt is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilizationof the idle time of WKS and PCsof the idle time of WKS and PCsIt can migrate jobs from machine to machineIt can migrate jobs from machine to machinebut imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can bemigrated and the programs have to be linkedmigrated and the programs have to be linkedwith a special checkpoint librarywith a special checkpoint library(More details on (More details on httphpcsissaitbatchhttphpcsissaitbatch))
Nov 242000 rinnocente 13
Load balancingmechanisms
Nov 242000 rinnocente 14
Load balancing methods taxonomy
job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this
NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration
Nov 242000 rinnocente 15
Load balancing algorithms
Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new
job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)
PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)
Nov 242000 rinnocente 16
Non pre-emptive load balancing (aka job placement)
centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects
distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs
random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs
Nov 242000 rinnocente 17
Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random
another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain
load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer
neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken
broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes
if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node
Nov 242000 rinnocente 18
Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)
std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time
but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations
Nov 242000 rinnocente 19
Process migrationmechanisms
Nov 242000 rinnocente 20
Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)
no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration
User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch
Nov 242000 rinnocente 21
Process migration factors
transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration
residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration
performanceperformancehow much the migration affects how much the migration affects performanceperformance
complexitycomplexity how much complex the migration is how much complex the migration is
Nov 242000 rinnocente 22
Transparent
Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent
Nov 242000 rinnocente 23
System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is
just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely
transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes
A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent
Nov 242000 rinnocente 24
Process migration parts
execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 7
Charlotte (Artsy Finkel 1989)
Itrsquos a Distributed Operating System developedItrsquos a Distributed Operating System developedat the UoWM based on the message passingat the UoWM based on the message passingparadigmparadigmProvides a wellProvides a well--developed process migrationdeveloped process migrationmechanism that leaves no mechanism that leaves no residual dependenciesresidual dependencies(the migrated process can survive to the death of the(the migrated process can survive to the death of theoriginating node) and therefore is suitable also fororiginating node) and therefore is suitable also forfault tolerant systemsfault tolerant systems
Nov 242000 rinnocente 8
Accent (Zayas 1987)
developed at CMU as precursor of Machdeveloped at CMU as precursor of Mach a kernel not Unix compatible with IPCs a kernel not Unix compatible with IPCs
based on a ldquoportrdquo abstractionbased on a ldquoportrdquo abstraction process migration has been added by process migration has been added by
Zayas(1987)Zayas(1987) no load balancing no load balancing
Nov 242000 rinnocente 9
Amoeba (Tanenbaum et al 1990)
Tanenbaumvan Renessevan Staveren Sharp Mullender Jansen vaTanenbaumvan Renessevan Staveren Sharp Mullender Jansen van n RossumRossum(Experiences with the Amoeba Distributed Operating System (Experiences with the Amoeba Distributed Operating System 1990)1990)
ZhuSteketeeZhuSteketee(An experimental study of Load Balancing on Amoeba (An experimental study of Load Balancing on Amoeba 1994)1994)
doesnrsquot support virtual memorydoesnrsquot support virtual memory was based on the was based on the processorprocessor--pool modelpool model that assumes the of that assumes the of
processors is more than the of processesprocessors is more than the of processes
of python fame of python fame
Nov 242000 rinnocente 10
V system (Theimer 1987)
developed at Stanford microkernel based developed at Stanford microkernel based runs on a set of diskless wksruns on a set of diskless wks
V has support for both remote execution V has support for both remote execution and process migration the facilities were and process migration the facilities were added having in mind to utilize wks during added having in mind to utilize wks during idle periodsidle periods
Nov 242000 rinnocente 11
MOSIX (Barak 19901995)
its roots are back in the mid rsquo80s its roots are back in the mid rsquo80s MOS multicomputer OS (Barak HUJI 1985)MOS multicomputer OS (Barak HUJI 1985) set of enhancements to BSDOS MOSIX rsquo90set of enhancements to BSDOS MOSIX rsquo90 port to LINUX (Mosix 7port to LINUX (Mosix 7thth edition) around rsquo95 edition) around rsquo95
the system call based mgmt has been the system call based mgmt has been transformed to a proc filesystem interfacetransformed to a proc filesystem interface
enhancements rsquo98 (memory ushering enhancements rsquo98 (memory ushering algorithm)algorithm)
Nov 242000 rinnocente 12
Condor (LitzkowLivny 1988)
Condor is not a Distributed OSCondor is not a Distributed OSIt is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilizationof the idle time of WKS and PCsof the idle time of WKS and PCsIt can migrate jobs from machine to machineIt can migrate jobs from machine to machinebut imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can bemigrated and the programs have to be linkedmigrated and the programs have to be linkedwith a special checkpoint librarywith a special checkpoint library(More details on (More details on httphpcsissaitbatchhttphpcsissaitbatch))
Nov 242000 rinnocente 13
Load balancingmechanisms
Nov 242000 rinnocente 14
Load balancing methods taxonomy
job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this
NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration
Nov 242000 rinnocente 15
Load balancing algorithms
Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new
job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)
PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)
Nov 242000 rinnocente 16
Non pre-emptive load balancing (aka job placement)
centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects
distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs
random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs
Nov 242000 rinnocente 17
Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random
another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain
load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer
neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken
broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes
if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node
Nov 242000 rinnocente 18
Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)
std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time
but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations
Nov 242000 rinnocente 19
Process migrationmechanisms
Nov 242000 rinnocente 20
Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)
no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration
User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch
Nov 242000 rinnocente 21
Process migration factors
transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration
residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration
performanceperformancehow much the migration affects how much the migration affects performanceperformance
complexitycomplexity how much complex the migration is how much complex the migration is
Nov 242000 rinnocente 22
Transparent
Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent
Nov 242000 rinnocente 23
System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is
just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely
transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes
A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent
Nov 242000 rinnocente 24
Process migration parts
execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 8
Accent (Zayas 1987)
developed at CMU as precursor of Machdeveloped at CMU as precursor of Mach a kernel not Unix compatible with IPCs a kernel not Unix compatible with IPCs
based on a ldquoportrdquo abstractionbased on a ldquoportrdquo abstraction process migration has been added by process migration has been added by
Zayas(1987)Zayas(1987) no load balancing no load balancing
Nov 242000 rinnocente 9
Amoeba (Tanenbaum et al 1990)
Tanenbaumvan Renessevan Staveren Sharp Mullender Jansen vaTanenbaumvan Renessevan Staveren Sharp Mullender Jansen van n RossumRossum(Experiences with the Amoeba Distributed Operating System (Experiences with the Amoeba Distributed Operating System 1990)1990)
ZhuSteketeeZhuSteketee(An experimental study of Load Balancing on Amoeba (An experimental study of Load Balancing on Amoeba 1994)1994)
doesnrsquot support virtual memorydoesnrsquot support virtual memory was based on the was based on the processorprocessor--pool modelpool model that assumes the of that assumes the of
processors is more than the of processesprocessors is more than the of processes
of python fame of python fame
Nov 242000 rinnocente 10
V system (Theimer 1987)
developed at Stanford microkernel based developed at Stanford microkernel based runs on a set of diskless wksruns on a set of diskless wks
V has support for both remote execution V has support for both remote execution and process migration the facilities were and process migration the facilities were added having in mind to utilize wks during added having in mind to utilize wks during idle periodsidle periods
Nov 242000 rinnocente 11
MOSIX (Barak 19901995)
its roots are back in the mid rsquo80s its roots are back in the mid rsquo80s MOS multicomputer OS (Barak HUJI 1985)MOS multicomputer OS (Barak HUJI 1985) set of enhancements to BSDOS MOSIX rsquo90set of enhancements to BSDOS MOSIX rsquo90 port to LINUX (Mosix 7port to LINUX (Mosix 7thth edition) around rsquo95 edition) around rsquo95
the system call based mgmt has been the system call based mgmt has been transformed to a proc filesystem interfacetransformed to a proc filesystem interface
enhancements rsquo98 (memory ushering enhancements rsquo98 (memory ushering algorithm)algorithm)
Nov 242000 rinnocente 12
Condor (LitzkowLivny 1988)
Condor is not a Distributed OSCondor is not a Distributed OSIt is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilizationof the idle time of WKS and PCsof the idle time of WKS and PCsIt can migrate jobs from machine to machineIt can migrate jobs from machine to machinebut imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can bemigrated and the programs have to be linkedmigrated and the programs have to be linkedwith a special checkpoint librarywith a special checkpoint library(More details on (More details on httphpcsissaitbatchhttphpcsissaitbatch))
Nov 242000 rinnocente 13
Load balancingmechanisms
Nov 242000 rinnocente 14
Load balancing methods taxonomy
job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this
NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration
Nov 242000 rinnocente 15
Load balancing algorithms
Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new
job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)
PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)
Nov 242000 rinnocente 16
Non pre-emptive load balancing (aka job placement)
centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects
distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs
random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs
Nov 242000 rinnocente 17
Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random
another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain
load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer
neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken
broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes
if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node
Nov 242000 rinnocente 18
Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)
std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time
but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations
Nov 242000 rinnocente 19
Process migrationmechanisms
Nov 242000 rinnocente 20
Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)
no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration
User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch
Nov 242000 rinnocente 21
Process migration factors
transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration
residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration
performanceperformancehow much the migration affects how much the migration affects performanceperformance
complexitycomplexity how much complex the migration is how much complex the migration is
Nov 242000 rinnocente 22
Transparent
Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent
Nov 242000 rinnocente 23
System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is
just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely
transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes
A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent
Nov 242000 rinnocente 24
Process migration parts
execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 9
Amoeba (Tanenbaum et al 1990)
Tanenbaumvan Renessevan Staveren Sharp Mullender Jansen vaTanenbaumvan Renessevan Staveren Sharp Mullender Jansen van n RossumRossum(Experiences with the Amoeba Distributed Operating System (Experiences with the Amoeba Distributed Operating System 1990)1990)
ZhuSteketeeZhuSteketee(An experimental study of Load Balancing on Amoeba (An experimental study of Load Balancing on Amoeba 1994)1994)
doesnrsquot support virtual memorydoesnrsquot support virtual memory was based on the was based on the processorprocessor--pool modelpool model that assumes the of that assumes the of
processors is more than the of processesprocessors is more than the of processes
of python fame of python fame
Nov 242000 rinnocente 10
V system (Theimer 1987)
developed at Stanford microkernel based developed at Stanford microkernel based runs on a set of diskless wksruns on a set of diskless wks
V has support for both remote execution V has support for both remote execution and process migration the facilities were and process migration the facilities were added having in mind to utilize wks during added having in mind to utilize wks during idle periodsidle periods
Nov 242000 rinnocente 11
MOSIX (Barak 19901995)
its roots are back in the mid rsquo80s its roots are back in the mid rsquo80s MOS multicomputer OS (Barak HUJI 1985)MOS multicomputer OS (Barak HUJI 1985) set of enhancements to BSDOS MOSIX rsquo90set of enhancements to BSDOS MOSIX rsquo90 port to LINUX (Mosix 7port to LINUX (Mosix 7thth edition) around rsquo95 edition) around rsquo95
the system call based mgmt has been the system call based mgmt has been transformed to a proc filesystem interfacetransformed to a proc filesystem interface
enhancements rsquo98 (memory ushering enhancements rsquo98 (memory ushering algorithm)algorithm)
Nov 242000 rinnocente 12
Condor (LitzkowLivny 1988)
Condor is not a Distributed OSCondor is not a Distributed OSIt is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilizationof the idle time of WKS and PCsof the idle time of WKS and PCsIt can migrate jobs from machine to machineIt can migrate jobs from machine to machinebut imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can bemigrated and the programs have to be linkedmigrated and the programs have to be linkedwith a special checkpoint librarywith a special checkpoint library(More details on (More details on httphpcsissaitbatchhttphpcsissaitbatch))
Nov 242000 rinnocente 13
Load balancingmechanisms
Nov 242000 rinnocente 14
Load balancing methods taxonomy
job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this
NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration
Nov 242000 rinnocente 15
Load balancing algorithms
Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new
job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)
PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)
Nov 242000 rinnocente 16
Non pre-emptive load balancing (aka job placement)
centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects
distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs
random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs
Nov 242000 rinnocente 17
Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random
another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain
load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer
neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken
broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes
if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node
Nov 242000 rinnocente 18
Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)
std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time
but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations
Nov 242000 rinnocente 19
Process migrationmechanisms
Nov 242000 rinnocente 20
Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)
no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration
User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch
Nov 242000 rinnocente 21
Process migration factors
transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration
residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration
performanceperformancehow much the migration affects how much the migration affects performanceperformance
complexitycomplexity how much complex the migration is how much complex the migration is
Nov 242000 rinnocente 22
Transparent
Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent
Nov 242000 rinnocente 23
System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is
just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely
transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes
A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent
Nov 242000 rinnocente 24
Process migration parts
execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 10
V system (Theimer 1987)
developed at Stanford microkernel based developed at Stanford microkernel based runs on a set of diskless wksruns on a set of diskless wks
V has support for both remote execution V has support for both remote execution and process migration the facilities were and process migration the facilities were added having in mind to utilize wks during added having in mind to utilize wks during idle periodsidle periods
Nov 242000 rinnocente 11
MOSIX (Barak 19901995)
its roots are back in the mid rsquo80s its roots are back in the mid rsquo80s MOS multicomputer OS (Barak HUJI 1985)MOS multicomputer OS (Barak HUJI 1985) set of enhancements to BSDOS MOSIX rsquo90set of enhancements to BSDOS MOSIX rsquo90 port to LINUX (Mosix 7port to LINUX (Mosix 7thth edition) around rsquo95 edition) around rsquo95
the system call based mgmt has been the system call based mgmt has been transformed to a proc filesystem interfacetransformed to a proc filesystem interface
enhancements rsquo98 (memory ushering enhancements rsquo98 (memory ushering algorithm)algorithm)
Nov 242000 rinnocente 12
Condor (LitzkowLivny 1988)
Condor is not a Distributed OSCondor is not a Distributed OSIt is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilizationof the idle time of WKS and PCsof the idle time of WKS and PCsIt can migrate jobs from machine to machineIt can migrate jobs from machine to machinebut imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can bemigrated and the programs have to be linkedmigrated and the programs have to be linkedwith a special checkpoint librarywith a special checkpoint library(More details on (More details on httphpcsissaitbatchhttphpcsissaitbatch))
Nov 242000 rinnocente 13
Load balancingmechanisms
Nov 242000 rinnocente 14
Load balancing methods taxonomy
job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this
NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration
Nov 242000 rinnocente 15
Load balancing algorithms
Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new
job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)
PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)
Nov 242000 rinnocente 16
Non pre-emptive load balancing (aka job placement)
centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects
distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs
random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs
Nov 242000 rinnocente 17
Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random
another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain
load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer
neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken
broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes
if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node
Nov 242000 rinnocente 18
Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)
std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time
but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations
Nov 242000 rinnocente 19
Process migrationmechanisms
Nov 242000 rinnocente 20
Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)
no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration
User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch
Nov 242000 rinnocente 21
Process migration factors
transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration
residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration
performanceperformancehow much the migration affects how much the migration affects performanceperformance
complexitycomplexity how much complex the migration is how much complex the migration is
Nov 242000 rinnocente 22
Transparent
Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent
Nov 242000 rinnocente 23
System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is
just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely
transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes
A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent
Nov 242000 rinnocente 24
Process migration parts
execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 11
MOSIX (Barak 19901995)
its roots are back in the mid rsquo80s its roots are back in the mid rsquo80s MOS multicomputer OS (Barak HUJI 1985)MOS multicomputer OS (Barak HUJI 1985) set of enhancements to BSDOS MOSIX rsquo90set of enhancements to BSDOS MOSIX rsquo90 port to LINUX (Mosix 7port to LINUX (Mosix 7thth edition) around rsquo95 edition) around rsquo95
the system call based mgmt has been the system call based mgmt has been transformed to a proc filesystem interfacetransformed to a proc filesystem interface
enhancements rsquo98 (memory ushering enhancements rsquo98 (memory ushering algorithm)algorithm)
Nov 242000 rinnocente 12
Condor (LitzkowLivny 1988)
Condor is not a Distributed OSCondor is not a Distributed OSIt is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilizationof the idle time of WKS and PCsof the idle time of WKS and PCsIt can migrate jobs from machine to machineIt can migrate jobs from machine to machinebut imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can bemigrated and the programs have to be linkedmigrated and the programs have to be linkedwith a special checkpoint librarywith a special checkpoint library(More details on (More details on httphpcsissaitbatchhttphpcsissaitbatch))
Nov 242000 rinnocente 13
Load balancingmechanisms
Nov 242000 rinnocente 14
Load balancing methods taxonomy
job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this
NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration
Nov 242000 rinnocente 15
Load balancing algorithms
Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new
job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)
PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)
Nov 242000 rinnocente 16
Non pre-emptive load balancing (aka job placement)
centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects
distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs
random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs
Nov 242000 rinnocente 17
Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random
another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain
load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer
neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken
broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes
if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node
Nov 242000 rinnocente 18
Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)
std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time
but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations
Nov 242000 rinnocente 19
Process migrationmechanisms
Nov 242000 rinnocente 20
Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)
no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration
User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch
Nov 242000 rinnocente 21
Process migration factors
transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration
residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration
performanceperformancehow much the migration affects how much the migration affects performanceperformance
complexitycomplexity how much complex the migration is how much complex the migration is
Nov 242000 rinnocente 22
Transparent
Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent
Nov 242000 rinnocente 23
System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is
just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely
transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes
A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent
Nov 242000 rinnocente 24
Process migration parts
execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 12
Condor (LitzkowLivny 1988)
Condor is not a Distributed OSCondor is not a Distributed OSIt is a batch system whose purpose is the utilizationIt is a batch system whose purpose is the utilizationof the idle time of WKS and PCsof the idle time of WKS and PCsIt can migrate jobs from machine to machineIt can migrate jobs from machine to machinebut imposes restrictions on the jobs that can bebut imposes restrictions on the jobs that can bemigrated and the programs have to be linkedmigrated and the programs have to be linkedwith a special checkpoint librarywith a special checkpoint library(More details on (More details on httphpcsissaitbatchhttphpcsissaitbatch))
Nov 242000 rinnocente 13
Load balancingmechanisms
Nov 242000 rinnocente 14
Load balancing methods taxonomy
job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this
NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration
Nov 242000 rinnocente 15
Load balancing algorithms
Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new
job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)
PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)
Nov 242000 rinnocente 16
Non pre-emptive load balancing (aka job placement)
centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects
distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs
random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs
Nov 242000 rinnocente 17
Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random
another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain
load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer
neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken
broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes
if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node
Nov 242000 rinnocente 18
Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)
std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time
but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations
Nov 242000 rinnocente 19
Process migrationmechanisms
Nov 242000 rinnocente 20
Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)
no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration
User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch
Nov 242000 rinnocente 21
Process migration factors
transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration
residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration
performanceperformancehow much the migration affects how much the migration affects performanceperformance
complexitycomplexity how much complex the migration is how much complex the migration is
Nov 242000 rinnocente 22
Transparent
Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent
Nov 242000 rinnocente 23
System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is
just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely
transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes
A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent
Nov 242000 rinnocente 24
Process migration parts
execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 13
Load balancingmechanisms
Nov 242000 rinnocente 14
Load balancing methods taxonomy
job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this
NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration
Nov 242000 rinnocente 15
Load balancing algorithms
Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new
job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)
PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)
Nov 242000 rinnocente 16
Non pre-emptive load balancing (aka job placement)
centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects
distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs
random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs
Nov 242000 rinnocente 17
Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random
another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain
load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer
neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken
broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes
if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node
Nov 242000 rinnocente 18
Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)
std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time
but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations
Nov 242000 rinnocente 19
Process migrationmechanisms
Nov 242000 rinnocente 20
Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)
no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration
User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch
Nov 242000 rinnocente 21
Process migration factors
transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration
residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration
performanceperformancehow much the migration affects how much the migration affects performanceperformance
complexitycomplexity how much complex the migration is how much complex the migration is
Nov 242000 rinnocente 22
Transparent
Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent
Nov 242000 rinnocente 23
System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is
just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely
transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes
A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent
Nov 242000 rinnocente 24
Process migration parts
execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 14
Load balancing methods taxonomy
job initiation only (aka job placement aka remote job initiation only (aka job placement aka remote execution)execution) staticstatic dynamic (almost all batch systems can do this dynamic (almost all batch systems can do this
NQS and derivatives)NQS and derivatives) migration (Mosix some Distributed OS)migration (Mosix some Distributed OS) job initiation + migrationjob initiation + migration
Nov 242000 rinnocente 15
Load balancing algorithms
Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new
job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)
PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)
Nov 242000 rinnocente 16
Non pre-emptive load balancing (aka job placement)
centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects
distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs
random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs
Nov 242000 rinnocente 17
Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random
another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain
load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer
neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken
broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes
if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node
Nov 242000 rinnocente 18
Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)
std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time
but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations
Nov 242000 rinnocente 19
Process migrationmechanisms
Nov 242000 rinnocente 20
Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)
no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration
User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch
Nov 242000 rinnocente 21
Process migration factors
transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration
residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration
performanceperformancehow much the migration affects how much the migration affects performanceperformance
complexitycomplexity how much complex the migration is how much complex the migration is
Nov 242000 rinnocente 22
Transparent
Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent
Nov 242000 rinnocente 23
System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is
just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely
transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes
A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent
Nov 242000 rinnocente 24
Process migration parts
execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 15
Load balancing algorithms
Two main categoriesTwo main categories Non preNon pre--emptive emptive they are used only when a new they are used only when a new
job has to be started they decide where to run a job has to be started they decide where to run a new job (aka job placement many batch systems)new job (aka job placement many batch systems)
PrePre--emptive emptive jobs can be migrated during jobs can be migrated during execution therefore load balancing can be execution therefore load balancing can be dynamically applied duringdynamically applied during job execution job execution (Distributed OSCondor)(Distributed OSCondor)
Nov 242000 rinnocente 16
Non pre-emptive load balancing (aka job placement)
centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects
distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs
random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs
Nov 242000 rinnocente 17
Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random
another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain
load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer
neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken
broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes
if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node
Nov 242000 rinnocente 18
Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)
std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time
but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations
Nov 242000 rinnocente 19
Process migrationmechanisms
Nov 242000 rinnocente 20
Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)
no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration
User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch
Nov 242000 rinnocente 21
Process migration factors
transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration
residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration
performanceperformancehow much the migration affects how much the migration affects performanceperformance
complexitycomplexity how much complex the migration is how much complex the migration is
Nov 242000 rinnocente 22
Transparent
Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent
Nov 242000 rinnocente 23
System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is
just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely
transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes
A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent
Nov 242000 rinnocente 24
Process migration parts
execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 16
Non pre-emptive load balancing (aka job placement)
centralized initiation centralized initiation there is only 1 load balancer in the there is only 1 load balancer in the system that is in charge of assigning new jobs to computers system that is in charge of assigning new jobs to computers according to the info it collectsaccording to the info it collects
distributed initiationdistributed initiation there is a load balancer on each there is a load balancer on each computer that xfers its load to the others anytime it detects computer that xfers its load to the others anytime it detects a load change the local balancer decides according to the a load change the local balancer decides according to the load info received where to start new jobsload info received where to start new jobs
random initiationrandom initiation when a new job arrives the local load when a new job arrives the local load balancer selects randomly to which computer to assign balancer selects randomly to which computer to assign new jobsnew jobs
Nov 242000 rinnocente 17
Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random
another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain
load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer
neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken
broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes
if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node
Nov 242000 rinnocente 18
Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)
std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time
but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations
Nov 242000 rinnocente 19
Process migrationmechanisms
Nov 242000 rinnocente 20
Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)
no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration
User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch
Nov 242000 rinnocente 21
Process migration factors
transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration
residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration
performanceperformancehow much the migration affects how much the migration affects performanceperformance
complexitycomplexity how much complex the migration is how much complex the migration is
Nov 242000 rinnocente 22
Transparent
Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent
Nov 242000 rinnocente 23
System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is
just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely
transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes
A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent
Nov 242000 rinnocente 24
Process migration parts
execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 17
Pre-emptive load balancing randomrandom when a computer becomes overloaded it select at random when a computer becomes overloaded it select at random
another computer to which to transfer a processanother computer to which to transfer a process centralcentral one load balancer regularly polls the other machine to obtain one load balancer regularly polls the other machine to obtain
load info If it detects a load imbalance it select a computer wload info If it detects a load imbalance it select a computer with a low ith a low load to migrate a process from an overloaded computerload to migrate a process from an overloaded computer
neighbourneighbour (BryantFinkel (BryantFinkel 1981) each computer sends a request to its 1981) each computer sends a request to its neighbours when the request is accepted a pair is constitutedneighbours when the request is accepted a pair is constitutedIf they If they detect a load imbalance between them they migrate some processesdetect a load imbalance between them they migrate some processesfrom one to another after that the pair is brokenfrom one to another after that the pair is broken
broadcastbroadcast server basedserver based distributeddistributed each node receives a measure of the load of other nodes
if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node
Nov 242000 rinnocente 18
Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)
std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time
but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations
Nov 242000 rinnocente 19
Process migrationmechanisms
Nov 242000 rinnocente 20
Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)
no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration
User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch
Nov 242000 rinnocente 21
Process migration factors
transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration
residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration
performanceperformancehow much the migration affects how much the migration affects performanceperformance
complexitycomplexity how much complex the migration is how much complex the migration is
Nov 242000 rinnocente 22
Transparent
Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent
Nov 242000 rinnocente 23
System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is
just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely
transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes
A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent
Nov 242000 rinnocente 24
Process migration parts
execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 18
Processor load measuresThe metric used to evaluate the load is extremely important in a loadbalancing system More than 20 different metrics are cited in only 1paper(Mehra amp Wah 1993)
std Unix of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time
but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations
Nov 242000 rinnocente 19
Process migrationmechanisms
Nov 242000 rinnocente 20
Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)
no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration
User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch
Nov 242000 rinnocente 21
Process migration factors
transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration
residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration
performanceperformancehow much the migration affects how much the migration affects performanceperformance
complexitycomplexity how much complex the migration is how much complex the migration is
Nov 242000 rinnocente 22
Transparent
Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent
Nov 242000 rinnocente 23
System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is
just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely
transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes
A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent
Nov 242000 rinnocente 24
Process migration parts
execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 19
Process migrationmechanisms
Nov 242000 rinnocente 20
Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)
no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration
User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch
Nov 242000 rinnocente 21
Process migration factors
transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration
residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration
performanceperformancehow much the migration affects how much the migration affects performanceperformance
complexitycomplexity how much complex the migration is how much complex the migration is
Nov 242000 rinnocente 22
Transparent
Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent
Nov 242000 rinnocente 23
System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is
just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely
transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes
A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent
Nov 242000 rinnocente 24
Process migration parts
execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 20
Userkernel level migration Kernel level process migration (Mosix)Kernel level process migration (Mosix)
no special requirements no need to recompileno special requirements no need to recompile transparent migrationtransparent migration
User level process migration (Condor)User level process migration (Condor) compilation with special libraries (libckpta)compilation with special libraries (libckpta) a snapshot of the running process is taken on signala snapshot of the running process is taken on signal the snapshot is restarted on another machinethe snapshot is restarted on another machine more details at more details at httphpcsissaitbatchhttphpcsissaitbatch
Nov 242000 rinnocente 21
Process migration factors
transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration
residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration
performanceperformancehow much the migration affects how much the migration affects performanceperformance
complexitycomplexity how much complex the migration is how much complex the migration is
Nov 242000 rinnocente 22
Transparent
Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent
Nov 242000 rinnocente 23
System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is
just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely
transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes
A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent
Nov 242000 rinnocente 24
Process migration parts
execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 21
Process migration factors
transparencytransparencyhow much a process needs to be how much a process needs to be aware of the migrationaware of the migration
residual dependencyresidual dependency how much a process is how much a process is dependent on the source node after migrationdependent on the source node after migration
performanceperformancehow much the migration affects how much the migration affects performanceperformance
complexitycomplexity how much complex the migration is how much complex the migration is
Nov 242000 rinnocente 22
Transparent
Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent
Nov 242000 rinnocente 23
System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is
just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely
transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes
A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent
Nov 242000 rinnocente 24
Process migration parts
execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 22
Transparent
Location transparentLocation transparent Access transparent (or network)Access transparent (or network) Migration transparentMigration transparent Failure transparentFailure transparent
Nov 242000 rinnocente 23
System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is
just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely
transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes
A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent
Nov 242000 rinnocente 24
Process migration parts
execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 23
System call interposingIn many cases transparency is obtained interposing some codeIn many cases transparency is obtained interposing some codeat the system call level There are different ways to do itat the system call level There are different ways to do it Very easy on message based OSs in which a system call is Very easy on message based OSs in which a system call is
just a message to the kerneljust a message to the kernel A modified kernel in this case it can be completely A modified kernel in this case it can be completely
transparent to the process (MOSIX) reasonably efficient transparent to the process (MOSIX) reasonably efficient for some kind of processesfor some kind of processes
A modified system library in this case the programs need A modified system library in this case the programs need to be linked with a special library(Condor) reasonably to be linked with a special library(Condor) reasonably transparent transparent
Nov 242000 rinnocente 24
Process migration parts
execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 24
Process migration parts
execution state migrationexecution state migration memory space migrationmemory space migration communication state migrationcommunication state migration
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 25
Execution state migration
Itrsquos not that difficult For the mostItrsquos not that difficult For the mostpart the code can be the same as that used forpart the code can be the same as that used forswapping In that case the destination is a diskswapping In that case the destination is a diskwhile for migration the destination is awhile for migration the destination is adifferent nodedifferent node
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 26
Communication statemigration
It is a difficult task and usually it is avoided forcing themigrated process to redirect networking operations to theHome Node where they are performedOn Amoeba each thread of a process has its own communication staOn Amoeba each thread of a process has its own communication statetewhich remembers the stage of the ongoing RPCs these states arewhich remembers the stage of the ongoing RPCs these states are kept inkept inthe kernelthe kernel
during migration any message received for a suspended process isduring migration any message received for a suspended process isrejected and a ldquoprocess migratingrdquo response is returnedrejected and a ldquoprocess migratingrdquo response is returned
sending machine will rexmit after some delaysending machine will rexmit after some delay sending machine using a special protocol(FLIP) will locate the sending machine using a special protocol(FLIP) will locate the
migrated processmigrated process the same technique is used for reply msgs to a migrating clientthe same technique is used for reply msgs to a migrating client
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 27
Migratable socketsUnfortunately TCPIP has no direct support for thisUnfortunately TCPIP has no direct support for thisResearch is ongoing on different solutionsResearch is ongoing on different solutionsA proposed userA proposed user--level implementation requireslevel implementation requires substitution of the socket librarysubstitution of the socket library relinking of programs with the new libraryrelinking of programs with the new library(Bubak Zbik et al 1999) (Bubak Zbik et al 1999) In this proposal communication is done between virtualIn this proposal communication is done between virtualaddresses (there are servers that keep a correspondence table beaddresses (there are servers that keep a correspondence table betweentweenvirtual and real addresses) When an endpoint migrates a stub rvirtual and real addresses) When an endpoint migrates a stub remains onemains onthe original node to respond with a ldquoprocess migratingrdquo msg Thethe original node to respond with a ldquoprocess migratingrdquo msg The otherotherendpoint then should consult the server for the new real addressendpoint then should consult the server for the new real address to contactto contact
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 28
Memory space migration total copytotal copy(CharlotteLOCUS) process is frozen and the entire memory space(CharlotteLOCUS) process is frozen and the entire memory space
is moved the process is then restarted on remote (this can reqis moved the process is then restarted on remote (this can require many uire many seconds and further it can transfer memory that will not be usedseconds and further it can transfer memory that will not be used))
prepre--copyingcopying (V SystemTheimer) allows the process to continue the (V SystemTheimer) allows the process to continue the execution while its memory space is being transferred to the taexecution while its memory space is being transferred to the targetThen V rgetThen V freezes the process on the source and refreezes the process on the source and re--copies the pages modified during the copies the pages modified during the transfer timetransfer time
lazylazy--copying or demand pagecopying or demand page(AccentZayas) pages are left on the source (AccentZayas) pages are left on the source machine and are transferred only at page faults This mechanism machine and are transferred only at page faults This mechanism can leave can leave dirty pages on all the nodes the process goes through A small vdirty pages on all the nodes the process goes through A small variation used ariation used by MOSIX is to freeze the process during the transfer of all theby MOSIX is to freeze the process during the transfer of all the dirty pages and dirty pages and then to transfer from the backing store the other pages when refthen to transfer from the backing store the other pages when referrederred
shared file servershared file server(Sprite) Sprite uses the network file system as a backing (Sprite) Sprite uses the network file system as a backing store for virtual memory Dirty pages are flushed on the networstore for virtual memory Dirty pages are flushed on the network file system k file system and on the target the process is restarted (itrsquos just aswapand on the target the process is restarted (itrsquos just aswap--outswapoutswap--in)in)
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 29
focus on Mosix
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 30
MOSIX important points
Probabilistic info disseminationProbabilistic info dissemination PPM (PrePPM (Pre--emptiveProcessMigration)emptiveProcessMigration) Dynamic load balancingDynamic load balancing Memory usheringMemory ushering Decentralized controlDecentralized control Efficient kernel communicationsEfficient kernel communications
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 31
Probabilistic info dissemination
each node at regular intervals sends info on each node at regular intervals sends info on its resources at a random subset of nodesits resources at a random subset of nodes
each node at the same time keeps a small each node at the same time keeps a small buffer with the most recently info receivedbuffer with the most recently info received
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 32
MOSIX load balancing algorithm
Mosix uses a preMosix uses a pre--emptive and distributed loademptive and distributed loadbalancing algorithm that is essentially basedbalancing algorithm that is essentially basedon a Unix load avg metric This means thaton a Unix load avg metric This means that
each node will try to find a less loaded node toeach node will try to find a less loaded node towhich it would try to migrate one of itswhich it would try to migrate one of itsprocessesprocessesWhen a node is low on memory a different algorithmWhen a node is low on memory a different algorithmprevails memory usheringprevails memory ushering
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 33
Mosix info variablesThese are the variables exchanged between Mosix nodes toThese are the variables exchanged between Mosix nodes tofeed the distributed load balancing algorithmfeed the distributed load balancing algorithm Load (based on a unit of a P IILoad (based on a unit of a P II--400Mhz)400Mhz) Number of processorsNumber of processors Speed of processorSpeed of processor Memory usedMemory used Raw Memory UsedRaw Memory Used Total MemoryTotal MemoryLook at Look at httphpcsissaitwalrusmjmhttphpcsissaitwalrusmjm for a java appletfor a java appletshowing these values over a Mosix clustershowing these values over a Mosix cluster
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 34
Memory ushering algorithm
Normally only the load balancing algorithm is activeNormally only the load balancing algorithm is activeWhen memory is under a threshold (paging) the memory When memory is under a threshold (paging) the memory ushering algorithm is started and prevails the load balancingushering algorithm is started and prevails the load balancingalgorithmalgorithmThis algorithm would try to migrate the smallest runningThis algorithm would try to migrate the smallest runningprocess to the node with more free memoryprocess to the node with more free memoryEventually processes are migrated between remote nodesEventually processes are migrated between remote nodes
to create sufficient space on a single nodeto create sufficient space on a single node
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 35
MOSIX network protocolsThese protocols are used to exchange load info amongThese protocols are used to exchange load info amongthe nodes and to negotiate and carry out process migrationsthe nodes and to negotiate and carry out process migrationsbetween nodes Mosix usesbetween nodes Mosix uses
UDP to txfer load infoUDP to txfer load info TCP to migrate processesTCP to migrate processes
These communications are not encrypted authenticatedThese communications are not encrypted authenticatedprotected therefore itrsquos better to use a private network for thprotected therefore itrsquos better to use a private network for theeMNPMNP(Unfortunately the sockets used by Mosix are not reported by (Unfortunately the sockets used by Mosix are not reported by
standard tools eg netstat)standard tools eg netstat)
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 36
MOSIX process migration
Itrsquos a Itrsquos a kernel level migrationkernel level migration (itrsquos done without (itrsquos done without the necessity of any modification to the programs the necessity of any modification to the programs because Mosix changes the linux kernel)because Mosix changes the linux kernel)
fully transparent fully transparent the process involved doesrsquont have any need to know about the migration
strong residual dependencythe migrated process is completely dependent on the deputy for system calls( and particularly IO and networking)
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 37
Mosix migration transparency
Mosix achieves complete transparency at the price Mosix achieves complete transparency at the price of some efficiency interposing to system calls a of some efficiency interposing to system calls a link layer that redirects their execution to the link layer that redirects their execution to the originating node of the process(originating node of the process(Unique Home Unique Home NodeNode) )
The skeleton of the process that remains on the The skeleton of the process that remains on the originating node to execute the system calls originating node to execute the system calls receive signals perform IO is called in Mosix the receive signals perform IO is called in Mosix the deputy deputy (what Condor calls shadow)(what Condor calls shadow)
The migrated process is called the The migrated process is called the remoteremote
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 38
MOSIX deputyremotemechanism
Deputy
Link layer
Remote
Link layerMNP(MosixNetworkProtocol)
UHN(UniqueHomeNode)
Runningnode
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 39
Mosix performance1 While a CPU bound process obtains clear advantages from Mosix
being very little penalized by process migration IO bound and network bound processes suffer big drawbacks and are usually notgood candidates for migration
A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExecMPmake) of processes With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the IO would be performed Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball)
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 40
Mosix performance2A more general solution to the IO bound processesA more general solution to the IO bound processesperformance problem under migration is anyway envisionedperformance problem under migration is anyway envisionedto be the use of a to be the use of a cache coherent global file systemcache coherent global file system cache coherentcache coherent clients keep their caches synchronized or only 1 clients keep their caches synchronized or only 1
cache at the server node is keptcache at the server node is kept global every file can be seen from every node with the same name every file can be seen from every node with the same name If such a file system can be used then IO operations can be exeIf such a file system can be used then IO operations can be executedcuteddirectly from the running node without resorting to redirect thedirectly from the running node without resorting to redirect them to them to theUnique Home Node Mosix recently has incorporated the possibilitUnique Home Node Mosix recently has incorporated the possibility toy todeclare that a file system is a Direct File System Accessdeclare that a file system is a Direct File System AccessFor such file systems Mosix will perform IO operation directly For such file systems Mosix will perform IO operation directly on the on the running noderunning node
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 41
MOSIX dfsa (direct file system access)
Deputy
Link layer
Remote
Link layerMNP (MosixNetworkProtocol)
DFSA access
UHN(UniqueHome Node)
Runningnode
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS
Nov 242000 rinnocente 42
Cache coherent global distributed file systems
To be a MOSIX DFSA itrsquos not sufficient to be a cacheTo be a MOSIX DFSA itrsquos not sufficient to be a cache--coherentcoherentglobal file system In fact some additional functions (not availglobal file system In fact some additional functions (not available in stdable in std
Linux superLinux super--blocks) must be available at the superblocks) must be available at the super--block levelblock level identifycomparereconstructidentifycomparereconstructA prototype DFSA that is just a small software layer over the stA prototype DFSA that is just a small software layer over the std Linux d Linux file system is file system is MFS (Mosix File System) provided directly by the MOSIX provided directly by the MOSIX team Other possibile candidate DFSA areteam Other possibile candidate DFSA are GFS GFS (GlobalFileSystem)(GlobalFileSystem) xFSxFS (Serverless File System)(Serverless File System)It seems now stable the integration of Mosix with GFSIt seems now stable the integration of Mosix with GFS