1 Multiprocessor Scheduling Module 3.1 For a good summary on multiprocessor and real-time scheduling, visit: weisskop/osnotes_html/M8.html

1

Multiprocessor Scheduling

Module 3.1

For a good summary on multiprocessor and real-time scheduling,

visit: http://www.cs.uah.edu/~weisskop/osnotes_html/M8.html

2

Classifications of Multiprocessor Systems

• Loosely coupled multiprocessor, or clusters– Each processor has its own memory and I/O channels.

• Functionally specialized processors– Such as I/O processor

• Nvidia GPGPU

– Controlled by a master processor

• Tightly coupled multiprocessing– MCMP

– Processors share main memory

– Controlled by operating system

– More economical than clusters

3

Types of parallelism

• Bit-level parallelism

• Instruction-level parallelism

• Data parallelism

• Task parallelism our focus

4

Synchronization Granuality

• Refers to frequency of synchronization or parallelism among processes in the system

• Five classes exist– Independent (SI is not applicable)

– Very coarse (2000 < SI < 1M)

– Course (200 < SI < 2000)

– Medium (20 < SI < 200)

– Fine (SI < 20)

SI is called the Synchronization Interval, and measured in instructions.

5

Wikipedia on Fine-grained, coarse-grained, and embarrassing parallelism

• Applications are often classified according to how often their subtasks need to synchronize or communicate with each other.

• An application exhibits fine-grained parallelism if its subtasks must communicate many times per second;

• it exhibits coarse-grained parallelism if they do not communicate many times per second,

• and it is embarrassingly parallel if they rarely or never have to communicate. Embarrassingly parallel applications are considered the easiest to parallelize.

6

Independent Parallelism

• Multiple unrelated processes

• Separate application or job, e.g. spreadsheet, word processor, etc.

• No synchronization

• More than one processor is available– Average response time to users is less

7

Coarse and Very Coarse-Grained Parallelism

• Very coarse: distributed processing across network nodes to form a single computing environment

• Coarse: Synchronization among processes at a very gross level

• Good for concurrent processes running on a multiprogrammed uniprocessor– Can by supported on a multiprocessor with little

change

8

Medium-Grained Parallelism

• Parallel processing or multitasking within a single application

• Single application is a collection of threads

• Threads usually interact frequently, leading to a medium-level synchronization.

9

Fine-Grained Parallelism

• Highly parallel applications• Synchronization every few instructions (on very

short events).• Fill the gap between ILP (instruction level

parallelism) and Medium-grained parallelism.• Can be found in small inner loops

– Use of MPI and OpenMP programming languages

• OS should not intervene. Usually done by HW• In practice, this is very specialized and fragmented

area

10

Scheduling

• Scheduling on a multiprocessor involves 3 interrelated design issues:– Assignment of processes to processors

– Use of multiprogramming on individual processors• Makes sense for processes (coarse-grained)

• May not be good for threads (medium-grained)

– Actual dispatching of a process• What scheduling policy should we use: FCFS, RR, etc.

Sometimes a very sophisticated policy becomes counter productive.

11

Assignment of Processes to Processors

• Treat processors as a pooled resource and assign process to processors on demand

• Permanently assign process to a processor– Dedicate short-term queue for each processor

– Less overhead. Each does its own scheduling on its queue.

– Disadvantage: Processor could be idle (has an empty queue) while another processor has a backlog.

12


• Global queue– Schedule to any available processor– During the lifetime of the process, process may run on different

processors at different times.– In SMP architecture, context switching can be done with small cost.

• Master/slave architecture– Key kernel functions always run on a particular processor– Master is responsible for scheduling– Slave sends service request to the master– Synchronization is simplified– Disadvantages

• Failure of master brings down whole system• Master can become a performance bottleneck

13


• Peer architecture– Operating system can execute on any processor– Each processor does self-scheduling from a pool

of available processes– Complicates the operating system

• Make sure two processors do not choose the same process

• Needs lots of synchronization

14

Process Scheduling in Today’s SMP

• M/M/M/K Queueing system– Single queue for all processes

– Multiple queues are used for priorities

– All queues feed to the common pool of processors

• Specific scheduling disciplines is less important with more than one processor– A simple FCFS discipline with a static priority may suffice for a

multi-processor system.

– Illustrate using graph, p460

– In conclusion, specific scheduling discipline is much less important with SMP than UP

15

Threads

• Executes separate from the rest of the process

• An application can be a set of threads that cooperate and execute concurrently in the same address space

• Threads running on separate processors yields a dramatic gain in performance

16

Multiprocessor Thread Scheduling --Four General Approaches (1/2)

• Load sharing– Processes are not assigned to a particular

processor

• Gang scheduling– A set of related threads is scheduled to run on a

set of processors at the same time

17

Multiprocessor Thread Scheduling --Four General Approaches (2/2)

• Dedicated processor assignment– Threads are assigned to a specific processor.

(each thread can be run on a processor.)– When program terminates, processors are returned

to the processor available pool

• Dynamic scheduling– Number of threads can be altered during course of

execution

18

Load Sharing

• Load is distributed evenly across the processors

• No centralized scheduler required– OS runs on every processor to select the next

thread.

• Use global queues for ready threads– Usually FCFS policy

19

Disadvantages of Load Sharing• Central queue needs mutual exclusion

– May be a bottleneck when more than one processor looks for work at the same time

– A noticeable problem when there are many processors.

• Preemptive threads will unlikely resume execution on the same processor– Cache use is less efficient

• If all threads are in the global queue, all threads of a program will not gain access to the processors at the same time. – Performance is compromised if coordination among threads

is high.

20

Gang Scheduling

• Simultaneous scheduling of threads that make up a single process

• Useful for applications where performance severely degrades when any part of the application is not running

• Threads often need to synchronize with each other

21

Advantages

• Closely-related threads execute in parallel

• Synchronization blocking is reduced

• Less context switching

• Scheduling overhead is reduced, as a single sync to signal() may affect many threads

Scheduling Groups – two time slicing divisions

23

Dedicated Processor Assignment – Affinitization (1/2)

• When application is scheduled, each of its threads is assigned a processor that remains dedicated/affinitized to that thread until the application runs to completion

• No multiprogramming of processors, i.e. one processor per a specific thread, and no other thread.

• Some processors may be idle as threads may block• Eliminates context switches; with certain types of applications

this can save enough time to compensate for the possible idle time penalty.

• However, in a highly parallel environment with 100 of processors, utilization of processors is not a major issue, but performance is. The total avoidance of switching results in substantial speedup.

24

Dedicated Processor Assignment – Affinitization (2/2)

• In Dedicated Assignment, when the number of threads exceeds the number of available processors, efficiency drops.– Due to thread preemption, context switching, suspending

of other threads, and cache pollution

• Notice that both gang scheduling and dedicated assignment are more concerned with allocation issues than scheduling issues.

• The important question becomes "How many processors should a process be assigned?" rather than "How shall I choose the next process?"

25

Dynamic Scheduling• Threads within a process are variable• Sharing the work between OS and application.• When a job originates, it requests a certain number of

processors. The OS grants some or all of the request, based on the number of processors currently available.

• Then the application itself can decide which threads run when on which processors. This requires language support as would be provided with thread libraries.

• When processors become free due to the termination of threads or processes, the OS can allocate them as needed to satisfy pending requests.

• Simulations have shown that this approach is superior to gang scheduling and or dedicated scheduling.

Documents

1 Multiprocessor Scheduling Module 3.1 For a good summary on multiprocessor and real-time scheduling, visit: weisskop/osnotes_html/M8.html