Multiprocessor scheduling · • On parallel programming - OpenMP (part1): ... Diagram is from paper “WCET(m) Estimation in Multi-Core Systems using Single Core Equivalence”,

1 | 69Embedded and Networked Systems

[email protected](appointment by email)

IN4343 Real-Time Systems, Lecture 10

Multiprocessor scheduling

mailto:[email protected]


Sources

Book (for interested students)• Multiprocessor Scheduling for Real-Time Systems (Embedded Systems) 2015 Edition

• By Sanjoy Baruah, Marko Bertogna, Giorgio Buttazzo• Available at https://www.springer.com/gp/book/9783319086958

Paper (for interested students)• A survey of hard real-time scheduling for multiprocessor systems

• By Robert I. Davis and Alan Burns from University of York, U.K.• Available at https://dl.acm.org/citation.cfm?id=1978814

• Global Scheduling Not Required: Simple, Near-Optimal Multiprocessor Real-Time Scheduling with Semi-Partitioned Reservations

• By Bjorn Brandenburg and M. Gül, Max Planck Institute, Germany.

Slides (for interested students)• Giorgio Buttazzo:

• On multiprocessor systems (part1): http://retis.sssup.it/~giorgio/slides/cbsd/mc1-intro-6p.pdf• On multiprocessor systems (part2): http://retis.sssup.it/~giorgio/slides/cbsd/mc2-sched-6p.pdf

• Alessandra Melani:• On global scheduling (part1): http://retis.sssup.it/~giorgio/slides/cbsd/Melani1-global.pdf• On global scheduling (part2): http://retis.sssup.it/~giorgio/slides/cbsd/Melani2-RTA-DAG.pdf• On OpenMP real-time task model: http://retis.sssup.it/~giorgio/slides/cbsd/Melani3-openMP.pdf• On semi-partitioning algorithms: http://retis.sssup.it/~giorgio/slides/cbsd/Melani4-semipar.pdf

• Emanuele Ruffaldi• On parallel programming - OpenMP (part1): http://retis.sssup.it/~giorgio/slides/cbsd/Ruffaldi-omp1.pdf• On parallel programming - OpenMP (part2): http://retis.sssup.it/~giorgio/slides/cbsd/Ruffaldi-omp2.pdf• On parallel programming - GPU (part1): http://retis.sssup.it/~giorgio/slides/cbsd/Ruffaldi-gpu1.pdf• On parallel programming - GPU (part2): http://retis.sssup.it/~giorgio/slides/cbsd/Ruffaldi-gpu2.pdf

https://www.springer.com/gp/book/9783319086958

https://dl.acm.org/citation.cfm?id=1978814

http://retis.sssup.it/~giorgio/slides/cbsd/mc1-intro-6p.pdf

http://retis.sssup.it/~giorgio/slides/cbsd/mc2-sched-6p.pdf

http://retis.sssup.it/~giorgio/slides/cbsd/Melani1-global.pdf

http://retis.sssup.it/~giorgio/slides/cbsd/Melani2-RTA-DAG.pdf

http://retis.sssup.it/~giorgio/slides/cbsd/Melani3-openMP.pdf

http://retis.sssup.it/~giorgio/slides/cbsd/Melani4-semipar.pdf

http://retis.sssup.it/~giorgio/slides/cbsd/Ruffaldi-omp1.pdf

http://retis.sssup.it/~giorgio/slides/cbsd/Ruffaldi-omp2.pdf

http://retis.sssup.it/~giorgio/slides/cbsd/Ruffaldi-gpu1.pdf

http://retis.sssup.it/~giorgio/slides/cbsd/Ruffaldi-gpu2.pdf


Sources

• From Giorgio Buttazzo’s website: http://retis.sssup.it/~giorgio/rts-MECS.html• And the CBSD course: http://retis.sssup.it/~giorgio/CBSD.html

• From Alessandra Melani (in the CBDS course 2016): http://retis.sssup.it/~giorgio/slides/cbsd/Melani1-global.pdf

• From Marko Bertogna (Uni More): http://algo.ing.unimo.it/people/marko/• Source: http://algo.ing.unimo.it/people/marko/presentations/Multiproc_course.zip

Sources of some of the slides used in this lecture

http://retis.sssup.it/~giorgio/rts-MECS.html

http://retis.sssup.it/~giorgio/CBSD.html

http://retis.sssup.it/~giorgio/slides/cbsd/Melani1-global.pdf

http://algo.ing.unimo.it/people/marko/

http://algo.ing.unimo.it/people/marko/presentations/Multiproc_course.zip


Moore’s Law Number of transistors per chip doubles every 24 months


The limit

At the launch of Pentium 4, Intel expected single core chips to scale up to 10 GHz using gates below 90 nm. However, the fastest Pentium 4 never exceeded 4 GHz.

Why did that happen?

The leakage current negatively affects the static power of the chip.

Recall:• Dynamic power (Pd)

consumed during operation;• Static power (Ps)

consumed when the circuit is off (caused by leakage current).

As devices scale down in size, gate oxide thicknesses decreases, resulting in a larger leakage current.


The limit

If processor performance would have improved by increasing the clock frequency, the chip temperature would have reached levels beyond the capability of current cooling systems.


However, the future real-time systems require a lot of computational power


Switching to multicore systems

• Industrial trend A: By consolidation • Port the existing functionalities of multiple processors into a multicore platform

• Example from automotive industry: ECU consolidation

• Why it is helpful?

• To reduce the hardware cost and communication delays

• Industrial trend B: By parallel programming techniques • By parallelizing the application code, each code segment can run on a different core in

parallel, hence, tasks will have shorter response time

• Main challenges:

• How to split the code into parallel segments that can be executed simultaneously?

• What to do with data or control flow dependencies?

• How to allocate such segments to different cores?

• How to analyze the worst-case execution time of the applications on a multicore platform?

How to exploit multicores in real-time systems?


Industry’s challenges

Problem 1: Parallelizing legacy code implies a tremendous cost and effort due to:

• re-designing the application

• re-writing the source code

• updating the operating system

• writing new documentation

• testing the system

• software certification

Problem 2. execution of tasks on a multicore platform cause a lot of conflicts/interferenceson software and hardware resources

• Multiple tasks (running on multiple cores) may want to access the memory at the same time• Tasks may evict each other’s cache blocks in the shared caches• The WCET of a task will depend on the set of co-running tasks on the platform

(this was not the case for single processor systems)


WCET in multicore

• Test by Lockheed Martin Space Systems on 8-core platform

The WCET increases because of the competition among cores in using shared resources.

• Main memory

• Memory-bus

• Last-level cache

• I/O devices


Types of memory


Cache in multicore systems

Possible solution: partition the last-level cache between the tasks

Issue: Reducing cache size of a task will increases its WCET and hence the task’s utilization!

L3:

n = 40 tasks on 4 cores

Any solution idea?

Diagram is from paper “WCET(m) Estimation in Multi-Core Systems using Single Core Equivalence”, Renato Mancuso, Rodolfo Pellizzoniy, Marco Caccamo, Lui Sha, Heechul Yun


Cache in multicore systems

Possible solution: partition the last-level cache between the tasks

Issue: Reducing cache size of a task increases its WCET

A nice direction is to • use “non-preemptive scheduling” on the cores

(to avoid cache conflicts on L1 and L2)

• partition the shared cache L3 by the number of cores (rather than the number of tasks which is usually much larger than the cores)

L3:


L3:



Memory banks

• To reduce memory conflicts, the DRAM is divided into banks:


Main memory and I/O conflicts

Still, when cores concurrently access the main memory, DRAM accesses have to be queued, causing a significant slowdown:

A similar problem occurs when tasks running in different cores request to access I/O devices at the same time:


How bad memory conflict can be?

• Diffbank: Core0 -> Bank0, Core1-3 -> Bank 1-3

• Samebank: All cores -> Bank0

Test on Intel-Xeon


Types of multicore systems

ARM’s MPCore STI’s Cell Processor

• 4 identical ARMv6 cores • One Power Processor Element (PPE)• 8 Synergistic Processing Element (SPE)

Heterogeneous coresidentical cores


Example: ARM Big.LITTLE architecture

There are many open scheduling problems when it comes to the heterogeneous platforms.In this course, we focus on identical cores


Parallel real-time tasks

Sequential task


Parallel programming

• Existing parallel programming models• OpenMP

• MPI

• IBM’s X10

• Intel’s TBB (abstraction for C++)

• Sun’s Fortress

• Cray’s Chapel

• Cilk (Cilk++)

• Codeplay’s Sieve C++

• Rapidmind Development Platform


Task models for parallel real-time tasks

• Representing a parallel code requires more complex structures like a graph (usually a directed-acyclic graph, a.k.a. DAG):

In a DAG this

connection isforbidden

DAGs can be conditioned(e.g., by an if-then-else block)

OR nodes represent conditional statements

AND nodes represent parallel computations

Both must be executed


Structured parallelism

• Fork-Join Graphs (a special type of DAGs)• After a fork node, all immediate

successors must be executed (the order does not matter).

• A join node is executed only after all immediate predecessors are completed.

• Nested fork-join model can have nested forks and joins.


Assumptions and parameters

• Arrival pattern• Periodic (activations exactly separates by a period T)• Sporadic (Minimum Inter-arrival Time T)• Aperiodic (no inter-arrival bound exists)

• Is preemption allowed at arbitrary times?

• Is task migration allowed?

Task parameters:

𝜏𝑖 = {𝑐𝑖,1, 𝑐𝑖,2, … , 𝑐𝑖,𝑚𝑖}, 𝐷𝑖 , 𝑇𝑖 𝑆1

𝑆2 𝑆4

𝑆3 𝑆5


ExampleThe number represents the execution time of that segment

Assume we have an infinite number of cores,

what is the shortest possible WCRT for this task?

Tasks with intra-precedence constraints are harder to schedule!(even with an infinite number of cores, we cannot finish this task earlier than 7 units of time).

Core1:Core2:Core3:


Important factors

• Sequential Computation Time (Volume):

𝐶𝑖𝑠 =

𝑗=1

𝑚𝑖

𝑐𝑖,𝑗

CPU utilization (of a parallel task):

𝑈𝑖 =𝐶𝑖𝑠

𝑇𝑖

If 𝐶𝑖𝑠 ≤ 𝐷𝑖 holds, the task is schedulable on a single core system.

• Critical path length 𝐶𝑖𝑝

is the length of the longest path in the graph.

1 2

3

4

1

5

8

3

1 What is the length of the critical path in this graph?

If σ𝑈𝑖 > 𝑚, then the task set is certainly NOT schedulable on a multicore platform with 𝑚 cores

Task parameters:

𝜏𝑖 = {𝑐𝑖,1, 𝑐𝑖,2, … , 𝑐𝑖,𝑚𝑖}, 𝐷𝑖 , 𝑇𝑖


1 2

3

4

1

5

8

3

1

Important factors


𝐶𝑖𝑠 =

𝑗=1

𝑚𝑖

𝑐𝑖,𝑗



𝑇𝑖

16

If 𝐶𝑖𝑠 ≤ 𝐷𝑖 holds, the task is schedulable on a single core system.



If σ𝑈𝑖 > 𝑚, then the task set is certainly NOT schedulable on a multicore platform with 𝑚 cores

Task parameters:

𝜏𝑖 = {𝑐𝑖,1, 𝑐𝑖,2, … , 𝑐𝑖,𝑚𝑖}, 𝐷𝑖 , 𝑇𝑖

What is the length of the critical path in this graph?


1 2

3

4

1

5

8

3

1

Important factors


𝐶𝑖𝑠 =

𝑗=1

𝑚𝑖

𝑐𝑖,𝑗

What would be a necessary schedulability test for a DAG task?

𝐶𝑖𝑝≤ 𝐷𝑖



𝑇𝑖

16



Task parameters:

𝜏𝑖 = {𝑐𝑖,1, 𝑐𝑖,2, … , 𝑐𝑖,𝑚𝑖}, 𝐷𝑖 , 𝑇𝑖

What is the length of the critical path in this graph?


Multiprocessor models

• Identical

Processors are of the same type and have the same speed. Each task has the same WCET on each processor.

• Uniform

Processors are of the same type but may have different speeds. Task WCETs are smaller on faster processors.

• Heterogeneous

Processors can be of different type. The WCET of a task depends on the processor type and the task itself.


Real-time scheduling for multicore platforms

Our assumptions:

• Identical multicore systems (all cores are the same)• WCET of each task is a sound upper bound on the actual execution time of the

task for any possible co-running task and any scenario• If it is not mentioned explicitly, we assume that each task has one sequential

code segment


Classification of multiprocessor scheduling algorithms

Partitioned scheduling

Tasks cannot migrate between cores

Semi-partitioned scheduling

Some of the tasks can migration between cores

Global scheduling

Any task is allowed to migrate between cores

Cluster scheduling

Some of the tasks can migration between some

pre-specified cores

Fixed-priority scheduling

Each task has a fixed priority

Job-level fixed-priority (JLFP) scheduling

Each job has a fixed priority, e.g., EDF and FP are both JLFP

Job-level dynamic-priority (JLDP) scheduling

Each job has a varying priority

Provide better schedulability but with a higher overhead.


Partitioned scheduling

• Each processor manages its own ready queue

• The processor for each task is determined offline

• The processor cannot be changed at runtime

running

running

running


Global scheduling

• The system manages a single queue of ready tasks

• The processor is determined at runtime

• During execution, a task can migrate to another processor

running


Global schedulingCore1

Core2

Core3

Global queue(ordered according to a given policy)

t1

t2

t3

• The first 𝒎 tasks are scheduled upon the 𝒎 cores

• When a task completes, the next one in the queue is scheduled on the available core

• When a higher-priority task arrives, it preempts the task with the lowest-priority among the executing ones

t4t5

As a result, tasks may MIGRATE between cores!


Exam example: global rate monotonic

• What would be the schedule of these tasks on a multicore platform with 3 cores?

𝝉𝒊 𝑪𝒊 𝑻𝒊

1 1 2

2 3 3

3 5 6

4 4 6

2

1

3

1

1

2

𝜏2 𝐶2=33

𝐶3=5

𝜏1 𝐶1=12 4 6

𝜏3

60𝐶4=4𝜏4

4

4 4

4

Deadline miss

Core 1

Core 3

Core 2

60

1

2

3


Exam example: global rate monotonic

• What would be the schedule of these tasks on a multicore platform with 3 cores?


1 1 2

2 3 3

3 5 6

4 4 6

2

1

3

1

1

2

𝜏2 𝐶2=33

𝐶3=5

𝜏1 𝐶1=12 4 6

𝜏3

60𝐶4=4𝜏4

4

4 4

4

Deadline miss

Core 1

Core 3

Core 2

60

Is this task set feasible?

Yes

4

Core 3

Core 22

1

3

1

1

2

Core 14

601

2

3


2

1

4Core 3

Core 2

Core 1

60

Exam example: other global scheduling policies

𝜏2 𝐶2=33

𝐶3=5

𝜏1 𝐶1=12 4 6

𝜏3

60 𝐶4=4𝜏4

Global fixed-priority with priorities: 𝑃1 < 𝑃2 < 𝑃4 < 𝑃3

Core 3Core 2

Core 1Global non-preemptive fixed-priority with priorities: 𝑃2 < 𝑃1 < 𝑃3 < 𝑃4

2

1 3 1 1

2

3

3

4

Deadline miss for 𝜏3

Global EDF(Ties in deadlines are broken by task index)

Deadline miss for 𝜏4

Core 3

Core 2

Core 1


1 1 2

2 3 3

3 5 6

4 4 6

Deadline miss

2

1 4 1 1

2

4

3

4

2

1

3

1

1

2

4

Global EDF(if there is a tie in deadlines, 𝝉𝟒 wins! God knows why :D )

No deadline missCore 3

Core 2

Core 12

1 4 1

1

2

4

3

3



1 1 2

2 3 3

3 5 6

4 4 6

Core 3

Core 22

1

3

1

1

2

Core 14 4

60

what should be each job’s priority to generate this schedule?

𝑃3,1 < 𝑃1,2 < 𝑃2,2 < 𝑃4,1 < 𝑃1,3

• Job 𝐽1,2 must be able to preempt 𝐽4,1, hence, its

priority should be higher than 𝑃4,1.

• Job 𝐽1,3 must NOT be able to preempt any job, hence, its priority should be the lowest.

𝑃1,1 < 𝑃2,1 <

𝜏2 𝐶2=33

𝐶3=5

𝜏1 𝐶1=12 4 6

𝜏3

60𝐶4=4𝜏4

Exam example: other global scheduling policies


𝝉𝒊 𝑻𝒊

1 3

2 6

Core 3

Core 2

Core 1

60

𝜏2

3𝜏1 6

Exam example: global scheduling of DAGs

Schedule using the global fixed-priority: 𝑷𝟏 < 𝑷𝟐

3 2 1

𝜏2 2

𝑠2,1 𝑠2,2

𝑠2,3

𝑠2,4

1 2

1𝜏1

𝑠1,1 𝑠1,2

𝑠1,3

2,1

1,1 1,2

2,2

1,3

2,4

1,1 1,2

1,32,3 2,3


Global scheduling

• Work-conserving scheduler• No processor is ever idled when a task is ready to execute.

• Non-work-conserving scheduler• A processor may be left idle event if there are

ready jobs in the system• (open research area) :D

Core1

Core2

Core3

Global ready queue

t1

t2

t3

t4t5


Global scheduling: advantages

Allows parallel execution

Load balancing between the cores• (being able to dispatch jobs on idle cores)

Easier re-scheduling • (dynamic loads, selective shutdown, etc.)

Lower average response time (known result from queueing theory)

More efficient reclaiming and overload management

Number of preemptions

Migration cost: • can be mitigated by proper hardware (e.g., MPCore’s Direct Data Intervention)

Scheduling overheads

Few schedulability tests Further research needed

No job-level fixed-priority scheduling algorithm is optimal


Cache level 1 (L1):

Cache level 2 (L2):

…

main memory (RAM)main memory

𝐶1 𝐶2 𝐶𝑚

1. Overheads

Global scheduling: disadvantages

Obvious overheadsTask migration between cores

Why task migration has a big impact on task’s execution time?

t1t2t6

Task 6 data

1. 𝜏6 is running so it loads its data into L2 and then L1, gradually.

2. 𝜏3 arrives and enters the ready queue. Since it has higher priority than 𝜏6, it preempts 𝜏6


Ready queue

Cache level 1 (L1):

Cache level 2 (L2):

…



1. Overheads




t1t2

t6

Task 6 data



t3

3. 𝜏2 finishes and 𝜏6 resume its execution on the second core


Cache level 1 (L1):

Cache level 2 (L2):

…



1. Overheads




t1t6

Task 6 data



t3


4. 𝜏3 tries to access its data. It may evict data of 𝜏6!

𝝉𝟑

𝝉𝟑


Cache level 1 (L1):

Cache level 2 (L2):

…



1. Overheads




t1t6

Task 6 data



t3


5. When 𝜏6 tries to access its data, it will receive cache miss. So it has to reload them again!

4. 𝜏3 tries to access its data. It may evict data of 𝜏6!

𝝉𝟑

𝝉𝟑𝝉𝟐

A migrated task needs to load a lot of data into cache

As you see, cache becomes a big source of unpredictability

Co-running tasks affect each other’s execution time


1. Overheads



Non-obvious overheadsLarge scheduling overhead

Whenever the task on a core completes, it calls the scheduler function. So, multiple scheduler functions can run at the same time if multiple tasks finish at the same time. Since each scheduler function wants to access the global ready queue, and since it is a «global variable», it must be protected by semaphores or locks. Consequently, scheduler functions called by different cores can frequently block each other!

Global ready queue


2. Dhall’s effect


The lower bound on the utilization of a task set that is not schedulable by any work-conserving global scheduling algorithm on a

multiprocessor system with 𝑚 cores is 1.

Namely: regardless of the number of cores in the system, we may not be able to find a feasible schedule for the tasks even if the utilization is just about 1.


Dhall’s effect(on any global work-conserving policy including global EDF)

1 heavy task 𝑈~1

𝑻 → ∞ ⇒ 𝑼 → 𝟏

Example:

𝒎 processors, 𝒏 = 𝒎+ 𝟏 tasks

𝝉𝒊 𝑪𝒊 𝑻𝒊 𝑫𝒊 𝝓𝒊 𝑼𝒊

1 1 𝑇 𝑇 0 ~0

2 1 𝑇 𝑇 0 ~0

… 1 𝑇 𝑇 0 ~0

𝑚 1 𝑇 𝑇 0 ~0

𝑚+ 1 𝑇 𝑇 𝑇 𝜖 ~1

𝑚 light tasks, 𝑈~0

𝜏1

𝜏2

𝜏𝑚

𝜏𝑚+1

…

Deadline miss

𝑇Core 1

Core 2

Core 𝑚…

0

𝜏1𝜏2

𝜏𝑚

𝜏𝑚+1𝝐 T+𝝐

…

𝑻

0

This task set is feasible by partitioned scheduling or by a work-conserving scheduler


Negative results about global scheduling

• Weak theoretical framework• Unknown critical instant

• Global-EDF is not optimal

• Any global job-level fixed-priority (G-JLFP) scheduler is not optimal

• Optimal algorithms exist only for sequential implicit deadline tasks• Example: PFair and RUN. These algorithms have a large number of preemptions

• Many sufficient tests (most of them incomparable)


Partitioned Scheduling

• The scheduling problem reduces to:

Bin-packingproblem

Uniprocessorscheduling

problem+

NP-hard in thestrong sense

Various heuristics used: FirstFit, NextFit, BestFit, FFDU, BFDD, etc.

Well known

EDFU ≤ 1

t2

t1 t3t4

t5

RM(Response-time analysis)

...


Possible partitioning choices

• Partition by information-sharing requirements

• Partition by functionality

• Use the least possible number of processors or run at the lowest possible frequency• Depends on considerations like fault tolerance, power consumed, temperature,

etc.

• Partition to increase schedulability

Real-time systems research has focused on this one extensively

These approaches might not be good for schedulability


Classic partitioning algorithms for real-time systems

• First fit (FF)• Best fit (BF)• Worst fit (WF)• Random fit (RF)• Next fit (NF)• …

Partitioning problem:Given a set of tasks 𝜏 = 𝜏1, 𝜏2, … , 𝜏𝑛 and a multiprocessor platform with 𝑚 processors, find an assignment from tasks to processors such that each task is assigned to one and only one processor

It is a bin packing problem!

Classic solutions: 1. Select a fitness criteria

• Example: task utilization 𝑈𝑖 =𝐶𝑖

𝑇𝑖or task density 𝛼𝑖 =

𝐶𝑖

𝐷𝑖

2. Decide how do you want to sort the tasks (decreasing, increasing, or

random)

3. Decide what is the fitness evaluation method (how will you reject

an assignment)

4. Use any of the following fitting policies to assign tasks to processors:


Partitioning heuristics

• First fit (FF)• Place each item in the first bin that can contain it.

• Best fit (BF)• Place each item in the bin with the smallest empty space.

• Worst fit (WF)• Place each item in the used bin with the largest empty space, otherwise start a new bin.

• Next fit (NF)• Place each item in the same bin as the last item. If it does not fit, start a new bin.


Comparison

• Suppose the current situation is represented in blue, the latest item was put in bin 2, and a new item of size 2 arrives:


Exam exampleFitness policy: First fitFitness criteria: task utilizationSorting: decreasingFitness evaluation: 𝑈 ≤ 1 (preemptive EDF will be used to schedule tasks assigned to a core)

𝑼𝟏 𝑼𝟐 𝑼𝟑 𝑼𝟒 𝑼𝟓 𝑼𝟔 𝑼𝟕 𝑼𝟖 𝑼𝟗

0.9 0.8 0.5 0.4 0.2 0.2 0.2 0.1 0.1

Proc1 Proc2 Proc3 Proc4

1. What are the partitions created from first-fit policy?

1 2

0.9Current U: 0.8

U=1

Proc5

U = 3.4

0.5

3

4

0.9

5

1.0

6

0.2

7

0.4

8

1.0

9

1.0

2. What is the minimum number of partitions? 4


First-fit algorithm

In other words, it is impossible for 2 bins to be at most half full because such a possibility implies that at some point, exactly one bin was at most half full and a new one was opened to accommodate an item of size at most V/2. where V is the size of a bin.

First-fit algorithm achieves an approximation factor of 2

Namely, the number of bins used by this algorithm is no more than twice the optimal number of bins.

https://en.wikipedia.org/wiki/APX


Observations

however:

• NF has a poor performance since it does not exploit the empty space in the previous bins

• FF improves the performance by exploiting the empty space available in all the used bins.

• BF tends to fill the used bins as much as possible.

• WF tends to balance the load among the used bins.

The performance of each algorithm strongly depends on the input sequence


Lopez utilization bound for partitioned EDF (with first-fit policy)

A refined bound: If 𝑛 > 𝛽 ⋅ 𝑚 and ∀𝑖, 𝑈𝑖 ≤ 𝑈𝑚𝑎𝑥, then the task set is scheduleable by 𝑈𝐸𝐷𝐹+𝐹𝐹 if

𝛽 is the maximum number of tasks with utilization 𝑈𝑚𝑎𝑥 that fit into one processor.

𝑈𝐸𝐷𝐹+𝐹𝐹 ≤1

2⋅ 𝑚 + 1

where 𝛽 =1

𝑈𝑚𝑎𝑥

𝑈𝐸𝐷𝐹+𝐹𝐹 ≤𝛽 ⋅ 𝑚 + 1

𝛽 + 1

𝑈𝑚𝑎𝑥 is the maximum task utilization among all tasks


Global v.s. partitioned scheduling


Global v.s. partitioned

• There are task sets that are schedulable only with a global scheduler

• Example: 𝜏1 = (𝐶1 = 1, 𝑇1 = 2); 𝜏2 = (2,3); 𝜏3 = (2,3)

𝜏2 𝐶2=2

30 𝐶3=2

𝜏1 𝐶1=12 4 6

𝜏3

Processor 1:

Processor 2:

This counter example is also valid for global FP algorithm when priorities follow p2 < p1 < p3

2

1 3

1

3 31

2

It is impossible to find a feasible schedule using partitioned approach:• We cannot schedule either of the two tasks

together on one core:

1

2+2

3> 1

Feasible schedule using global scheduling



• There are task sets that are schedulable only with a partitioned scheduler

• Example for 2 cores (assume that each core is scheduled with EDF)

𝝉𝒊 𝑪𝒊 𝑻𝒊 𝑼𝒊

1 4 6 0.6

2 7 12 0.58

3 4 12 0.33

4 10 24 0.41

All 4! = 24 global priority assignments lead to deadline miss.



Supported by automotive industry (e.g., AUTOSAR)

No migrations

Isolation between cores

Mature scheduling framework

Low scheduling overhead (no need to access a global ready queue)

× Cannot exploit unused capacity

× Rescheduling not convenient

× NP-hard allocation

Allows parallel execution

Automatic load balancing

Lower avg. response time

Easier re-scheduling

More efficient reclaiming and overload management

Generally lower number of preemptions

× Migration costs

× Inter-core synchronization

× Loss of cache affinity

× Weak scheduling framework



• Tasks are statically allocated to processors, if possible.

• Remaining tasks are split into chunks (subtasks), which are allocated to different processors.



Note that subtasks are not independents,

but are subject to a precedence constraint:

This precedence must be managed!

This can be done, for example, by assigning an offset to the second segment and a

tighter deadline to the first segment of the task that must be split.

𝝉𝟓

𝜏51

𝜏52


Clustered scheduling

• A task can only migrate within a predefined subset of processors (cluster).


Example: global EDF and global RM are far from being optimal

• The schedulability bound of global-EDF and global-RM is equal to 1, independently of the number m of available processors.


Example: Partitioned EDF (or RM) could scheduled the same task set as the previous one without deadline miss


Any G-JLFP scheduler is not optimal

• Two processors, three tasks, 𝑇𝑖 = 15, 𝐶𝑖 = 10

Any job-level fixed-priority scheduler is not optimal• Synchronous release time• One of the three jobs is scheduled last under any JLFP policy• Deadline miss unavoidable!


Any G-JLFP scheduler is not optimal

• Two processors, three tasks, 𝑇𝑖 = 15, 𝐶𝑖 = 10

This task set is feasible as you see

𝜏1

𝜏2

𝜏3


More examples for the “Next Fit”• Suppose the current situation is represented in blue. The size of the new item is 2.

latest item was put in bin 4, so the new item goes to bin 5 since it does not fit in bin 4

latest item was put in bin 1, so the new item goes to bin 1 since it still fits

latest item was put in bin 3, so the new item goes to bin 3 since it still fits

Documents

Multiprocessor scheduling · • On parallel programming - OpenMP (part1): ... Diagram is from paper “WCET(m) Estimation in Multi-Core Systems using Single Core Equivalence”,