Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
1 | 69Embedded and Networked Systems
[email protected](appointment by email)
IN4343 Real-Time Systems, Lecture 10
Multiprocessor scheduling
2 | 69Embedded and Networked Systems
Sources
Book (for interested students)• Multiprocessor Scheduling for Real-Time Systems (Embedded Systems) 2015 Edition
• By Sanjoy Baruah, Marko Bertogna, Giorgio Buttazzo• Available at https://www.springer.com/gp/book/9783319086958
Paper (for interested students)• A survey of hard real-time scheduling for multiprocessor systems
• By Robert I. Davis and Alan Burns from University of York, U.K.• Available at https://dl.acm.org/citation.cfm?id=1978814
• Global Scheduling Not Required: Simple, Near-Optimal Multiprocessor Real-Time Scheduling with Semi-Partitioned Reservations
• By Bjorn Brandenburg and M. Gül, Max Planck Institute, Germany.
Slides (for interested students)• Giorgio Buttazzo:
• On multiprocessor systems (part1): http://retis.sssup.it/~giorgio/slides/cbsd/mc1-intro-6p.pdf• On multiprocessor systems (part2): http://retis.sssup.it/~giorgio/slides/cbsd/mc2-sched-6p.pdf
• Alessandra Melani:• On global scheduling (part1): http://retis.sssup.it/~giorgio/slides/cbsd/Melani1-global.pdf• On global scheduling (part2): http://retis.sssup.it/~giorgio/slides/cbsd/Melani2-RTA-DAG.pdf• On OpenMP real-time task model: http://retis.sssup.it/~giorgio/slides/cbsd/Melani3-openMP.pdf• On semi-partitioning algorithms: http://retis.sssup.it/~giorgio/slides/cbsd/Melani4-semipar.pdf
• Emanuele Ruffaldi• On parallel programming - OpenMP (part1): http://retis.sssup.it/~giorgio/slides/cbsd/Ruffaldi-omp1.pdf• On parallel programming - OpenMP (part2): http://retis.sssup.it/~giorgio/slides/cbsd/Ruffaldi-omp2.pdf• On parallel programming - GPU (part1): http://retis.sssup.it/~giorgio/slides/cbsd/Ruffaldi-gpu1.pdf• On parallel programming - GPU (part2): http://retis.sssup.it/~giorgio/slides/cbsd/Ruffaldi-gpu2.pdf
3 | 69Embedded and Networked Systems
Sources
• From Giorgio Buttazzo’s website: http://retis.sssup.it/~giorgio/rts-MECS.html• And the CBSD course: http://retis.sssup.it/~giorgio/CBSD.html
• From Alessandra Melani (in the CBDS course 2016): http://retis.sssup.it/~giorgio/slides/cbsd/Melani1-global.pdf
• From Marko Bertogna (Uni More): http://algo.ing.unimo.it/people/marko/• Source: http://algo.ing.unimo.it/people/marko/presentations/Multiproc_course.zip
Sources of some of the slides used in this lecture
4 | 69Embedded and Networked Systems
Moore’s Law Number of transistors per chip doubles every 24 months
5 | 69Embedded and Networked Systems
The limit
At the launch of Pentium 4, Intel expected single core chips to scale up to 10 GHz using gates below 90 nm. However, the fastest Pentium 4 never exceeded 4 GHz.
Why did that happen?
The leakage current negatively affects the static power of the chip.
Recall:• Dynamic power (Pd)
consumed during operation;• Static power (Ps)
consumed when the circuit is off (caused by leakage current).
As devices scale down in size, gate oxide thicknesses decreases, resulting in a larger leakage current.
6 | 69Embedded and Networked Systems
The limit
If processor performance would have improved by increasing the clock frequency, the chip temperature would have reached levels beyond the capability of current cooling systems.
7 | 69Embedded and Networked Systems
However, the future real-time systems require a lot of computational power
8 | 69Embedded and Networked Systems
Switching to multicore systems
• Industrial trend A: By consolidation • Port the existing functionalities of multiple processors into a multicore platform
• Example from automotive industry: ECU consolidation
• Why it is helpful?
• To reduce the hardware cost and communication delays
• Industrial trend B: By parallel programming techniques • By parallelizing the application code, each code segment can run on a different core in
parallel, hence, tasks will have shorter response time
• Main challenges:
• How to split the code into parallel segments that can be executed simultaneously?
• What to do with data or control flow dependencies?
• How to allocate such segments to different cores?
• How to analyze the worst-case execution time of the applications on a multicore platform?
How to exploit multicores in real-time systems?
9 | 69Embedded and Networked Systems
Industry’s challenges
Problem 1: Parallelizing legacy code implies a tremendous cost and effort due to:
• re-designing the application
• re-writing the source code
• updating the operating system
• writing new documentation
• testing the system
• software certification
Problem 2. execution of tasks on a multicore platform cause a lot of conflicts/interferenceson software and hardware resources
• Multiple tasks (running on multiple cores) may want to access the memory at the same time• Tasks may evict each other’s cache blocks in the shared caches• The WCET of a task will depend on the set of co-running tasks on the platform
(this was not the case for single processor systems)
10 | 69Embedded and Networked Systems
WCET in multicore
• Test by Lockheed Martin Space Systems on 8-core platform
The WCET increases because of the competition among cores in using shared resources.
• Main memory
• Memory-bus
• Last-level cache
• I/O devices
11 | 69Embedded and Networked Systems
Types of memory
12 | 69Embedded and Networked Systems
Cache in multicore systems
Possible solution: partition the last-level cache between the tasks
Issue: Reducing cache size of a task will increases its WCET and hence the task’s utilization!
L3:
n = 40 tasks on 4 cores
Any solution idea?
Diagram is from paper “WCET(m) Estimation in Multi-Core Systems using Single Core Equivalence”, Renato Mancuso, Rodolfo Pellizzoniy, Marco Caccamo, Lui Sha, Heechul Yun
13 | 69Embedded and Networked Systems
Cache in multicore systems
Possible solution: partition the last-level cache between the tasks
Issue: Reducing cache size of a task increases its WCET
A nice direction is to • use “non-preemptive scheduling” on the cores
(to avoid cache conflicts on L1 and L2)
• partition the shared cache L3 by the number of cores (rather than the number of tasks which is usually much larger than the cores)
L3:
n = 40 tasks on 4 cores
L3:
n = 40 tasks on 4 cores
14 | 69Embedded and Networked Systems
Memory banks
• To reduce memory conflicts, the DRAM is divided into banks:
15 | 69Embedded and Networked Systems
Main memory and I/O conflicts
Still, when cores concurrently access the main memory, DRAM accesses have to be queued, causing a significant slowdown:
A similar problem occurs when tasks running in different cores request to access I/O devices at the same time:
16 | 69Embedded and Networked Systems
How bad memory conflict can be?
• Diffbank: Core0 -> Bank0, Core1-3 -> Bank 1-3
• Samebank: All cores -> Bank0
Test on Intel-Xeon
17 | 69Embedded and Networked Systems
Types of multicore systems
ARM’s MPCore STI’s Cell Processor
• 4 identical ARMv6 cores • One Power Processor Element (PPE)• 8 Synergistic Processing Element (SPE)
Heterogeneous coresidentical cores
18 | 69Embedded and Networked Systems
Example: ARM Big.LITTLE architecture
There are many open scheduling problems when it comes to the heterogeneous platforms.In this course, we focus on identical cores
19 | 69Embedded and Networked Systems
Parallel real-time tasks
Sequential task
20 | 69Embedded and Networked Systems
Parallel programming
• Existing parallel programming models• OpenMP
• MPI
• IBM’s X10
• Intel’s TBB (abstraction for C++)
• Sun’s Fortress
• Cray’s Chapel
• Cilk (Cilk++)
• Codeplay’s Sieve C++
• Rapidmind Development Platform
21 | 69Embedded and Networked Systems
Task models for parallel real-time tasks
• Representing a parallel code requires more complex structures like a graph (usually a directed-acyclic graph, a.k.a. DAG):
In a DAG this
connection isforbidden
DAGs can be conditioned(e.g., by an if-then-else block)
OR nodes represent conditional statements
AND nodes represent parallel computations
Both must be executed
22 | 69Embedded and Networked Systems
Structured parallelism
• Fork-Join Graphs (a special type of DAGs)• After a fork node, all immediate
successors must be executed (the order does not matter).
• A join node is executed only after all immediate predecessors are completed.
• Nested fork-join model can have nested forks and joins.
23 | 69Embedded and Networked Systems
Assumptions and parameters
• Arrival pattern• Periodic (activations exactly separates by a period T)• Sporadic (Minimum Inter-arrival Time T)• Aperiodic (no inter-arrival bound exists)
• Is preemption allowed at arbitrary times?
• Is task migration allowed?
Task parameters:
𝜏𝑖 = {𝑐𝑖,1, 𝑐𝑖,2, … , 𝑐𝑖,𝑚𝑖}, 𝐷𝑖 , 𝑇𝑖 𝑆1
𝑆2 𝑆4
𝑆3 𝑆5
24 | 69Embedded and Networked Systems
ExampleThe number represents the execution time of that segment
Assume we have an infinite number of cores,
what is the shortest possible WCRT for this task?
Tasks with intra-precedence constraints are harder to schedule!(even with an infinite number of cores, we cannot finish this task earlier than 7 units of time).
Core1:Core2:Core3:
25 | 69Embedded and Networked Systems
Important factors
• Sequential Computation Time (Volume):
𝐶𝑖𝑠 =
𝑗=1
𝑚𝑖
𝑐𝑖,𝑗
CPU utilization (of a parallel task):
𝑈𝑖 =𝐶𝑖𝑠
𝑇𝑖
If 𝐶𝑖𝑠 ≤ 𝐷𝑖 holds, the task is schedulable on a single core system.
• Critical path length 𝐶𝑖𝑝
is the length of the longest path in the graph.
1 2
3
4
1
5
8
3
1 What is the length of the critical path in this graph?
If σ𝑈𝑖 > 𝑚, then the task set is certainly NOT schedulable on a multicore platform with 𝑚 cores
Task parameters:
𝜏𝑖 = {𝑐𝑖,1, 𝑐𝑖,2, … , 𝑐𝑖,𝑚𝑖}, 𝐷𝑖 , 𝑇𝑖
26 | 69Embedded and Networked Systems
1 2
3
4
1
5
8
3
1
Important factors
• Sequential Computation Time (Volume):
𝐶𝑖𝑠 =
𝑗=1
𝑚𝑖
𝑐𝑖,𝑗
CPU utilization (of a parallel task):
𝑈𝑖 =𝐶𝑖𝑠
𝑇𝑖
16
If 𝐶𝑖𝑠 ≤ 𝐷𝑖 holds, the task is schedulable on a single core system.
• Critical path length 𝐶𝑖𝑝
is the length of the longest path in the graph.
If σ𝑈𝑖 > 𝑚, then the task set is certainly NOT schedulable on a multicore platform with 𝑚 cores
Task parameters:
𝜏𝑖 = {𝑐𝑖,1, 𝑐𝑖,2, … , 𝑐𝑖,𝑚𝑖}, 𝐷𝑖 , 𝑇𝑖
What is the length of the critical path in this graph?
27 | 69Embedded and Networked Systems
1 2
3
4
1
5
8
3
1
Important factors
• Sequential Computation Time (Volume):
𝐶𝑖𝑠 =
𝑗=1
𝑚𝑖
𝑐𝑖,𝑗
What would be a necessary schedulability test for a DAG task?
𝐶𝑖𝑝≤ 𝐷𝑖
CPU utilization (of a parallel task):
𝑈𝑖 =𝐶𝑖𝑠
𝑇𝑖
16
• Critical path length 𝐶𝑖𝑝
is the length of the longest path in the graph.
Task parameters:
𝜏𝑖 = {𝑐𝑖,1, 𝑐𝑖,2, … , 𝑐𝑖,𝑚𝑖}, 𝐷𝑖 , 𝑇𝑖
What is the length of the critical path in this graph?
28 | 69Embedded and Networked Systems
Multiprocessor models
• Identical
Processors are of the same type and have the same speed. Each task has the same WCET on each processor.
• Uniform
Processors are of the same type but may have different speeds. Task WCETs are smaller on faster processors.
• Heterogeneous
Processors can be of different type. The WCET of a task depends on the processor type and the task itself.
29 | 69Embedded and Networked Systems
Real-time scheduling for multicore platforms
Our assumptions:
• Identical multicore systems (all cores are the same)• WCET of each task is a sound upper bound on the actual execution time of the
task for any possible co-running task and any scenario• If it is not mentioned explicitly, we assume that each task has one sequential
code segment
30 | 69Embedded and Networked Systems
Classification of multiprocessor scheduling algorithms
Partitioned scheduling
Tasks cannot migrate between cores
Semi-partitioned scheduling
Some of the tasks can migration between cores
Global scheduling
Any task is allowed to migrate between cores
Cluster scheduling
Some of the tasks can migration between some
pre-specified cores
Fixed-priority scheduling
Each task has a fixed priority
Job-level fixed-priority (JLFP) scheduling
Each job has a fixed priority, e.g., EDF and FP are both JLFP
Job-level dynamic-priority (JLDP) scheduling
Each job has a varying priority
Provide better schedulability but with a higher overhead.
31 | 69Embedded and Networked Systems
Partitioned scheduling
• Each processor manages its own ready queue
• The processor for each task is determined offline
• The processor cannot be changed at runtime
running
running
running
32 | 69Embedded and Networked Systems
Global scheduling
• The system manages a single queue of ready tasks
• The processor is determined at runtime
• During execution, a task can migrate to another processor
running
33 | 69Embedded and Networked Systems
Global schedulingCore1
Core2
Core3
Global queue(ordered according to a given policy)
t1
t2
t3
• The first 𝒎 tasks are scheduled upon the 𝒎 cores
• When a task completes, the next one in the queue is scheduled on the available core
• When a higher-priority task arrives, it preempts the task with the lowest-priority among the executing ones
t4t5
As a result, tasks may MIGRATE between cores!
34 | 69Embedded and Networked Systems
Exam example: global rate monotonic
• What would be the schedule of these tasks on a multicore platform with 3 cores?
𝝉𝒊 𝑪𝒊 𝑻𝒊
1 1 2
2 3 3
3 5 6
4 4 6
2
1
3
1
1
2
𝜏2 𝐶2=33
𝐶3=5
𝜏1 𝐶1=12 4 6
𝜏3
60𝐶4=4𝜏4
4
4 4
4
Deadline miss
Core 1
Core 3
Core 2
60
1
2
3
35 | 69Embedded and Networked Systems
Exam example: global rate monotonic
• What would be the schedule of these tasks on a multicore platform with 3 cores?
𝝉𝒊 𝑪𝒊 𝑻𝒊
1 1 2
2 3 3
3 5 6
4 4 6
2
1
3
1
1
2
𝜏2 𝐶2=33
𝐶3=5
𝜏1 𝐶1=12 4 6
𝜏3
60𝐶4=4𝜏4
4
4 4
4
Deadline miss
Core 1
Core 3
Core 2
60
Is this task set feasible?
Yes
4
Core 3
Core 22
1
3
1
1
2
Core 14
601
2
3
36 | 69Embedded and Networked Systems
2
1
4Core 3
Core 2
Core 1
60
Exam example: other global scheduling policies
𝜏2 𝐶2=33
𝐶3=5
𝜏1 𝐶1=12 4 6
𝜏3
60 𝐶4=4𝜏4
Global fixed-priority with priorities: 𝑃1 < 𝑃2 < 𝑃4 < 𝑃3
Core 3Core 2
Core 1Global non-preemptive fixed-priority with priorities: 𝑃2 < 𝑃1 < 𝑃3 < 𝑃4
2
1 3 1 1
2
3
3
4
Deadline miss for 𝜏3
Global EDF(Ties in deadlines are broken by task index)
Deadline miss for 𝜏4
Core 3
Core 2
Core 1
𝝉𝒊 𝑪𝒊 𝑻𝒊
1 1 2
2 3 3
3 5 6
4 4 6
Deadline miss
2
1 4 1 1
2
4
3
4
2
1
3
1
1
2
4
Global EDF(if there is a tie in deadlines, 𝝉𝟒 wins! God knows why :D )
No deadline missCore 3
Core 2
Core 12
1 4 1
1
2
4
3
3
37 | 69Embedded and Networked Systems
𝝉𝒊 𝑪𝒊 𝑻𝒊
1 1 2
2 3 3
3 5 6
4 4 6
Core 3
Core 22
1
3
1
1
2
Core 14 4
60
what should be each job’s priority to generate this schedule?
𝑃3,1 < 𝑃1,2 < 𝑃2,2 < 𝑃4,1 < 𝑃1,3
• Job 𝐽1,2 must be able to preempt 𝐽4,1, hence, its
priority should be higher than 𝑃4,1.
• Job 𝐽1,3 must NOT be able to preempt any job, hence, its priority should be the lowest.
𝑃1,1 < 𝑃2,1 <
𝜏2 𝐶2=33
𝐶3=5
𝜏1 𝐶1=12 4 6
𝜏3
60𝐶4=4𝜏4
Exam example: other global scheduling policies
38 | 69Embedded and Networked Systems
𝝉𝒊 𝑻𝒊
1 3
2 6
Core 3
Core 2
Core 1
60
𝜏2
3𝜏1 6
Exam example: global scheduling of DAGs
Schedule using the global fixed-priority: 𝑷𝟏 < 𝑷𝟐
3 2 1
𝜏2 2
𝑠2,1 𝑠2,2
𝑠2,3
𝑠2,4
1 2
1𝜏1
𝑠1,1 𝑠1,2
𝑠1,3
2,1
1,1 1,2
2,2
1,3
2,4
1,1 1,2
1,32,3 2,3
39 | 69Embedded and Networked Systems
Global scheduling
• Work-conserving scheduler• No processor is ever idled when a task is ready to execute.
• Non-work-conserving scheduler• A processor may be left idle event if there are
ready jobs in the system• (open research area) :D
Core1
Core2
Core3
Global ready queue
t1
t2
t3
t4t5
40 | 69Embedded and Networked Systems
Global scheduling: advantages
Allows parallel execution
Load balancing between the cores• (being able to dispatch jobs on idle cores)
Easier re-scheduling • (dynamic loads, selective shutdown, etc.)
Lower average response time (known result from queueing theory)
More efficient reclaiming and overload management
Number of preemptions
Migration cost: • can be mitigated by proper hardware (e.g., MPCore’s Direct Data Intervention)
Scheduling overheads
Few schedulability tests Further research needed
No job-level fixed-priority scheduling algorithm is optimal
41 | 69Embedded and Networked Systems
Cache level 1 (L1):
Cache level 2 (L2):
…
main memory (RAM)main memory
𝐶1 𝐶2 𝐶𝑚
1. Overheads
Global scheduling: disadvantages
Obvious overheadsTask migration between cores
Why task migration has a big impact on task’s execution time?
t1t2t6
Task 6 data
1. 𝜏6 is running so it loads its data into L2 and then L1, gradually.
2. 𝜏3 arrives and enters the ready queue. Since it has higher priority than 𝜏6, it preempts 𝜏6
42 | 69Embedded and Networked Systems
Ready queue
Cache level 1 (L1):
Cache level 2 (L2):
…
main memory (RAM)main memory
𝐶1 𝐶2 𝐶𝑚
1. Overheads
Global scheduling: disadvantages
Obvious overheadsTask migration between cores
Why task migration has a big impact on task’s execution time?
t1t2
t6
Task 6 data
1. 𝜏6 is running so it loads its data into L2 and then L1, gradually.
2. 𝜏3 arrives and enters the ready queue. Since it has higher priority than 𝜏6, it preempts 𝜏6
t3
3. 𝜏2 finishes and 𝜏6 resume its execution on the second core
43 | 69Embedded and Networked Systems
Cache level 1 (L1):
Cache level 2 (L2):
…
main memory (RAM)main memory
𝐶1 𝐶2 𝐶𝑚
1. Overheads
Global scheduling: disadvantages
Obvious overheadsTask migration between cores
Why task migration has a big impact on task’s execution time?
t1t6
Task 6 data
1. 𝜏6 is running so it loads its data into L2 and then L1, gradually.
2. 𝜏3 arrives and enters the ready queue. Since it has higher priority than 𝜏6, it preempts 𝜏6
t3
3. 𝜏2 finishes and 𝜏6 resume its execution on the second core
4. 𝜏3 tries to access its data. It may evict data of 𝜏6!
𝝉𝟑
𝝉𝟑
44 | 69Embedded and Networked Systems
Cache level 1 (L1):
Cache level 2 (L2):
…
main memory (RAM)main memory
𝐶1 𝐶2 𝐶𝑚
1. Overheads
Global scheduling: disadvantages
Obvious overheadsTask migration between cores
Why task migration has a big impact on task’s execution time?
t1t6
Task 6 data
1. 𝜏6 is running so it loads its data into L2 and then L1, gradually.
2. 𝜏3 arrives and enters the ready queue. Since it has higher priority than 𝜏6, it preempts 𝜏6
t3
3. 𝜏2 finishes and 𝜏6 resume its execution on the second core
5. When 𝜏6 tries to access its data, it will receive cache miss. So it has to reload them again!
4. 𝜏3 tries to access its data. It may evict data of 𝜏6!
𝝉𝟑
𝝉𝟑𝝉𝟐
A migrated task needs to load a lot of data into cache
As you see, cache becomes a big source of unpredictability
Co-running tasks affect each other’s execution time
45 | 69Embedded and Networked Systems
1. Overheads
Global scheduling: disadvantages
Obvious overheadsTask migration between cores
Non-obvious overheadsLarge scheduling overhead
Whenever the task on a core completes, it calls the scheduler function. So, multiple scheduler functions can run at the same time if multiple tasks finish at the same time. Since each scheduler function wants to access the global ready queue, and since it is a «global variable», it must be protected by semaphores or locks. Consequently, scheduler functions called by different cores can frequently block each other!
Global ready queue
46 | 69Embedded and Networked Systems
2. Dhall’s effect
Global scheduling: disadvantages
The lower bound on the utilization of a task set that is not schedulable by any work-conserving global scheduling algorithm on a
multiprocessor system with 𝑚 cores is 1.
Namely: regardless of the number of cores in the system, we may not be able to find a feasible schedule for the tasks even if the utilization is just about 1.
47 | 69Embedded and Networked Systems
Dhall’s effect(on any global work-conserving policy including global EDF)
1 heavy task 𝑈~1
𝑻 → ∞ ⇒ 𝑼 → 𝟏
Example:
𝒎 processors, 𝒏 = 𝒎+ 𝟏 tasks
𝝉𝒊 𝑪𝒊 𝑻𝒊 𝑫𝒊 𝝓𝒊 𝑼𝒊
1 1 𝑇 𝑇 0 ~0
2 1 𝑇 𝑇 0 ~0
… 1 𝑇 𝑇 0 ~0
𝑚 1 𝑇 𝑇 0 ~0
𝑚+ 1 𝑇 𝑇 𝑇 𝜖 ~1
𝑚 light tasks, 𝑈~0
𝜏1
𝜏2
𝜏𝑚
𝜏𝑚+1
…
Deadline miss
𝑇Core 1
Core 2
Core 𝑚…
0
𝜏1𝜏2
𝜏𝑚
𝜏𝑚+1𝝐 T+𝝐
…
𝑻
0
This task set is feasible by partitioned scheduling or by a work-conserving scheduler
48 | 69Embedded and Networked Systems
Negative results about global scheduling
• Weak theoretical framework• Unknown critical instant
• Global-EDF is not optimal
• Any global job-level fixed-priority (G-JLFP) scheduler is not optimal
• Optimal algorithms exist only for sequential implicit deadline tasks• Example: PFair and RUN. These algorithms have a large number of preemptions
• Many sufficient tests (most of them incomparable)
49 | 69Embedded and Networked Systems
Partitioned Scheduling
• The scheduling problem reduces to:
Bin-packingproblem
Uniprocessorscheduling
problem+
NP-hard in thestrong sense
Various heuristics used: FirstFit, NextFit, BestFit, FFDU, BFDD, etc.
Well known
EDFU ≤ 1
t2
t1 t3t4
t5
RM(Response-time analysis)
...
50 | 69Embedded and Networked Systems
Possible partitioning choices
• Partition by information-sharing requirements
• Partition by functionality
• Use the least possible number of processors or run at the lowest possible frequency• Depends on considerations like fault tolerance, power consumed, temperature,
etc.
• Partition to increase schedulability
Real-time systems research has focused on this one extensively
These approaches might not be good for schedulability
51 | 69Embedded and Networked Systems
Classic partitioning algorithms for real-time systems
• First fit (FF)• Best fit (BF)• Worst fit (WF)• Random fit (RF)• Next fit (NF)• …
Partitioning problem:Given a set of tasks 𝜏 = 𝜏1, 𝜏2, … , 𝜏𝑛 and a multiprocessor platform with 𝑚 processors, find an assignment from tasks to processors such that each task is assigned to one and only one processor
It is a bin packing problem!
Classic solutions: 1. Select a fitness criteria
• Example: task utilization 𝑈𝑖 =𝐶𝑖
𝑇𝑖or task density 𝛼𝑖 =
𝐶𝑖
𝐷𝑖
2. Decide how do you want to sort the tasks (decreasing, increasing, or
random)
3. Decide what is the fitness evaluation method (how will you reject
an assignment)
4. Use any of the following fitting policies to assign tasks to processors:
52 | 69Embedded and Networked Systems
Partitioning heuristics
• First fit (FF)• Place each item in the first bin that can contain it.
• Best fit (BF)• Place each item in the bin with the smallest empty space.
• Worst fit (WF)• Place each item in the used bin with the largest empty space, otherwise start a new bin.
• Next fit (NF)• Place each item in the same bin as the last item. If it does not fit, start a new bin.
53 | 69Embedded and Networked Systems
Comparison
• Suppose the current situation is represented in blue, the latest item was put in bin 2, and a new item of size 2 arrives:
54 | 69Embedded and Networked Systems
Exam exampleFitness policy: First fitFitness criteria: task utilizationSorting: decreasingFitness evaluation: 𝑈 ≤ 1 (preemptive EDF will be used to schedule tasks assigned to a core)
𝑼𝟏 𝑼𝟐 𝑼𝟑 𝑼𝟒 𝑼𝟓 𝑼𝟔 𝑼𝟕 𝑼𝟖 𝑼𝟗
0.9 0.8 0.5 0.4 0.2 0.2 0.2 0.1 0.1
Proc1 Proc2 Proc3 Proc4
1. What are the partitions created from first-fit policy?
1 2
0.9Current U: 0.8
U=1
Proc5
U = 3.4
0.5
3
4
0.9
5
1.0
6
0.2
7
0.4
8
1.0
9
1.0
2. What is the minimum number of partitions? 4
55 | 69Embedded and Networked Systems
First-fit algorithm
In other words, it is impossible for 2 bins to be at most half full because such a possibility implies that at some point, exactly one bin was at most half full and a new one was opened to accommodate an item of size at most V/2. where V is the size of a bin.
First-fit algorithm achieves an approximation factor of 2
Namely, the number of bins used by this algorithm is no more than twice the optimal number of bins.
56 | 69Embedded and Networked Systems
Observations
however:
• NF has a poor performance since it does not exploit the empty space in the previous bins
• FF improves the performance by exploiting the empty space available in all the used bins.
• BF tends to fill the used bins as much as possible.
• WF tends to balance the load among the used bins.
The performance of each algorithm strongly depends on the input sequence
57 | 69Embedded and Networked Systems
Lopez utilization bound for partitioned EDF (with first-fit policy)
A refined bound: If 𝑛 > 𝛽 ⋅ 𝑚 and ∀𝑖, 𝑈𝑖 ≤ 𝑈𝑚𝑎𝑥, then the task set is scheduleable by 𝑈𝐸𝐷𝐹+𝐹𝐹 if
𝛽 is the maximum number of tasks with utilization 𝑈𝑚𝑎𝑥 that fit into one processor.
𝑈𝐸𝐷𝐹+𝐹𝐹 ≤1
2⋅ 𝑚 + 1
where 𝛽 =1
𝑈𝑚𝑎𝑥
𝑈𝐸𝐷𝐹+𝐹𝐹 ≤𝛽 ⋅ 𝑚 + 1
𝛽 + 1
𝑈𝑚𝑎𝑥 is the maximum task utilization among all tasks
58 | 69Embedded and Networked Systems
Global v.s. partitioned scheduling
59 | 69Embedded and Networked Systems
Global v.s. partitioned
• There are task sets that are schedulable only with a global scheduler
• Example: 𝜏1 = (𝐶1 = 1, 𝑇1 = 2); 𝜏2 = (2,3); 𝜏3 = (2,3)
𝜏2 𝐶2=2
30 𝐶3=2
𝜏1 𝐶1=12 4 6
𝜏3
Processor 1:
Processor 2:
This counter example is also valid for global FP algorithm when priorities follow p2 < p1 < p3
2
1 3
1
3 31
2
It is impossible to find a feasible schedule using partitioned approach:• We cannot schedule either of the two tasks
together on one core:
1
2+2
3> 1
Feasible schedule using global scheduling
60 | 69Embedded and Networked Systems
Global v.s. partitioned
• There are task sets that are schedulable only with a partitioned scheduler
• Example for 2 cores (assume that each core is scheduled with EDF)
𝝉𝒊 𝑪𝒊 𝑻𝒊 𝑼𝒊
1 4 6 0.6
2 7 12 0.58
3 4 12 0.33
4 10 24 0.41
All 4! = 24 global priority assignments lead to deadline miss.
61 | 69Embedded and Networked Systems
Global v.s. partitioned
Supported by automotive industry (e.g., AUTOSAR)
No migrations
Isolation between cores
Mature scheduling framework
Low scheduling overhead (no need to access a global ready queue)
× Cannot exploit unused capacity
× Rescheduling not convenient
× NP-hard allocation
Allows parallel execution
Automatic load balancing
Lower avg. response time
Easier re-scheduling
More efficient reclaiming and overload management
Generally lower number of preemptions
× Migration costs
× Inter-core synchronization
× Loss of cache affinity
× Weak scheduling framework
62 | 69Embedded and Networked Systems
Semi-partitioned scheduling
• Tasks are statically allocated to processors, if possible.
• Remaining tasks are split into chunks (subtasks), which are allocated to different processors.
63 | 69Embedded and Networked Systems
Semi-partitioned scheduling
Note that subtasks are not independents,
but are subject to a precedence constraint:
This precedence must be managed!
This can be done, for example, by assigning an offset to the second segment and a
tighter deadline to the first segment of the task that must be split.
𝝉𝟓
𝜏51
𝜏52
64 | 69Embedded and Networked Systems
Clustered scheduling
• A task can only migrate within a predefined subset of processors (cluster).
65 | 69Embedded and Networked Systems
Example: global EDF and global RM are far from being optimal
• The schedulability bound of global-EDF and global-RM is equal to 1, independently of the number m of available processors.
66 | 69Embedded and Networked Systems
Example: Partitioned EDF (or RM) could scheduled the same task set as the previous one without deadline miss
67 | 69Embedded and Networked Systems
Any G-JLFP scheduler is not optimal
• Two processors, three tasks, 𝑇𝑖 = 15, 𝐶𝑖 = 10
Any job-level fixed-priority scheduler is not optimal• Synchronous release time• One of the three jobs is scheduled last under any JLFP policy• Deadline miss unavoidable!
68 | 69Embedded and Networked Systems
Any G-JLFP scheduler is not optimal
• Two processors, three tasks, 𝑇𝑖 = 15, 𝐶𝑖 = 10
This task set is feasible as you see
𝜏1
𝜏2
𝜏3
69 | 69Embedded and Networked Systems
More examples for the “Next Fit”• Suppose the current situation is represented in blue. The size of the new item is 2.
latest item was put in bin 4, so the new item goes to bin 5 since it does not fit in bin 4
latest item was put in bin 1, so the new item goes to bin 1 since it still fits
latest item was put in bin 3, so the new item goes to bin 3 since it still fits