dissecting linux schedulers & implementing our toy cfs_scheduler simulation

03 Feb, 2025

Abstract

Linux scheduling involves multiple things like which tasks to pick, how to pick, how to maximize the CPU utilisation etc. Over the time scheduling algorithms evolved from FIFO, RR to EDF and CFS. These algorithms performs better in different type of tasks i.e. real time, hard limit tasks. An algorithm that is better in scheduling real time tasks is not necessary that it will be able to schedule a hard limit task. This article is the study of what Scheduling heuristics are there to take in mind before checking any scheduling algorithm, how different schedulers evolved and how they are being in use. At the end we will run some simulations with different kind of tasks in our toy cfs schedular simulation.

TL;DR

Linux schedulars evolved over time from FIFO, RR to O(1) and CFS
Each schedular behave differently for different type of tasks
We would explore evolution of schedulars, checking schedulable tasks
We will see Dahll's effect, Uniprocessors and Multiprocessors behaviour
We simulated a set of tasks in our toy cfs_schedular

1.1 Task scheduling

Whenever we have a set of tasks $τ$ and to check if this set is schedulable, we need to check if it exceeds CPU utilisation or not.
$τ = (C_{i}, T_{i}, D_{i})$ where
$C_{i}$ is the cpu time that $τ_{i}$ will take to complete its exection
$T_{i}$ is the time interval after which $τ_{i}$ 's second instance will be released
$D_{i}$ is the deadline before which that $τ_{i}$ should complete its execution
Note:- $D_{i} = T_{i}$ is a common assumption for real time systems

1.1.1 For Uniprocessor:-

For uniprocessors or single processors there should exsist a function $σ (t)$ which maps $t$ to a set of tasks containing runnable and idle tasks
$σ (t) : t \to T \cup (T_{i d l e})$

If $σ (t) \in T$ , it means that at time t, a specific task is being executed
If $σ (t) = T_{i d l e}$ , it means the system is idle at time t which mean no task is being run

1.1.2 For Multiprocessor:-

Let say we have multiprocessors setup with $M$ CPUs. For multiprocessor $σ (t)$ should map a matrix of tasks to $M$ CPUs to time $t$ as:-
$σ (t)$ : $T \in (T \cup (T_{i d l e})^{M}$

Note:- Hence, $σ (t)$ is a scheduling function in the context of CPU scheduling which determines which task is assigned to execution at any given time t

In dual core processor $σ (t)$ for at time $t$ will look like:-
$σ (t) : = [\begin{matrix} T_{1} \\ T_{2} \end{matrix}], M = 2$

For $M > 1$ there is two types of scheduling that we will look at:- Global scheduling, Partitioned Scheduling

1.2 Schedulable Task Set

For N periodic tasks, the task set is schedulable if
$\sum (\binom{C_{i}}{T_{i}}) \leq 1$
which essentially means that summing up all the utilisation of CPU per task $T_{i}$ is still not maxium which is 100% of CPU capacity.

similarly, for multiprocessor this equation becomes
$\sum (\binom{C_{i}}{T_{i}}) \leq M$

which essentially means that summing up all the capacity used by the task set $T$ it should not exceed the total CPU capacity available.

1.2.1 Dhall's Effect

Even if we have $\sum \leq M$ then also it is possible that task $T_{i}$ is not schedulable because of its high utilisation behaviour that is also called hard limit task.

Let's take an example of $(M + 1)$ tasks where $M : = (ϵ, T - 1, T - 1)$ and $(M + 1)^{t h} : = (T, T, T)$ then CPU utilisation upper bound can be found as:-
$\sum \frac{C_{i}}{T_{i}} : = M \frac{ϵ}{T - 1} + \frac{T}{T}$
$: = M \frac{ϵ}{T - 1} + 1$
Now, as $ϵ \to 0$ which is possible for real time tasks then $\sum \frac{C_{i}}{T_{i}} \to 1$

In this case EDF (earliest deadline first) and RM (rate monotonic) scheduling algorithm can not be used because they are not meant for hard tasks.

Hence, even if utilisation is less than what we want but it doesn't guarantee if task set is schedulable. Therefore, algorithms like EDF and RM are applicable only for soft tasks.

1.3 Constant Bandwidth Server (CBS)

Constant bandwidth server is used for hard limit tasks where it ensures that tasks meet their deadlines by allocating a constant bandwidth allocation to them.

It allocates capacity that a task can use and periodic interval after which task's instance can run. $T : = (Q_{i}, P_{i})$ . In other words task $τ_{i}$ is allocated a maximum runtime of $Q_{i}$ after every periodic interval of $P_{i}$

If task $τ_{i}$ wants to run for more than allcated $Q_{i}$ every $P_{i}$ then algorithm throttles/decreases its priority.

$\sum_{j} \frac{Q_{j}}{P_{j}} \leq 1$ where $τ_{i}$ is guranteed to receive $Q_{i}$ every $P_{i}$ interval.

Each task's dynamic deadline is maintained and throttled according to the history and allocated bandwidth. For $M$ cpus/cores with global scheduling if $\sum_{j} \frac{Q_{j}}{P_{j}} \leq M$ then CBS guarantees the scheduling of tasks from a task set.

1.3.1 CBS Algorithm - 1

Each task $τ_{i}$ starts with a dynamic deadline $d_{i}^{s}$ that are initialised as 0 when task is created.
When task wakes up, check if current deadline can be used:-
$d_{i}^{s} > t$ and $\frac{q_{i}}{d_{i}^{s} - t} < \frac{Q_{i}}{P_{i}}$
If not, $d_{i}^{s} = t + P_{i}$ and $q_{i} = Q_{i}$

This ensures that when a task wakes up, it will get its share of cpu after the alloted wait time. But if task wants to run for more time than allocated then its priority will be throttled.

CBS algorithm is generally used in RTOS (real time OS) or real time tasks where they have to meet the deadline or the deadline is critical.

1.4 Complete Fair Schedular (CFS)

CFS is an algorithm which can be used in general to schedule the tasks instead of CBS which is favorable for real time tasks. CFS relies on the hypothesis that it can give a fair share to each task in a task set.
CFS uses few terms/parameters as stated below:-

weight:- a priority or niceness in UNIX systems that is given to each task
timeslice:- an interval given to each task which tells that in this time task $τ_{i}$ has to run atleast once
vruntime:- runtime divided by thread's weight that essentially means that high priority thread will have less vruntime
$v r u n t i m e = a c t u a l_r u n t i m e * \frac{w e i g h t_o f_d e f a u l t_t a s k}{w e i g h t_o f_t a s k}$
High priority task's rate of increase of vruntime is slower compared to lower priority (less weight) task
runqueue:- CFS relies on runqueues which are nothing but red-black tree with ascending vruntime. CFS picks lower most task when scheduling a new task. Which means high priority task will be executed more as compared to low priority task

1.4.1 Our toy CFS simulation

We made a simulation where our cfs picks up the tasks from a runnable queue (priority min queue according to vruntime). At $t = 0$ every task has some vruntime, some random values are given instead of zeros, and vruntime increases like:-

at $t = 0$ push all tasks into the runnable queue having both CPU and IO tasks
have one vruntime calculation function which calculate task vruntime according to the task's priority:-
$v r u n t i m e + = \frac{e x e c u t e d_t i m e * N I C E_0_L O A D}{w e i g h t}$
$w e i g h t = \frac{N I C E_0_L O A D}{p r i o r i t y + 1}$
which means higher the task's priority (lower value of priority variable) will give higher the value of weight
heigher the value of weight lower rate of increase in vruntime of that task

CFS simluation
GLOABAL NICE_0_LOAD = 1024

FUNCTION weight_function(priority):
    return NICE_0_LOAD / (priorit + 1)
    
FUNCTION run_io_bound(process, iowait, queue):
    sleep_thread(iowait)
    weight = weight_function(process.priority)
    process.vruntime += (iowait * NICE_0_LOAD) / weight
    executed_time = 1
    process.cpu_burst_time -= executed_time
    process.vruntime += (executed_time * NICE_0_LOAD) / weight
    if process.cpu_burst_time > 0:
        queue.push(process)
    
FUNCTION run_cpu_bound(process, timeslice, queue):
    executed_time = min(timeslice, process.cpu_burst_time)
    process.cpu_burst_time -= executed_time
    weight = weight_function(process.priority)
    process.vruntime += (executed_time * NICE_0_LOAD) / weight
    
    if process.cpu_burst_time > 0:
        queue.push(process)

LOOP task in queue:
    if task is cpu_bound:
        run_cpu_task(task, 1, queue)
    else
        run_io_task(task, 10, queue)

1.4.2 Some simulations:-

First simulation is of processes with different vruntime, priority etc but all processes are of IO_BOUND nature whereas second one contains all CPU_BOUND processes.

We can clearly see that IO tasks are waiting for their iowait time and getting penalised. More priority IO task is getting more penalised and scheduling will get delayed whereas in CPU bound tasks, Process 1 got finalised quickly because it has the highest priroirty and got penalised the least because it started with least vruntime and highest priority. So, resulted in more frequent scheduling. Screenshot 1 Screenshot 2

References

https://github.com/singhdevhub-lovepreet/cfs-schedular-cpp https://retis.sssup.it/luca/AKP/Slides/dl_internals.pdf https://en.wikipedia.org/wiki/Completely_Fair_Scheduler https://wxdublin.gitbooks.io/deep-into-linux-and-beyond/content/cfs_internals.html