About Me

With proven experience in Web Development, Quality Assurance, and Tech Support and an education concentrating in application development I have a well rounded skill set to accomplish almost any challenge placed before me.

What I can help you do:
1 Analysis and Design: From business requirement collection to system and database design.
2 Web Development: Sales driven sites or a web application UI.
3 Application Development: An information system backend or a standalone application.
Research Paper

The final project for this course consisted of a written research paper focused on a specific aspect of an operating system. I chose to compare 3 different process schedulers used in different versions of the Linux kernel.


Process Scheduling in Linux

Overview

Since 2001 Linux Kernel developers have made several attempts to improve the performance of the process scheduler in the Linux Kernel. The first scheduler that will be looked at is the priority based scheduler used in the 2.4 Kernel. With very few mechanisms to gather information regarding the priority at which the processes are scheduled by and its lack of interactivity mechanisms are only two of the downfalls of the process scheduler. The scheduling functions simplistic design, though effective, is not efficient when the number of active processes on the system grow. It is not hard to see why the 2.6 Kernel series used a newly designed scheduler until the release of 2.6.23 when it was once again redesigned. (Aas, 2005)

The multilevel queue scheduler used in the 2.6 kernel up to the 2.6.22 release was an attempt to fix all the pitfalls of the previous scheduler. With a total of 140 priority levels, mechanisms for judging the interactivity of a process, and a scheduler that operates in constant time regardless of the number of processes active on the system, it is hard to believe that it was totally redesign for the 2.6.23 kernel releases. (Bovet & Cesati, 2000)

In the release of the 2.6.23 Linux kernel the "Completely Fair Scheduler" was introduced into the kernel. It was designed be very flexible, fair and efficient. Supporting a broad range of new features from process grouping to modular process classes, its real beauty lies in the concept of "fairness". No longer does the scheduler separate the processes by the priorities, it attempts to give each process a "fair share" of CPU time based on its weight of the system load against the other processes. However, fairness only applies to normal processes, real time processes still use a similar method of scheduling used in the previous scheduler. (kumar, 2008)

Introduction

The Linux operating system has been around for many years and thousands of programmers from different backgrounds with a wide range of experience contribute to its success on a daily basis. With new releases being published monthly, if not weekly, it is no surprise that the Linux kernel is a well oiled machine. There have been three recent attempts to improve process scheduling in Linux over the past few years. The first being the priority based scheduler last used in the 2.4.31 kernel which was the last release using the priority scheduling method. The second, last used in the 2.6.22.19 release of the kernel contained a priority based multilevel feedback back queue. The final scheduler used, currently in mainstream use, dubbed the Completely Fair Scheduler or CFS, found in the releases 2.6.23 and up. Though the CFS went through several changes after it was first released, the 2.6.25.9 kernel will be examined to give a more current view of the scheduler.

Judging whether the design of a process scheduler is or was successful is not an easy task. Criteria must be set to scale the scheduler against the others in order to identify their strengths and weaknesses. By taking a look at any overhead incurred in the system, the throughput, response time, and turn-around time can give a good indication whether a scheduler can provide the system with efficient scheduling. (Silberschatz, Gagne & Galvin)

Five key aspects of each scheduler will be examined in regards to the previously mentioned criteria. The first aspect is the structure of each active task that will is to be scheduled. This will be reviewed to help identify how the process information is stored in the Linux Kernel. Time slices calculation will be looked at as the length of the time slice can have a direct affect on the performance of the system. Priorities also play a crucial role in the performance, because of this, the use and calculation of the processes priorities will also be looked at. Believe it or not, the way in which the schedulers run queue holds the processes currently active on the system has an effect on the performance of the system. The mechanisms used in regards to the run queue will be examined in terms of use and performance. Finally the overall structure and layout of the scheduler will be analysed as it determines the final outcome of how the previously mentioned aspects are executed.

Task Structure

In the 2.4 Linux kernel there are only a handful of elements that are used by the scheduler to determine the amount of CPU usage and the order in which the processes are executed. To keep track of the amount of time left in a processes time slice, counter is used. It represents the total number of clock ticks that are to go by before its time slice is over. To assist with the calculation of the time slice, each process has a nice value, ranging from -20 to 19, where a nice value -20 would yield a longer time slice, and 19 would have considerably less runtime. The scheduling policy is kept to determine which algorithm the scheduler will use when executing the process. A real time process with a policy of SCHED_FIFO uses the First-In-First-Out algorithm and simply runs with an unlimited time slice until it yields the CPU, finishes processing or, is interrupted by another process with a higher priority. A policy of SCHED_RR on the other hand is assigned a time slice and is moved to the end of the queue by the scheduler once it has expired. The last policy is SCHED_OTHER, this indicates a non-real time process. Real time processes contain a value from 0 to 99 to indicate its priority, where 0 is the highest priority. The task_struct pointers are to the next and previous processes in the linked list. Each process also has a nice value to assist in the calculation of a time slice which is stored in counter. The task structure also contains pointers for the doubly-linked list run-queue. Figure 1 shows the key elements of the task_struct in the 2.4.31 Linux scheduler.

struct task_struct{
     . . .
     Long counter;
     Long nice;
     unsigned long policy;
     . . .
     struct task_struct *next_task, *prev_task;
     . . .
     unsigned long rt_priority;
};
Figure 1 - Key elements of the /include/linux/sched.h
task structure in the 2.4.31 Kernel.

Like the task structure in the 2.4 kernel, the 2.6 scheduler also contains the state of the process, its policy, time slice, and real time priority. It also holds its priority (20 to -19 nice value), static priority (100 to 140 used for the priority run queues) and a normal priority that accounts for interactivity. In addition to the new priority values, sleep_avg and sched_time are contrained with the structure for interactivity calculations. The key elements of the task structure can be found in figure 2.

struct task_struct {
     . . .
     int prio, static_prio, normal_prio;
     struct prio_array *array;
     . . .
     unsigned long sleep_avg;
     unsigned long long timestamp, last_ran;
     unsigned long long sched_time; /* sched_clock time spent running */
     . . .
     unsigned int policy;
     . . .
     unsigned int time_slice, first_time_slice;
     . . .
     unsigned int rt_priority;
     . . .
};
Figure 2 - Key elements of the /include/linux/sched.h task structure in the 2.6.22.19 Kernel

The task structure in the new CFS scheduler contains the same priority and policy information as the previous scheduler but introduces two key structures, sched_class and sched_entity. The class of the process is used to determine what set of mechanisms will be used to manipulate that process within the scheduler as well as other parts of the system. At this point only classes for real time processes, a policy of SCED_FIFO or SCHED_RR, and normal processes, policy of SCHED_OTHER or SCHED_BATCH have been included in the kernel. An entity (sched_entity structure shown in figure 3), contains the information for a process or group of process that are to be scheduled. Its runtime, sleep time, virtual runtime, a flag indicating if it is currently on the run queue and its node on the rbtree are included in this structure. Though the use of different implementations between policies can become complex, it will allow system developers to further fine tune their systems to their needs more easily when compared to previous kernels. Figure 4 shows the important elements of the task structure.

struct sched_entity {
     struct load_weight load; /* for load-balancing */
     struct rb_node run_node;
     unsigned int on_rq;

     u64 exec_start;
     u64 sum_exec_runtime;
     u64 vruntime;
     u64 prev_sum_exec_runtime;
     . . .
};
Figure 3 - Key elements of the /include/linux/sched.h
entity structure in the 2.6.25. 9 Kernel.



struct task_struct {
     . . .
     int prio, static_prio, normal_prio;
     const struct sched_class * sched_class;
     struct sched_entity se;
     struct sched_rt_entity rt;
     . . .
     unsigned int policy;
     . . .
     struct list_head tasks;
     . . .
     unsigned int rt_priority;
     . . .
};
Figure 4 Key elements of the /include/linux/sched.h
task structure in the 2.6.25. 9 Kernel.

Time slices

Being able to calculate an appropriate time slice is a key aspect of any scheduler. If it produces a time slice that is too long, processes which require very little CPU time often are kept waiting. On the other hand, if the time slice is to short, the system is constantly switching between tasks incurring the overhead of rescheduling and context switches. One major problem with this is that each process requires a different amount of execution time and there is no way to determine this amount accuracy.

There are two pieces of information the 2.4 Kernel uses to determine a processes time slice. The first being the "nice" value assigned that process. The default nice value for all processes is 0, and can be changed through various system calls. The second piece of information the scheduler uses to determine the length of CPU usage a process gets is the frequency of the timer, the HZ variable. For an x86 system, the timer is usually set to 100, which is every 10ms. Figure 5 shows the calculation of the counter. Keep in mind that the counter is not the amount of time passed, but rather the number of ticks of the system timer.

This would give time slices in the range of 10ms for a process with a nice value of 19, to 210ms for a process with a nice value of -20. As stated before, these short time slices incur considerable overhead with CPU intensive processes as the system must perform context switches repeatedly even if there are only a few large processes left to run.

Like the 2.4 Kernel, the early release of the 2.6 kernel uses the nice value to help determine the length of a processes time slice. It uses two different calculations depending on the processes static priority, the nice value. Assuming the system is a has a default time slice of 100ms, and there are the typical 140 priority levels, 40 of which are user level priorities there are two different calculations that can be made based on the static priority of the process. Figure 6 shows the calculation for process with a nice value of -1 to -20, or static priority of 120 to 101:


The Completely Fair Scheduler takes a totally different approach to the amount of time a task has to execute. It calculates time slice based on the number of processes running and its weight compared to the load of the system. It first calculates a "period" to determine the length of time it needs for each task with a policy of SCHED_OTHER/SCHED_BATCH to run. This is done using the __sched_period() function defined in sched_fair.c. The idea is that if there are 5 or less processes needing the CPU, 20ms should be enough time for each of them to run. However, if the system becomes busy the period is extended by adding 4ms for every process that is on the system. The next two pieces of information that are needed to determine the time slice are the weight of that process, and the weight of the run queue. The processes weight is determined by its static priority, and a static array of available weight values from 15 for a process with a nice value of 19, to 88761 for a process with a nice value of -20. Together with the sum of all the processes weights on the run queue an appropriate amount of CPU time can be given to each process. (Hailperin) Figure 8 shows the calculation of a process time slice:


For example, with the given processes:

Process P1 P2 P3
Nice 5 0 -5
Weight 335 1024 3121
Time Slice 1.5ms 4.5ms 14ms

As you can see from the previous example, at that given moment with only those three processes on the run queue they were assigned a time slice not only based on its nice value, but proportionate to all the other processes nice values. Obviously as processes enter and leave the system this amount of time will vary. With extremely short time slices most would assume that the overhead of the constant context switches would degrade the systems performance. However, through careful coding developers were able eliminate this overhead. (Molnar, 2007)

Like the previous schedulers, processes with a policy of SCHED_RR use a default time slice of 100ms, and SCHED_FIFO execute as if there was no time limit for its usage of the CPU at all. Going from a fixed time slice based on the nice value to a time slice based on all the other processes on the system as far as its weight is concerned is a major development for the Linux Kernel.

Priorities

Selecting a process that should be the next to run is a critical aspect of the scheduler. Though in the 2.4 Kernel there is very little information available to determine this, it uses a simple calculation in the goodness() function to determine the best candidate. It takes the sum of the time remaining in the time slice, value of 20 less the nice value, and one addition credit if the memory for the current process is mapped. A final bonus of 1000 is given to real-time processes with a policy of SCHED_FIFO and SCHED_RR to ensure that they run before the other processes. This calculation unfortunately depends largely of the programs nice value which is not the best value to basing the scheduling of processes.

Having 100 real time priority levels and 40 normal priority levels for processes allows the scheduler in the 2.6 Linux kernel to easily determine the next process to run. Normal processes are placed in their run queue based on their nice value, -20 to 19, giving the 40 priority run queues. Real time processes hold their initial priority through the life of the system while, unless changed through system calls, but normal processes have their priority altered throughout their execution lifetime. Processes can have their priority level increased by up to five levels, and decreased by up to five levels. The system does this by determining the processes interactive rating. The introduction of process interactivity credits into the scheduler was one of the most significant developments. Using two complex calculations to determine if the process is interactive, the process can be placed back in its queue and begin executing once the other processes in that priority level have finished execution. This allows for a more responsive feel to the system for the user, and at the same time increases resource utilization by constantly keeping tasks ready while they are waiting for IO.

For real time processes in the CFS, the scheduler first checks if there are any on the run queue to be executed. Stepping through the array starting at the highest priority and working its way down through executing the processes the same as it does in the original 2.6 kernel scheduler. The major difference is seen when the scheduler attempts to execute a normal process, known in the scheduler as a "fair" process. As mentioned earlier, each process is given a percentage of runtime based on the number of other processes, and the weight of all the processes on the run queue. The CFS prioritizes its task by calculating the amount of time it ran based its weight, not the actual amount of time it spent on the CPU. Figure 9 shows the calculation of virtual runtime the Completely Fair Scheduler.:

Run queues

Another key aspect of any scheduler is the technique used to store the processes that are witing on the run queue. The technique must be efficient as delays could possibly hinder performance of the system.

The 2.4 kernel uses a doubly-linked list to hold all the processes available to be scheduled. This allows the scheduler to examine all of the processes to find the best possible process to execute based on its goodness value mentioned earlier. It simply steps through the processes one by one using the next_task pointer declared within the current processes task structure. Though a doubly-linked list is efficient for cycling through processes, it is not kept sorted and therefore each process must be analysed on each invocation of the scheduler. As the amount of active processes on the system grow, the longer it will take to calculate the goodness and find the process that is best fit to run.

The 2.6 kernel takes a more complex approach to organizing processes in its run queue. Each CPU has a master run queue that holds all the active processes for that CPU. This structure holds two arrays, one for tasks waiting to use their assigned time slices (active), and one for the tasks that have used their time slices (expired). These two arrays are further organized into an array of 140 doubly-linked lists, a linked list for every priority level in the system to hold the processes of that priority. Once a processes has expired its time slice, it is moved to the expired array. Once the active array is empty, the two arrays are switched and the process starts again. Figure 10 shows a graphical representation the run queue.

To reduce any overhead from analysing priority levels where there are no processes present, the scheduler uses a bit map of 140 bits with the active and expired arrays to indicate if there are any processes to be executed at that priority level. Using the bitmap mechanisms, the scheduler can quickly identify the first priority with an active task and simply skip to the next priority level with active tasks.

The CFS scheduler in the later releases of the 2.6 Linux Kernel uses a Red Black tree to sort the processes. The tree is sorted by its virtual runtime, with the lowest value always being the left-most node in the tree. This allows the scheduler to quickly determine which process needs the most attention. The sched_entity within the task structure contains an rb_node structure that holds the information needed to manipulate the tree. Each node contains pointers the root node, the node to its left, and the node to its right, and the color of the node. This information is used by system developers to build the necessary mechanisms to insert, delete, and add nodes to the tree, meaning that the mechanisms can be adjusted if needed. When a process has finished its execution it is put back into the tree at the appropriate position depending on how long it had the CPU for. For example, figure 11 gives a snapshot of the Red Black tree of processes with their virtual runtimes. It indicates that P5 would be the next to execute as it is the left-most node in the tree.

If the calculated virtual runtime for P5 was 200ms, figure 12 shows how the tree is rebalanced and that P3 would be the next to be scheduled.



The Scheduler

The structure of the scheduler function also plays a critical role in the performance of the system. If it is not properly designed it can increase process wait and turn-around time and decrease the throughput of the system.

The 2.4 scheduler has two major downfalls that could possibly lead to unacceptable system performance. After moving any round robin tasks to the end of the run queue, the scheduler inspects every task in the linked list run queue to see what process should be scheduled by using the goodness() function. Once the best task has been found it tests to see if there is time remaining in its counter. If there is not, it must recalculate the counters for each task in the run queue. This means at some point, the scheduler may need cycle through all of the tasks on the run queue twice. With five or six processes this may not seem like an issue, but as the number of processes are increase cycling through the run queue even once can add substantial amount of time to each processes wait and turnaround time, and throughput would be catastrophically decreased.

The early releases of the 2.6 Linux Kernel scheduler solved these problems by tracking a processes progress with the active and expired arrays. No longer is there a need to cycle through each linked list run queue to find a process that needs to be scheduled. Once processes expire their time slice, the time slice is recalculated and the process is placed in the expired array. By using a counter of all the processes on the run queue, it is able to quickly identify if it needs to switch the active and expired arrays. Once the scheduler has the appropriate run queue to work with, the use of a bit map speeds the process even more avoiding the need to examine all the elements of the array. This means that the scheduler can hold as many processes as it needs without drastically affecting the performance of the system.

The CFS scheduler uses a similar way of scheduling for real-time processes. Once the processes has expired its time slice, it is marked to be rescheduled and placed back in the appropriate place in the run queue array. When a real time process is to be scheduled next, it simply uses the bitmap to find the first task able to execute in the run queue. For normal processes however, the process is removed from the Red Black tree and placed in the proper position based after the value of its virtual runtime has been updated. When it comes time to schedule a new process, the schedule chooses the left most node on the tree as it has the lowest virtual runtime of all the processes. The continuous put back and pick next nature of the scheduler offers a tremendous performance gain over the previous schedulers as the processes no longer need to be sorted and no arrays need to be switched.

Conclusion

As you can see, years of hard work and continuous testing and tuning have paid in respect to process scheduling in Linux. Both server and desktop computing have both benefited from theories and mechanisms introduced in each new scheduler as it was released, from the priority based scheduler in the 2.4 Linux kernel to the Completely Fair Scheduler found in the latest releases.

The main downfall of in the 2.4 kernel’s process schedulers is the delay which occurs when time slices need to be recalculated. This overhead may not have a significant impact on the system with only a handful of processes, but would get considerably worse as more processes are added to the system. This is also the case when the scheduler is choosing the next task to execute. Due to the fact that a processes "goodness" changes, the scheduler must recalculate it for each process when it must schedule another process. The information collected for each process, the task structure does not provide sufficient information regarding the process or the processes history in order to determine its priority. Based merely on the processes nice value, the time left in its counter and the memory mapping, is far from an ideal set of criteria to determine a priority. It contains no information regarding its sleep time or CPU usage which would allow for some form of interactivity based priority. Not only is the way in which the priorities are calculated restricting the performance, using a linked list in order to store the processes and their priorities creates additional overhead as it must cycle through the list when choosing the next process to execute. Though the scheduler was not a complete failure, the narrow range of short time slices allowed the scheduler to decrease response and turnaround time, as well as maximize its throughput. The scheduler obviously was in need of improvement, and when it was redesigned in the 2.6 series of the Linux kernel they aimed to correct some of the scheduler shortcomings.

One of the few aspects of the 2.6 kernel scheduler that hinders the performance of the system is the wide range of large time slices used for the processes. With a range from 1 to 800ms, processes with large time slices may increase the wait times of lower priority processes with very small time slices. This could force IO bound processes to wait behind CPU hogs when they only need a portion of their assigned time slice and in turn decreases the responsiveness of the system. Though the interactivity mechanisms in the scheduler help to avoid this issue, it takes time for the scheduler to gather the statistics on such processes. The first major improvement of the scheduler was the way in which it handles the priority and time slice calculation. When a process is either created or expires its time slice both the priority and time slice are recalculated to avoid any long delays when the number of processes on the system grow. The efficiency of the sorting and searching of the prioritized processes also saves precious time throughout the operation of the scheduler by eliminating any overhead. The use of an active and expired array avoids any unnecessary manipulation of a linked list. Furthermore, to avoid any checks whether there is a task within a priority a bitmap is used to keep track of which priorities contain active processes. Though the scheduler is fast and efficient the kernel developers knew that change was necessary in order to further improve the performance of the scheduler.

A new approach to process scheduling was introduced to Linux with the development of the Completely Fair scheduler. Rather than determining the amount of time each individual task is allocated to the CPU based solely on its own heuristics, it is based on a combination of all the other processes active on the system. One potential downfall in the CFS scheduler is that the period of time in which all process are to be executed, the allocated time slice could be far too short and result in the overhead of context switches. However, with careful coding techniques developers were able to reduce this overhead to yield better results. These short time slices also reduce the response time for the processes. To allow for efficient sorting of tasks based on their virtual runtimes, the developers opted to use a Red Black Tree. Due to the design of a Red Black Tree, insertion and deletion of nodes can be done very rapidly which is crucial to the ongoing creation and termination of processes in a system. Though the search time for a particular node in a Red Black Tree is more time consuming, this does not affect the performance of the scheduler as it is always choosing the leftmost node in the tree.

Though the schedulers found in the 2.4 and early 2.6 kernels were able to schedule the systems load efficiently, it is obvious that the new CFS process scheduler will be prominent in upcoming releases of the Linux kernel. Like in the past there will still be many more changes to the design of scheduler as developers fine tune it to be more efficient.

Work Cited

Aas, J. (2005, February 17). Understanding the Linux 2.6.8.1 CPU Scheduler
Silicon Graphics, Inc. http://joshaas.net/linux/linux_cpu_scheduler.pdf

Aivazian, T. (2001, April 24). Linux Kernel 2.4 Internals
http://linux-int.carnet.hr/knjige/LinuxkernelInternals2_4.pdf

Bovet D. P. And Cesati M (2000, October). Understanding the Linux Kernel
O’Reilly & Associates, Inc. http://oreilly.com/catalog/linuxkernel/chapter/ch10.html

Corbet (2006, June 22). Trees II: reb-black trees
LWN.net. http://lwn.net/Articles/184495/

Hailperin, M. Changes in the Linux Scheduler as of 2.6.23.
http://gustavus.edu/+max/os-book/updates/CFS.html

Kumar, A. (2008, January 08). Multiprocessing with the Completely Fair Scheduler
IBM. http://www.ibm.com/developerworks/linux/library/l-cfs/index.html

Linux 2.6.25.9 Source Documentation. rbtree.txt
http://lxr.linux.no/linux+v2.6.25.9/Documentation/rbtree.txt

Linux 2.6.25.9 Source Documentation. Sched-design-CFS.txt
http://lxr.linux.no/linux+v2.6.25.9/Documentation/scheduler/sched-design-CFS.txt

Molnar, I. (Email to Nick PIggin, 2007, August 2) Lmbench ctxsw regression with CFS
http://kerneltrap.org/node/14055