Introduction
In recent years, the adoption of the Linux operating system into the desktop and server markets has increased at a nearly exponential rate. Linux has begun to fill many of the roles previously occupied by other operating systems. Linux is now being installed on many servers and other utility machines performing such tasks as CGI or CAD. These systems, as well as many modern desktop systems, typically contain multiple processors or processors containing multiple cores. In a recent paper (Lindsley, Rick. The New 2.6 Scheduler, Linux Journal, March 2004.), Rick Lindsley examines the flaws in the 2.4 Linux kernel scheduler, and the differences between scheduler in the 2.4 series kernels, and the 2.6 series.
In the 2.6 series of kernels, there are some fundamental differences in the way the scheduler works, when compared to the 2.4 series predecessors. This is a result of Linux’s increasing usage in symmetric multi-processing (SMP) environments. Due to certain architectural flaws, the 2.4 series scheduler was unable to scale well to SMP environments.
The 2.4 Series Scheduler
To properly understand the problem which faced the kernel engineers, one needs to have an understanding of the operation of the scheduler in the 2.4 series kernels.
The scheduler of the 2.4 series kernels was of exceptionally simple design. The processes are all stored on a single list, termed the ‘tasklist’. The tasklist is an unsorted linked list containing process references. Each process reference contains relevant process description and control information, as well as what is termed the ‘goodness rating’. The goodness rating determined the priority of a process and was the criteria by which a process was selected for running.
There were four main factors influencing the goodness rating:
- The primary ‘goodness’ factor was the number of allocated clock ticks remaining to a process. Upon being given processor resources, a process is allocated a certain number of clock cycles for which it is given control of the resource. If a process gives up the processor to wait for I/O, or is interrupted involuntarily, the process is awarded with increased ‘goodness’ when it is again ready to be processed.
- By using a certain system call, processes could advise the scheduler that they would like to remain on the current processor, even if another should open up first. This is called CPU affinity. This is done because it is expensive to move large amounts of memory between processes.
- User priority can be set by a root user using the ‘nice’ command. This allows a process’ priority to be elevated or lowered by a large range.
- Whether or not the process is a real time process influenced the priority greatly. This ensured that real time tasks could consistently reach their deadlines.
In order to chose a process to run, the scheduler would examine the tasklist, visiting each process within the list, and select the process with the highest goodness rating. Because the list was not ordered in any way, this operation was always O(n) in the best case.
This system worked fairly well for single-cored single-processor machines, which were common during the majority of the life cycle of the 2.4 series kernels. In 2001 when work began on the 2.5 series of experimental kernels, the kernel engineering community began to notice that their kernel was now being run on a large percentage of multi-processor or multi-cored machines, and that the 2.4 series scheduler did not scale well to meet the increased complexities.
The main issue with the 2.4 scheduler lay in that the tasklist was guarded by a simple spinlock. In a multi-processing system, there were multiple instances of scheduling threads running, each choosing processes to be run upon it’s own processor.
Because of the O(n) nature of the tasklist, this led to increased overhead as systems became busier, and because there existed only one tasklist, only one processor’s scheduler could search it at a time. Often, several processors would be in the process of polling the tasklist for the highest priority task, only to find that a previous task they had chosen had been given out by a different instance of the scheduler, causing the processes to have to begin checking the list again from the beginning.
The 2.6 Series Scheduler
It was quite apparent to the kernel maintainers that something needed to be done with the scheduler. It was decided that the scheduler would be redesigned into an O(1) design.
The primary difference between the 2.4 scheduler and the 2.6 scheduler lays in the fact that while the 2.4 scheduler contains a single active tasklist containing all processes, the 2.6 scheduler creates a separate queue for each of the 140 different possible priorities on each processor.
As a process is created, or moves out of the Running state, it’s priority is determined, and it is placed into the queue representing it’s priority on the CPU it has just finished on. The CPU, in order to determine which process to take next, merely performs a bit mask operation on the status bits of the 140 different queues, and takes a process out of the highest priority non-empty queue.
This results in nearly O(1) operation. It does, however, lead to a few questions about load balancing. The scheme for ensuring that each processor is kept busy lies in that each processor, at 200ms intervals, checks to see if any other processor is out of balance. If another processor is out of balance, then some processes are shifted to the processor with the lighter load. If a processor finds itself idle, it will poll at 1ms intervals until it finds another processor from which it can obtain new processes.
Conclusions and Critique
It is obvious to anyone involved in Computer Science that at least the near future of computing lies in multiprocessing. The 2.4 series scheduler’s inability to scale well to the demands of the more complex SMP environments was severely hurting Linux’s performance, creating a need which desperately needed to be addressed. The new 2.6 series scheduler has been designed in such a way that it is far better suited to the SMP environment than it’s predecessor. The performance benefits of the new scheduler are even apparent on uniprocessor systems; due to the O(1) nature of the new scheduler, the amount overhead has been drastically reduced.
Lindsley’s article provided a unique look into the mind of a kernel engineer. I was most impressed by the simplistic, elegant solution which was provided to the scheduler problem. This article does leave me with one question, however. If the original scheduler was O(n), why did they settle for that? The scheduler is an extremely critical component of the operating system. I would have expected that long before the SMP issue someone would have implemented this or a similar solution.
Regardless of my own curiosity, this article was concise, and well written, with intent to be understandable for both the typical user as well as the kernel engineer, yielding an educational look into a very important topic in Computer Science.