Commit | Line | Data |
---|---|---|
dd3629b5 TG |
1 | High resolution timers and dynamic ticks design notes |
2 | ----------------------------------------------------- | |
3 | ||
4 | Further information can be found in the paper of the OLS 2006 talk "hrtimers | |
5 | and beyond". The paper is part of the OLS 2006 Proceedings Volume 1, which can | |
6 | be found on the OLS website: | |
7 | http://www.linuxsymposium.org/2006/linuxsymposium_procv1.pdf | |
8 | ||
9 | The slides to this talk are available from: | |
10 | http://tglx.de/projects/hrtimers/ols2006-hrtimers.pdf | |
11 | ||
12 | The slides contain five figures (pages 2, 15, 18, 20, 22), which illustrate the | |
13 | changes in the time(r) related Linux subsystems. Figure #1 (p. 2) shows the | |
14 | design of the Linux time(r) system before hrtimers and other building blocks | |
15 | got merged into mainline. | |
16 | ||
17 | Note: the paper and the slides are talking about "clock event source", while we | |
18 | switched to the name "clock event devices" in meantime. | |
19 | ||
20 | The design contains the following basic building blocks: | |
21 | ||
22 | - hrtimer base infrastructure | |
23 | - timeofday and clock source management | |
24 | - clock event management | |
25 | - high resolution timer functionality | |
26 | - dynamic ticks | |
27 | ||
28 | ||
29 | hrtimer base infrastructure | |
30 | --------------------------- | |
31 | ||
32 | The hrtimer base infrastructure was merged into the 2.6.16 kernel. Details of | |
395cf969 | 33 | the base implementation are covered in Documentation/timers/hrtimers.txt. See |
dd3629b5 TG |
34 | also figure #2 (OLS slides p. 15) |
35 | ||
36 | The main differences to the timer wheel, which holds the armed timer_list type | |
37 | timers are: | |
38 | - time ordered enqueueing into a rb-tree | |
39 | - independent of ticks (the processing is based on nanoseconds) | |
40 | ||
41 | ||
42 | timeofday and clock source management | |
43 | ------------------------------------- | |
44 | ||
45 | John Stultz's Generic Time Of Day (GTOD) framework moves a large portion of | |
46 | code out of the architecture-specific areas into a generic management | |
47 | framework, as illustrated in figure #3 (OLS slides p. 18). The architecture | |
48 | specific portion is reduced to the low level hardware details of the clock | |
49 | sources, which are registered in the framework and selected on a quality based | |
50 | decision. The low level code provides hardware setup and readout routines and | |
51 | initializes data structures, which are used by the generic time keeping code to | |
52 | convert the clock ticks to nanosecond based time values. All other time keeping | |
53 | related functionality is moved into the generic code. The GTOD base patch got | |
54 | merged into the 2.6.18 kernel. | |
55 | ||
56 | Further information about the Generic Time Of Day framework is available in the | |
57 | OLS 2005 Proceedings Volume 1: | |
58 | http://www.linuxsymposium.org/2005/linuxsymposium_procv1.pdf | |
59 | ||
60 | The paper "We Are Not Getting Any Younger: A New Approach to Time and | |
61 | Timers" was written by J. Stultz, D.V. Hart, & N. Aravamudan. | |
62 | ||
63 | Figure #3 (OLS slides p.18) illustrates the transformation. | |
64 | ||
65 | ||
66 | clock event management | |
67 | ---------------------- | |
68 | ||
69 | While clock sources provide read access to the monotonically increasing time | |
70 | value, clock event devices are used to schedule the next event | |
71 | interrupt(s). The next event is currently defined to be periodic, with its | |
72 | period defined at compile time. The setup and selection of the event device | |
73 | for various event driven functionalities is hardwired into the architecture | |
74 | dependent code. This results in duplicated code across all architectures and | |
75 | makes it extremely difficult to change the configuration of the system to use | |
76 | event interrupt devices other than those already built into the | |
77 | architecture. Another implication of the current design is that it is necessary | |
78 | to touch all the architecture-specific implementations in order to provide new | |
79 | functionality like high resolution timers or dynamic ticks. | |
80 | ||
81 | The clock events subsystem tries to address this problem by providing a generic | |
82 | solution to manage clock event devices and their usage for the various clock | |
83 | event driven kernel functionalities. The goal of the clock event subsystem is | |
84 | to minimize the clock event related architecture dependent code to the pure | |
85 | hardware related handling and to allow easy addition and utilization of new | |
86 | clock event devices. It also minimizes the duplicated code across the | |
87 | architectures as it provides generic functionality down to the interrupt | |
88 | service handler, which is almost inherently hardware dependent. | |
89 | ||
90 | Clock event devices are registered either by the architecture dependent boot | |
91 | code or at module insertion time. Each clock event device fills a data | |
92 | structure with clock-specific property parameters and callback functions. The | |
93 | clock event management decides, by using the specified property parameters, the | |
94 | set of system functions a clock event device will be used to support. This | |
95 | includes the distinction of per-CPU and per-system global event devices. | |
96 | ||
97 | System-level global event devices are used for the Linux periodic tick. Per-CPU | |
98 | event devices are used to provide local CPU functionality such as process | |
99 | accounting, profiling, and high resolution timers. | |
100 | ||
b40b5162 | 101 | The management layer assigns one or more of the following functions to a clock |
dd3629b5 TG |
102 | event device: |
103 | - system global periodic tick (jiffies update) | |
104 | - cpu local update_process_times | |
105 | - cpu local profiling | |
106 | - cpu local next event interrupt (non periodic mode) | |
107 | ||
108 | The clock event device delegates the selection of those timer interrupt related | |
109 | functions completely to the management layer. The clock management layer stores | |
110 | a function pointer in the device description structure, which has to be called | |
111 | from the hardware level handler. This removes a lot of duplicated code from the | |
112 | architecture specific timer interrupt handlers and hands the control over the | |
113 | clock event devices and the assignment of timer interrupt related functionality | |
114 | to the core code. | |
115 | ||
116 | The clock event layer API is rather small. Aside from the clock event device | |
117 | registration interface it provides functions to schedule the next event | |
118 | interrupt, clock event device notification service and support for suspend and | |
119 | resume. | |
120 | ||
121 | The framework adds about 700 lines of code which results in a 2KB increase of | |
122 | the kernel binary size. The conversion of i386 removes about 100 lines of | |
123 | code. The binary size decrease is in the range of 400 byte. We believe that the | |
124 | increase of flexibility and the avoidance of duplicated code across | |
125 | architectures justifies the slight increase of the binary size. | |
126 | ||
127 | The conversion of an architecture has no functional impact, but allows to | |
d9195881 | 128 | utilize the high resolution and dynamic tick functionalities without any change |
dd3629b5 TG |
129 | to the clock event device and timer interrupt code. After the conversion the |
130 | enabling of high resolution timers and dynamic ticks is simply provided by | |
131 | adding the kernel/time/Kconfig file to the architecture specific Kconfig and | |
132 | adding the dynamic tick specific calls to the idle routine (a total of 3 lines | |
133 | added to the idle function and the Kconfig file) | |
134 | ||
135 | Figure #4 (OLS slides p.20) illustrates the transformation. | |
136 | ||
137 | ||
138 | high resolution timer functionality | |
139 | ----------------------------------- | |
140 | ||
141 | During system boot it is not possible to use the high resolution timer | |
142 | functionality, while making it possible would be difficult and would serve no | |
143 | useful function. The initialization of the clock event device framework, the | |
144 | clock source framework (GTOD) and hrtimers itself has to be done and | |
145 | appropriate clock sources and clock event devices have to be registered before | |
146 | the high resolution functionality can work. Up to the point where hrtimers are | |
147 | initialized, the system works in the usual low resolution periodic mode. The | |
148 | clock source and the clock event device layers provide notification functions | |
149 | which inform hrtimers about availability of new hardware. hrtimers validates | |
150 | the usability of the registered clock sources and clock event devices before | |
151 | switching to high resolution mode. This ensures also that a kernel which is | |
152 | configured for high resolution timers can run on a system which lacks the | |
153 | necessary hardware support. | |
154 | ||
155 | The high resolution timer code does not support SMP machines which have only | |
156 | global clock event devices. The support of such hardware would involve IPI | |
157 | calls when an interrupt happens. The overhead would be much larger than the | |
158 | benefit. This is the reason why we currently disable high resolution and | |
159 | dynamic ticks on i386 SMP systems which stop the local APIC in C3 power | |
160 | state. A workaround is available as an idea, but the problem has not been | |
161 | tackled yet. | |
162 | ||
163 | The time ordered insertion of timers provides all the infrastructure to decide | |
164 | whether the event device has to be reprogrammed when a timer is added. The | |
165 | decision is made per timer base and synchronized across per-cpu timer bases in | |
166 | a support function. The design allows the system to utilize separate per-CPU | |
167 | clock event devices for the per-CPU timer bases, but currently only one | |
168 | reprogrammable clock event device per-CPU is utilized. | |
169 | ||
170 | When the timer interrupt happens, the next event interrupt handler is called | |
171 | from the clock event distribution code and moves expired timers from the | |
172 | red-black tree to a separate double linked list and invokes the softirq | |
173 | handler. An additional mode field in the hrtimer structure allows the system to | |
174 | execute callback functions directly from the next event interrupt handler. This | |
175 | is restricted to code which can safely be executed in the hard interrupt | |
176 | context. This applies, for example, to the common case of a wakeup function as | |
177 | used by nanosleep. The advantage of executing the handler in the interrupt | |
178 | context is the avoidance of up to two context switches - from the interrupted | |
179 | context to the softirq and to the task which is woken up by the expired | |
180 | timer. | |
181 | ||
182 | Once a system has switched to high resolution mode, the periodic tick is | |
183 | switched off. This disables the per system global periodic clock event device - | |
184 | e.g. the PIT on i386 SMP systems. | |
185 | ||
186 | The periodic tick functionality is provided by an per-cpu hrtimer. The callback | |
187 | function is executed in the next event interrupt context and updates jiffies | |
188 | and calls update_process_times and profiling. The implementation of the hrtimer | |
189 | based periodic tick is designed to be extended with dynamic tick functionality. | |
190 | This allows to use a single clock event device to schedule high resolution | |
191 | timer and periodic events (jiffies tick, profiling, process accounting) on UP | |
192 | systems. This has been proved to work with the PIT on i386 and the Incrementer | |
193 | on PPC. | |
194 | ||
195 | The softirq for running the hrtimer queues and executing the callbacks has been | |
196 | separated from the tick bound timer softirq to allow accurate delivery of high | |
197 | resolution timer signals which are used by itimer and POSIX interval | |
198 | timers. The execution of this softirq can still be delayed by other softirqs, | |
199 | but the overall latencies have been significantly improved by this separation. | |
200 | ||
201 | Figure #5 (OLS slides p.22) illustrates the transformation. | |
202 | ||
203 | ||
204 | dynamic ticks | |
205 | ------------- | |
206 | ||
207 | Dynamic ticks are the logical consequence of the hrtimer based periodic tick | |
208 | replacement (sched_tick). The functionality of the sched_tick hrtimer is | |
209 | extended by three functions: | |
210 | ||
211 | - hrtimer_stop_sched_tick | |
212 | - hrtimer_restart_sched_tick | |
213 | - hrtimer_update_jiffies | |
214 | ||
215 | hrtimer_stop_sched_tick() is called when a CPU goes into idle state. The code | |
216 | evaluates the next scheduled timer event (from both hrtimers and the timer | |
217 | wheel) and in case that the next event is further away than the next tick it | |
218 | reprograms the sched_tick to this future event, to allow longer idle sleeps | |
219 | without worthless interruption by the periodic tick. The function is also | |
220 | called when an interrupt happens during the idle period, which does not cause a | |
221 | reschedule. The call is necessary as the interrupt handler might have armed a | |
222 | new timer whose expiry time is before the time which was identified as the | |
223 | nearest event in the previous call to hrtimer_stop_sched_tick. | |
224 | ||
225 | hrtimer_restart_sched_tick() is called when the CPU leaves the idle state before | |
226 | it calls schedule(). hrtimer_restart_sched_tick() resumes the periodic tick, | |
227 | which is kept active until the next call to hrtimer_stop_sched_tick(). | |
228 | ||
229 | hrtimer_update_jiffies() is called from irq_enter() when an interrupt happens | |
230 | in the idle period to make sure that jiffies are up to date and the interrupt | |
231 | handler has not to deal with an eventually stale jiffy value. | |
232 | ||
233 | The dynamic tick feature provides statistical values which are exported to | |
234 | userspace via /proc/stats and can be made available for enhanced power | |
235 | management control. | |
236 | ||
237 | The implementation leaves room for further development like full tickless | |
238 | systems, where the time slice is controlled by the scheduler, variable | |
239 | frequency profiling, and a complete removal of jiffies in the future. | |
240 | ||
241 | ||
242 | Aside the current initial submission of i386 support, the patchset has been | |
243 | extended to x86_64 and ARM already. Initial (work in progress) support is also | |
244 | available for MIPS and PowerPC. | |
245 | ||
246 | Thomas, Ingo | |
247 | ||
248 | ||
249 |