Commit | Line | Data |
---|---|---|
1da177e4 LT |
1 | Review Checklist for RCU Patches |
2 | ||
3 | ||
4 | This document contains a checklist for producing and reviewing patches | |
5 | that make use of RCU. Violating any of the rules listed below will | |
6 | result in the same sorts of problems that leaving out a locking primitive | |
7 | would cause. This list is based on experiences reviewing such patches | |
8 | over a rather long period of time, but improvements are always welcome! | |
9 | ||
10 | 0. Is RCU being applied to a read-mostly situation? If the data | |
4c54005c PM |
11 | structure is updated more than about 10% of the time, then you |
12 | should strongly consider some other approach, unless detailed | |
13 | performance measurements show that RCU is nonetheless the right | |
14 | tool for the job. Yes, RCU does reduce read-side overhead by | |
15 | increasing write-side overhead, which is exactly why normal uses | |
16 | of RCU will do much more reading than updating. | |
1da177e4 | 17 | |
32300751 PM |
18 | Another exception is where performance is not an issue, and RCU |
19 | provides a simpler implementation. An example of this situation | |
20 | is the dynamic NMI code in the Linux 2.6 kernel, at least on | |
21 | architectures where NMIs are rare. | |
22 | ||
23 | Yet another exception is where the low real-time latency of RCU's | |
24 | read-side primitives is critically important. | |
1da177e4 LT |
25 | |
26 | 1. Does the update code have proper mutual exclusion? | |
27 | ||
28 | RCU does allow -readers- to run (almost) naked, but -writers- must | |
29 | still use some sort of mutual exclusion, such as: | |
30 | ||
31 | a. locking, | |
32 | b. atomic operations, or | |
33 | c. restricting updates to a single task. | |
34 | ||
35 | If you choose #b, be prepared to describe how you have handled | |
36 | memory barriers on weakly ordered machines (pretty much all of | |
4c54005c PM |
37 | them -- even x86 allows later loads to be reordered to precede |
38 | earlier stores), and be prepared to explain why this added | |
39 | complexity is worthwhile. If you choose #c, be prepared to | |
40 | explain how this single task does not become a major bottleneck on | |
41 | big multiprocessor machines (for example, if the task is updating | |
42 | information relating to itself that other tasks can read, there | |
43 | by definition can be no bottleneck). | |
1da177e4 LT |
44 | |
45 | 2. Do the RCU read-side critical sections make proper use of | |
46 | rcu_read_lock() and friends? These primitives are needed | |
32300751 PM |
47 | to prevent grace periods from ending prematurely, which |
48 | could result in data being unceremoniously freed out from | |
49 | under your read-side code, which can greatly increase the | |
50 | actuarial risk of your kernel. | |
1da177e4 | 51 | |
dd81eca8 | 52 | As a rough rule of thumb, any dereference of an RCU-protected |
4c54005c PM |
53 | pointer must be covered by rcu_read_lock(), rcu_read_lock_bh(), |
54 | rcu_read_lock_sched(), or by the appropriate update-side lock. | |
55 | Disabling of preemption can serve as rcu_read_lock_sched(), but | |
56 | is less readable. | |
dd81eca8 | 57 | |
1da177e4 LT |
58 | 3. Does the update code tolerate concurrent accesses? |
59 | ||
60 | The whole point of RCU is to permit readers to run without | |
61 | any locks or atomic operations. This means that readers will | |
62 | be running while updates are in progress. There are a number | |
63 | of ways to handle this concurrency, depending on the situation: | |
64 | ||
32300751 | 65 | a. Use the RCU variants of the list and hlist update |
4c54005c PM |
66 | primitives to add, remove, and replace elements on |
67 | an RCU-protected list. Alternatively, use the other | |
68 | RCU-protected data structures that have been added to | |
69 | the Linux kernel. | |
32300751 PM |
70 | |
71 | This is almost always the best approach. | |
72 | ||
73 | b. Proceed as in (a) above, but also maintain per-element | |
74 | locks (that are acquired by both readers and writers) | |
75 | that guard per-element state. Of course, fields that | |
4c54005c PM |
76 | the readers refrain from accessing can be guarded by |
77 | some other lock acquired only by updaters, if desired. | |
32300751 PM |
78 | |
79 | This works quite well, also. | |
80 | ||
81 | c. Make updates appear atomic to readers. For example, | |
4c54005c PM |
82 | pointer updates to properly aligned fields will |
83 | appear atomic, as will individual atomic primitives. | |
84 | Sequences of perations performed under a lock will -not- | |
85 | appear to be atomic to RCU readers, nor will sequences | |
86 | of multiple atomic primitives. | |
1da177e4 | 87 | |
32300751 | 88 | This can work, but is starting to get a bit tricky. |
1da177e4 | 89 | |
32300751 | 90 | d. Carefully order the updates and the reads so that |
1da177e4 LT |
91 | readers see valid data at all phases of the update. |
92 | This is often more difficult than it sounds, especially | |
93 | given modern CPUs' tendency to reorder memory references. | |
94 | One must usually liberally sprinkle memory barriers | |
95 | (smp_wmb(), smp_rmb(), smp_mb()) through the code, | |
96 | making it difficult to understand and to test. | |
97 | ||
98 | It is usually better to group the changing data into | |
99 | a separate structure, so that the change may be made | |
100 | to appear atomic by updating a pointer to reference | |
101 | a new structure containing updated values. | |
102 | ||
103 | 4. Weakly ordered CPUs pose special challenges. Almost all CPUs | |
4c54005c PM |
104 | are weakly ordered -- even x86 CPUs allow later loads to be |
105 | reordered to precede earlier stores. RCU code must take all of | |
106 | the following measures to prevent memory-corruption problems: | |
1da177e4 LT |
107 | |
108 | a. Readers must maintain proper ordering of their memory | |
109 | accesses. The rcu_dereference() primitive ensures that | |
110 | the CPU picks up the pointer before it picks up the data | |
111 | that the pointer points to. This really is necessary | |
112 | on Alpha CPUs. If you don't believe me, see: | |
113 | ||
114 | http://www.openvms.compaq.com/wizard/wiz_2637.html | |
115 | ||
116 | The rcu_dereference() primitive is also an excellent | |
117 | documentation aid, letting the person reading the code | |
118 | know exactly which pointers are protected by RCU. | |
4c54005c PM |
119 | Please note that compilers can also reorder code, and |
120 | they are becoming increasingly aggressive about doing | |
121 | just that. The rcu_dereference() primitive therefore | |
122 | also prevents destructive compiler optimizations. | |
123 | ||
124 | The rcu_dereference() primitive is used by the | |
125 | various "_rcu()" list-traversal primitives, such | |
126 | as the list_for_each_entry_rcu(). Note that it is | |
127 | perfectly legal (if redundant) for update-side code to | |
128 | use rcu_dereference() and the "_rcu()" list-traversal | |
129 | primitives. This is particularly useful in code that | |
c598a070 PM |
130 | is common to readers and updaters. However, lockdep |
131 | will complain if you access rcu_dereference() outside | |
132 | of an RCU read-side critical section. See lockdep.txt | |
133 | to learn what to do about this. | |
134 | ||
135 | Of course, neither rcu_dereference() nor the "_rcu()" | |
136 | list-traversal primitives can substitute for a good | |
137 | concurrency design coordinating among multiple updaters. | |
1da177e4 | 138 | |
a83f1fe2 PM |
139 | b. If the list macros are being used, the list_add_tail_rcu() |
140 | and list_add_rcu() primitives must be used in order | |
141 | to prevent weakly ordered machines from misordering | |
142 | structure initialization and pointer planting. | |
1da177e4 | 143 | Similarly, if the hlist macros are being used, the |
a83f1fe2 | 144 | hlist_add_head_rcu() primitive is required. |
1da177e4 | 145 | |
a83f1fe2 PM |
146 | c. If the list macros are being used, the list_del_rcu() |
147 | primitive must be used to keep list_del()'s pointer | |
148 | poisoning from inflicting toxic effects on concurrent | |
149 | readers. Similarly, if the hlist macros are being used, | |
150 | the hlist_del_rcu() primitive is required. | |
151 | ||
4c54005c PM |
152 | The list_replace_rcu() and hlist_replace_rcu() primitives |
153 | may be used to replace an old structure with a new one | |
154 | in their respective types of RCU-protected lists. | |
155 | ||
156 | d. Rules similar to (4b) and (4c) apply to the "hlist_nulls" | |
157 | type of RCU-protected linked lists. | |
a83f1fe2 | 158 | |
4c54005c | 159 | e. Updates must ensure that initialization of a given |
1da177e4 LT |
160 | structure happens before pointers to that structure are |
161 | publicized. Use the rcu_assign_pointer() primitive | |
162 | when publicizing a pointer to a structure that can | |
163 | be traversed by an RCU read-side critical section. | |
164 | ||
32300751 PM |
165 | 5. If call_rcu(), or a related primitive such as call_rcu_bh() or |
166 | call_rcu_sched(), is used, the callback function must be | |
167 | written to be called from softirq context. In particular, | |
168 | it cannot block. | |
1da177e4 | 169 | |
a83f1fe2 | 170 | 6. Since synchronize_rcu() can block, it cannot be called from |
4c54005c PM |
171 | any sort of irq context. The same rule applies for |
172 | synchronize_rcu_bh(), synchronize_sched(), synchronize_srcu(), | |
173 | synchronize_rcu_expedited(), synchronize_rcu_bh_expedited(), | |
174 | synchronize_sched_expedite(), and synchronize_srcu_expedited(). | |
175 | ||
176 | The expedited forms of these primitives have the same semantics | |
177 | as the non-expedited forms, but expediting is both expensive | |
178 | and unfriendly to real-time workloads. Use of the expedited | |
179 | primitives should be restricted to rare configuration-change | |
180 | operations that would not normally be undertaken while a real-time | |
181 | workload is running. | |
182 | ||
183 | 7. If the updater uses call_rcu() or synchronize_rcu(), then the | |
184 | corresponding readers must use rcu_read_lock() and | |
185 | rcu_read_unlock(). If the updater uses call_rcu_bh() or | |
186 | synchronize_rcu_bh(), then the corresponding readers must | |
187 | use rcu_read_lock_bh() and rcu_read_unlock_bh(). If the | |
188 | updater uses call_rcu_sched() or synchronize_sched(), then | |
189 | the corresponding readers must disable preemption, possibly | |
190 | by calling rcu_read_lock_sched() and rcu_read_unlock_sched(). | |
191 | If the updater uses synchronize_srcu(), the the corresponding | |
192 | readers must use srcu_read_lock() and srcu_read_unlock(), | |
193 | and with the same srcu_struct. The rules for the expedited | |
194 | primitives are the same as for their non-expedited counterparts. | |
195 | Mixing things up will result in confusion and broken kernels. | |
1da177e4 LT |
196 | |
197 | One exception to this rule: rcu_read_lock() and rcu_read_unlock() | |
198 | may be substituted for rcu_read_lock_bh() and rcu_read_unlock_bh() | |
199 | in cases where local bottom halves are already known to be | |
200 | disabled, for example, in irq or softirq context. Commenting | |
201 | such cases is a must, of course! And the jury is still out on | |
202 | whether the increased speed is worth it. | |
203 | ||
32300751 PM |
204 | 8. Although synchronize_rcu() is slower than is call_rcu(), it |
205 | usually results in simpler code. So, unless update performance | |
206 | is critically important or the updaters cannot block, | |
165d6c78 PM |
207 | synchronize_rcu() should be used in preference to call_rcu(). |
208 | ||
209 | An especially important property of the synchronize_rcu() | |
210 | primitive is that it automatically self-limits: if grace periods | |
211 | are delayed for whatever reason, then the synchronize_rcu() | |
212 | primitive will correspondingly delay updates. In contrast, | |
213 | code using call_rcu() should explicitly limit update rate in | |
214 | cases where grace periods are delayed, as failing to do so can | |
215 | result in excessive realtime latencies or even OOM conditions. | |
216 | ||
217 | Ways of gaining this self-limiting property when using call_rcu() | |
218 | include: | |
219 | ||
220 | a. Keeping a count of the number of data-structure elements | |
5cc6517a PM |
221 | used by the RCU-protected data structure, including |
222 | those waiting for a grace period to elapse. Enforce a | |
223 | limit on this number, stalling updates as needed to allow | |
224 | previously deferred frees to complete. Alternatively, | |
225 | limit only the number awaiting deferred free rather than | |
226 | the total number of elements. | |
227 | ||
228 | One way to stall the updates is to acquire the update-side | |
229 | mutex. (Don't try this with a spinlock -- other CPUs | |
230 | spinning on the lock could prevent the grace period | |
231 | from ever ending.) Another way to stall the updates | |
232 | is for the updates to use a wrapper function around | |
233 | the memory allocator, so that this wrapper function | |
234 | simulates OOM when there is too much memory awaiting an | |
235 | RCU grace period. There are of course many other | |
236 | variations on this theme. | |
165d6c78 PM |
237 | |
238 | b. Limiting update rate. For example, if updates occur only | |
239 | once per hour, then no explicit rate limiting is required, | |
240 | unless your system is already badly broken. The dcache | |
241 | subsystem takes this approach -- updates are guarded | |
242 | by a global lock, limiting their rate. | |
243 | ||
244 | c. Trusted update -- if updates can only be done manually by | |
245 | superuser or some other trusted user, then it might not | |
246 | be necessary to automatically limit them. The theory | |
247 | here is that superuser already has lots of ways to crash | |
248 | the machine. | |
249 | ||
250 | d. Use call_rcu_bh() rather than call_rcu(), in order to take | |
251 | advantage of call_rcu_bh()'s faster grace periods. | |
252 | ||
253 | e. Periodically invoke synchronize_rcu(), permitting a limited | |
254 | number of updates per grace period. | |
1da177e4 | 255 | |
4c54005c PM |
256 | The same cautions apply to call_rcu_bh() and call_rcu_sched(). |
257 | ||
1da177e4 | 258 | 9. All RCU list-traversal primitives, which include |
34d7c2b3 | 259 | rcu_dereference(), list_for_each_entry_rcu(), |
1da177e4 | 260 | list_for_each_continue_rcu(), and list_for_each_safe_rcu(), |
32300751 PM |
261 | must be either within an RCU read-side critical section or |
262 | must be protected by appropriate update-side locks. RCU | |
1da177e4 LT |
263 | read-side critical sections are delimited by rcu_read_lock() |
264 | and rcu_read_unlock(), or by similar primitives such as | |
c598a070 PM |
265 | rcu_read_lock_bh() and rcu_read_unlock_bh(), in which case |
266 | the matching rcu_dereference() primitive must be used in order | |
267 | to keep lockdep happy, in this case, rcu_dereference_bh(). | |
1da177e4 | 268 | |
32300751 PM |
269 | The reason that it is permissible to use RCU list-traversal |
270 | primitives when the update-side lock is held is that doing so | |
271 | can be quite helpful in reducing code bloat when common code is | |
50aec002 PM |
272 | shared between readers and updaters. Additional primitives |
273 | are provided for this case, as discussed in lockdep.txt. | |
1da177e4 LT |
274 | |
275 | 10. Conversely, if you are in an RCU read-side critical section, | |
32300751 PM |
276 | and you don't hold the appropriate update-side lock, you -must- |
277 | use the "_rcu()" variants of the list macros. Failing to do so | |
4c54005c PM |
278 | will break Alpha, cause aggressive compilers to generate bad code, |
279 | and confuse people trying to read your code. | |
a83f1fe2 PM |
280 | |
281 | 11. Note that synchronize_rcu() -only- guarantees to wait until | |
282 | all currently executing rcu_read_lock()-protected RCU read-side | |
283 | critical sections complete. It does -not- necessarily guarantee | |
284 | that all currently running interrupts, NMIs, preempt_disable() | |
285 | code, or idle loops will complete. Therefore, if you do not have | |
286 | rcu_read_lock()-protected read-side critical sections, do -not- | |
287 | use synchronize_rcu(). | |
288 | ||
4c54005c PM |
289 | Similarly, disabling preemption is not an acceptable substitute |
290 | for rcu_read_lock(). Code that attempts to use preemption | |
291 | disabling where it should be using rcu_read_lock() will break | |
292 | in real-time kernel builds. | |
293 | ||
294 | If you want to wait for interrupt handlers, NMI handlers, and | |
295 | code under the influence of preempt_disable(), you instead | |
296 | need to use synchronize_irq() or synchronize_sched(). | |
d19720a9 PM |
297 | |
298 | 12. Any lock acquired by an RCU callback must be acquired elsewhere | |
240ebbf8 PM |
299 | with softirq disabled, e.g., via spin_lock_irqsave(), |
300 | spin_lock_bh(), etc. Failing to disable irq on a given | |
4c54005c PM |
301 | acquisition of that lock will result in deadlock as soon as |
302 | the RCU softirq handler happens to run your RCU callback while | |
303 | interrupting that acquisition's critical section. | |
621934ee | 304 | |
ef48bd24 PM |
305 | 13. RCU callbacks can be and are executed in parallel. In many cases, |
306 | the callback code simply wrappers around kfree(), so that this | |
307 | is not an issue (or, more accurately, to the extent that it is | |
308 | an issue, the memory-allocator locking handles it). However, | |
309 | if the callbacks do manipulate a shared data structure, they | |
310 | must use whatever locking or other synchronization is required | |
311 | to safely access and/or modify that data structure. | |
312 | ||
32300751 PM |
313 | RCU callbacks are -usually- executed on the same CPU that executed |
314 | the corresponding call_rcu(), call_rcu_bh(), or call_rcu_sched(), | |
315 | but are by -no- means guaranteed to be. For example, if a given | |
316 | CPU goes offline while having an RCU callback pending, then that | |
317 | RCU callback will execute on some surviving CPU. (If this was | |
318 | not the case, a self-spawning RCU callback would prevent the | |
319 | victim CPU from ever going offline.) | |
320 | ||
c598a070 PM |
321 | 14. SRCU (srcu_read_lock(), srcu_read_unlock(), srcu_dereference(), |
322 | synchronize_srcu(), and synchronize_srcu_expedited()) may only | |
323 | be invoked from process context. Unlike other forms of RCU, it | |
324 | -is- permissible to block in an SRCU read-side critical section | |
325 | (demarked by srcu_read_lock() and srcu_read_unlock()), hence the | |
326 | "SRCU": "sleepable RCU". Please note that if you don't need | |
327 | to sleep in read-side critical sections, you should be using | |
328 | RCU rather than SRCU, because RCU is almost always faster and | |
329 | easier to use than is SRCU. | |
621934ee PM |
330 | |
331 | Also unlike other forms of RCU, explicit initialization | |
332 | and cleanup is required via init_srcu_struct() and | |
333 | cleanup_srcu_struct(). These are passed a "struct srcu_struct" | |
334 | that defines the scope of a given SRCU domain. Once initialized, | |
335 | the srcu_struct is passed to srcu_read_lock(), srcu_read_unlock() | |
4c54005c PM |
336 | synchronize_srcu(), and synchronize_srcu_expedited(). A given |
337 | synchronize_srcu() waits only for SRCU read-side critical | |
338 | sections governed by srcu_read_lock() and srcu_read_unlock() | |
339 | calls that have been passed the same srcu_struct. This property | |
340 | is what makes sleeping read-side critical sections tolerable -- | |
341 | a given subsystem delays only its own updates, not those of other | |
342 | subsystems using SRCU. Therefore, SRCU is less prone to OOM the | |
343 | system than RCU would be if RCU's read-side critical sections | |
344 | were permitted to sleep. | |
621934ee PM |
345 | |
346 | The ability to sleep in read-side critical sections does not | |
347 | come for free. First, corresponding srcu_read_lock() and | |
348 | srcu_read_unlock() calls must be passed the same srcu_struct. | |
349 | Second, grace-period-detection overhead is amortized only | |
350 | over those updates sharing a given srcu_struct, rather than | |
351 | being globally amortized as they are for other forms of RCU. | |
352 | Therefore, SRCU should be used in preference to rw_semaphore | |
353 | only in extremely read-intensive situations, or in situations | |
354 | requiring SRCU's read-side deadlock immunity or low read-side | |
355 | realtime latency. | |
356 | ||
50aec002 PM |
357 | Note that, rcu_assign_pointer() relates to SRCU just as they do |
358 | to other forms of RCU. | |
0612ea00 PM |
359 | |
360 | 15. The whole point of call_rcu(), synchronize_rcu(), and friends | |
361 | is to wait until all pre-existing readers have finished before | |
362 | carrying out some otherwise-destructive operation. It is | |
363 | therefore critically important to -first- remove any path | |
364 | that readers can follow that could be affected by the | |
365 | destructive operation, and -only- -then- invoke call_rcu(), | |
366 | synchronize_rcu(), or friends. | |
367 | ||
4c54005c PM |
368 | Because these primitives only wait for pre-existing readers, it |
369 | is the caller's responsibility to guarantee that any subsequent | |
370 | readers will execute safely. | |
240ebbf8 | 371 | |
4c54005c PM |
372 | 16. The various RCU read-side primitives do -not- necessarily contain |
373 | memory barriers. You should therefore plan for the CPU | |
374 | and the compiler to freely reorder code into and out of RCU | |
375 | read-side critical sections. It is the responsibility of the | |
376 | RCU update-side primitives to deal with this. | |
84483ea4 PM |
377 | |
378 | 17. Use CONFIG_PROVE_RCU, CONFIG_DEBUG_OBJECTS_RCU_HEAD, and | |
379 | the __rcu sparse checks to validate your RCU code. These | |
380 | can help find problems as follows: | |
381 | ||
382 | CONFIG_PROVE_RCU: check that accesses to RCU-protected data | |
383 | structures are carried out under the proper RCU | |
384 | read-side critical section, while holding the right | |
385 | combination of locks, or whatever other conditions | |
386 | are appropriate. | |
387 | ||
388 | CONFIG_DEBUG_OBJECTS_RCU_HEAD: check that you don't pass the | |
389 | same object to call_rcu() (or friends) before an RCU | |
390 | grace period has elapsed since the last time that you | |
391 | passed that same object to call_rcu() (or friends). | |
392 | ||
393 | __rcu sparse checks: tag the pointer to the RCU-protected data | |
394 | structure with __rcu, and sparse will warn you if you | |
395 | access that pointer without the services of one of the | |
396 | variants of rcu_dereference(). | |
397 | ||
398 | These debugging aids can help you find problems that are | |
399 | otherwise extremely difficult to spot. |