Commit | Line | Data |
---|---|---|
81da251a MD |
1 | '\" t |
2 | .\" Copyright 2015-2023 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> | |
41478a42 | 3 | .\" |
81da251a | 4 | .\" SPDX-License-Identifier: Linux-man-pages-copyleft |
41478a42 | 5 | .\" |
81da251a | 6 | .TH rseq 2 (date) "Linux man-pages (unreleased)" |
41478a42 | 7 | .SH NAME |
81da251a | 8 | rseq \- restartable sequences system call |
41478a42 MD |
9 | .SH SYNOPSIS |
10 | .nf | |
81da251a MD |
11 | .PP |
12 | .BR "#include <linux/rseq.h>" \ | |
13 | " /* Definition of " RSEQ_* " constants and rseq types */" | |
14 | .BR "#include #include <sys/syscall.h>" " * Definition of " SYS_* " constants */" | |
15 | .B #include <unistd.h> | |
16 | .PP | |
17 | .BI "int syscall(SYS_rseq, struct rseq *_Nullable " rseq ", uint32_t " rseq_len \ | |
18 | ", int " flags ", uint32_t " sig "); | |
19 | .fi | |
20 | .PP | |
21 | .IR Note : | |
22 | glibc provides no wrapper for | |
23 | .BR rseq (), | |
24 | necessitating the use of | |
25 | .BR syscall (2). | |
41478a42 | 26 | .SH DESCRIPTION |
81da251a | 27 | .PP |
841a0f9b MD |
28 | The |
29 | .BR rseq () | |
30 | ABI accelerates specific user-space operations by registering a | |
81da251a MD |
31 | per-thread data structure shared between kernel and user-space. |
32 | This data structure can be read from or written to by user-space to skip | |
841a0f9b | 33 | otherwise expensive system calls. |
81da251a | 34 | .PP |
da5633b4 MD |
35 | A restartable sequence is a sequence of instructions guaranteed to be executed |
36 | atomically with respect to other threads and signal handlers on the current | |
81da251a MD |
37 | CPU. |
38 | If its execution does not complete atomically, the kernel changes the | |
39 | execution flow by jumping to an abort handler defined by user-space for | |
40 | that restartable sequence. | |
41 | .PP | |
da5633b4 | 42 | Using restartable sequences requires to register a |
81da251a MD |
43 | rseq ABI per-thread data structure ( |
44 | .B struct rseq | |
45 | ) through the | |
da5633b4 | 46 | .BR rseq () |
81da251a MD |
47 | system call. |
48 | Only one rseq ABI can be registered per thread, so user-space libraries | |
49 | and applications must follow a user-space ABI defining how to share this | |
50 | resource. | |
51 | The ABI defining how to share this resource between applications and | |
52 | libraries is defined by the C library. | |
841a0f9b MD |
53 | Allocation of the per-thread rseq ABI and its registration to the kernel |
54 | is handled by glibc since version 2.35. | |
81da251a | 55 | .PP |
841a0f9b | 56 | The rseq ABI per-thread data structure contains a |
da5633b4 | 57 | .I rseq_cs |
81da251a MD |
58 | field which points to the currently executing critical section. |
59 | For each thread, a single rseq critical section can run at any given | |
60 | point. | |
61 | Each critical section need to be implemented in assembly. | |
62 | .PP | |
41478a42 MD |
63 | The |
64 | .BR rseq () | |
65 | ABI accelerates user-space operations on per-cpu data by defining a | |
66 | shared data structure ABI between each user-space thread and the kernel. | |
81da251a | 67 | .PP |
41478a42 MD |
68 | It allows user-space to perform update operations on per-cpu data |
69 | without requiring heavy-weight atomic operations. | |
81da251a | 70 | .PP |
41478a42 | 71 | The term CPU used in this documentation refers to a hardware execution |
81da251a MD |
72 | context. |
73 | For instance, each CPU number returned by | |
da5633b4 | 74 | .BR sched_getcpu () |
81da251a MD |
75 | is a CPU. |
76 | The current CPU means to the CPU on which the registered thread is | |
da5633b4 | 77 | running. |
81da251a | 78 | .PP |
41478a42 | 79 | Restartable sequences are atomic with respect to preemption (making it |
81da251a MD |
80 | atomic with respect to other threads running on the same CPU), |
81 | as well as signal delivery (user-space execution contexts nested over | |
82 | the same thread). | |
83 | They either complete atomically with respect to preemption on the | |
84 | current CPU and signal delivery, or they are aborted. | |
85 | .PP | |
da5633b4 | 86 | Restartable sequences are suited for update operations on per-cpu data. |
81da251a | 87 | .PP |
da5633b4 | 88 | Restartable sequences can be used on data structures shared between threads |
81da251a MD |
89 | within a process, |
90 | and on data structures shared between threads across different | |
91 | processes. | |
41478a42 | 92 | .PP |
81da251a MD |
93 | Some examples of operations that can be accelerated or improved by this ABI: |
94 | .IP \(bu 3 | |
41478a42 | 95 | Memory allocator per-cpu free-lists, |
81da251a | 96 | .IP \(bu 3 |
41478a42 | 97 | Querying the current CPU number, |
81da251a | 98 | .IP \(bu 3 |
41478a42 | 99 | Incrementing per-CPU counters, |
81da251a | 100 | .IP \(bu 3 |
41478a42 | 101 | Modifying data protected by per-CPU spinlocks, |
81da251a | 102 | .IP \(bu 3 |
41478a42 | 103 | Inserting/removing elements in per-CPU linked-lists, |
81da251a | 104 | .IP \(bu 3 |
41478a42 | 105 | Writing/reading per-CPU ring buffers content. |
81da251a MD |
106 | .IP \(bu 3 |
107 | Accurately reading performance monitoring unit counters with respect to | |
108 | thread migration. | |
41478a42 | 109 | .PP |
81da251a MD |
110 | Restartable sequences must not perform system calls. |
111 | Doing so may result in termination of the process by a segmentation | |
112 | fault. | |
41478a42 MD |
113 | .PP |
114 | The | |
115 | .I rseq | |
116 | argument is a pointer to the thread-local rseq structure to be shared | |
6a78527e | 117 | between kernel and user-space. |
41478a42 | 118 | .PP |
841a0f9b | 119 | The structure |
41478a42 | 120 | .B struct rseq |
81da251a MD |
121 | is an extensible structure. |
122 | Additional feature fields can be added in future kernel versions. | |
123 | Its layout is as follows: | |
41478a42 MD |
124 | .TP |
125 | .B Structure alignment | |
81da251a MD |
126 | This structure is aligned on either 32-byte boundary, |
127 | or on the alignment value returned by | |
128 | .I getauxval( | |
129 | .B AT_RSEQ_ALIGN | |
130 | ) | |
841a0f9b | 131 | if the structure size differs from 32 bytes. |
41478a42 MD |
132 | .TP |
133 | .B Structure size | |
81da251a MD |
134 | This structure size needs to be at least 32 bytes. |
135 | It can be either 32 bytes, | |
136 | or it needs to be large enough to hold the result of | |
137 | .I getauxval( | |
138 | .B AT_RSEQ_FEATURE_SIZE | |
139 | ) . | |
841a0f9b | 140 | Its size is passed as parameter to the rseq system call. |
81da251a | 141 | .RS |
da5633b4 | 142 | .PP |
da5633b4 MD |
143 | .EX |
144 | struct rseq { | |
145 | __u32 cpu_id_start; | |
146 | __u32 cpu_id; | |
147 | union { | |
148 | /* Edited out for conciseness. [...] */ | |
149 | } rseq_cs; | |
150 | __u32 flags; | |
841a0f9b MD |
151 | __u32 node_id; |
152 | __u32 mm_cid; | |
da5633b4 MD |
153 | } __attribute__((aligned(32))); |
154 | .EE | |
81da251a | 155 | .RE |
41478a42 MD |
156 | .TP |
157 | .B Fields | |
81da251a | 158 | .RS |
41478a42 | 159 | .I cpu_id_start |
81da251a | 160 | .RS |
841a0f9b | 161 | Always-updated value of the CPU number on which the registered thread is |
81da251a MD |
162 | running. |
163 | Its value is guaranteed to always be a possible CPU number, | |
164 | even when rseq is not registered. | |
165 | Its value should always be confirmed by reading the cpu_id field before | |
166 | user-space performs any side-effect | |
167 | (e.g. storing to memory). | |
168 | .PP | |
841a0f9b | 169 | This field is always guaranteed to hold a valid CPU number in the range |
81da251a MD |
170 | [ 0 .. nr_possible_cpus - 1 ]. |
171 | It can therefore be loaded by user-space and used as an offset in | |
172 | per-cpu data structures without having to check whether its value is | |
173 | within the valid bounds compared to the number of possible CPUs in the | |
174 | system. | |
175 | .PP | |
176 | Initialized by user-space to a possible CPU number (e.g., 0), | |
177 | updated by the kernel for threads registered with rseq. | |
178 | .PP | |
6a78527e MD |
179 | For user-space applications executed on a kernel without rseq support, |
180 | the cpu_id_start field stays initialized at 0, which is indeed a valid | |
81da251a MD |
181 | CPU number. |
182 | It is therefore valid to use it as an offset in per-cpu data structures, | |
183 | and only validate whether it's actually the current CPU number by | |
184 | comparing it with the cpu_id field within the rseq critical section. | |
185 | If the kernel does not provide rseq support, that cpu_id field stays | |
186 | initialized at -1, | |
187 | so the comparison always fails, as intended. | |
188 | .PP | |
841a0f9b | 189 | This field should only be read by the thread which registered this data |
81da251a MD |
190 | structure. |
191 | Aligned on 32-bit. | |
192 | .PP | |
da5633b4 MD |
193 | It is up to user-space to implement a fall-back mechanism for scenarios where |
194 | rseq is not available. | |
81da251a MD |
195 | .RE |
196 | .PP | |
41478a42 | 197 | .I cpu_id |
81da251a | 198 | .RS |
841a0f9b | 199 | Always-updated value of the CPU number on which the registered thread is |
81da251a MD |
200 | running. |
201 | Initialized by user-space to -1, | |
202 | updated by the kernel for threads registered with rseq. | |
203 | .PP | |
841a0f9b | 204 | This field should only be read by the thread which registered this data |
81da251a MD |
205 | structure. |
206 | Aligned on 32-bit. | |
207 | .RE | |
208 | .PP | |
41478a42 | 209 | .I rseq_cs |
81da251a MD |
210 | .RS |
211 | The rseq_cs field is a pointer to a | |
212 | .B struct rseq_cs . | |
213 | Is is NULL when no rseq assembly block critical section is active for | |
214 | the registered thread. | |
215 | Setting it to point to a critical section descriptor ( | |
216 | .B struct rseq_cs | |
217 | ) marks the beginning of the critical section. | |
218 | .PP | |
da5633b4 | 219 | Initialized by user-space to NULL. |
81da251a | 220 | .PP |
da5633b4 MD |
221 | Updated by user-space, which sets the address of the currently |
222 | active rseq_cs at the beginning of assembly instruction sequence | |
81da251a MD |
223 | block, |
224 | and set to NULL by the kernel when it restarts an assembly instruction | |
225 | sequence block, | |
226 | as well as when the kernel detects that it is preempting or delivering a | |
227 | signal outside of the range targeted by the rseq_cs. | |
228 | Also needs to be set to NULL by user-space before reclaiming memory that | |
229 | contains the targeted | |
230 | .B struct rseq_cs . | |
231 | .PP | |
da5633b4 | 232 | Read and set by the kernel. |
81da251a | 233 | .PP |
841a0f9b | 234 | This field should only be updated by the thread which registered this |
81da251a MD |
235 | data structure. |
236 | Aligned on 64-bit. | |
237 | .RE | |
238 | .PP | |
41478a42 | 239 | .I flags |
81da251a MD |
240 | .RS |
241 | Flags indicating the restart behavior for the registered thread. | |
242 | This is mainly used for debugging purposes. | |
243 | Can be a combination of: | |
244 | .TP | |
245 | .B RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT | |
246 | Inhibit instruction sequence block restart on preemption for this | |
247 | thread. | |
248 | This flag is deprecated since kernel 6.1. | |
249 | .TP | |
250 | .B RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | |
251 | Inhibit instruction sequence block restart on signal delivery for this | |
252 | thread. | |
253 | This flag is deprecated since kernel 6.1. | |
841a0f9b | 254 | .TP |
81da251a MD |
255 | .B RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE |
256 | Inhibit instruction sequence block restart on migration for this thread. | |
257 | This flag is deprecated since kernel 6.1. | |
258 | .PP | |
259 | Initialized by user-space, used by the kernel. | |
260 | .RE | |
261 | .PP | |
841a0f9b | 262 | .I node_id |
81da251a | 263 | .RS |
841a0f9b | 264 | Always-updated value of the current NUMA node ID. |
81da251a | 265 | .PP |
841a0f9b | 266 | Initialized by user-space to 0. |
81da251a MD |
267 | .PP |
268 | Updated by the kernel. | |
269 | Read by user-space with single-copy atomicity semantics. | |
270 | This field should only be read by the thread which registered | |
271 | this data structure. | |
272 | Aligned on 32-bit. | |
273 | .RE | |
274 | .PP | |
841a0f9b | 275 | .I mm_cid |
81da251a MD |
276 | .RS |
277 | Contains the current thread's concurrency ID | |
278 | (allocated uniquely within a memory map). | |
279 | .PP | |
280 | Updated by the kernel. | |
281 | Read by user-space with single-copy atomicity semantics. | |
282 | This field should only be read by the thread which registered this data | |
283 | structure. | |
284 | Aligned on 32-bit. | |
285 | .PP | |
286 | This concurrency ID is within the possible cpus range, | |
287 | and is temporarily (and uniquely) assigned while threads are actively | |
288 | running within a memory map. | |
289 | If a memory map has fewer threads than cores, | |
290 | or is limited to run on few cores concurrently through sched affinity or | |
291 | cgroup cpusets, | |
292 | the concurrency IDs will be values close to 0, | |
293 | thus allowing efficient use of user-space memory for per-cpu data | |
294 | structures. | |
295 | .RE | |
296 | .RE | |
297 | .RE | |
41478a42 MD |
298 | .PP |
299 | The layout of | |
300 | .B struct rseq_cs | |
301 | version 0 is as follows: | |
302 | .TP | |
303 | .B Structure alignment | |
6a78527e | 304 | This structure is aligned on 32-byte boundary. |
41478a42 MD |
305 | .TP |
306 | .B Structure size | |
307 | This structure has a fixed size of 32 bytes. | |
81da251a | 308 | .RS |
da5633b4 MD |
309 | .EX |
310 | struct rseq_cs { | |
311 | __u32 version; | |
312 | __u32 flags; | |
313 | __u64 start_ip; | |
314 | __u64 post_commit_offset; | |
315 | __u64 abort_ip; | |
316 | } __attribute__((aligned(32))); | |
317 | .EE | |
81da251a MD |
318 | .RE |
319 | .PP | |
41478a42 | 320 | .B Fields |
81da251a | 321 | .RS |
41478a42 | 322 | .I version |
81da251a MD |
323 | .RS |
324 | Version of this structure. | |
325 | Should be initialized to 0. | |
326 | .RE | |
327 | .PP | |
41478a42 | 328 | .I flags |
81da251a MD |
329 | .RS |
330 | Flags indicating the restart behavior of this structure. | |
331 | Can be a combination of: | |
332 | .TP | |
333 | .B RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT | |
334 | Inhibit instruction sequence block restart on preemption for this | |
335 | critical section. | |
336 | This flag is deprecated since kernel 6.1. | |
41478a42 | 337 | .TP |
81da251a MD |
338 | .B RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL |
339 | Inhibit instruction sequence block restart on signal delivery for this | |
340 | critical section. | |
341 | This flag is deprecated since kernel 6.1. | |
342 | .TP | |
343 | .B RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE | |
344 | Inhibit instruction sequence block restart on migration for this | |
345 | critical section. | |
346 | This flag is deprecated since kernel 6.1. | |
347 | .RE | |
348 | .PP | |
41478a42 | 349 | .I start_ip |
81da251a | 350 | .RS |
41478a42 MD |
351 | Instruction pointer address of the first instruction of the sequence of |
352 | consecutive assembly instructions. | |
81da251a MD |
353 | .RE |
354 | .PP | |
41478a42 | 355 | .I post_commit_offset |
81da251a | 356 | .RS |
41478a42 MD |
357 | Offset (from start_ip address) of the address after the last instruction |
358 | of the sequence of consecutive assembly instructions. | |
81da251a MD |
359 | .RE |
360 | .PP | |
41478a42 | 361 | .I abort_ip |
81da251a | 362 | .RS |
41478a42 MD |
363 | Instruction pointer address where to move the execution flow in case of |
364 | abort of the sequence of consecutive assembly instructions. | |
81da251a MD |
365 | .RE |
366 | .RE | |
41478a42 MD |
367 | .PP |
368 | The | |
369 | .I rseq_len | |
370 | argument is the size of the | |
81da251a | 371 | .B struct rseq |
41478a42 | 372 | to register. |
41478a42 MD |
373 | .PP |
374 | The | |
375 | .I flags | |
376 | argument is 0 for registration, and | |
81da251a | 377 | .B RSEQ_FLAG_UNREGISTER |
41478a42 | 378 | for unregistration. |
41478a42 MD |
379 | .PP |
380 | The | |
381 | .I sig | |
382 | argument is the 32-bit signature to be expected before the abort | |
383 | handler code. | |
41478a42 MD |
384 | .PP |
385 | A single library per process should keep the rseq structure in a | |
841a0f9b | 386 | per-thread data structure. |
41478a42 MD |
387 | The |
388 | .I cpu_id | |
389 | field should be initialized to -1, and the | |
390 | .I cpu_id_start | |
391 | field should be initialized to a possible CPU value (typically 0). | |
41478a42 MD |
392 | .PP |
393 | Each thread is responsible for registering and unregistering its rseq | |
81da251a MD |
394 | structure. |
395 | No more than one rseq structure address can be registered per thread at | |
396 | a given time. | |
6a78527e | 397 | .PP |
da5633b4 MD |
398 | Reclaim of rseq object's memory must only be done after either an |
399 | explicit rseq unregistration is performed or after the thread exits. | |
41478a42 MD |
400 | .PP |
401 | In a typical usage scenario, the thread registering the rseq | |
81da251a MD |
402 | structure will be performing loads and stores from/to that structure. |
403 | It is however also allowed to read that structure from other threads. | |
41478a42 | 404 | The rseq field updates performed by the kernel provide relaxed atomicity |
81da251a MD |
405 | semantics (atomic store, without memory ordering), |
406 | which guarantee that other threads performing relaxed atomic reads | |
407 | (atomic load, without memory ordering) of the cpu number fields will | |
408 | always observe a consistent value. | |
409 | .PP | |
41478a42 | 410 | .SH RETURN VALUE |
81da251a MD |
411 | A return value of 0 indicates success. |
412 | On error, \-1 is returned, and | |
41478a42 MD |
413 | .I errno |
414 | is set appropriately. | |
81da251a | 415 | .PP |
41478a42 MD |
416 | .SH ERRORS |
417 | .TP | |
418 | .B EINVAL | |
419 | Either | |
420 | .I flags | |
421 | contains an invalid value, or | |
422 | .I rseq | |
423 | contains an address which is not appropriately aligned, or | |
424 | .I rseq_len | |
da5633b4 | 425 | contains an incorrect size. |
41478a42 MD |
426 | .TP |
427 | .B ENOSYS | |
428 | The | |
429 | .BR rseq () | |
430 | system call is not implemented by this kernel. | |
431 | .TP | |
432 | .B EFAULT | |
433 | .I rseq | |
434 | is an invalid address. | |
435 | .TP | |
436 | .B EBUSY | |
437 | Restartable sequence is already registered for this thread. | |
438 | .TP | |
439 | .B EPERM | |
440 | The | |
441 | .I sig | |
442 | argument on unregistration does not match the signature received | |
443 | on registration. | |
81da251a | 444 | .PP |
41478a42 MD |
445 | .SH VERSIONS |
446 | The | |
447 | .BR rseq () | |
448 | system call was added in Linux 4.18. | |
81da251a MD |
449 | .PP |
450 | .SH STANDARDS | |
41478a42 MD |
451 | .BR rseq () |
452 | is Linux-specific. | |
81da251a | 453 | .PP |
41478a42 MD |
454 | .SH SEE ALSO |
455 | .BR sched_getcpu (3) , | |
841a0f9b MD |
456 | .BR membarrier (2) , |
457 | .BR getauxval (3) |