Commit | Line | Data |
---|---|---|
f2d7b530 | 1 | .\" SPDX-FileCopyrightText: 2015-2023 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> |
41478a42 | 2 | .\" |
81da251a | 3 | .\" SPDX-License-Identifier: Linux-man-pages-copyleft |
41478a42 | 4 | .\" |
81da251a | 5 | .TH rseq 2 (date) "Linux man-pages (unreleased)" |
41478a42 | 6 | .SH NAME |
81da251a | 7 | rseq \- restartable sequences system call |
82ee9b47 MD |
8 | .SH LIBRARY |
9 | Standard C library | |
10 | .RI ( libc ", " \-lc ) | |
41478a42 MD |
11 | .SH SYNOPSIS |
12 | .nf | |
81da251a | 13 | .PP |
82ee9b47 MD |
14 | .BR "#include <linux/rseq.h>" " /* Definition of " RSEQ_* " constants */" |
15 | .BR "#include <sys/syscall.h>" " /* Definition of " SYS_* " constants */" | |
81da251a MD |
16 | .B #include <unistd.h> |
17 | .PP | |
c7df66c5 | 18 | .BI "int syscall(SYS_rseq, struct rseq *" rseq ", uint32_t " rseq_len , |
82ee9b47 | 19 | .BI " int " flags ", uint32_t " sig ); |
81da251a MD |
20 | .fi |
21 | .PP | |
22 | .IR Note : | |
23 | glibc provides no wrapper for | |
24 | .BR rseq (), | |
25 | necessitating the use of | |
26 | .BR syscall (2). | |
41478a42 | 27 | .SH DESCRIPTION |
841a0f9b MD |
28 | The |
29 | .BR rseq () | |
30 | ABI accelerates specific user-space operations by registering a | |
81da251a MD |
31 | per-thread data structure shared between kernel and user-space. |
32 | This data structure can be read from or written to by user-space to skip | |
841a0f9b | 33 | otherwise expensive system calls. |
81da251a | 34 | .PP |
82ee9b47 MD |
35 | A restartable sequence is a sequence of instructions |
36 | guaranteed to be executed atomically with respect to | |
37 | other threads and signal handlers on the current CPU. | |
38 | If its execution does not complete atomically, | |
39 | the kernel changes the execution flow by jumping to an abort handler | |
40 | defined by user-space for that restartable sequence. | |
81da251a | 41 | .PP |
da5633b4 | 42 | Using restartable sequences requires to register a |
82ee9b47 MD |
43 | .BR rseq () |
44 | ABI per-thread data structure | |
45 | .RB ( "struct rseq" ) | |
46 | through the | |
da5633b4 | 47 | .BR rseq () |
81da251a | 48 | system call. |
82ee9b47 MD |
49 | Only one |
50 | .BR rseq () | |
51 | ABI can be registered per thread, so user-space libraries and | |
52 | applications must follow a user-space ABI defining how to share this | |
81da251a MD |
53 | resource. |
54 | The ABI defining how to share this resource between applications and | |
55 | libraries is defined by the C library. | |
82ee9b47 MD |
56 | Allocation of the per-thread |
57 | .BR rseq () | |
58 | ABI and its registration to the kernel is handled by glibc since version | |
59 | 2.35. | |
81da251a | 60 | .PP |
82ee9b47 MD |
61 | The |
62 | .BR rseq () | |
63 | ABI per-thread data structure contains a | |
da5633b4 | 64 | .I rseq_cs |
81da251a MD |
65 | field which points to the currently executing critical section. |
66 | For each thread, a single rseq critical section can run at any given | |
67 | point. | |
68 | Each critical section need to be implemented in assembly. | |
69 | .PP | |
41478a42 MD |
70 | The |
71 | .BR rseq () | |
72 | ABI accelerates user-space operations on per-cpu data by defining a | |
73 | shared data structure ABI between each user-space thread and the kernel. | |
81da251a | 74 | .PP |
41478a42 MD |
75 | It allows user-space to perform update operations on per-cpu data |
76 | without requiring heavy-weight atomic operations. | |
81da251a | 77 | .PP |
41478a42 | 78 | The term CPU used in this documentation refers to a hardware execution |
81da251a MD |
79 | context. |
80 | For instance, each CPU number returned by | |
da5633b4 | 81 | .BR sched_getcpu () |
81da251a MD |
82 | is a CPU. |
83 | The current CPU means to the CPU on which the registered thread is | |
da5633b4 | 84 | running. |
81da251a | 85 | .PP |
41478a42 | 86 | Restartable sequences are atomic with respect to preemption (making it |
81da251a MD |
87 | atomic with respect to other threads running on the same CPU), |
88 | as well as signal delivery (user-space execution contexts nested over | |
89 | the same thread). | |
90 | They either complete atomically with respect to preemption on the | |
91 | current CPU and signal delivery, or they are aborted. | |
92 | .PP | |
da5633b4 | 93 | Restartable sequences are suited for update operations on per-cpu data. |
81da251a | 94 | .PP |
da5633b4 | 95 | Restartable sequences can be used on data structures shared between threads |
81da251a MD |
96 | within a process, |
97 | and on data structures shared between threads across different | |
98 | processes. | |
41478a42 | 99 | .PP |
81da251a MD |
100 | Some examples of operations that can be accelerated or improved by this ABI: |
101 | .IP \(bu 3 | |
41478a42 | 102 | Memory allocator per-cpu free-lists, |
81da251a | 103 | .IP \(bu 3 |
41478a42 | 104 | Querying the current CPU number, |
81da251a | 105 | .IP \(bu 3 |
41478a42 | 106 | Incrementing per-CPU counters, |
81da251a | 107 | .IP \(bu 3 |
41478a42 | 108 | Modifying data protected by per-CPU spinlocks, |
81da251a | 109 | .IP \(bu 3 |
41478a42 | 110 | Inserting/removing elements in per-CPU linked-lists, |
81da251a | 111 | .IP \(bu 3 |
41478a42 | 112 | Writing/reading per-CPU ring buffers content. |
81da251a MD |
113 | .IP \(bu 3 |
114 | Accurately reading performance monitoring unit counters with respect to | |
115 | thread migration. | |
41478a42 | 116 | .PP |
81da251a MD |
117 | Restartable sequences must not perform system calls. |
118 | Doing so may result in termination of the process by a segmentation | |
119 | fault. | |
41478a42 MD |
120 | .PP |
121 | The | |
122 | .I rseq | |
82ee9b47 MD |
123 | argument is a pointer to the thread-local |
124 | .B struct rseq | |
125 | to be shared between kernel and user-space. | |
41478a42 | 126 | .PP |
841a0f9b | 127 | The structure |
41478a42 | 128 | .B struct rseq |
81da251a MD |
129 | is an extensible structure. |
130 | Additional feature fields can be added in future kernel versions. | |
131 | Its layout is as follows: | |
41478a42 MD |
132 | .TP |
133 | .B Structure alignment | |
81da251a MD |
134 | This structure is aligned on either 32-byte boundary, |
135 | or on the alignment value returned by | |
82ee9b47 MD |
136 | .IR getauxval () |
137 | invoked with | |
81da251a | 138 | .B AT_RSEQ_ALIGN |
841a0f9b | 139 | if the structure size differs from 32 bytes. |
41478a42 MD |
140 | .TP |
141 | .B Structure size | |
81da251a MD |
142 | This structure size needs to be at least 32 bytes. |
143 | It can be either 32 bytes, | |
144 | or it needs to be large enough to hold the result of | |
82ee9b47 MD |
145 | .IR getauxval () |
146 | invoked with | |
147 | .BR AT_RSEQ_FEATURE_SIZE . | |
148 | Its size is passed as parameter to the | |
149 | .BR rseq () | |
150 | system call. | |
151 | .in +4n | |
152 | .IP | |
da5633b4 | 153 | .EX |
82ee9b47 MD |
154 | #include <linux/rseq.h> |
155 | ||
da5633b4 MD |
156 | struct rseq { |
157 | __u32 cpu_id_start; | |
158 | __u32 cpu_id; | |
159 | union { | |
82ee9b47 | 160 | /* ... */ |
da5633b4 MD |
161 | } rseq_cs; |
162 | __u32 flags; | |
841a0f9b MD |
163 | __u32 node_id; |
164 | __u32 mm_cid; | |
da5633b4 MD |
165 | } __attribute__((aligned(32))); |
166 | .EE | |
82ee9b47 | 167 | .in |
41478a42 MD |
168 | .TP |
169 | .B Fields | |
81da251a | 170 | .RS |
82ee9b47 | 171 | .TP |
41478a42 | 172 | .I cpu_id_start |
841a0f9b | 173 | Always-updated value of the CPU number on which the registered thread is |
81da251a MD |
174 | running. |
175 | Its value is guaranteed to always be a possible CPU number, | |
82ee9b47 MD |
176 | even when |
177 | .BR rseq () | |
178 | is not registered. | |
81da251a MD |
179 | Its value should always be confirmed by reading the cpu_id field before |
180 | user-space performs any side-effect | |
181 | (e.g. storing to memory). | |
82ee9b47 | 182 | .IP |
841a0f9b | 183 | This field is always guaranteed to hold a valid CPU number in the range |
81da251a | 184 | [ 0 .. nr_possible_cpus - 1 ]. |
82ee9b47 MD |
185 | It can therefore be loaded by user-space |
186 | and used as an offset in per-cpu data structures | |
187 | without having to check whether its value is within the valid bounds | |
188 | compared to the number of possible CPUs in the system. | |
189 | .IP | |
81da251a | 190 | Initialized by user-space to a possible CPU number (e.g., 0), |
82ee9b47 MD |
191 | updated by the kernel for threads registered with |
192 | .BR rseq (). | |
193 | .IP | |
194 | For user-space applications executed on a kernel without | |
195 | .BR rseq () | |
196 | support, | |
197 | the cpu_id_start field stays initialized at 0, | |
198 | which is indeed a valid CPU number. | |
81da251a MD |
199 | It is therefore valid to use it as an offset in per-cpu data structures, |
200 | and only validate whether it's actually the current CPU number by | |
201 | comparing it with the cpu_id field within the rseq critical section. | |
82ee9b47 MD |
202 | If the kernel does not provide |
203 | .BR rseq () | |
204 | support, that cpu_id field stays initialized at -1, | |
81da251a | 205 | so the comparison always fails, as intended. |
82ee9b47 | 206 | .IP |
841a0f9b | 207 | This field should only be read by the thread which registered this data |
81da251a MD |
208 | structure. |
209 | Aligned on 32-bit. | |
82ee9b47 MD |
210 | .IP |
211 | It is up to user space to implement a fall-back mechanism for scenarios where | |
212 | .BR rseq () | |
213 | is not available. | |
214 | .TP | |
41478a42 | 215 | .I cpu_id |
841a0f9b | 216 | Always-updated value of the CPU number on which the registered thread is |
81da251a MD |
217 | running. |
218 | Initialized by user-space to -1, | |
82ee9b47 MD |
219 | updated by the kernel for threads registered with |
220 | .BR rseq (). | |
221 | .IP | |
841a0f9b | 222 | This field should only be read by the thread which registered this data |
81da251a MD |
223 | structure. |
224 | Aligned on 32-bit. | |
82ee9b47 | 225 | .TP |
41478a42 | 226 | .I rseq_cs |
81da251a | 227 | The rseq_cs field is a pointer to a |
82ee9b47 | 228 | .BR "struct rseq_cs" . |
81da251a MD |
229 | Is is NULL when no rseq assembly block critical section is active for |
230 | the registered thread. | |
82ee9b47 MD |
231 | Setting it to point to a critical section descriptor |
232 | .RB ( "struct rseq_cs") | |
233 | marks the beginning of the critical section. | |
234 | .IP | |
da5633b4 | 235 | Initialized by user-space to NULL. |
82ee9b47 | 236 | .IP |
da5633b4 MD |
237 | Updated by user-space, which sets the address of the currently |
238 | active rseq_cs at the beginning of assembly instruction sequence | |
81da251a MD |
239 | block, |
240 | and set to NULL by the kernel when it restarts an assembly instruction | |
241 | sequence block, | |
242 | as well as when the kernel detects that it is preempting or delivering a | |
243 | signal outside of the range targeted by the rseq_cs. | |
244 | Also needs to be set to NULL by user-space before reclaiming memory that | |
245 | contains the targeted | |
82ee9b47 MD |
246 | .BR "struct rseq_cs" . |
247 | .IP | |
da5633b4 | 248 | Read and set by the kernel. |
82ee9b47 | 249 | .IP |
841a0f9b | 250 | This field should only be updated by the thread which registered this |
81da251a MD |
251 | data structure. |
252 | Aligned on 64-bit. | |
82ee9b47 | 253 | .TP |
41478a42 | 254 | .I flags |
81da251a MD |
255 | Flags indicating the restart behavior for the registered thread. |
256 | This is mainly used for debugging purposes. | |
257 | Can be a combination of: | |
82ee9b47 | 258 | .RS |
81da251a MD |
259 | .TP |
260 | .B RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT | |
261 | Inhibit instruction sequence block restart on preemption for this | |
262 | thread. | |
82ee9b47 | 263 | This flag is deprecated since Linux 6.1. |
81da251a MD |
264 | .TP |
265 | .B RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | |
266 | Inhibit instruction sequence block restart on signal delivery for this | |
267 | thread. | |
82ee9b47 | 268 | This flag is deprecated since Linux 6.1. |
841a0f9b | 269 | .TP |
81da251a MD |
270 | .B RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE |
271 | Inhibit instruction sequence block restart on migration for this thread. | |
82ee9b47 | 272 | This flag is deprecated since Linux 6.1. |
81da251a | 273 | .RE |
82ee9b47 MD |
274 | .IP |
275 | Initialized by user-space, used by the kernel. | |
276 | .TP | |
841a0f9b MD |
277 | .I node_id |
278 | Always-updated value of the current NUMA node ID. | |
82ee9b47 | 279 | .IP |
841a0f9b | 280 | Initialized by user-space to 0. |
82ee9b47 | 281 | .IP |
81da251a MD |
282 | Updated by the kernel. |
283 | Read by user-space with single-copy atomicity semantics. | |
284 | This field should only be read by the thread which registered | |
285 | this data structure. | |
286 | Aligned on 32-bit. | |
82ee9b47 | 287 | .TP |
841a0f9b | 288 | .I mm_cid |
81da251a MD |
289 | Contains the current thread's concurrency ID |
290 | (allocated uniquely within a memory map). | |
82ee9b47 | 291 | .IP |
81da251a MD |
292 | Updated by the kernel. |
293 | Read by user-space with single-copy atomicity semantics. | |
294 | This field should only be read by the thread which registered this data | |
295 | structure. | |
296 | Aligned on 32-bit. | |
82ee9b47 | 297 | .IP |
81da251a MD |
298 | This concurrency ID is within the possible cpus range, |
299 | and is temporarily (and uniquely) assigned while threads are actively | |
300 | running within a memory map. | |
301 | If a memory map has fewer threads than cores, | |
302 | or is limited to run on few cores concurrently through sched affinity or | |
303 | cgroup cpusets, | |
304 | the concurrency IDs will be values close to 0, | |
305 | thus allowing efficient use of user-space memory for per-cpu data | |
306 | structures. | |
307 | .RE | |
41478a42 MD |
308 | .PP |
309 | The layout of | |
310 | .B struct rseq_cs | |
311 | version 0 is as follows: | |
312 | .TP | |
313 | .B Structure alignment | |
6a78527e | 314 | This structure is aligned on 32-byte boundary. |
41478a42 MD |
315 | .TP |
316 | .B Structure size | |
317 | This structure has a fixed size of 32 bytes. | |
82ee9b47 MD |
318 | .in +4n |
319 | .IP | |
da5633b4 | 320 | .EX |
82ee9b47 MD |
321 | #include <linux/rseq.h> |
322 | ||
da5633b4 MD |
323 | struct rseq_cs { |
324 | __u32 version; | |
325 | __u32 flags; | |
326 | __u64 start_ip; | |
327 | __u64 post_commit_offset; | |
328 | __u64 abort_ip; | |
329 | } __attribute__((aligned(32))); | |
330 | .EE | |
82ee9b47 MD |
331 | .in |
332 | .TP | |
41478a42 | 333 | .B Fields |
81da251a | 334 | .RS |
82ee9b47 | 335 | .TP |
41478a42 | 336 | .I version |
81da251a MD |
337 | Version of this structure. |
338 | Should be initialized to 0. | |
82ee9b47 | 339 | .TP |
41478a42 | 340 | .I flags |
81da251a MD |
341 | .RS |
342 | Flags indicating the restart behavior of this structure. | |
343 | Can be a combination of: | |
344 | .TP | |
345 | .B RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT | |
346 | Inhibit instruction sequence block restart on preemption for this | |
347 | critical section. | |
82ee9b47 | 348 | This flag is deprecated since Linux 6.1. |
41478a42 | 349 | .TP |
81da251a MD |
350 | .B RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL |
351 | Inhibit instruction sequence block restart on signal delivery for this | |
352 | critical section. | |
82ee9b47 | 353 | This flag is deprecated since Linux 6.1. |
81da251a MD |
354 | .TP |
355 | .B RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE | |
356 | Inhibit instruction sequence block restart on migration for this | |
357 | critical section. | |
82ee9b47 | 358 | This flag is deprecated since Linux 6.1. |
81da251a | 359 | .RE |
82ee9b47 | 360 | .TP |
41478a42 MD |
361 | .I start_ip |
362 | Instruction pointer address of the first instruction of the sequence of | |
363 | consecutive assembly instructions. | |
82ee9b47 | 364 | .TP |
41478a42 MD |
365 | .I post_commit_offset |
366 | Offset (from start_ip address) of the address after the last instruction | |
367 | of the sequence of consecutive assembly instructions. | |
82ee9b47 | 368 | .TP |
41478a42 MD |
369 | .I abort_ip |
370 | Instruction pointer address where to move the execution flow in case of | |
371 | abort of the sequence of consecutive assembly instructions. | |
81da251a | 372 | .RE |
41478a42 MD |
373 | .PP |
374 | The | |
375 | .I rseq_len | |
376 | argument is the size of the | |
81da251a | 377 | .B struct rseq |
41478a42 | 378 | to register. |
41478a42 MD |
379 | .PP |
380 | The | |
381 | .I flags | |
382 | argument is 0 for registration, and | |
81da251a | 383 | .B RSEQ_FLAG_UNREGISTER |
41478a42 | 384 | for unregistration. |
41478a42 MD |
385 | .PP |
386 | The | |
387 | .I sig | |
388 | argument is the 32-bit signature to be expected before the abort | |
389 | handler code. | |
41478a42 | 390 | .PP |
82ee9b47 MD |
391 | A single library per process should keep the |
392 | .B struct rseq | |
393 | in a per-thread data structure. | |
41478a42 MD |
394 | The |
395 | .I cpu_id | |
396 | field should be initialized to -1, and the | |
397 | .I cpu_id_start | |
398 | field should be initialized to a possible CPU value (typically 0). | |
41478a42 | 399 | .PP |
82ee9b47 MD |
400 | Each thread is responsible for registering and unregistering its |
401 | .BR "struct rseq" . | |
402 | No more than one | |
403 | .B struct rseq | |
404 | address can be registered per thread at a given time. | |
6a78527e | 405 | .PP |
82ee9b47 MD |
406 | Reclaim of |
407 | .B struct rseq | |
408 | object's memory must only be done after either an explicit rseq | |
409 | unregistration is performed or after the thread exits. | |
41478a42 | 410 | .PP |
82ee9b47 MD |
411 | In a typical usage scenario, the thread registering the |
412 | .B struct rseq | |
413 | will be performing loads and stores from/to that structure. | |
81da251a | 414 | It is however also allowed to read that structure from other threads. |
82ee9b47 MD |
415 | The |
416 | .B struct rseq | |
417 | field updates performed by the kernel provide relaxed atomicity | |
81da251a MD |
418 | semantics (atomic store, without memory ordering), |
419 | which guarantee that other threads performing relaxed atomic reads | |
420 | (atomic load, without memory ordering) of the cpu number fields will | |
421 | always observe a consistent value. | |
41478a42 | 422 | .SH RETURN VALUE |
81da251a MD |
423 | A return value of 0 indicates success. |
424 | On error, \-1 is returned, and | |
41478a42 MD |
425 | .I errno |
426 | is set appropriately. | |
41478a42 MD |
427 | .SH ERRORS |
428 | .TP | |
429 | .B EINVAL | |
430 | Either | |
431 | .I flags | |
432 | contains an invalid value, or | |
433 | .I rseq | |
434 | contains an address which is not appropriately aligned, or | |
435 | .I rseq_len | |
da5633b4 | 436 | contains an incorrect size. |
41478a42 MD |
437 | .TP |
438 | .B ENOSYS | |
439 | The | |
440 | .BR rseq () | |
441 | system call is not implemented by this kernel. | |
442 | .TP | |
443 | .B EFAULT | |
444 | .I rseq | |
445 | is an invalid address. | |
446 | .TP | |
447 | .B EBUSY | |
448 | Restartable sequence is already registered for this thread. | |
449 | .TP | |
450 | .B EPERM | |
451 | The | |
452 | .I sig | |
453 | argument on unregistration does not match the signature received | |
454 | on registration. | |
41478a42 MD |
455 | .SH VERSIONS |
456 | The | |
457 | .BR rseq () | |
458 | system call was added in Linux 4.18. | |
81da251a | 459 | .SH STANDARDS |
41478a42 MD |
460 | .BR rseq () |
461 | is Linux-specific. | |
41478a42 MD |
462 | .SH SEE ALSO |
463 | .BR sched_getcpu (3) , | |
841a0f9b MD |
464 | .BR membarrier (2) , |
465 | .BR getauxval (3) |