Commit | Line | Data |
---|---|---|
1da177e4 LT |
1 | CPUSETS |
2 | ------- | |
3 | ||
4 | Copyright (C) 2004 BULL SA. | |
5 | Written by Simon.Derr@bull.net | |
6 | ||
7 | Portions Copyright (c) 2004 Silicon Graphics, Inc. | |
8 | Modified by Paul Jackson <pj@sgi.com> | |
9 | ||
10 | CONTENTS: | |
11 | ========= | |
12 | ||
13 | 1. Cpusets | |
14 | 1.1 What are cpusets ? | |
15 | 1.2 Why are cpusets needed ? | |
16 | 1.3 How are cpusets implemented ? | |
17 | 1.4 How do I use cpusets ? | |
18 | 2. Usage Examples and Syntax | |
19 | 2.1 Basic Usage | |
20 | 2.2 Adding/removing cpus | |
21 | 2.3 Setting flags | |
22 | 2.4 Attaching processes | |
23 | 3. Questions | |
24 | 4. Contact | |
25 | ||
26 | 1. Cpusets | |
27 | ========== | |
28 | ||
29 | 1.1 What are cpusets ? | |
30 | ---------------------- | |
31 | ||
32 | Cpusets provide a mechanism for assigning a set of CPUs and Memory | |
33 | Nodes to a set of tasks. | |
34 | ||
35 | Cpusets constrain the CPU and Memory placement of tasks to only | |
36 | the resources within a tasks current cpuset. They form a nested | |
37 | hierarchy visible in a virtual file system. These are the essential | |
38 | hooks, beyond what is already present, required to manage dynamic | |
39 | job placement on large systems. | |
40 | ||
41 | Each task has a pointer to a cpuset. Multiple tasks may reference | |
42 | the same cpuset. Requests by a task, using the sched_setaffinity(2) | |
43 | system call to include CPUs in its CPU affinity mask, and using the | |
44 | mbind(2) and set_mempolicy(2) system calls to include Memory Nodes | |
45 | in its memory policy, are both filtered through that tasks cpuset, | |
46 | filtering out any CPUs or Memory Nodes not in that cpuset. The | |
47 | scheduler will not schedule a task on a CPU that is not allowed in | |
48 | its cpus_allowed vector, and the kernel page allocator will not | |
49 | allocate a page on a node that is not allowed in the requesting tasks | |
50 | mems_allowed vector. | |
51 | ||
52 | If a cpuset is cpu or mem exclusive, no other cpuset, other than a direct | |
53 | ancestor or descendent, may share any of the same CPUs or Memory Nodes. | |
85d7b949 DG |
54 | A cpuset that is cpu exclusive has a sched domain associated with it. |
55 | The sched domain consists of all cpus in the current cpuset that are not | |
56 | part of any exclusive child cpusets. | |
57 | This ensures that the scheduler load balacing code only balances | |
58 | against the cpus that are in the sched domain as defined above and not | |
59 | all of the cpus in the system. This removes any overhead due to | |
60 | load balancing code trying to pull tasks outside of the cpu exclusive | |
61 | cpuset only to be prevented by the tasks' cpus_allowed mask. | |
1da177e4 LT |
62 | |
63 | User level code may create and destroy cpusets by name in the cpuset | |
64 | virtual file system, manage the attributes and permissions of these | |
65 | cpusets and which CPUs and Memory Nodes are assigned to each cpuset, | |
66 | specify and query to which cpuset a task is assigned, and list the | |
67 | task pids assigned to a cpuset. | |
68 | ||
69 | ||
70 | 1.2 Why are cpusets needed ? | |
71 | ---------------------------- | |
72 | ||
73 | The management of large computer systems, with many processors (CPUs), | |
74 | complex memory cache hierarchies and multiple Memory Nodes having | |
75 | non-uniform access times (NUMA) presents additional challenges for | |
76 | the efficient scheduling and memory placement of processes. | |
77 | ||
78 | Frequently more modest sized systems can be operated with adequate | |
79 | efficiency just by letting the operating system automatically share | |
80 | the available CPU and Memory resources amongst the requesting tasks. | |
81 | ||
82 | But larger systems, which benefit more from careful processor and | |
83 | memory placement to reduce memory access times and contention, | |
84 | and which typically represent a larger investment for the customer, | |
85 | can benefit from explictly placing jobs on properly sized subsets of | |
86 | the system. | |
87 | ||
88 | This can be especially valuable on: | |
89 | ||
90 | * Web Servers running multiple instances of the same web application, | |
91 | * Servers running different applications (for instance, a web server | |
92 | and a database), or | |
93 | * NUMA systems running large HPC applications with demanding | |
94 | performance characteristics. | |
85d7b949 DG |
95 | * Also cpu_exclusive cpusets are useful for servers running orthogonal |
96 | workloads such as RT applications requiring low latency and HPC | |
97 | applications that are throughput sensitive | |
1da177e4 LT |
98 | |
99 | These subsets, or "soft partitions" must be able to be dynamically | |
100 | adjusted, as the job mix changes, without impacting other concurrently | |
101 | executing jobs. | |
102 | ||
103 | The kernel cpuset patch provides the minimum essential kernel | |
104 | mechanisms required to efficiently implement such subsets. It | |
105 | leverages existing CPU and Memory Placement facilities in the Linux | |
106 | kernel to avoid any additional impact on the critical scheduler or | |
107 | memory allocator code. | |
108 | ||
109 | ||
110 | 1.3 How are cpusets implemented ? | |
111 | --------------------------------- | |
112 | ||
113 | Cpusets provide a Linux kernel (2.6.7 and above) mechanism to constrain | |
114 | which CPUs and Memory Nodes are used by a process or set of processes. | |
115 | ||
116 | The Linux kernel already has a pair of mechanisms to specify on which | |
117 | CPUs a task may be scheduled (sched_setaffinity) and on which Memory | |
118 | Nodes it may obtain memory (mbind, set_mempolicy). | |
119 | ||
120 | Cpusets extends these two mechanisms as follows: | |
121 | ||
122 | - Cpusets are sets of allowed CPUs and Memory Nodes, known to the | |
123 | kernel. | |
124 | - Each task in the system is attached to a cpuset, via a pointer | |
125 | in the task structure to a reference counted cpuset structure. | |
126 | - Calls to sched_setaffinity are filtered to just those CPUs | |
127 | allowed in that tasks cpuset. | |
128 | - Calls to mbind and set_mempolicy are filtered to just | |
129 | those Memory Nodes allowed in that tasks cpuset. | |
130 | - The root cpuset contains all the systems CPUs and Memory | |
131 | Nodes. | |
132 | - For any cpuset, one can define child cpusets containing a subset | |
133 | of the parents CPU and Memory Node resources. | |
134 | - The hierarchy of cpusets can be mounted at /dev/cpuset, for | |
135 | browsing and manipulation from user space. | |
136 | - A cpuset may be marked exclusive, which ensures that no other | |
137 | cpuset (except direct ancestors and descendents) may contain | |
138 | any overlapping CPUs or Memory Nodes. | |
85d7b949 DG |
139 | Also a cpu_exclusive cpuset would be associated with a sched |
140 | domain. | |
1da177e4 LT |
141 | - You can list all the tasks (by pid) attached to any cpuset. |
142 | ||
143 | The implementation of cpusets requires a few, simple hooks | |
144 | into the rest of the kernel, none in performance critical paths: | |
145 | ||
146 | - in main/init.c, to initialize the root cpuset at system boot. | |
147 | - in fork and exit, to attach and detach a task from its cpuset. | |
148 | - in sched_setaffinity, to mask the requested CPUs by what's | |
149 | allowed in that tasks cpuset. | |
150 | - in sched.c migrate_all_tasks(), to keep migrating tasks within | |
151 | the CPUs allowed by their cpuset, if possible. | |
85d7b949 DG |
152 | - in sched.c, a new API partition_sched_domains for handling |
153 | sched domain changes associated with cpu_exclusive cpusets | |
154 | and related changes in both sched.c and arch/ia64/kernel/domain.c | |
1da177e4 LT |
155 | - in the mbind and set_mempolicy system calls, to mask the requested |
156 | Memory Nodes by what's allowed in that tasks cpuset. | |
157 | - in page_alloc, to restrict memory to allowed nodes. | |
158 | - in vmscan.c, to restrict page recovery to the current cpuset. | |
159 | ||
160 | In addition a new file system, of type "cpuset" may be mounted, | |
161 | typically at /dev/cpuset, to enable browsing and modifying the cpusets | |
162 | presently known to the kernel. No new system calls are added for | |
163 | cpusets - all support for querying and modifying cpusets is via | |
164 | this cpuset file system. | |
165 | ||
166 | Each task under /proc has an added file named 'cpuset', displaying | |
167 | the cpuset name, as the path relative to the root of the cpuset file | |
168 | system. | |
169 | ||
170 | The /proc/<pid>/status file for each task has two added lines, | |
171 | displaying the tasks cpus_allowed (on which CPUs it may be scheduled) | |
172 | and mems_allowed (on which Memory Nodes it may obtain memory), | |
173 | in the format seen in the following example: | |
174 | ||
175 | Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff | |
176 | Mems_allowed: ffffffff,ffffffff | |
177 | ||
178 | Each cpuset is represented by a directory in the cpuset file system | |
179 | containing the following files describing that cpuset: | |
180 | ||
181 | - cpus: list of CPUs in that cpuset | |
182 | - mems: list of Memory Nodes in that cpuset | |
183 | - cpu_exclusive flag: is cpu placement exclusive? | |
184 | - mem_exclusive flag: is memory placement exclusive? | |
185 | - tasks: list of tasks (by pid) attached to that cpuset | |
186 | ||
187 | New cpusets are created using the mkdir system call or shell | |
188 | command. The properties of a cpuset, such as its flags, allowed | |
189 | CPUs and Memory Nodes, and attached tasks, are modified by writing | |
190 | to the appropriate file in that cpusets directory, as listed above. | |
191 | ||
192 | The named hierarchical structure of nested cpusets allows partitioning | |
193 | a large system into nested, dynamically changeable, "soft-partitions". | |
194 | ||
195 | The attachment of each task, automatically inherited at fork by any | |
196 | children of that task, to a cpuset allows organizing the work load | |
197 | on a system into related sets of tasks such that each set is constrained | |
198 | to using the CPUs and Memory Nodes of a particular cpuset. A task | |
199 | may be re-attached to any other cpuset, if allowed by the permissions | |
200 | on the necessary cpuset file system directories. | |
201 | ||
202 | Such management of a system "in the large" integrates smoothly with | |
203 | the detailed placement done on individual tasks and memory regions | |
204 | using the sched_setaffinity, mbind and set_mempolicy system calls. | |
205 | ||
206 | The following rules apply to each cpuset: | |
207 | ||
208 | - Its CPUs and Memory Nodes must be a subset of its parents. | |
209 | - It can only be marked exclusive if its parent is. | |
210 | - If its cpu or memory is exclusive, they may not overlap any sibling. | |
211 | ||
212 | These rules, and the natural hierarchy of cpusets, enable efficient | |
213 | enforcement of the exclusive guarantee, without having to scan all | |
214 | cpusets every time any of them change to ensure nothing overlaps a | |
215 | exclusive cpuset. Also, the use of a Linux virtual file system (vfs) | |
216 | to represent the cpuset hierarchy provides for a familiar permission | |
217 | and name space for cpusets, with a minimum of additional kernel code. | |
218 | ||
219 | 1.4 How do I use cpusets ? | |
220 | -------------------------- | |
221 | ||
222 | In order to minimize the impact of cpusets on critical kernel | |
223 | code, such as the scheduler, and due to the fact that the kernel | |
224 | does not support one task updating the memory placement of another | |
225 | task directly, the impact on a task of changing its cpuset CPU | |
226 | or Memory Node placement, or of changing to which cpuset a task | |
227 | is attached, is subtle. | |
228 | ||
229 | If a cpuset has its Memory Nodes modified, then for each task attached | |
230 | to that cpuset, the next time that the kernel attempts to allocate | |
231 | a page of memory for that task, the kernel will notice the change | |
232 | in the tasks cpuset, and update its per-task memory placement to | |
233 | remain within the new cpusets memory placement. If the task was using | |
234 | mempolicy MPOL_BIND, and the nodes to which it was bound overlap with | |
235 | its new cpuset, then the task will continue to use whatever subset | |
236 | of MPOL_BIND nodes are still allowed in the new cpuset. If the task | |
237 | was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed | |
238 | in the new cpuset, then the task will be essentially treated as if it | |
239 | was MPOL_BIND bound to the new cpuset (even though its numa placement, | |
240 | as queried by get_mempolicy(), doesn't change). If a task is moved | |
241 | from one cpuset to another, then the kernel will adjust the tasks | |
242 | memory placement, as above, the next time that the kernel attempts | |
243 | to allocate a page of memory for that task. | |
244 | ||
245 | If a cpuset has its CPUs modified, then each task using that | |
246 | cpuset does _not_ change its behavior automatically. In order to | |
247 | minimize the impact on the critical scheduling code in the kernel, | |
248 | tasks will continue to use their prior CPU placement until they | |
249 | are rebound to their cpuset, by rewriting their pid to the 'tasks' | |
250 | file of their cpuset. If a task had been bound to some subset of its | |
251 | cpuset using the sched_setaffinity() call, and if any of that subset | |
252 | is still allowed in its new cpuset settings, then the task will be | |
253 | restricted to the intersection of the CPUs it was allowed on before, | |
254 | and its new cpuset CPU placement. If, on the other hand, there is | |
255 | no overlap between a tasks prior placement and its new cpuset CPU | |
256 | placement, then the task will be allowed to run on any CPU allowed | |
257 | in its new cpuset. If a task is moved from one cpuset to another, | |
258 | its CPU placement is updated in the same way as if the tasks pid is | |
259 | rewritten to the 'tasks' file of its current cpuset. | |
260 | ||
261 | In summary, the memory placement of a task whose cpuset is changed is | |
262 | updated by the kernel, on the next allocation of a page for that task, | |
263 | but the processor placement is not updated, until that tasks pid is | |
264 | rewritten to the 'tasks' file of its cpuset. This is done to avoid | |
265 | impacting the scheduler code in the kernel with a check for changes | |
266 | in a tasks processor placement. | |
267 | ||
268 | There is an exception to the above. If hotplug funtionality is used | |
269 | to remove all the CPUs that are currently assigned to a cpuset, | |
270 | then the kernel will automatically update the cpus_allowed of all | |
b39c4fab | 271 | tasks attached to CPUs in that cpuset to allow all CPUs. When memory |
1da177e4 LT |
272 | hotplug functionality for removing Memory Nodes is available, a |
273 | similar exception is expected to apply there as well. In general, | |
274 | the kernel prefers to violate cpuset placement, over starving a task | |
275 | that has had all its allowed CPUs or Memory Nodes taken offline. User | |
276 | code should reconfigure cpusets to only refer to online CPUs and Memory | |
277 | Nodes when using hotplug to add or remove such resources. | |
278 | ||
279 | There is a second exception to the above. GFP_ATOMIC requests are | |
280 | kernel internal allocations that must be satisfied, immediately. | |
281 | The kernel may drop some request, in rare cases even panic, if a | |
282 | GFP_ATOMIC alloc fails. If the request cannot be satisfied within | |
283 | the current tasks cpuset, then we relax the cpuset, and look for | |
284 | memory anywhere we can find it. It's better to violate the cpuset | |
285 | than stress the kernel. | |
286 | ||
287 | To start a new job that is to be contained within a cpuset, the steps are: | |
288 | ||
289 | 1) mkdir /dev/cpuset | |
290 | 2) mount -t cpuset none /dev/cpuset | |
291 | 3) Create the new cpuset by doing mkdir's and write's (or echo's) in | |
292 | the /dev/cpuset virtual file system. | |
293 | 4) Start a task that will be the "founding father" of the new job. | |
294 | 5) Attach that task to the new cpuset by writing its pid to the | |
295 | /dev/cpuset tasks file for that cpuset. | |
296 | 6) fork, exec or clone the job tasks from this founding father task. | |
297 | ||
298 | For example, the following sequence of commands will setup a cpuset | |
299 | named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, | |
300 | and then start a subshell 'sh' in that cpuset: | |
301 | ||
302 | mount -t cpuset none /dev/cpuset | |
303 | cd /dev/cpuset | |
304 | mkdir Charlie | |
305 | cd Charlie | |
306 | /bin/echo 2-3 > cpus | |
307 | /bin/echo 1 > mems | |
308 | /bin/echo $$ > tasks | |
309 | sh | |
310 | # The subshell 'sh' is now running in cpuset Charlie | |
311 | # The next line should display '/Charlie' | |
312 | cat /proc/self/cpuset | |
313 | ||
314 | In the case that a change of cpuset includes wanting to move already | |
315 | allocated memory pages, consider further the work of IWAMOTO | |
316 | Toshihiro <iwamoto@valinux.co.jp> for page remapping and memory | |
317 | hotremoval, which can be found at: | |
318 | ||
319 | http://people.valinux.co.jp/~iwamoto/mh.html | |
320 | ||
321 | The integration of cpusets with such memory migration is not yet | |
322 | available. | |
323 | ||
324 | In the future, a C library interface to cpusets will likely be | |
325 | available. For now, the only way to query or modify cpusets is | |
326 | via the cpuset file system, using the various cd, mkdir, echo, cat, | |
327 | rmdir commands from the shell, or their equivalent from C. | |
328 | ||
329 | The sched_setaffinity calls can also be done at the shell prompt using | |
330 | SGI's runon or Robert Love's taskset. The mbind and set_mempolicy | |
331 | calls can be done at the shell prompt using the numactl command | |
332 | (part of Andi Kleen's numa package). | |
333 | ||
334 | 2. Usage Examples and Syntax | |
335 | ============================ | |
336 | ||
337 | 2.1 Basic Usage | |
338 | --------------- | |
339 | ||
340 | Creating, modifying, using the cpusets can be done through the cpuset | |
341 | virtual filesystem. | |
342 | ||
343 | To mount it, type: | |
344 | # mount -t cpuset none /dev/cpuset | |
345 | ||
346 | Then under /dev/cpuset you can find a tree that corresponds to the | |
347 | tree of the cpusets in the system. For instance, /dev/cpuset | |
348 | is the cpuset that holds the whole system. | |
349 | ||
350 | If you want to create a new cpuset under /dev/cpuset: | |
351 | # cd /dev/cpuset | |
352 | # mkdir my_cpuset | |
353 | ||
354 | Now you want to do something with this cpuset. | |
355 | # cd my_cpuset | |
356 | ||
357 | In this directory you can find several files: | |
358 | # ls | |
359 | cpus cpu_exclusive mems mem_exclusive tasks | |
360 | ||
361 | Reading them will give you information about the state of this cpuset: | |
362 | the CPUs and Memory Nodes it can use, the processes that are using | |
363 | it, its properties. By writing to these files you can manipulate | |
364 | the cpuset. | |
365 | ||
366 | Set some flags: | |
367 | # /bin/echo 1 > cpu_exclusive | |
368 | ||
369 | Add some cpus: | |
370 | # /bin/echo 0-7 > cpus | |
371 | ||
372 | Now attach your shell to this cpuset: | |
373 | # /bin/echo $$ > tasks | |
374 | ||
375 | You can also create cpusets inside your cpuset by using mkdir in this | |
376 | directory. | |
377 | # mkdir my_sub_cs | |
378 | ||
379 | To remove a cpuset, just use rmdir: | |
380 | # rmdir my_sub_cs | |
381 | This will fail if the cpuset is in use (has cpusets inside, or has | |
382 | processes attached). | |
383 | ||
384 | 2.2 Adding/removing cpus | |
385 | ------------------------ | |
386 | ||
387 | This is the syntax to use when writing in the cpus or mems files | |
388 | in cpuset directories: | |
389 | ||
390 | # /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4 | |
391 | # /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4 | |
392 | ||
393 | 2.3 Setting flags | |
394 | ----------------- | |
395 | ||
396 | The syntax is very simple: | |
397 | ||
398 | # /bin/echo 1 > cpu_exclusive -> set flag 'cpu_exclusive' | |
399 | # /bin/echo 0 > cpu_exclusive -> unset flag 'cpu_exclusive' | |
400 | ||
401 | 2.4 Attaching processes | |
402 | ----------------------- | |
403 | ||
404 | # /bin/echo PID > tasks | |
405 | ||
406 | Note that it is PID, not PIDs. You can only attach ONE task at a time. | |
407 | If you have several tasks to attach, you have to do it one after another: | |
408 | ||
409 | # /bin/echo PID1 > tasks | |
410 | # /bin/echo PID2 > tasks | |
411 | ... | |
412 | # /bin/echo PIDn > tasks | |
413 | ||
414 | ||
415 | 3. Questions | |
416 | ============ | |
417 | ||
418 | Q: what's up with this '/bin/echo' ? | |
419 | A: bash's builtin 'echo' command does not check calls to write() against | |
420 | errors. If you use it in the cpuset file system, you won't be | |
421 | able to tell whether a command succeeded or failed. | |
422 | ||
423 | Q: When I attach processes, only the first of the line gets really attached ! | |
424 | A: We can only return one error code per call to write(). So you should also | |
425 | put only ONE pid. | |
426 | ||
427 | 4. Contact | |
428 | ========== | |
429 | ||
430 | Web: http://www.bullopensource.org/cpuset |