Commit | Line | Data |
---|---|---|
65731578 TH |
1 | |
2 | Cgroup unified hierarchy | |
3 | ||
4 | April, 2014 Tejun Heo <tj@kernel.org> | |
5 | ||
6 | This document describes the changes made by unified hierarchy and | |
7 | their rationales. It will eventually be merged into the main cgroup | |
8 | documentation. | |
9 | ||
10 | CONTENTS | |
11 | ||
12 | 1. Background | |
13 | 2. Basic Operation | |
14 | 2-1. Mounting | |
15 | 2-2. cgroup.subtree_control | |
16 | 2-3. cgroup.controllers | |
17 | 3. Structural Constraints | |
18 | 3-1. Top-down | |
19 | 3-2. No internal tasks | |
20 | 4. Other Changes | |
21 | 4-1. [Un]populated Notification | |
22 | 4-2. Other Core Changes | |
23 | 4-3. Per-Controller Changes | |
24 | 4-3-1. blkio | |
25 | 4-3-2. cpuset | |
26 | 4-3-3. memory | |
27 | 5. Planned Changes | |
28 | 5-1. CAP for resource control | |
29 | ||
30 | ||
31 | 1. Background | |
32 | ||
33 | cgroup allows an arbitrary number of hierarchies and each hierarchy | |
34 | can host any number of controllers. While this seems to provide a | |
35 | high level of flexibility, it isn't quite useful in practice. | |
36 | ||
37 | For example, as there is only one instance of each controller, utility | |
38 | type controllers such as freezer which can be useful in all | |
39 | hierarchies can only be used in one. The issue is exacerbated by the | |
40 | fact that controllers can't be moved around once hierarchies are | |
41 | populated. Another issue is that all controllers bound to a hierarchy | |
42 | are forced to have exactly the same view of the hierarchy. It isn't | |
43 | possible to vary the granularity depending on the specific controller. | |
44 | ||
45 | In practice, these issues heavily limit which controllers can be put | |
46 | on the same hierarchy and most configurations resort to putting each | |
47 | controller on its own hierarchy. Only closely related ones, such as | |
48 | the cpu and cpuacct controllers, make sense to put on the same | |
49 | hierarchy. This often means that userland ends up managing multiple | |
50 | similar hierarchies repeating the same steps on each hierarchy | |
51 | whenever a hierarchy management operation is necessary. | |
52 | ||
53 | Unfortunately, support for multiple hierarchies comes at a steep cost. | |
54 | Internal implementation in cgroup core proper is dazzlingly | |
55 | complicated but more importantly the support for multiple hierarchies | |
56 | restricts how cgroup is used in general and what controllers can do. | |
57 | ||
58 | There's no limit on how many hierarchies there may be, which means | |
59 | that a task's cgroup membership can't be described in finite length. | |
60 | The key may contain any varying number of entries and is unlimited in | |
61 | length, which makes it highly awkward to handle and leads to addition | |
62 | of controllers which exist only to identify membership, which in turn | |
63 | exacerbates the original problem. | |
64 | ||
65 | Also, as a controller can't have any expectation regarding what shape | |
66 | of hierarchies other controllers would be on, each controller has to | |
67 | assume that all other controllers are operating on completely | |
68 | orthogonal hierarchies. This makes it impossible, or at least very | |
69 | cumbersome, for controllers to cooperate with each other. | |
70 | ||
71 | In most use cases, putting controllers on hierarchies which are | |
72 | completely orthogonal to each other isn't necessary. What usually is | |
73 | called for is the ability to have differing levels of granularity | |
74 | depending on the specific controller. In other words, hierarchy may | |
75 | be collapsed from leaf towards root when viewed from specific | |
76 | controllers. For example, a given configuration might not care about | |
77 | how memory is distributed beyond a certain level while still wanting | |
78 | to control how CPU cycles are distributed. | |
79 | ||
80 | Unified hierarchy is the next version of cgroup interface. It aims to | |
81 | address the aforementioned issues by having more structure while | |
82 | retaining enough flexibility for most use cases. Various other | |
83 | general and controller-specific interface issues are also addressed in | |
84 | the process. | |
85 | ||
86 | ||
87 | 2. Basic Operation | |
88 | ||
89 | 2-1. Mounting | |
90 | ||
91 | Currently, unified hierarchy can be mounted with the following mount | |
92 | command. Note that this is still under development and scheduled to | |
93 | change soon. | |
94 | ||
95 | mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT | |
96 | ||
97 | All controllers which are not bound to other hierarchies are | |
98 | automatically bound to unified hierarchy and show up at the root of | |
99 | it. Controllers which are enabled only in the root of unified | |
100 | hierarchy can be bound to other hierarchies at any time. This allows | |
101 | mixing unified hierarchy with the traditional multiple hierarchies in | |
102 | a fully backward compatible way. | |
103 | ||
104 | ||
105 | 2-2. cgroup.subtree_control | |
106 | ||
107 | All cgroups on unified hierarchy have a "cgroup.subtree_control" file | |
108 | which governs which controllers are enabled on the children of the | |
109 | cgroup. Let's assume a hierarchy like the following. | |
110 | ||
111 | root - A - B - C | |
112 | \ D | |
113 | ||
114 | root's "cgroup.subtree_control" file determines which controllers are | |
115 | enabled on A. A's on B. B's on C and D. This coincides with the | |
116 | fact that controllers on the immediate sub-level are used to | |
117 | distribute the resources of the parent. In fact, it's natural to | |
118 | assume that resource control knobs of a child belong to its parent. | |
119 | Enabling a controller in a "cgroup.subtree_control" file declares that | |
120 | distribution of the respective resources of the cgroup will be | |
121 | controlled. Note that this means that controller enable states are | |
122 | shared among siblings. | |
123 | ||
124 | When read, the file contains a space-separated list of currently | |
125 | enabled controllers. A write to the file should contain a | |
126 | space-separated list of controllers with '+' or '-' prefixed (without | |
127 | the quotes). Controllers prefixed with '+' are enabled and '-' | |
128 | disabled. If a controller is listed multiple times, the last entry | |
129 | wins. The specific operations are executed atomically - either all | |
130 | succeed or fail. | |
131 | ||
132 | ||
133 | 2-3. cgroup.controllers | |
134 | ||
135 | Read-only "cgroup.controllers" file contains a space-separated list of | |
136 | controllers which can be enabled in the cgroup's | |
137 | "cgroup.subtree_control" file. | |
138 | ||
139 | In the root cgroup, this lists controllers which are not bound to | |
140 | other hierarchies and the content changes as controllers are bound to | |
141 | and unbound from other hierarchies. | |
142 | ||
143 | In non-root cgroups, the content of this file equals that of the | |
144 | parent's "cgroup.subtree_control" file as only controllers enabled | |
145 | from the parent can be used in its children. | |
146 | ||
147 | ||
148 | 3. Structural Constraints | |
149 | ||
150 | 3-1. Top-down | |
151 | ||
152 | As it doesn't make sense to nest control of an uncontrolled resource, | |
153 | all non-root "cgroup.subtree_control" files can only contain | |
154 | controllers which are enabled in the parent's "cgroup.subtree_control" | |
155 | file. A controller can be enabled only if the parent has the | |
156 | controller enabled and a controller can't be disabled if one or more | |
157 | children have it enabled. | |
158 | ||
159 | ||
160 | 3-2. No internal tasks | |
161 | ||
162 | One long-standing issue that cgroup faces is the competition between | |
163 | tasks belonging to the parent cgroup and its children cgroups. This | |
164 | is inherently nasty as two different types of entities compete and | |
165 | there is no agreed-upon obvious way to handle it. Different | |
166 | controllers are doing different things. | |
167 | ||
168 | The cpu controller considers tasks and cgroups as equivalents and maps | |
169 | nice levels to cgroup weights. This works for some cases but falls | |
170 | flat when children should be allocated specific ratios of CPU cycles | |
171 | and the number of internal tasks fluctuates - the ratios constantly | |
172 | change as the number of competing entities fluctuates. There also are | |
173 | other issues. The mapping from nice level to weight isn't obvious or | |
174 | universal, and there are various other knobs which simply aren't | |
175 | available for tasks. | |
176 | ||
177 | The blkio controller implicitly creates a hidden leaf node for each | |
178 | cgroup to host the tasks. The hidden leaf has its own copies of all | |
179 | the knobs with "leaf_" prefixed. While this allows equivalent control | |
180 | over internal tasks, it's with serious drawbacks. It always adds an | |
181 | extra layer of nesting which may not be necessary, makes the interface | |
182 | messy and significantly complicates the implementation. | |
183 | ||
184 | The memory controller currently doesn't have a way to control what | |
185 | happens between internal tasks and child cgroups and the behavior is | |
186 | not clearly defined. There have been attempts to add ad-hoc behaviors | |
187 | and knobs to tailor the behavior to specific workloads. Continuing | |
188 | this direction will lead to problems which will be extremely difficult | |
189 | to resolve in the long term. | |
190 | ||
191 | Multiple controllers struggle with internal tasks and came up with | |
192 | different ways to deal with it; unfortunately, all the approaches in | |
193 | use now are severely flawed and, furthermore, the widely different | |
194 | behaviors make cgroup as whole highly inconsistent. | |
195 | ||
196 | It is clear that this is something which needs to be addressed from | |
197 | cgroup core proper in a uniform way so that controllers don't need to | |
198 | worry about it and cgroup as a whole shows a consistent and logical | |
199 | behavior. To achieve that, unified hierarchy enforces the following | |
200 | structural constraint: | |
201 | ||
202 | Except for the root, only cgroups which don't contain any task may | |
203 | have controllers enabled in their "cgroup.subtree_control" files. | |
204 | ||
205 | Combined with other properties, this guarantees that, when a | |
206 | controller is looking at the part of the hierarchy which has it | |
207 | enabled, tasks are always only on the leaves. This rules out | |
208 | situations where child cgroups compete against internal tasks of the | |
209 | parent. | |
210 | ||
211 | There are two things to note. Firstly, the root cgroup is exempt from | |
212 | the restriction. Root contains tasks and anonymous resource | |
213 | consumption which can't be associated with any other cgroup and | |
214 | requires special treatment from most controllers. How resource | |
215 | consumption in the root cgroup is governed is up to each controller. | |
216 | ||
217 | Secondly, the restriction doesn't take effect if there is no enabled | |
218 | controller in the cgroup's "cgroup.subtree_control" file. This is | |
219 | important as otherwise it wouldn't be possible to create children of a | |
220 | populated cgroup. To control resource distribution of a cgroup, the | |
221 | cgroup must create children and transfer all its tasks to the children | |
222 | before enabling controllers in its "cgroup.subtree_control" file. | |
223 | ||
224 | ||
225 | 4. Other Changes | |
226 | ||
227 | 4-1. [Un]populated Notification | |
228 | ||
229 | cgroup users often need a way to determine when a cgroup's | |
230 | subhierarchy becomes empty so that it can be cleaned up. cgroup | |
231 | currently provides release_agent for it; unfortunately, this mechanism | |
232 | is riddled with issues. | |
233 | ||
234 | - It delivers events by forking and execing a userland binary | |
235 | specified as the release_agent. This is a long deprecated method of | |
236 | notification delivery. It's extremely heavy, slow and cumbersome to | |
237 | integrate with larger infrastructure. | |
238 | ||
239 | - There is single monitoring point at the root. There's no way to | |
240 | delegate management of a subtree. | |
241 | ||
242 | - The event isn't recursive. It triggers when a cgroup doesn't have | |
243 | any tasks or child cgroups. Events for internal nodes trigger only | |
244 | after all children are removed. This again makes it impossible to | |
245 | delegate management of a subtree. | |
246 | ||
247 | - Events are filtered from the kernel side. A "notify_on_release" | |
248 | file is used to subscribe to or suppress release events. This is | |
249 | unnecessarily complicated and probably done this way because event | |
250 | delivery itself was expensive. | |
251 | ||
252 | Unified hierarchy implements an interface file "cgroup.populated" | |
253 | which can be used to monitor whether the cgroup's subhierarchy has | |
254 | tasks in it or not. Its value is 0 if there is no task in the cgroup | |
255 | and its descendants; otherwise, 1. poll and [id]notify events are | |
256 | triggered when the value changes. | |
257 | ||
258 | This is significantly lighter and simpler and trivially allows | |
259 | delegating management of subhierarchy - subhierarchy monitoring can | |
260 | block further propagation simply by putting itself or another process | |
261 | in the subhierarchy and monitor events that it's interested in from | |
262 | there without interfering with monitoring higher in the tree. | |
263 | ||
264 | In unified hierarchy, the release_agent mechanism is no longer | |
265 | supported and the interface files "release_agent" and | |
266 | "notify_on_release" do not exist. | |
267 | ||
268 | ||
269 | 4-2. Other Core Changes | |
270 | ||
271 | - None of the mount options is allowed. | |
272 | ||
273 | - remount is disallowed. | |
274 | ||
275 | - rename(2) is disallowed. | |
276 | ||
277 | - The "tasks" file is removed. Everything should at process | |
278 | granularity. Use the "cgroup.procs" file instead. | |
279 | ||
280 | - The "cgroup.procs" file is not sorted. pids will be unique unless | |
281 | they got recycled in-between reads. | |
282 | ||
283 | - The "cgroup.clone_children" file is removed. | |
284 | ||
285 | ||
286 | 4-3. Per-Controller Changes | |
287 | ||
288 | 4-3-1. blkio | |
289 | ||
290 | - blk-throttle becomes properly hierarchical. | |
291 | ||
292 | ||
293 | 4-3-2. cpuset | |
294 | ||
295 | - Tasks are kept in empty cpusets after hotplug and take on the masks | |
296 | of the nearest non-empty ancestor, instead of being moved to it. | |
297 | ||
298 | - A task can be moved into an empty cpuset, and again it takes on the | |
299 | masks of the nearest non-empty ancestor. | |
300 | ||
301 | ||
302 | 4-3-3. memory | |
303 | ||
304 | - use_hierarchy is on by default and the cgroup file for the flag is | |
305 | not created. | |
306 | ||
307 | ||
308 | 5. Planned Changes | |
309 | ||
310 | 5-1. CAP for resource control | |
311 | ||
312 | Unified hierarchy will require one of the capabilities(7), which is | |
313 | yet to be decided, for all resource control related knobs. Process | |
314 | organization operations - creation of sub-cgroups and migration of | |
315 | processes in sub-hierarchies may be delegated by changing the | |
316 | ownership and/or permissions on the cgroup directory and | |
317 | "cgroup.procs" interface file; however, all operations which affect | |
318 | resource control - writes to a "cgroup.subtree_control" file or any | |
319 | controller-specific knobs - will require an explicit CAP privilege. | |
320 | ||
321 | This, in part, is to prevent the cgroup interface from being | |
322 | inadvertently promoted to programmable API used by non-privileged | |
323 | binaries. cgroup exposes various aspects of the system in ways which | |
324 | aren't properly abstracted for direct consumption by regular programs. | |
325 | This is an administration interface much closer to sysctl knobs than | |
326 | system calls. Even the basic access model, being filesystem path | |
327 | based, isn't suitable for direct consumption. There's no way to | |
328 | access "my cgroup" in a race-free way or make multiple operations | |
329 | atomic against migration to another cgroup. | |
330 | ||
331 | Another aspect is that, for better or for worse, the cgroup interface | |
332 | goes through far less scrutiny than regular interfaces for | |
333 | unprivileged userland. The upside is that cgroup is able to expose | |
334 | useful features which may not be suitable for general consumption in a | |
335 | reasonable time frame. It provides a relatively short path between | |
336 | internal details and userland-visible interface. Of course, this | |
337 | shortcut comes with high risk. We go through what we go through for | |
338 | general kernel APIs for good reasons. It may end up leaking internal | |
339 | details in a way which can exert significant pain by locking the | |
340 | kernel into a contract that can't be maintained in a reasonable | |
341 | manner. | |
342 | ||
343 | Also, due to the specific nature, cgroup and its controllers don't | |
344 | tend to attract attention from a wide scope of developers. cgroup's | |
345 | short history is already fraught with severely mis-designed | |
346 | interfaces, unnecessary commitments to and exposing of internal | |
347 | details, broken and dangerous implementations of various features. | |
348 | ||
349 | Keeping cgroup as an administration interface is both advantageous for | |
350 | its role and imperative given its nature. Some of the cgroup features | |
351 | may make sense for unprivileged access. If deemed justified, those | |
352 | must be further abstracted and implemented as a different interface, | |
353 | be it a system call or process-private filesystem, and survive through | |
354 | the scrutiny that any interface for general consumption is required to | |
355 | go through. | |
356 | ||
357 | Requiring CAP is not a complete solution but should serve as a | |
358 | significant deterrent against spraying cgroup usages in non-privileged | |
359 | programs. |