Commit | Line | Data |
---|---|---|
7063fbf2 | 1 | |
6c28f2c0 | 2 | configfs - Userspace-driven kernel object configuration. |
7063fbf2 JB |
3 | |
4 | Joel Becker <joel.becker@oracle.com> | |
5 | ||
6 | Updated: 31 March 2005 | |
7 | ||
8 | Copyright (c) 2005 Oracle Corporation, | |
9 | Joel Becker <joel.becker@oracle.com> | |
10 | ||
11 | ||
12 | [What is configfs?] | |
13 | ||
14 | configfs is a ram-based filesystem that provides the converse of | |
15 | sysfs's functionality. Where sysfs is a filesystem-based view of | |
16 | kernel objects, configfs is a filesystem-based manager of kernel | |
17 | objects, or config_items. | |
18 | ||
19 | With sysfs, an object is created in kernel (for example, when a device | |
20 | is discovered) and it is registered with sysfs. Its attributes then | |
21 | appear in sysfs, allowing userspace to read the attributes via | |
22 | readdir(3)/read(2). It may allow some attributes to be modified via | |
23 | write(2). The important point is that the object is created and | |
24 | destroyed in kernel, the kernel controls the lifecycle of the sysfs | |
25 | representation, and sysfs is merely a window on all this. | |
26 | ||
27 | A configfs config_item is created via an explicit userspace operation: | |
28 | mkdir(2). It is destroyed via rmdir(2). The attributes appear at | |
29 | mkdir(2) time, and can be read or modified via read(2) and write(2). | |
30 | As with sysfs, readdir(3) queries the list of items and/or attributes. | |
31 | symlink(2) can be used to group items together. Unlike sysfs, the | |
32 | lifetime of the representation is completely driven by userspace. The | |
33 | kernel modules backing the items must respond to this. | |
34 | ||
35 | Both sysfs and configfs can and should exist together on the same | |
36 | system. One is not a replacement for the other. | |
37 | ||
38 | [Using configfs] | |
39 | ||
40 | configfs can be compiled as a module or into the kernel. You can access | |
41 | it by doing | |
42 | ||
43 | mount -t configfs none /config | |
44 | ||
45 | The configfs tree will be empty unless client modules are also loaded. | |
46 | These are modules that register their item types with configfs as | |
47 | subsystems. Once a client subsystem is loaded, it will appear as a | |
48 | subdirectory (or more than one) under /config. Like sysfs, the | |
49 | configfs tree is always there, whether mounted on /config or not. | |
50 | ||
51 | An item is created via mkdir(2). The item's attributes will also | |
52 | appear at this time. readdir(3) can determine what the attributes are, | |
53 | read(2) can query their default values, and write(2) can store new | |
54 | values. Like sysfs, attributes should be ASCII text files, preferably | |
55 | with only one value per file. The same efficiency caveats from sysfs | |
56 | apply. Don't mix more than one attribute in one attribute file. | |
57 | ||
58 | Like sysfs, configfs expects write(2) to store the entire buffer at | |
59 | once. When writing to configfs attributes, userspace processes should | |
60 | first read the entire file, modify the portions they wish to change, and | |
61 | then write the entire buffer back. Attribute files have a maximum size | |
62 | of one page (PAGE_SIZE, 4096 on i386). | |
63 | ||
64 | When an item needs to be destroyed, remove it with rmdir(2). An | |
65 | item cannot be destroyed if any other item has a link to it (via | |
66 | symlink(2)). Links can be removed via unlink(2). | |
67 | ||
68 | [Configuring FakeNBD: an Example] | |
69 | ||
70 | Imagine there's a Network Block Device (NBD) driver that allows you to | |
71 | access remote block devices. Call it FakeNBD. FakeNBD uses configfs | |
72 | for its configuration. Obviously, there will be a nice program that | |
73 | sysadmins use to configure FakeNBD, but somehow that program has to tell | |
74 | the driver about it. Here's where configfs comes in. | |
75 | ||
76 | When the FakeNBD driver is loaded, it registers itself with configfs. | |
77 | readdir(3) sees this just fine: | |
78 | ||
79 | # ls /config | |
80 | fakenbd | |
81 | ||
82 | A fakenbd connection can be created with mkdir(2). The name is | |
83 | arbitrary, but likely the tool will make some use of the name. Perhaps | |
84 | it is a uuid or a disk name: | |
85 | ||
86 | # mkdir /config/fakenbd/disk1 | |
87 | # ls /config/fakenbd/disk1 | |
88 | target device rw | |
89 | ||
90 | The target attribute contains the IP address of the server FakeNBD will | |
91 | connect to. The device attribute is the device on the server. | |
92 | Predictably, the rw attribute determines whether the connection is | |
93 | read-only or read-write. | |
94 | ||
95 | # echo 10.0.0.1 > /config/fakenbd/disk1/target | |
96 | # echo /dev/sda1 > /config/fakenbd/disk1/device | |
97 | # echo 1 > /config/fakenbd/disk1/rw | |
98 | ||
99 | That's it. That's all there is. Now the device is configured, via the | |
100 | shell no less. | |
101 | ||
102 | [Coding With configfs] | |
103 | ||
104 | Every object in configfs is a config_item. A config_item reflects an | |
105 | object in the subsystem. It has attributes that match values on that | |
106 | object. configfs handles the filesystem representation of that object | |
107 | and its attributes, allowing the subsystem to ignore all but the | |
108 | basic show/store interaction. | |
109 | ||
110 | Items are created and destroyed inside a config_group. A group is a | |
111 | collection of items that share the same attributes and operations. | |
112 | Items are created by mkdir(2) and removed by rmdir(2), but configfs | |
113 | handles that. The group has a set of operations to perform these tasks | |
114 | ||
115 | A subsystem is the top level of a client module. During initialization, | |
116 | the client module registers the subsystem with configfs, the subsystem | |
117 | appears as a directory at the top of the configfs filesystem. A | |
118 | subsystem is also a config_group, and can do everything a config_group | |
119 | can. | |
120 | ||
121 | [struct config_item] | |
122 | ||
123 | struct config_item { | |
124 | char *ci_name; | |
125 | char ci_namebuf[UOBJ_NAME_LEN]; | |
126 | struct kref ci_kref; | |
127 | struct list_head ci_entry; | |
128 | struct config_item *ci_parent; | |
129 | struct config_group *ci_group; | |
130 | struct config_item_type *ci_type; | |
131 | struct dentry *ci_dentry; | |
132 | }; | |
133 | ||
134 | void config_item_init(struct config_item *); | |
135 | void config_item_init_type_name(struct config_item *, | |
136 | const char *name, | |
137 | struct config_item_type *type); | |
138 | struct config_item *config_item_get(struct config_item *); | |
139 | void config_item_put(struct config_item *); | |
140 | ||
141 | Generally, struct config_item is embedded in a container structure, a | |
142 | structure that actually represents what the subsystem is doing. The | |
143 | config_item portion of that structure is how the object interacts with | |
144 | configfs. | |
145 | ||
146 | Whether statically defined in a source file or created by a parent | |
147 | config_group, a config_item must have one of the _init() functions | |
148 | called on it. This initializes the reference count and sets up the | |
149 | appropriate fields. | |
150 | ||
151 | All users of a config_item should have a reference on it via | |
152 | config_item_get(), and drop the reference when they are done via | |
153 | config_item_put(). | |
154 | ||
155 | By itself, a config_item cannot do much more than appear in configfs. | |
156 | Usually a subsystem wants the item to display and/or store attributes, | |
157 | among other things. For that, it needs a type. | |
158 | ||
159 | [struct config_item_type] | |
160 | ||
161 | struct configfs_item_operations { | |
162 | void (*release)(struct config_item *); | |
7063fbf2 JB |
163 | int (*allow_link)(struct config_item *src, |
164 | struct config_item *target); | |
165 | int (*drop_link)(struct config_item *src, | |
166 | struct config_item *target); | |
167 | }; | |
168 | ||
169 | struct config_item_type { | |
170 | struct module *ct_owner; | |
171 | struct configfs_item_operations *ct_item_ops; | |
172 | struct configfs_group_operations *ct_group_ops; | |
173 | struct configfs_attribute **ct_attrs; | |
174 | }; | |
175 | ||
176 | The most basic function of a config_item_type is to define what | |
177 | operations can be performed on a config_item. All items that have been | |
178 | allocated dynamically will need to provide the ct_item_ops->release() | |
179 | method. This method is called when the config_item's reference count | |
51798222 | 180 | reaches zero. |
7063fbf2 JB |
181 | |
182 | [struct configfs_attribute] | |
183 | ||
184 | struct configfs_attribute { | |
185 | char *ca_name; | |
186 | struct module *ca_owner; | |
43947514 | 187 | umode_t ca_mode; |
51798222 CH |
188 | ssize_t (*show)(struct config_item *, char *); |
189 | ssize_t (*store)(struct config_item *, const char *, size_t); | |
7063fbf2 JB |
190 | }; |
191 | ||
192 | When a config_item wants an attribute to appear as a file in the item's | |
193 | configfs directory, it must define a configfs_attribute describing it. | |
194 | It then adds the attribute to the NULL-terminated array | |
195 | config_item_type->ct_attrs. When the item appears in configfs, the | |
196 | attribute file will appear with the configfs_attribute->ca_name | |
197 | filename. configfs_attribute->ca_mode specifies the file permissions. | |
198 | ||
51798222 CH |
199 | If an attribute is readable and provides a ->show method, that method will |
200 | be called whenever userspace asks for a read(2) on the attribute. If an | |
201 | attribute is writable and provides a ->store method, that method will be | |
202 | be called whenever userspace asks for a write(2) on the attribute. | |
7063fbf2 JB |
203 | |
204 | [struct config_group] | |
205 | ||
4ae0edc2 | 206 | A config_item cannot live in a vacuum. The only way one can be created |
7063fbf2 JB |
207 | is via mkdir(2) on a config_group. This will trigger creation of a |
208 | child item. | |
209 | ||
210 | struct config_group { | |
211 | struct config_item cg_item; | |
212 | struct list_head cg_children; | |
213 | struct configfs_subsystem *cg_subsys; | |
214 | struct config_group **default_groups; | |
215 | }; | |
216 | ||
217 | void config_group_init(struct config_group *group); | |
218 | void config_group_init_type_name(struct config_group *group, | |
219 | const char *name, | |
220 | struct config_item_type *type); | |
221 | ||
222 | ||
223 | The config_group structure contains a config_item. Properly configuring | |
224 | that item means that a group can behave as an item in its own right. | |
225 | However, it can do more: it can create child items or groups. This is | |
226 | accomplished via the group operations specified on the group's | |
227 | config_item_type. | |
228 | ||
229 | struct configfs_group_operations { | |
f89ab861 JB |
230 | struct config_item *(*make_item)(struct config_group *group, |
231 | const char *name); | |
232 | struct config_group *(*make_group)(struct config_group *group, | |
233 | const char *name); | |
7063fbf2 | 234 | int (*commit_item)(struct config_item *item); |
299894cc JB |
235 | void (*disconnect_notify)(struct config_group *group, |
236 | struct config_item *item); | |
7063fbf2 JB |
237 | void (*drop_item)(struct config_group *group, |
238 | struct config_item *item); | |
239 | }; | |
240 | ||
241 | A group creates child items by providing the | |
242 | ct_group_ops->make_item() method. If provided, this method is called from mkdir(2) in the group's directory. The subsystem allocates a new | |
243 | config_item (or more likely, its container structure), initializes it, | |
244 | and returns it to configfs. Configfs will then populate the filesystem | |
245 | tree to reflect the new item. | |
246 | ||
247 | If the subsystem wants the child to be a group itself, the subsystem | |
248 | provides ct_group_ops->make_group(). Everything else behaves the same, | |
249 | using the group _init() functions on the group. | |
250 | ||
251 | Finally, when userspace calls rmdir(2) on the item or group, | |
252 | ct_group_ops->drop_item() is called. As a config_group is also a | |
53cb4726 | 253 | config_item, it is not necessary for a separate drop_group() method. |
7063fbf2 JB |
254 | The subsystem must config_item_put() the reference that was initialized |
255 | upon item allocation. If a subsystem has no work to do, it may omit | |
256 | the ct_group_ops->drop_item() method, and configfs will call | |
257 | config_item_put() on the item on behalf of the subsystem. | |
258 | ||
259 | IMPORTANT: drop_item() is void, and as such cannot fail. When rmdir(2) | |
260 | is called, configfs WILL remove the item from the filesystem tree | |
261 | (assuming that it has no children to keep it busy). The subsystem is | |
262 | responsible for responding to this. If the subsystem has references to | |
263 | the item in other threads, the memory is safe. It may take some time | |
264 | for the item to actually disappear from the subsystem's usage. But it | |
265 | is gone from configfs. | |
266 | ||
299894cc JB |
267 | When drop_item() is called, the item's linkage has already been torn |
268 | down. It no longer has a reference on its parent and has no place in | |
269 | the item hierarchy. If a client needs to do some cleanup before this | |
270 | teardown happens, the subsystem can implement the | |
271 | ct_group_ops->disconnect_notify() method. The method is called after | |
272 | configfs has removed the item from the filesystem view but before the | |
273 | item is removed from its parent group. Like drop_item(), | |
274 | disconnect_notify() is void and cannot fail. Client subsystems should | |
275 | not drop any references here, as they still must do it in drop_item(). | |
276 | ||
7063fbf2 JB |
277 | A config_group cannot be removed while it still has child items. This |
278 | is implemented in the configfs rmdir(2) code. ->drop_item() will not be | |
279 | called, as the item has not been dropped. rmdir(2) will fail, as the | |
280 | directory is not empty. | |
281 | ||
282 | [struct configfs_subsystem] | |
283 | ||
4ae0edc2 | 284 | A subsystem must register itself, usually at module_init time. This |
7063fbf2 JB |
285 | tells configfs to make the subsystem appear in the file tree. |
286 | ||
287 | struct configfs_subsystem { | |
288 | struct config_group su_group; | |
e6bd07ae | 289 | struct mutex su_mutex; |
7063fbf2 JB |
290 | }; |
291 | ||
292 | int configfs_register_subsystem(struct configfs_subsystem *subsys); | |
293 | void configfs_unregister_subsystem(struct configfs_subsystem *subsys); | |
294 | ||
e6bd07ae | 295 | A subsystem consists of a toplevel config_group and a mutex. |
7063fbf2 JB |
296 | The group is where child config_items are created. For a subsystem, |
297 | this group is usually defined statically. Before calling | |
298 | configfs_register_subsystem(), the subsystem must have initialized the | |
299 | group via the usual group _init() functions, and it must also have | |
e6bd07ae | 300 | initialized the mutex. |
7063fbf2 JB |
301 | When the register call returns, the subsystem is live, and it |
302 | will be visible via configfs. At that point, mkdir(2) can be called and | |
303 | the subsystem must be ready for it. | |
304 | ||
305 | [An Example] | |
306 | ||
307 | The best example of these basic concepts is the simple_children | |
51798222 CH |
308 | subsystem/group and the simple_child item in |
309 | samples/configfs/configfs_sample.c. It shows a trivial object displaying | |
310 | and storing an attribute, and a simple group creating and destroying | |
311 | these children. | |
7063fbf2 | 312 | |
e6bd07ae | 313 | [Hierarchy Navigation and the Subsystem Mutex] |
7063fbf2 JB |
314 | |
315 | There is an extra bonus that configfs provides. The config_groups and | |
316 | config_items are arranged in a hierarchy due to the fact that they | |
317 | appear in a filesystem. A subsystem is NEVER to touch the filesystem | |
318 | parts, but the subsystem might be interested in this hierarchy. For | |
319 | this reason, the hierarchy is mirrored via the config_group->cg_children | |
320 | and config_item->ci_parent structure members. | |
321 | ||
322 | A subsystem can navigate the cg_children list and the ci_parent pointer | |
323 | to see the tree created by the subsystem. This can race with configfs' | |
e6bd07ae | 324 | management of the hierarchy, so configfs uses the subsystem mutex to |
7063fbf2 JB |
325 | protect modifications. Whenever a subsystem wants to navigate the |
326 | hierarchy, it must do so under the protection of the subsystem | |
e6bd07ae | 327 | mutex. |
7063fbf2 | 328 | |
e6bd07ae | 329 | A subsystem will be prevented from acquiring the mutex while a newly |
7063fbf2 | 330 | allocated item has not been linked into this hierarchy. Similarly, it |
e6bd07ae | 331 | will not be able to acquire the mutex while a dropping item has not |
7063fbf2 JB |
332 | yet been unlinked. This means that an item's ci_parent pointer will |
333 | never be NULL while the item is in configfs, and that an item will only | |
334 | be in its parent's cg_children list for the same duration. This allows | |
335 | a subsystem to trust ci_parent and cg_children while they hold the | |
e6bd07ae | 336 | mutex. |
7063fbf2 JB |
337 | |
338 | [Item Aggregation Via symlink(2)] | |
339 | ||
340 | configfs provides a simple group via the group->item parent/child | |
341 | relationship. Often, however, a larger environment requires aggregation | |
342 | outside of the parent/child connection. This is implemented via | |
343 | symlink(2). | |
344 | ||
345 | A config_item may provide the ct_item_ops->allow_link() and | |
346 | ct_item_ops->drop_link() methods. If the ->allow_link() method exists, | |
347 | symlink(2) may be called with the config_item as the source of the link. | |
348 | These links are only allowed between configfs config_items. Any | |
349 | symlink(2) attempt outside the configfs filesystem will be denied. | |
350 | ||
351 | When symlink(2) is called, the source config_item's ->allow_link() | |
352 | method is called with itself and a target item. If the source item | |
353 | allows linking to target item, it returns 0. A source item may wish to | |
354 | reject a link if it only wants links to a certain type of object (say, | |
355 | in its own subsystem). | |
356 | ||
357 | When unlink(2) is called on the symbolic link, the source item is | |
358 | notified via the ->drop_link() method. Like the ->drop_item() method, | |
359 | this is a void function and cannot return failure. The subsystem is | |
360 | responsible for responding to the change. | |
361 | ||
362 | A config_item cannot be removed while it links to any other item, nor | |
363 | can it be removed while an item links to it. Dangling symlinks are not | |
364 | allowed in configfs. | |
365 | ||
366 | [Automatically Created Subgroups] | |
367 | ||
368 | A new config_group may want to have two types of child config_items. | |
369 | While this could be codified by magic names in ->make_item(), it is much | |
370 | more explicit to have a method whereby userspace sees this divergence. | |
371 | ||
372 | Rather than have a group where some items behave differently than | |
373 | others, configfs provides a method whereby one or many subgroups are | |
374 | automatically created inside the parent at its creation. Thus, | |
48cc7ec9 | 375 | mkdir("parent") results in "parent", "parent/subgroup1", up through |
7063fbf2 JB |
376 | "parent/subgroupN". Items of type 1 can now be created in |
377 | "parent/subgroup1", and items of type N can be created in | |
378 | "parent/subgroupN". | |
379 | ||
380 | These automatic subgroups, or default groups, do not preclude other | |
381 | children of the parent group. If ct_group_ops->make_group() exists, | |
382 | other child groups can be created on the parent group directly. | |
383 | ||
384 | A configfs subsystem specifies default groups by filling in the | |
385 | NULL-terminated array default_groups on the config_group structure. | |
386 | Each group in that array is populated in the configfs tree at the same | |
387 | time as the parent group. Similarly, they are removed at the same time | |
388 | as the parent. No extra notification is provided. When a ->drop_item() | |
389 | method call notifies the subsystem the parent group is going away, it | |
390 | also means every default group child associated with that parent group. | |
391 | ||
392 | As a consequence of this, default_groups cannot be removed directly via | |
393 | rmdir(2). They also are not considered when rmdir(2) on the parent | |
394 | group is checking for children. | |
395 | ||
25985edc | 396 | [Dependent Subsystems] |
631d1feb JB |
397 | |
398 | Sometimes other drivers depend on particular configfs items. For | |
399 | example, ocfs2 mounts depend on a heartbeat region item. If that | |
400 | region item is removed with rmdir(2), the ocfs2 mount must BUG or go | |
401 | readonly. Not happy. | |
402 | ||
403 | configfs provides two additional API calls: configfs_depend_item() and | |
404 | configfs_undepend_item(). A client driver can call | |
405 | configfs_depend_item() on an existing item to tell configfs that it is | |
406 | depended on. configfs will then return -EBUSY from rmdir(2) for that | |
407 | item. When the item is no longer depended on, the client driver calls | |
408 | configfs_undepend_item() on it. | |
409 | ||
410 | These API cannot be called underneath any configfs callbacks, as | |
411 | they will conflict. They can block and allocate. A client driver | |
412 | probably shouldn't calling them of its own gumption. Rather it should | |
413 | be providing an API that external subsystems call. | |
414 | ||
415 | How does this work? Imagine the ocfs2 mount process. When it mounts, | |
416 | it asks for a heartbeat region item. This is done via a call into the | |
417 | heartbeat code. Inside the heartbeat code, the region item is looked | |
418 | up. Here, the heartbeat code calls configfs_depend_item(). If it | |
419 | succeeds, then heartbeat knows the region is safe to give to ocfs2. | |
420 | If it fails, it was being torn down anyway, and heartbeat can gracefully | |
421 | pass up an error. | |
422 | ||
7063fbf2 JB |
423 | [Committable Items] |
424 | ||
425 | NOTE: Committable items are currently unimplemented. | |
426 | ||
427 | Some config_items cannot have a valid initial state. That is, no | |
428 | default values can be specified for the item's attributes such that the | |
429 | item can do its work. Userspace must configure one or more attributes, | |
430 | after which the subsystem can start whatever entity this item | |
431 | represents. | |
432 | ||
433 | Consider the FakeNBD device from above. Without a target address *and* | |
434 | a target device, the subsystem has no idea what block device to import. | |
435 | The simple example assumes that the subsystem merely waits until all the | |
436 | appropriate attributes are configured, and then connects. This will, | |
437 | indeed, work, but now every attribute store must check if the attributes | |
438 | are initialized. Every attribute store must fire off the connection if | |
439 | that condition is met. | |
440 | ||
441 | Far better would be an explicit action notifying the subsystem that the | |
442 | config_item is ready to go. More importantly, an explicit action allows | |
3f6dee9b | 443 | the subsystem to provide feedback as to whether the attributes are |
7063fbf2 JB |
444 | initialized in a way that makes sense. configfs provides this as |
445 | committable items. | |
446 | ||
447 | configfs still uses only normal filesystem operations. An item is | |
448 | committed via rename(2). The item is moved from a directory where it | |
449 | can be modified to a directory where it cannot. | |
450 | ||
451 | Any group that provides the ct_group_ops->commit_item() method has | |
452 | committable items. When this group appears in configfs, mkdir(2) will | |
453 | not work directly in the group. Instead, the group will have two | |
454 | subdirectories: "live" and "pending". The "live" directory does not | |
455 | support mkdir(2) or rmdir(2) either. It only allows rename(2). The | |
456 | "pending" directory does allow mkdir(2) and rmdir(2). An item is | |
457 | created in the "pending" directory. Its attributes can be modified at | |
458 | will. Userspace commits the item by renaming it into the "live" | |
d6bc8ac9 | 459 | directory. At this point, the subsystem receives the ->commit_item() |
7063fbf2 JB |
460 | callback. If all required attributes are filled to satisfaction, the |
461 | method returns zero and the item is moved to the "live" directory. | |
462 | ||
463 | As rmdir(2) does not work in the "live" directory, an item must be | |
464 | shutdown, or "uncommitted". Again, this is done via rename(2), this | |
465 | time from the "live" directory back to the "pending" one. The subsystem | |
466 | is notified by the ct_group_ops->uncommit_object() method. | |
467 | ||
468 |