Commit | Line | Data |
---|---|---|
991d9fa0 JT |
1 | Introduction |
2 | ============ | |
3 | ||
4998d8ed | 4 | This document describes a collection of device-mapper targets that |
991d9fa0 JT |
5 | between them implement thin-provisioning and snapshots. |
6 | ||
7 | The main highlight of this implementation, compared to the previous | |
8 | implementation of snapshots, is that it allows many virtual devices to | |
9 | be stored on the same data volume. This simplifies administration and | |
10 | allows the sharing of data between volumes, thus reducing disk usage. | |
11 | ||
12 | Another significant feature is support for an arbitrary depth of | |
13 | recursive snapshots (snapshots of snapshots of snapshots ...). The | |
14 | previous implementation of snapshots did this by chaining together | |
15 | lookup tables, and so performance was O(depth). This new | |
16 | implementation uses a single data structure to avoid this degradation | |
17 | with depth. Fragmentation may still be an issue, however, in some | |
18 | scenarios. | |
19 | ||
20 | Metadata is stored on a separate device from data, giving the | |
21 | administrator some freedom, for example to: | |
22 | ||
23 | - Improve metadata resilience by storing metadata on a mirrored volume | |
24 | but data on a non-mirrored one. | |
25 | ||
26 | - Improve performance by storing the metadata on SSD. | |
27 | ||
28 | Status | |
29 | ====== | |
30 | ||
31 | These targets are very much still in the EXPERIMENTAL state. Please | |
32 | do not yet rely on them in production. But do experiment and offer us | |
33 | feedback. Different use cases will have different performance | |
34 | characteristics, for example due to fragmentation of the data volume. | |
35 | ||
36 | If you find this software is not performing as expected please mail | |
37 | dm-devel@redhat.com with details and we'll try our best to improve | |
38 | things for you. | |
39 | ||
40 | Userspace tools for checking and repairing the metadata are under | |
41 | development. | |
42 | ||
43 | Cookbook | |
44 | ======== | |
45 | ||
46 | This section describes some quick recipes for using thin provisioning. | |
47 | They use the dmsetup program to control the device-mapper driver | |
48 | directly. End users will be advised to use a higher-level volume | |
49 | manager such as LVM2 once support has been added. | |
50 | ||
51 | Pool device | |
52 | ----------- | |
53 | ||
54 | The pool device ties together the metadata volume and the data volume. | |
55 | It maps I/O linearly to the data volume and updates the metadata via | |
56 | two mechanisms: | |
57 | ||
58 | - Function calls from the thin targets | |
59 | ||
60 | - Device-mapper 'messages' from userspace which control the creation of new | |
61 | virtual devices amongst other things. | |
62 | ||
63 | Setting up a fresh pool device | |
64 | ------------------------------ | |
65 | ||
66 | Setting up a pool device requires a valid metadata device, and a | |
67 | data device. If you do not have an existing metadata device you can | |
68 | make one by zeroing the first 4k to indicate empty metadata. | |
69 | ||
70 | dd if=/dev/zero of=$metadata_dev bs=4096 count=1 | |
71 | ||
72 | The amount of metadata you need will vary according to how many blocks | |
73 | are shared between thin devices (i.e. through snapshots). If you have | |
74 | less sharing than average you'll need a larger-than-average metadata device. | |
75 | ||
76 | As a guide, we suggest you calculate the number of bytes to use in the | |
77 | metadata device as 48 * $data_dev_size / $data_block_size but round it up | |
c4a69ecd MS |
78 | to 2MB if the answer is smaller. If you're creating large numbers of |
79 | snapshots which are recording large amounts of change, you may find you | |
80 | need to increase this. | |
991d9fa0 | 81 | |
c4a69ecd MS |
82 | The largest size supported is 16GB: If the device is larger, |
83 | a warning will be issued and the excess space will not be used. | |
991d9fa0 JT |
84 | |
85 | Reloading a pool table | |
86 | ---------------------- | |
87 | ||
88 | You may reload a pool's table, indeed this is how the pool is resized | |
89 | if it runs out of space. (N.B. While specifying a different metadata | |
90 | device when reloading is not forbidden at the moment, things will go | |
91 | wrong if it does not route I/O to exactly the same on-disk location as | |
92 | previously.) | |
93 | ||
94 | Using an existing pool device | |
95 | ----------------------------- | |
96 | ||
97 | dmsetup create pool \ | |
98 | --table "0 20971520 thin-pool $metadata_dev $data_dev \ | |
99 | $data_block_size $low_water_mark" | |
100 | ||
101 | $data_block_size gives the smallest unit of disk space that can be | |
102 | allocated at a time expressed in units of 512-byte sectors. People | |
103 | primarily interested in thin provisioning may want to use a value such | |
104 | as 1024 (512KB). People doing lots of snapshotting may want a smaller value | |
105 | such as 128 (64KB). If you are not zeroing newly-allocated data, | |
106 | a larger $data_block_size in the region of 256000 (128MB) is suggested. | |
107 | $data_block_size must be the same for the lifetime of the | |
108 | metadata device. | |
109 | ||
110 | $low_water_mark is expressed in blocks of size $data_block_size. If | |
111 | free space on the data device drops below this level then a dm event | |
112 | will be triggered which a userspace daemon should catch allowing it to | |
113 | extend the pool device. Only one such event will be sent. | |
114 | Resuming a device with a new table itself triggers an event so the | |
115 | userspace daemon can use this to detect a situation where a new table | |
116 | already exceeds the threshold. | |
117 | ||
118 | Thin provisioning | |
119 | ----------------- | |
120 | ||
121 | i) Creating a new thinly-provisioned volume. | |
122 | ||
123 | To create a new thinly- provisioned volume you must send a message to an | |
124 | active pool device, /dev/mapper/pool in this example. | |
125 | ||
126 | dmsetup message /dev/mapper/pool 0 "create_thin 0" | |
127 | ||
128 | Here '0' is an identifier for the volume, a 24-bit number. It's up | |
129 | to the caller to allocate and manage these identifiers. If the | |
130 | identifier is already in use, the message will fail with -EEXIST. | |
131 | ||
132 | ii) Using a thinly-provisioned volume. | |
133 | ||
134 | Thinly-provisioned volumes are activated using the 'thin' target: | |
135 | ||
136 | dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0" | |
137 | ||
138 | The last parameter is the identifier for the thinp device. | |
139 | ||
140 | Internal snapshots | |
141 | ------------------ | |
142 | ||
143 | i) Creating an internal snapshot. | |
144 | ||
145 | Snapshots are created with another message to the pool. | |
146 | ||
147 | N.B. If the origin device that you wish to snapshot is active, you | |
148 | must suspend it before creating the snapshot to avoid corruption. | |
149 | This is NOT enforced at the moment, so please be careful! | |
150 | ||
151 | dmsetup suspend /dev/mapper/thin | |
152 | dmsetup message /dev/mapper/pool 0 "create_snap 1 0" | |
153 | dmsetup resume /dev/mapper/thin | |
154 | ||
155 | Here '1' is the identifier for the volume, a 24-bit number. '0' is the | |
156 | identifier for the origin device. | |
157 | ||
158 | ii) Using an internal snapshot. | |
159 | ||
160 | Once created, the user doesn't have to worry about any connection | |
161 | between the origin and the snapshot. Indeed the snapshot is no | |
162 | different from any other thinly-provisioned device and can be | |
163 | snapshotted itself via the same method. It's perfectly legal to | |
164 | have only one of them active, and there's no ordering requirement on | |
165 | activating or removing them both. (This differs from conventional | |
166 | device-mapper snapshots.) | |
167 | ||
168 | Activate it exactly the same way as any other thinly-provisioned volume: | |
169 | ||
170 | dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1" | |
171 | ||
2dd9c257 JT |
172 | External snapshots |
173 | ------------------ | |
174 | ||
175 | You can use an external _read only_ device as an origin for a | |
176 | thinly-provisioned volume. Any read to an unprovisioned area of the | |
177 | thin device will be passed through to the origin. Writes trigger | |
178 | the allocation of new blocks as usual. | |
179 | ||
180 | One use case for this is VM hosts that want to run guests on | |
181 | thinly-provisioned volumes but have the base image on another device | |
182 | (possibly shared between many VMs). | |
183 | ||
184 | You must not write to the origin device if you use this technique! | |
185 | Of course, you may write to the thin device and take internal snapshots | |
186 | of the thin volume. | |
187 | ||
188 | i) Creating a snapshot of an external device | |
189 | ||
190 | This is the same as creating a thin device. | |
191 | You don't mention the origin at this stage. | |
192 | ||
193 | dmsetup message /dev/mapper/pool 0 "create_thin 0" | |
194 | ||
195 | ii) Using a snapshot of an external device. | |
196 | ||
197 | Append an extra parameter to the thin target specifying the origin: | |
198 | ||
199 | dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 0 /dev/image" | |
200 | ||
201 | N.B. All descendants (internal snapshots) of this snapshot require the | |
202 | same extra origin parameter. | |
203 | ||
991d9fa0 JT |
204 | Deactivation |
205 | ------------ | |
206 | ||
207 | All devices using a pool must be deactivated before the pool itself | |
208 | can be. | |
209 | ||
210 | dmsetup remove thin | |
211 | dmsetup remove snap | |
212 | dmsetup remove pool | |
213 | ||
214 | Reference | |
215 | ========= | |
216 | ||
217 | 'thin-pool' target | |
218 | ------------------ | |
219 | ||
220 | i) Constructor | |
221 | ||
222 | thin-pool <metadata dev> <data dev> <data block size (sectors)> \ | |
223 | <low water mark (blocks)> [<number of feature args> [<arg>]*] | |
224 | ||
225 | Optional feature arguments: | |
67e2e2b2 JT |
226 | |
227 | skip_block_zeroing: Skip the zeroing of newly-provisioned blocks. | |
228 | ||
229 | ignore_discard: Disable discard support. | |
230 | ||
231 | no_discard_passdown: Don't pass discards down to the underlying | |
232 | data device, but just remove the mapping. | |
991d9fa0 JT |
233 | |
234 | Data block size must be between 64KB (128 sectors) and 1GB | |
235 | (2097152 sectors) inclusive. | |
236 | ||
237 | ||
238 | ii) Status | |
239 | ||
240 | <transaction id> <used metadata blocks>/<total metadata blocks> | |
241 | <used data blocks>/<total data blocks> <held metadata root> | |
242 | ||
243 | ||
244 | transaction id: | |
245 | A 64-bit number used by userspace to help synchronise with metadata | |
246 | from volume managers. | |
247 | ||
248 | used data blocks / total data blocks | |
249 | If the number of free blocks drops below the pool's low water mark a | |
250 | dm event will be sent to userspace. This event is edge-triggered and | |
251 | it will occur only once after each resume so volume manager writers | |
252 | should register for the event and then check the target's status. | |
253 | ||
254 | held metadata root: | |
255 | The location, in sectors, of the metadata root that has been | |
256 | 'held' for userspace read access. '-' indicates there is no | |
257 | held root. This feature is not yet implemented so '-' is | |
258 | always returned. | |
259 | ||
260 | iii) Messages | |
261 | ||
262 | create_thin <dev id> | |
263 | ||
264 | Create a new thinly-provisioned device. | |
265 | <dev id> is an arbitrary unique 24-bit identifier chosen by | |
266 | the caller. | |
267 | ||
268 | create_snap <dev id> <origin id> | |
269 | ||
270 | Create a new snapshot of another thinly-provisioned device. | |
271 | <dev id> is an arbitrary unique 24-bit identifier chosen by | |
272 | the caller. | |
273 | <origin id> is the identifier of the thinly-provisioned device | |
274 | of which the new device will be a snapshot. | |
275 | ||
276 | delete <dev id> | |
277 | ||
278 | Deletes a thin device. Irreversible. | |
279 | ||
991d9fa0 JT |
280 | set_transaction_id <current id> <new id> |
281 | ||
282 | Userland volume managers, such as LVM, need a way to | |
283 | synchronise their external metadata with the internal metadata of the | |
284 | pool target. The thin-pool target offers to store an | |
285 | arbitrary 64-bit transaction id and return it on the target's | |
286 | status line. To avoid races you must provide what you think | |
287 | the current transaction id is when you change it with this | |
288 | compare-and-swap message. | |
289 | ||
290 | 'thin' target | |
291 | ------------- | |
292 | ||
293 | i) Constructor | |
294 | ||
2dd9c257 | 295 | thin <pool dev> <dev id> [<external origin dev>] |
991d9fa0 JT |
296 | |
297 | pool dev: | |
298 | the thin-pool device, e.g. /dev/mapper/my_pool or 253:0 | |
299 | ||
300 | dev id: | |
301 | the internal device identifier of the device to be | |
302 | activated. | |
303 | ||
2dd9c257 JT |
304 | external origin dev: |
305 | an optional block device outside the pool to be treated as a | |
306 | read-only snapshot origin: reads to unprovisioned areas of the | |
307 | thin target will be mapped to this device. | |
308 | ||
991d9fa0 JT |
309 | The pool doesn't store any size against the thin devices. If you |
310 | load a thin target that is smaller than you've been using previously, | |
311 | then you'll have no access to blocks mapped beyond the end. If you | |
312 | load a target that is bigger than before, then extra blocks will be | |
313 | provisioned as and when needed. | |
314 | ||
315 | If you wish to reduce the size of your thin device and potentially | |
316 | regain some space then send the 'trim' message to the pool. | |
317 | ||
318 | ii) Status | |
319 | ||
320 | <nr mapped sectors> <highest mapped sector> |