Commit | Line | Data |
---|---|---|
5db53f3e JE |
1 | |
2 | The LogFS Flash Filesystem | |
3 | ========================== | |
4 | ||
5 | Specification | |
6 | ============= | |
7 | ||
8 | Superblocks | |
9 | ----------- | |
10 | ||
11 | Two superblocks exist at the beginning and end of the filesystem. | |
12 | Each superblock is 256 Bytes large, with another 3840 Bytes reserved | |
13 | for future purposes, making a total of 4096 Bytes. | |
14 | ||
15 | Superblock locations may differ for MTD and block devices. On MTD the | |
16 | first non-bad block contains a superblock in the first 4096 Bytes and | |
17 | the last non-bad block contains a superblock in the last 4096 Bytes. | |
18 | On block devices, the first 4096 Bytes of the device contain the first | |
19 | superblock and the last aligned 4096 Byte-block contains the second | |
20 | superblock. | |
21 | ||
22 | For the most part, the superblocks can be considered read-only. They | |
23 | are written only to correct errors detected within the superblocks, | |
24 | move the journal and change the filesystem parameters through tunefs. | |
25 | As a result, the superblock does not contain any fields that require | |
26 | constant updates, like the amount of free space, etc. | |
27 | ||
28 | Segments | |
29 | -------- | |
30 | ||
31 | The space in the device is split up into equal-sized segments. | |
32 | Segments are the primary write unit of LogFS. Within each segments, | |
33 | writes happen from front (low addresses) to back (high addresses. If | |
34 | only a partial segment has been written, the segment number, the | |
35 | current position within and optionally a write buffer are stored in | |
36 | the journal. | |
37 | ||
38 | Segments are erased as a whole. Therefore Garbage Collection may be | |
39 | required to completely free a segment before doing so. | |
40 | ||
41 | Journal | |
42 | -------- | |
43 | ||
44 | The journal contains all global information about the filesystem that | |
45 | is subject to frequent change. At mount time, it has to be scanned | |
46 | for the most recent commit entry, which contains a list of pointers to | |
47 | all currently valid entries. | |
48 | ||
49 | Object Store | |
50 | ------------ | |
51 | ||
52 | All space except for the superblocks and journal is part of the object | |
53 | store. Each segment contains a segment header and a number of | |
54 | objects, each consisting of the object header and the payload. | |
55 | Objects are either inodes, directory entries (dentries), file data | |
56 | blocks or indirect blocks. | |
57 | ||
58 | Levels | |
59 | ------ | |
60 | ||
61 | Garbage collection (GC) may fail if all data is written | |
a8cd4561 | 62 | indiscriminately. One requirement of GC is that data is separated |
5db53f3e JE |
63 | roughly according to the distance between the tree root and the data. |
64 | Effectively that means all file data is on level 0, indirect blocks | |
65 | are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks, | |
66 | respectively. Inode file data is on level 6 for the inodes and 7-11 | |
67 | for indirect blocks. | |
68 | ||
69 | Each segment contains objects of a single level only. As a result, | |
a8cd4561 | 70 | each level requires its own separate segment to be open for writing. |
5db53f3e JE |
71 | |
72 | Inode File | |
73 | ---------- | |
74 | ||
75 | All inodes are stored in a special file, the inode file. Single | |
76 | exception is the inode file's inode (master inode) which for obvious | |
77 | reasons is stored in the journal instead. Instead of data blocks, the | |
78 | leaf nodes of the inode files are inodes. | |
79 | ||
80 | Aliases | |
81 | ------- | |
82 | ||
83 | Writes in LogFS are done by means of a wandering tree. A naïve | |
84 | implementation would require that for each write or a block, all | |
85 | parent blocks are written as well, since the block pointers have | |
86 | changed. Such an implementation would not be very efficient. | |
87 | ||
88 | In LogFS, the block pointer changes are cached in the journal by means | |
89 | of alias entries. Each alias consists of its logical address - inode | |
90 | number, block index, level and child number (index into block) - and | |
91 | the changed data. Any 8-byte word can be changes in this manner. | |
92 | ||
93 | Currently aliases are used for block pointers, file size, file used | |
94 | bytes and the height of an inodes indirect tree. | |
95 | ||
96 | Segment Aliases | |
97 | --------------- | |
98 | ||
99 | Related to regular aliases, these are used to handle bad blocks. | |
100 | Initially, bad blocks are handled by moving the affected segment | |
101 | content to a spare segment and noting this move in the journal with a | |
102 | segment alias, a simple (to, from) tupel. GC will later empty this | |
103 | segment and the alias can be removed again. This is used on MTD only. | |
104 | ||
105 | Vim | |
106 | --- | |
107 | ||
108 | By cleverly predicting the life time of data, it is possible to | |
a8cd4561 | 109 | separate long-living data from short-living data and thereby reduce |
5db53f3e | 110 | the GC overhead later. Each type of distinc life expectency (vim) can |
a8cd4561 | 111 | have a separate segment open for writing. Each (level, vim) tupel can |
5db53f3e JE |
112 | be open just once. If an open segment with unknown vim is encountered |
113 | at mount time, it is closed and ignored henceforth. | |
114 | ||
115 | Indirect Tree | |
116 | ------------- | |
117 | ||
118 | Inodes in LogFS are similar to FFS-style filesystems with direct and | |
119 | indirect block pointers. One difference is that LogFS uses a single | |
120 | indirect pointer that can be either a 1x, 2x, etc. indirect pointer. | |
121 | A height field in the inode defines the height of the indirect tree | |
122 | and thereby the indirection of the pointer. | |
123 | ||
124 | Another difference is the addressing of indirect blocks. In LogFS, | |
125 | the first 16 pointers in the first indirect block are left empty, | |
126 | corresponding to the 16 direct pointers in the inode. In ext2 (maybe | |
127 | others as well) the first pointer in the first indirect block | |
128 | corresponds to logical block 12, skipping the 12 direct pointers. | |
129 | So where ext2 is using arithmetic to better utilize space, LogFS keeps | |
130 | arithmetic simple and uses compression to save space. | |
131 | ||
132 | Compression | |
133 | ----------- | |
134 | ||
135 | Both file data and metadata can be compressed. Compression for file | |
136 | data can be enabled with chattr +c and disabled with chattr -c. Doing | |
137 | so has no effect on existing data, but new data will be stored | |
138 | accordingly. New inodes will inherit the compression flag of the | |
139 | parent directory. | |
140 | ||
141 | Metadata is always compressed. However, the space accounting ignores | |
142 | this and charges for the uncompressed size. Failing to do so could | |
143 | result in GC failures when, after moving some data, indirect blocks | |
144 | compress worse than previously. Even on a 100% full medium, GC may | |
145 | not consume any extra space, so the compression gains are lost space | |
146 | to the user. | |
147 | ||
148 | However, they are not lost space to the filesystem internals. By | |
149 | cheating the user for those bytes, the filesystem gained some slack | |
150 | space and GC will run less often and faster. | |
151 | ||
152 | Garbage Collection and Wear Leveling | |
153 | ------------------------------------ | |
154 | ||
155 | Garbage collection is invoked whenever the number of free segments | |
156 | falls below a threshold. The best (known) candidate is picked based | |
157 | on the least amount of valid data contained in the segment. All | |
158 | remaining valid data is copied elsewhere, thereby invalidating it. | |
159 | ||
160 | The GC code also checks for aliases and writes then back if their | |
161 | number gets too large. | |
162 | ||
163 | Wear leveling is done by occasionally picking a suboptimal segment for | |
164 | garbage collection. If a stale segments erase count is significantly | |
165 | lower than the active segments' erase counts, it will be picked. Wear | |
166 | leveling is rate limited, so it will never monopolize the device for | |
167 | more than one segment worth at a time. | |
168 | ||
169 | Values for "occasionally", "significantly lower" are compile time | |
170 | constants. | |
171 | ||
172 | Hashed directories | |
173 | ------------------ | |
174 | ||
175 | To satisfy efficient lookup(), directory entries are hashed and | |
176 | located based on the hash. In order to both support large directories | |
177 | and not be overly inefficient for small directories, several hash | |
178 | tables of increasing size are used. For each table, the hash value | |
179 | modulo the table size gives the table index. | |
180 | ||
181 | Tables sizes are chosen to limit the number of indirect blocks with a | |
182 | fully populated table to 0, 1, 2 or 3 respectively. So the first | |
183 | table contains 16 entries, the second 512-16, etc. | |
184 | ||
185 | The last table is special in several ways. First its size depends on | |
186 | the effective 32bit limit on telldir/seekdir cookies. Since logfs | |
187 | uses the upper half of the address space for indirect blocks, the size | |
188 | is limited to 2^31. Secondly the table contains hash buckets with 16 | |
189 | entries each. | |
190 | ||
191 | Using single-entry buckets would result in birthday "attacks". At | |
192 | just 2^16 used entries, hash collisions would be likely (P >= 0.5). | |
193 | My math skills are insufficient to do the combinatorics for the 17x | |
194 | collisions necessary to overflow a bucket, but testing showed that in | |
195 | 10,000 runs the lowest directory fill before a bucket overflow was | |
196 | 188,057,130 entries with an average of 315,149,915 entries. So for | |
197 | directory sizes of up to a million, bucket overflows should be | |
198 | virtually impossible under normal circumstances. | |
199 | ||
200 | With carefully chosen filenames, it is obviously possible to cause an | |
201 | overflow with just 21 entries (4 higher tables + 16 entries + 1). So | |
202 | there may be a security concern if a malicious user has write access | |
203 | to a directory. | |
204 | ||
205 | Open For Discussion | |
206 | =================== | |
207 | ||
208 | Device Address Space | |
209 | -------------------- | |
210 | ||
211 | A device address space is used for caching. Both block devices and | |
212 | MTD provide functions to either read a single page or write a segment. | |
213 | Partial segments may be written for data integrity, but where possible | |
214 | complete segments are written for performance on simple block device | |
215 | flash media. | |
216 | ||
217 | Meta Inodes | |
218 | ----------- | |
219 | ||
220 | Inodes are stored in the inode file, which is just a regular file for | |
221 | most purposes. At umount time, however, the inode file needs to | |
222 | remain open until all dirty inodes are written. So | |
223 | generic_shutdown_super() may not close this inode, but shouldn't | |
224 | complain about remaining inodes due to the inode file either. Same | |
225 | goes for mapping inode of the device address space. | |
226 | ||
227 | Currently logfs uses a hack that essentially copies part of fs/inode.c | |
228 | code over. A general solution would be preferred. | |
229 | ||
230 | Indirect block mapping | |
231 | ---------------------- | |
232 | ||
233 | With compression, the block device (or mapping inode) cannot be used | |
234 | to cache indirect blocks. Some other place is required. Currently | |
235 | logfs uses the top half of each inode's address space. The low 8TB | |
236 | (on 32bit) are filled with file data, the high 8TB are used for | |
237 | indirect blocks. | |
238 | ||
239 | One problem is that 16TB files created on 64bit systems actually have | |
240 | data in the top 8TB. But files >16TB would cause problems anyway, so | |
241 | only the limit has changed. |