Commit | Line | Data |
---|---|---|
e6cf5df1 | 1 | Introduction |
2 | ============ | |
3 | ||
4 | Having looked at the linux mtd/nand driver and more specific at nand_ecc.c | |
5 | I felt there was room for optimisation. I bashed the code for a few hours | |
6 | performing tricks like table lookup removing superfluous code etc. | |
7 | After that the speed was increased by 35-40%. | |
8 | Still I was not too happy as I felt there was additional room for improvement. | |
9 | ||
10 | Bad! I was hooked. | |
11 | I decided to annotate my steps in this file. Perhaps it is useful to someone | |
12 | or someone learns something from it. | |
13 | ||
14 | ||
15 | The problem | |
16 | =========== | |
17 | ||
18 | NAND flash (at least SLC one) typically has sectors of 256 bytes. | |
19 | However NAND flash is not extremely reliable so some error detection | |
20 | (and sometimes correction) is needed. | |
21 | ||
22 | This is done by means of a Hamming code. I'll try to explain it in | |
23 | laymans terms (and apologies to all the pro's in the field in case I do | |
24 | not use the right terminology, my coding theory class was almost 30 | |
25 | years ago, and I must admit it was not one of my favourites). | |
26 | ||
27 | As I said before the ecc calculation is performed on sectors of 256 | |
28 | bytes. This is done by calculating several parity bits over the rows and | |
29 | columns. The parity used is even parity which means that the parity bit = 1 | |
30 | if the data over which the parity is calculated is 1 and the parity bit = 0 | |
31 | if the data over which the parity is calculated is 0. So the total | |
32 | number of bits over the data over which the parity is calculated + the | |
33 | parity bit is even. (see wikipedia if you can't follow this). | |
34 | Parity is often calculated by means of an exclusive or operation, | |
35 | sometimes also referred to as xor. In C the operator for xor is ^ | |
36 | ||
37 | Back to ecc. | |
38 | Let's give a small figure: | |
39 | ||
40 | byte 0: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp4 ... rp14 | |
41 | byte 1: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp2 rp4 ... rp14 | |
42 | byte 2: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp4 ... rp14 | |
43 | byte 3: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp4 ... rp14 | |
44 | byte 4: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp5 ... rp14 | |
45 | .... | |
46 | byte 254: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp5 ... rp15 | |
47 | byte 255: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp5 ... rp15 | |
48 | cp1 cp0 cp1 cp0 cp1 cp0 cp1 cp0 | |
49 | cp3 cp3 cp2 cp2 cp3 cp3 cp2 cp2 | |
50 | cp5 cp5 cp5 cp5 cp4 cp4 cp4 cp4 | |
51 | ||
52 | This figure represents a sector of 256 bytes. | |
19f59460 | 53 | cp is my abbreviation for column parity, rp for row parity. |
e6cf5df1 | 54 | |
55 | Let's start to explain column parity. | |
56 | cp0 is the parity that belongs to all bit0, bit2, bit4, bit6. | |
57 | so the sum of all bit0, bit2, bit4 and bit6 values + cp0 itself is even. | |
58 | Similarly cp1 is the sum of all bit1, bit3, bit5 and bit7. | |
59 | cp2 is the parity over bit0, bit1, bit4 and bit5 | |
60 | cp3 is the parity over bit2, bit3, bit6 and bit7. | |
61 | cp4 is the parity over bit0, bit1, bit2 and bit3. | |
62 | cp5 is the parity over bit4, bit5, bit6 and bit7. | |
63 | Note that each of cp0 .. cp5 is exactly one bit. | |
64 | ||
65 | Row parity actually works almost the same. | |
66 | rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254) | |
67 | rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255) | |
68 | rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ... | |
69 | (so handle two bytes, then skip 2 bytes). | |
70 | rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...) | |
71 | for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc. | |
72 | so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...) | |
73 | and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, .. | |
74 | The story now becomes quite boring. I guess you get the idea. | |
75 | rp6 covers 8 bytes then skips 8 etc | |
76 | rp7 skips 8 bytes then covers 8 etc | |
77 | rp8 covers 16 bytes then skips 16 etc | |
78 | rp9 skips 16 bytes then covers 16 etc | |
79 | rp10 covers 32 bytes then skips 32 etc | |
80 | rp11 skips 32 bytes then covers 32 etc | |
81 | rp12 covers 64 bytes then skips 64 etc | |
82 | rp13 skips 64 bytes then covers 64 etc | |
83 | rp14 covers 128 bytes then skips 128 | |
84 | rp15 skips 128 bytes then covers 128 | |
85 | ||
86 | In the end the parity bits are grouped together in three bytes as | |
87 | follows: | |
88 | ECC Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0 | |
89 | ECC 0 rp07 rp06 rp05 rp04 rp03 rp02 rp01 rp00 | |
90 | ECC 1 rp15 rp14 rp13 rp12 rp11 rp10 rp09 rp08 | |
91 | ECC 2 cp5 cp4 cp3 cp2 cp1 cp0 1 1 | |
92 | ||
93 | I detected after writing this that ST application note AN1823 | |
0ea6e611 | 94 | (http://www.st.com/stonline/) gives a much |
e6cf5df1 | 95 | nicer picture.(but they use line parity as term where I use row parity) |
96 | Oh well, I'm graphically challenged, so suffer with me for a moment :-) | |
97 | And I could not reuse the ST picture anyway for copyright reasons. | |
98 | ||
99 | ||
100 | Attempt 0 | |
101 | ========= | |
102 | ||
103 | Implementing the parity calculation is pretty simple. | |
104 | In C pseudocode: | |
105 | for (i = 0; i < 256; i++) | |
106 | { | |
107 | if (i & 0x01) | |
108 | rp1 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1; | |
109 | else | |
110 | rp0 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1; | |
111 | if (i & 0x02) | |
112 | rp3 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp3; | |
113 | else | |
114 | rp2 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp2; | |
115 | if (i & 0x04) | |
116 | rp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp5; | |
117 | else | |
118 | rp4 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp4; | |
119 | if (i & 0x08) | |
120 | rp7 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp7; | |
121 | else | |
122 | rp6 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp6; | |
123 | if (i & 0x10) | |
124 | rp9 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp9; | |
125 | else | |
126 | rp8 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp8; | |
127 | if (i & 0x20) | |
128 | rp11 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp11; | |
129 | else | |
130 | rp10 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp10; | |
131 | if (i & 0x40) | |
132 | rp13 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp13; | |
133 | else | |
134 | rp12 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp12; | |
135 | if (i & 0x80) | |
136 | rp15 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp15; | |
137 | else | |
138 | rp14 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp14; | |
139 | cp0 = bit6 ^ bit4 ^ bit2 ^ bit0 ^ cp0; | |
140 | cp1 = bit7 ^ bit5 ^ bit3 ^ bit1 ^ cp1; | |
141 | cp2 = bit5 ^ bit4 ^ bit1 ^ bit0 ^ cp2; | |
142 | cp3 = bit7 ^ bit6 ^ bit3 ^ bit2 ^ cp3 | |
143 | cp4 = bit3 ^ bit2 ^ bit1 ^ bit0 ^ cp4 | |
144 | cp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ cp5 | |
145 | } | |
146 | ||
147 | ||
148 | Analysis 0 | |
149 | ========== | |
150 | ||
151 | C does have bitwise operators but not really operators to do the above | |
152 | efficiently (and most hardware has no such instructions either). | |
153 | Therefore without implementing this it was clear that the code above was | |
154 | not going to bring me a Nobel prize :-) | |
155 | ||
156 | Fortunately the exclusive or operation is commutative, so we can combine | |
157 | the values in any order. So instead of calculating all the bits | |
158 | individually, let us try to rearrange things. | |
159 | For the column parity this is easy. We can just xor the bytes and in the | |
160 | end filter out the relevant bits. This is pretty nice as it will bring | |
161 | all cp calculation out of the if loop. | |
162 | ||
163 | Similarly we can first xor the bytes for the various rows. | |
164 | This leads to: | |
165 | ||
166 | ||
167 | Attempt 1 | |
168 | ========= | |
169 | ||
170 | const char parity[256] = { | |
171 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | |
172 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | |
173 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | |
174 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | |
175 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | |
176 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | |
177 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | |
178 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | |
179 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | |
180 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | |
181 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | |
182 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | |
183 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | |
184 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | |
185 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | |
186 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0 | |
187 | }; | |
188 | ||
189 | void ecc1(const unsigned char *buf, unsigned char *code) | |
190 | { | |
191 | int i; | |
192 | const unsigned char *bp = buf; | |
193 | unsigned char cur; | |
194 | unsigned char rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; | |
195 | unsigned char rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; | |
196 | unsigned char par; | |
197 | ||
198 | par = 0; | |
199 | rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; | |
200 | rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; | |
201 | rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; | |
202 | rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; | |
203 | ||
204 | for (i = 0; i < 256; i++) | |
205 | { | |
206 | cur = *bp++; | |
207 | par ^= cur; | |
208 | if (i & 0x01) rp1 ^= cur; else rp0 ^= cur; | |
209 | if (i & 0x02) rp3 ^= cur; else rp2 ^= cur; | |
210 | if (i & 0x04) rp5 ^= cur; else rp4 ^= cur; | |
211 | if (i & 0x08) rp7 ^= cur; else rp6 ^= cur; | |
212 | if (i & 0x10) rp9 ^= cur; else rp8 ^= cur; | |
213 | if (i & 0x20) rp11 ^= cur; else rp10 ^= cur; | |
214 | if (i & 0x40) rp13 ^= cur; else rp12 ^= cur; | |
215 | if (i & 0x80) rp15 ^= cur; else rp14 ^= cur; | |
216 | } | |
217 | code[0] = | |
218 | (parity[rp7] << 7) | | |
219 | (parity[rp6] << 6) | | |
220 | (parity[rp5] << 5) | | |
221 | (parity[rp4] << 4) | | |
222 | (parity[rp3] << 3) | | |
223 | (parity[rp2] << 2) | | |
224 | (parity[rp1] << 1) | | |
225 | (parity[rp0]); | |
226 | code[1] = | |
227 | (parity[rp15] << 7) | | |
228 | (parity[rp14] << 6) | | |
229 | (parity[rp13] << 5) | | |
230 | (parity[rp12] << 4) | | |
231 | (parity[rp11] << 3) | | |
232 | (parity[rp10] << 2) | | |
233 | (parity[rp9] << 1) | | |
234 | (parity[rp8]); | |
235 | code[2] = | |
236 | (parity[par & 0xf0] << 7) | | |
237 | (parity[par & 0x0f] << 6) | | |
238 | (parity[par & 0xcc] << 5) | | |
239 | (parity[par & 0x33] << 4) | | |
240 | (parity[par & 0xaa] << 3) | | |
241 | (parity[par & 0x55] << 2); | |
242 | code[0] = ~code[0]; | |
243 | code[1] = ~code[1]; | |
244 | code[2] = ~code[2]; | |
245 | } | |
246 | ||
247 | Still pretty straightforward. The last three invert statements are there to | |
248 | give a checksum of 0xff 0xff 0xff for an empty flash. In an empty flash | |
249 | all data is 0xff, so the checksum then matches. | |
250 | ||
251 | I also introduced the parity lookup. I expected this to be the fastest | |
252 | way to calculate the parity, but I will investigate alternatives later | |
253 | on. | |
254 | ||
255 | ||
256 | Analysis 1 | |
257 | ========== | |
258 | ||
259 | The code works, but is not terribly efficient. On my system it took | |
260 | almost 4 times as much time as the linux driver code. But hey, if it was | |
261 | *that* easy this would have been done long before. | |
262 | No pain. no gain. | |
263 | ||
264 | Fortunately there is plenty of room for improvement. | |
265 | ||
266 | In step 1 we moved from bit-wise calculation to byte-wise calculation. | |
267 | However in C we can also use the unsigned long data type and virtually | |
268 | every modern microprocessor supports 32 bit operations, so why not try | |
269 | to write our code in such a way that we process data in 32 bit chunks. | |
270 | ||
271 | Of course this means some modification as the row parity is byte by | |
272 | byte. A quick analysis: | |
273 | for the column parity we use the par variable. When extending to 32 bits | |
274 | we can in the end easily calculate p0 and p1 from it. | |
275 | (because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0 | |
276 | respectively) | |
277 | also rp2 and rp3 can be easily retrieved from par as rp3 covers the | |
278 | first two bytes and rp2 the last two bytes. | |
279 | ||
280 | Note that of course now the loop is executed only 64 times (256/4). | |
281 | And note that care must taken wrt byte ordering. The way bytes are | |
282 | ordered in a long is machine dependent, and might affect us. | |
283 | Anyway, if there is an issue: this code is developed on x86 (to be | |
284 | precise: a DELL PC with a D920 Intel CPU) | |
285 | ||
286 | And of course the performance might depend on alignment, but I expect | |
287 | that the I/O buffers in the nand driver are aligned properly (and | |
288 | otherwise that should be fixed to get maximum performance). | |
289 | ||
290 | Let's give it a try... | |
291 | ||
292 | ||
293 | Attempt 2 | |
294 | ========= | |
295 | ||
296 | extern const char parity[256]; | |
297 | ||
298 | void ecc2(const unsigned char *buf, unsigned char *code) | |
299 | { | |
300 | int i; | |
301 | const unsigned long *bp = (unsigned long *)buf; | |
302 | unsigned long cur; | |
303 | unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; | |
304 | unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; | |
305 | unsigned long par; | |
306 | ||
307 | par = 0; | |
308 | rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; | |
309 | rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; | |
310 | rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; | |
311 | rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; | |
312 | ||
313 | for (i = 0; i < 64; i++) | |
314 | { | |
315 | cur = *bp++; | |
316 | par ^= cur; | |
317 | if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; | |
318 | if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; | |
319 | if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; | |
320 | if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; | |
321 | if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; | |
322 | if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; | |
323 | } | |
324 | /* | |
325 | we need to adapt the code generation for the fact that rp vars are now | |
326 | long; also the column parity calculation needs to be changed. | |
327 | we'll bring rp4 to 15 back to single byte entities by shifting and | |
328 | xoring | |
329 | */ | |
330 | rp4 ^= (rp4 >> 16); rp4 ^= (rp4 >> 8); rp4 &= 0xff; | |
331 | rp5 ^= (rp5 >> 16); rp5 ^= (rp5 >> 8); rp5 &= 0xff; | |
332 | rp6 ^= (rp6 >> 16); rp6 ^= (rp6 >> 8); rp6 &= 0xff; | |
333 | rp7 ^= (rp7 >> 16); rp7 ^= (rp7 >> 8); rp7 &= 0xff; | |
334 | rp8 ^= (rp8 >> 16); rp8 ^= (rp8 >> 8); rp8 &= 0xff; | |
335 | rp9 ^= (rp9 >> 16); rp9 ^= (rp9 >> 8); rp9 &= 0xff; | |
336 | rp10 ^= (rp10 >> 16); rp10 ^= (rp10 >> 8); rp10 &= 0xff; | |
337 | rp11 ^= (rp11 >> 16); rp11 ^= (rp11 >> 8); rp11 &= 0xff; | |
338 | rp12 ^= (rp12 >> 16); rp12 ^= (rp12 >> 8); rp12 &= 0xff; | |
339 | rp13 ^= (rp13 >> 16); rp13 ^= (rp13 >> 8); rp13 &= 0xff; | |
340 | rp14 ^= (rp14 >> 16); rp14 ^= (rp14 >> 8); rp14 &= 0xff; | |
341 | rp15 ^= (rp15 >> 16); rp15 ^= (rp15 >> 8); rp15 &= 0xff; | |
342 | rp3 = (par >> 16); rp3 ^= (rp3 >> 8); rp3 &= 0xff; | |
343 | rp2 = par & 0xffff; rp2 ^= (rp2 >> 8); rp2 &= 0xff; | |
344 | par ^= (par >> 16); | |
345 | rp1 = (par >> 8); rp1 &= 0xff; | |
346 | rp0 = (par & 0xff); | |
347 | par ^= (par >> 8); par &= 0xff; | |
348 | ||
349 | code[0] = | |
350 | (parity[rp7] << 7) | | |
351 | (parity[rp6] << 6) | | |
352 | (parity[rp5] << 5) | | |
353 | (parity[rp4] << 4) | | |
354 | (parity[rp3] << 3) | | |
355 | (parity[rp2] << 2) | | |
356 | (parity[rp1] << 1) | | |
357 | (parity[rp0]); | |
358 | code[1] = | |
359 | (parity[rp15] << 7) | | |
360 | (parity[rp14] << 6) | | |
361 | (parity[rp13] << 5) | | |
362 | (parity[rp12] << 4) | | |
363 | (parity[rp11] << 3) | | |
364 | (parity[rp10] << 2) | | |
365 | (parity[rp9] << 1) | | |
366 | (parity[rp8]); | |
367 | code[2] = | |
368 | (parity[par & 0xf0] << 7) | | |
369 | (parity[par & 0x0f] << 6) | | |
370 | (parity[par & 0xcc] << 5) | | |
371 | (parity[par & 0x33] << 4) | | |
372 | (parity[par & 0xaa] << 3) | | |
373 | (parity[par & 0x55] << 2); | |
374 | code[0] = ~code[0]; | |
375 | code[1] = ~code[1]; | |
376 | code[2] = ~code[2]; | |
377 | } | |
378 | ||
379 | The parity array is not shown any more. Note also that for these | |
380 | examples I kinda deviated from my regular programming style by allowing | |
381 | multiple statements on a line, not using { } in then and else blocks | |
382 | with only a single statement and by using operators like ^= | |
383 | ||
384 | ||
385 | Analysis 2 | |
386 | ========== | |
387 | ||
388 | The code (of course) works, and hurray: we are a little bit faster than | |
389 | the linux driver code (about 15%). But wait, don't cheer too quickly. | |
390 | THere is more to be gained. | |
391 | If we look at e.g. rp14 and rp15 we see that we either xor our data with | |
392 | rp14 or with rp15. However we also have par which goes over all data. | |
393 | This means there is no need to calculate rp14 as it can be calculated from | |
394 | rp15 through rp14 = par ^ rp15; | |
395 | (or if desired we can avoid calculating rp15 and calculate it from | |
396 | rp14). That is why some places refer to inverse parity. | |
397 | Of course the same thing holds for rp4/5, rp6/7, rp8/9, rp10/11 and rp12/13. | |
398 | Effectively this means we can eliminate the else clause from the if | |
399 | statements. Also we can optimise the calculation in the end a little bit | |
400 | by going from long to byte first. Actually we can even avoid the table | |
401 | lookups | |
402 | ||
403 | Attempt 3 | |
404 | ========= | |
405 | ||
406 | Odd replaced: | |
407 | if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; | |
408 | if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; | |
409 | if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; | |
410 | if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; | |
411 | if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; | |
412 | if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; | |
413 | with | |
414 | if (i & 0x01) rp5 ^= cur; | |
415 | if (i & 0x02) rp7 ^= cur; | |
416 | if (i & 0x04) rp9 ^= cur; | |
417 | if (i & 0x08) rp11 ^= cur; | |
418 | if (i & 0x10) rp13 ^= cur; | |
419 | if (i & 0x20) rp15 ^= cur; | |
420 | ||
421 | and outside the loop added: | |
422 | rp4 = par ^ rp5; | |
423 | rp6 = par ^ rp7; | |
424 | rp8 = par ^ rp9; | |
425 | rp10 = par ^ rp11; | |
426 | rp12 = par ^ rp13; | |
427 | rp14 = par ^ rp15; | |
428 | ||
429 | And after that the code takes about 30% more time, although the number of | |
430 | statements is reduced. This is also reflected in the assembly code. | |
431 | ||
432 | ||
433 | Analysis 3 | |
434 | ========== | |
435 | ||
436 | Very weird. Guess it has to do with caching or instruction parallellism | |
437 | or so. I also tried on an eeePC (Celeron, clocked at 900 Mhz). Interesting | |
438 | observation was that this one is only 30% slower (according to time) | |
439 | executing the code as my 3Ghz D920 processor. | |
440 | ||
441 | Well, it was expected not to be easy so maybe instead move to a | |
442 | different track: let's move back to the code from attempt2 and do some | |
443 | loop unrolling. This will eliminate a few if statements. I'll try | |
444 | different amounts of unrolling to see what works best. | |
445 | ||
446 | ||
447 | Attempt 4 | |
448 | ========= | |
449 | ||
450 | Unrolled the loop 1, 2, 3 and 4 times. | |
451 | For 4 the code starts with: | |
452 | ||
453 | for (i = 0; i < 4; i++) | |
454 | { | |
455 | cur = *bp++; | |
456 | par ^= cur; | |
457 | rp4 ^= cur; | |
458 | rp6 ^= cur; | |
459 | rp8 ^= cur; | |
460 | rp10 ^= cur; | |
461 | if (i & 0x1) rp13 ^= cur; else rp12 ^= cur; | |
462 | if (i & 0x2) rp15 ^= cur; else rp14 ^= cur; | |
463 | cur = *bp++; | |
464 | par ^= cur; | |
465 | rp5 ^= cur; | |
466 | rp6 ^= cur; | |
467 | ... | |
468 | ||
469 | ||
470 | Analysis 4 | |
471 | ========== | |
472 | ||
473 | Unrolling once gains about 15% | |
474 | Unrolling twice keeps the gain at about 15% | |
475 | Unrolling three times gives a gain of 30% compared to attempt 2. | |
476 | Unrolling four times gives a marginal improvement compared to unrolling | |
477 | three times. | |
478 | ||
479 | I decided to proceed with a four time unrolled loop anyway. It was my gut | |
480 | feeling that in the next steps I would obtain additional gain from it. | |
481 | ||
482 | The next step was triggered by the fact that par contains the xor of all | |
483 | bytes and rp4 and rp5 each contain the xor of half of the bytes. | |
484 | So in effect par = rp4 ^ rp5. But as xor is commutative we can also say | |
485 | that rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can | |
486 | eliminate rp5 (or rp4, but I already foresaw another optimisation). | |
487 | The same holds for rp6/7, rp8/9, rp10/11 rp12/13 and rp14/15. | |
488 | ||
489 | ||
490 | Attempt 5 | |
491 | ========= | |
492 | ||
493 | Effectively so all odd digit rp assignments in the loop were removed. | |
494 | This included the else clause of the if statements. | |
495 | Of course after the loop we need to correct things by adding code like: | |
496 | rp5 = par ^ rp4; | |
497 | Also the initial assignments (rp5 = 0; etc) could be removed. | |
498 | Along the line I also removed the initialisation of rp0/1/2/3. | |
499 | ||
500 | ||
501 | Analysis 5 | |
502 | ========== | |
503 | ||
504 | Measurements showed this was a good move. The run-time roughly halved | |
505 | compared with attempt 4 with 4 times unrolled, and we only require 1/3rd | |
506 | of the processor time compared to the current code in the linux kernel. | |
507 | ||
508 | However, still I thought there was more. I didn't like all the if | |
509 | statements. Why not keep a running parity and only keep the last if | |
510 | statement. Time for yet another version! | |
511 | ||
512 | ||
513 | Attempt 6 | |
514 | ========= | |
515 | ||
516 | THe code within the for loop was changed to: | |
517 | ||
518 | for (i = 0; i < 4; i++) | |
519 | { | |
520 | cur = *bp++; tmppar = cur; rp4 ^= cur; | |
521 | cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; | |
522 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | |
523 | cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; | |
524 | ||
525 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; | |
526 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | |
527 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | |
528 | cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; | |
529 | ||
530 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; rp8 ^= cur; | |
531 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; rp8 ^= cur; | |
532 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp8 ^= cur; | |
533 | cur = *bp++; tmppar ^= cur; rp8 ^= cur; | |
534 | ||
535 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; | |
536 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | |
537 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | |
538 | cur = *bp++; tmppar ^= cur; | |
539 | ||
540 | par ^= tmppar; | |
541 | if ((i & 0x1) == 0) rp12 ^= tmppar; | |
542 | if ((i & 0x2) == 0) rp14 ^= tmppar; | |
543 | } | |
544 | ||
545 | As you can see tmppar is used to accumulate the parity within a for | |
546 | iteration. In the last 3 statements is is added to par and, if needed, | |
547 | to rp12 and rp14. | |
548 | ||
549 | While making the changes I also found that I could exploit that tmppar | |
550 | contains the running parity for this iteration. So instead of having: | |
551 | rp4 ^= cur; rp6 = cur; | |
552 | I removed the rp6 = cur; statement and did rp6 ^= tmppar; on next | |
553 | statement. A similar change was done for rp8 and rp10 | |
554 | ||
555 | ||
556 | Analysis 6 | |
557 | ========== | |
558 | ||
559 | Measuring this code again showed big gain. When executing the original | |
560 | linux code 1 million times, this took about 1 second on my system. | |
561 | (using time to measure the performance). After this iteration I was back | |
562 | to 0.075 sec. Actually I had to decide to start measuring over 10 | |
19f59460 | 563 | million iterations in order not to lose too much accuracy. This one |
e6cf5df1 | 564 | definitely seemed to be the jackpot! |
565 | ||
566 | There is a little bit more room for improvement though. There are three | |
567 | places with statements: | |
568 | rp4 ^= cur; rp6 ^= cur; | |
569 | It seems more efficient to also maintain a variable rp4_6 in the while | |
570 | loop; This eliminates 3 statements per loop. Of course after the loop we | |
571 | need to correct by adding: | |
572 | rp4 ^= rp4_6; | |
573 | rp6 ^= rp4_6 | |
19f59460 ML |
574 | Furthermore there are 4 sequential assignments to rp8. This can be |
575 | encoded slightly more efficiently by saving tmppar before those 4 lines | |
e6cf5df1 | 576 | and later do rp8 = rp8 ^ tmppar ^ notrp8; |
577 | (where notrp8 is the value of rp8 before those 4 lines). | |
578 | Again a use of the commutative property of xor. | |
579 | Time for a new test! | |
580 | ||
581 | ||
582 | Attempt 7 | |
583 | ========= | |
584 | ||
585 | The new code now looks like: | |
586 | ||
587 | for (i = 0; i < 4; i++) | |
588 | { | |
589 | cur = *bp++; tmppar = cur; rp4 ^= cur; | |
590 | cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; | |
591 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | |
592 | cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; | |
593 | ||
594 | cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; | |
595 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | |
596 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | |
597 | cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; | |
598 | ||
599 | notrp8 = tmppar; | |
600 | cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; | |
601 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | |
602 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | |
603 | cur = *bp++; tmppar ^= cur; | |
604 | rp8 = rp8 ^ tmppar ^ notrp8; | |
605 | ||
606 | cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; | |
607 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | |
608 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | |
609 | cur = *bp++; tmppar ^= cur; | |
610 | ||
611 | par ^= tmppar; | |
612 | if ((i & 0x1) == 0) rp12 ^= tmppar; | |
613 | if ((i & 0x2) == 0) rp14 ^= tmppar; | |
614 | } | |
615 | rp4 ^= rp4_6; | |
616 | rp6 ^= rp4_6; | |
617 | ||
618 | ||
619 | Not a big change, but every penny counts :-) | |
620 | ||
621 | ||
622 | Analysis 7 | |
623 | ========== | |
624 | ||
19f59460 | 625 | Actually this made things worse. Not very much, but I don't want to move |
e6cf5df1 | 626 | into the wrong direction. Maybe something to investigate later. Could |
627 | have to do with caching again. | |
628 | ||
629 | Guess that is what there is to win within the loop. Maybe unrolling one | |
630 | more time will help. I'll keep the optimisations from 7 for now. | |
631 | ||
632 | ||
633 | Attempt 8 | |
634 | ========= | |
635 | ||
636 | Unrolled the loop one more time. | |
637 | ||
638 | ||
639 | Analysis 8 | |
640 | ========== | |
641 | ||
642 | This makes things worse. Let's stick with attempt 6 and continue from there. | |
643 | Although it seems that the code within the loop cannot be optimised | |
644 | further there is still room to optimize the generation of the ecc codes. | |
19f59460 | 645 | We can simply calculate the total parity. If this is 0 then rp4 = rp5 |
e6cf5df1 | 646 | etc. If the parity is 1, then rp4 = !rp5; |
647 | But if rp4 = rp5 we do not need rp5 etc. We can just write the even bits | |
648 | in the result byte and then do something like | |
649 | code[0] |= (code[0] << 1); | |
650 | Lets test this. | |
651 | ||
652 | ||
653 | Attempt 9 | |
654 | ========= | |
655 | ||
656 | Changed the code but again this slightly degrades performance. Tried all | |
657 | kind of other things, like having dedicated parity arrays to avoid the | |
658 | shift after parity[rp7] << 7; No gain. | |
659 | Change the lookup using the parity array by using shift operators (e.g. | |
660 | replace parity[rp7] << 7 with: | |
661 | rp7 ^= (rp7 << 4); | |
662 | rp7 ^= (rp7 << 2); | |
663 | rp7 ^= (rp7 << 1); | |
664 | rp7 &= 0x80; | |
665 | No gain. | |
666 | ||
667 | The only marginal change was inverting the parity bits, so we can remove | |
668 | the last three invert statements. | |
669 | ||
670 | Ah well, pity this does not deliver more. Then again 10 million | |
671 | iterations using the linux driver code takes between 13 and 13.5 | |
672 | seconds, whereas my code now takes about 0.73 seconds for those 10 | |
673 | million iterations. So basically I've improved the performance by a | |
674 | factor 18 on my system. Not that bad. Of course on different hardware | |
675 | you will get different results. No warranties! | |
676 | ||
677 | But of course there is no such thing as a free lunch. The codesize almost | |
678 | tripled (from 562 bytes to 1434 bytes). Then again, it is not that much. | |
679 | ||
680 | ||
681 | Correcting errors | |
682 | ================= | |
683 | ||
684 | For correcting errors I again used the ST application note as a starter, | |
685 | but I also peeked at the existing code. | |
686 | The algorithm itself is pretty straightforward. Just xor the given and | |
687 | the calculated ecc. If all bytes are 0 there is no problem. If 11 bits | |
688 | are 1 we have one correctable bit error. If there is 1 bit 1, we have an | |
689 | error in the given ecc code. | |
690 | It proved to be fastest to do some table lookups. Performance gain | |
691 | introduced by this is about a factor 2 on my system when a repair had to | |
692 | be done, and 1% or so if no repair had to be done. | |
693 | Code size increased from 330 bytes to 686 bytes for this function. | |
694 | (gcc 4.2, -O3) | |
695 | ||
696 | ||
697 | Conclusion | |
698 | ========== | |
699 | ||
700 | The gain when calculating the ecc is tremendous. Om my development hardware | |
701 | a speedup of a factor of 18 for ecc calculation was achieved. On a test on an | |
702 | embedded system with a MIPS core a factor 7 was obtained. | |
703 | On a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor | |
704 | 5 (big endian mode, gcc 4.1.2, -O3) | |
705 | For correction not much gain could be obtained (as bitflips are rare). Then | |
706 | again there are also much less cycles spent there. | |
707 | ||
708 | It seems there is not much more gain possible in this, at least when | |
709 | programmed in C. Of course it might be possible to squeeze something more | |
710 | out of it with an assembler program, but due to pipeline behaviour etc | |
711 | this is very tricky (at least for intel hw). | |
712 | ||
713 | Author: Frans Meulenbroeks | |
714 | Copyright (C) 2008 Koninklijke Philips Electronics NV. |