Commit | Line | Data |
---|---|---|
d533f671 | 1 | Below is the original README file from the descore.shar package. |
1da177e4 LT |
2 | ------------------------------------------------------------------------------ |
3 | ||
4 | des - fast & portable DES encryption & decryption. | |
5 | Copyright (C) 1992 Dana L. How | |
6 | ||
7 | This program is free software; you can redistribute it and/or modify | |
8 | it under the terms of the GNU Library General Public License as published by | |
9 | the Free Software Foundation; either version 2 of the License, or | |
10 | (at your option) any later version. | |
11 | ||
12 | This program is distributed in the hope that it will be useful, | |
13 | but WITHOUT ANY WARRANTY; without even the implied warranty of | |
14 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
15 | GNU Library General Public License for more details. | |
16 | ||
17 | You should have received a copy of the GNU Library General Public License | |
18 | along with this program; if not, write to the Free Software | |
19 | Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. | |
20 | ||
21 | Author's address: how@isl.stanford.edu | |
22 | ||
23 | $Id: README,v 1.15 1992/05/20 00:25:32 how E $ | |
24 | ||
25 | ||
26 | ==>> To compile after untarring/unsharring, just `make' <<== | |
27 | ||
28 | ||
29 | This package was designed with the following goals: | |
30 | 1. Highest possible encryption/decryption PERFORMANCE. | |
31 | 2. PORTABILITY to any byte-addressable host with a 32bit unsigned C type | |
32 | 3. Plug-compatible replacement for KERBEROS's low-level routines. | |
33 | ||
34 | This second release includes a number of performance enhancements for | |
35 | register-starved machines. My discussions with Richard Outerbridge, | |
36 | 71755.204@compuserve.com, sparked a number of these enhancements. | |
37 | ||
38 | To more rapidly understand the code in this package, inspect desSmallFips.i | |
39 | (created by typing `make') BEFORE you tackle desCode.h. The latter is set | |
40 | up in a parameterized fashion so it can easily be modified by speed-daemon | |
41 | hackers in pursuit of that last microsecond. You will find it more | |
42 | illuminating to inspect one specific implementation, | |
43 | and then move on to the common abstract skeleton with this one in mind. | |
44 | ||
45 | ||
46 | performance comparison to other available des code which i could | |
47 | compile on a SPARCStation 1 (cc -O4, gcc -O2): | |
48 | ||
49 | this code (byte-order independent): | |
50 | 30us per encryption (options: 64k tables, no IP/FP) | |
51 | 33us per encryption (options: 64k tables, FIPS standard bit ordering) | |
52 | 45us per encryption (options: 2k tables, no IP/FP) | |
53 | 48us per encryption (options: 2k tables, FIPS standard bit ordering) | |
54 | 275us to set a new key (uses 1k of key tables) | |
55 | this has the quickest encryption/decryption routines i've seen. | |
56 | since i was interested in fast des filters rather than crypt(3) | |
57 | and password cracking, i haven't really bothered yet to speed up | |
58 | the key setting routine. also, i have no interest in re-implementing | |
59 | all the other junk in the mit kerberos des library, so i've just | |
60 | provided my routines with little stub interfaces so they can be | |
61 | used as drop-in replacements with mit's code or any of the mit- | |
62 | compatible packages below. (note that the first two timings above | |
63 | are highly variable because of cache effects). | |
64 | ||
65 | kerberos des replacement from australia (version 1.95): | |
66 | 53us per encryption (uses 2k of tables) | |
67 | 96us to set a new key (uses 2.25k of key tables) | |
68 | so despite the author's inclusion of some of the performance | |
69 | improvements i had suggested to him, this package's | |
70 | encryption/decryption is still slower on the sparc and 68000. | |
71 | more specifically, 19-40% slower on the 68020 and 11-35% slower | |
72 | on the sparc, depending on the compiler; | |
73 | in full gory detail (ALT_ECB is a libdes variant): | |
74 | compiler machine desCore libdes ALT_ECB slower by | |
75 | gcc 2.1 -O2 Sun 3/110 304 uS 369.5uS 461.8uS 22% | |
76 | cc -O1 Sun 3/110 336 uS 436.6uS 399.3uS 19% | |
77 | cc -O2 Sun 3/110 360 uS 532.4uS 505.1uS 40% | |
78 | cc -O4 Sun 3/110 365 uS 532.3uS 505.3uS 38% | |
79 | gcc 2.1 -O2 Sun 4/50 48 uS 53.4uS 57.5uS 11% | |
80 | cc -O2 Sun 4/50 48 uS 64.6uS 64.7uS 35% | |
81 | cc -O4 Sun 4/50 48 uS 64.7uS 64.9uS 35% | |
82 | (my time measurements are not as accurate as his). | |
83 | the comments in my first release of desCore on version 1.92: | |
84 | 68us per encryption (uses 2k of tables) | |
85 | 96us to set a new key (uses 2.25k of key tables) | |
86 | this is a very nice package which implements the most important | |
87 | of the optimizations which i did in my encryption routines. | |
88 | it's a bit weak on common low-level optimizations which is why | |
89 | it's 39%-106% slower. because he was interested in fast crypt(3) and | |
90 | password-cracking applications, he also used the same ideas to | |
91 | speed up the key-setting routines with impressive results. | |
92 | (at some point i may do the same in my package). he also implements | |
93 | the rest of the mit des library. | |
94 | (code from eay@psych.psy.uq.oz.au via comp.sources.misc) | |
95 | ||
96 | fast crypt(3) package from denmark: | |
97 | the des routine here is buried inside a loop to do the | |
98 | crypt function and i didn't feel like ripping it out and measuring | |
99 | performance. his code takes 26 sparc instructions to compute one | |
100 | des iteration; above, Quick (64k) takes 21 and Small (2k) takes 37. | |
101 | he claims to use 280k of tables but the iteration calculation seems | |
102 | to use only 128k. his tables and code are machine independent. | |
103 | (code from glad@daimi.aau.dk via alt.sources or comp.sources.misc) | |
104 | ||
105 | swedish reimplementation of Kerberos des library | |
106 | 108us per encryption (uses 34k worth of tables) | |
107 | 134us to set a new key (uses 32k of key tables to get this speed!) | |
108 | the tables used seem to be machine-independent; | |
109 | he seems to have included a lot of special case code | |
110 | so that, e.g., `long' loads can be used instead of 4 `char' loads | |
111 | when the machine's architecture allows it. | |
112 | (code obtained from chalmers.se:pub/des) | |
113 | ||
114 | crack 3.3c package from england: | |
115 | as in crypt above, the des routine is buried in a loop. it's | |
116 | also very modified for crypt. his iteration code uses 16k | |
117 | of tables and appears to be slow. | |
118 | (code obtained from aem@aber.ac.uk via alt.sources or comp.sources.misc) | |
119 | ||
120 | ``highly optimized'' and tweaked Kerberos/Athena code (byte-order dependent): | |
121 | 165us per encryption (uses 6k worth of tables) | |
122 | 478us to set a new key (uses <1k of key tables) | |
123 | so despite the comments in this code, it was possible to get | |
124 | faster code AND smaller tables, as well as making the tables | |
125 | machine-independent. | |
126 | (code obtained from prep.ai.mit.edu) | |
127 | ||
128 | UC Berkeley code (depends on machine-endedness): | |
129 | 226us per encryption | |
130 | 10848us to set a new key | |
131 | table sizes are unclear, but they don't look very small | |
132 | (code obtained from wuarchive.wustl.edu) | |
133 | ||
134 | ||
135 | motivation and history | |
136 | ||
137 | a while ago i wanted some des routines and the routines documented on sun's | |
138 | man pages either didn't exist or dumped core. i had heard of kerberos, | |
139 | and knew that it used des, so i figured i'd use its routines. but once | |
140 | i got it and looked at the code, it really set off a lot of pet peeves - | |
141 | it was too convoluted, the code had been written without taking | |
142 | advantage of the regular structure of operations such as IP, E, and FP | |
143 | (i.e. the author didn't sit down and think before coding), | |
144 | it was excessively slow, the author had attempted to clarify the code | |
145 | by adding MORE statements to make the data movement more `consistent' | |
146 | instead of simplifying his implementation and cutting down on all data | |
147 | movement (in particular, his use of L1, R1, L2, R2), and it was full of | |
148 | idiotic `tweaks' for particular machines which failed to deliver significant | |
149 | speedups but which did obfuscate everything. so i took the test data | |
150 | from his verification program and rewrote everything else. | |
151 | ||
152 | a while later i ran across the great crypt(3) package mentioned above. | |
153 | the fact that this guy was computing 2 sboxes per table lookup rather | |
154 | than one (and using a MUCH larger table in the process) emboldened me to | |
155 | do the same - it was a trivial change from which i had been scared away | |
156 | by the larger table size. in his case he didn't realize you don't need to keep | |
157 | the working data in TWO forms, one for easy use of half the sboxes in | |
158 | indexing, the other for easy use of the other half; instead you can keep | |
159 | it in the form for the first half and use a simple rotate to get the other | |
160 | half. this means i have (almost) half the data manipulation and half | |
161 | the table size. in fairness though he might be encoding something particular | |
162 | to crypt(3) in his tables - i didn't check. | |
163 | ||
164 | i'm glad that i implemented it the way i did, because this C version is | |
165 | portable (the ifdef's are performance enhancements) and it is faster | |
166 | than versions hand-written in assembly for the sparc! | |
167 | ||
168 | ||
169 | porting notes | |
170 | ||
171 | one thing i did not want to do was write an enormous mess | |
172 | which depended on endedness and other machine quirks, | |
173 | and which necessarily produced different code and different lookup tables | |
174 | for different machines. see the kerberos code for an example | |
175 | of what i didn't want to do; all their endedness-specific `optimizations' | |
176 | obfuscate the code and in the end were slower than a simpler machine | |
177 | independent approach. however, there are always some portability | |
178 | considerations of some kind, and i have included some options | |
179 | for varying numbers of register variables. | |
180 | perhaps some will still regard the result as a mess! | |
181 | ||
182 | 1) i assume everything is byte addressable, although i don't actually | |
183 | depend on the byte order, and that bytes are 8 bits. | |
184 | i assume word pointers can be freely cast to and from char pointers. | |
185 | note that 99% of C programs make these assumptions. | |
186 | i always use unsigned char's if the high bit could be set. | |
187 | 2) the typedef `word' means a 32 bit unsigned integral type. | |
188 | if `unsigned long' is not 32 bits, change the typedef in desCore.h. | |
189 | i assume sizeof(word) == 4 EVERYWHERE. | |
190 | ||
191 | the (worst-case) cost of my NOT doing endedness-specific optimizations | |
192 | in the data loading and storing code surrounding the key iterations | |
193 | is less than 12%. also, there is the added benefit that | |
194 | the input and output work areas do not need to be word-aligned. | |
195 | ||
196 | ||
197 | OPTIONAL performance optimizations | |
198 | ||
199 | 1) you should define one of `i386,' `vax,' `mc68000,' or `sparc,' | |
200 | whichever one is closest to the capabilities of your machine. | |
201 | see the start of desCode.h to see exactly what this selection implies. | |
202 | note that if you select the wrong one, the des code will still work; | |
203 | these are just performance tweaks. | |
204 | 2) for those with functional `asm' keywords: you should change the | |
205 | ROR and ROL macros to use machine rotate instructions if you have them. | |
206 | this will save 2 instructions and a temporary per use, | |
207 | or about 32 to 40 instructions per en/decryption. | |
208 | note that gcc is smart enough to translate the ROL/R macros into | |
209 | machine rotates! | |
210 | ||
211 | these optimizations are all rather persnickety, yet with them you should | |
212 | be able to get performance equal to assembly-coding, except that: | |
213 | 1) with the lack of a bit rotate operator in C, rotates have to be synthesized | |
214 | from shifts. so access to `asm' will speed things up if your machine | |
215 | has rotates, as explained above in (3) (not necessary if you use gcc). | |
216 | 2) if your machine has less than 12 32-bit registers i doubt your compiler will | |
217 | generate good code. | |
218 | `i386' tries to configure the code for a 386 by only declaring 3 registers | |
219 | (it appears that gcc can use ebx, esi and edi to hold register variables). | |
220 | however, if you like assembly coding, the 386 does have 7 32-bit registers, | |
221 | and if you use ALL of them, use `scaled by 8' address modes with displacement | |
222 | and other tricks, you can get reasonable routines for DesQuickCore... with | |
223 | about 250 instructions apiece. For DesSmall... it will help to rearrange | |
224 | des_keymap, i.e., now the sbox # is the high part of the index and | |
225 | the 6 bits of data is the low part; it helps to exchange these. | |
226 | since i have no way to conveniently test it i have not provided my | |
227 | shoehorned 386 version. note that with this release of desCore, gcc is able | |
228 | to put everything in registers(!), and generate about 370 instructions apiece | |
229 | for the DesQuickCore... routines! | |
230 | ||
231 | coding notes | |
232 | ||
233 | the en/decryption routines each use 6 necessary register variables, | |
234 | with 4 being actively used at once during the inner iterations. | |
235 | if you don't have 4 register variables get a new machine. | |
236 | up to 8 more registers are used to hold constants in some configurations. | |
237 | ||
238 | i assume that the use of a constant is more expensive than using a register: | |
239 | a) additionally, i have tried to put the larger constants in registers. | |
240 | registering priority was by the following: | |
241 | anything more than 12 bits (bad for RISC and CISC) | |
242 | greater than 127 in value (can't use movq or byte immediate on CISC) | |
243 | 9-127 (may not be able to use CISC shift immediate or add/sub quick), | |
244 | 1-8 were never registered, being the cheapest constants. | |
245 | b) the compiler may be too stupid to realize table and table+256 should | |
246 | be assigned to different constant registers and instead repetitively | |
247 | do the arithmetic, so i assign these to explicit `m' register variables | |
248 | when possible and helpful. | |
249 | ||
250 | i assume that indexing is cheaper or equivalent to auto increment/decrement, | |
251 | where the index is 7 bits unsigned or smaller. | |
252 | this assumption is reversed for 68k and vax. | |
253 | ||
254 | i assume that addresses can be cheaply formed from two registers, | |
255 | or from a register and a small constant. | |
256 | for the 68000, the `two registers and small offset' form is used sparingly. | |
257 | all index scaling is done explicitly - no hidden shifts by log2(sizeof). | |
258 | ||
259 | the code is written so that even a dumb compiler | |
260 | should never need more than one hidden temporary, | |
261 | increasing the chance that everything will fit in the registers. | |
262 | KEEP THIS MORE SUBTLE POINT IN MIND IF YOU REWRITE ANYTHING. | |
263 | (actually, there are some code fragments now which do require two temps, | |
264 | but fixing it would either break the structure of the macros or | |
265 | require declaring another temporary). | |
266 | ||
267 | ||
268 | special efficient data format | |
269 | ||
270 | bits are manipulated in this arrangement most of the time (S7 S5 S3 S1): | |
271 | 003130292827xxxx242322212019xxxx161514131211xxxx080706050403xxxx | |
272 | (the x bits are still there, i'm just emphasizing where the S boxes are). | |
273 | bits are rotated left 4 when computing S6 S4 S2 S0: | |
274 | 282726252423xxxx201918171615xxxx121110090807xxxx040302010031xxxx | |
275 | the rightmost two bits are usually cleared so the lower byte can be used | |
276 | as an index into an sbox mapping table. the next two x'd bits are set | |
277 | to various values to access different parts of the tables. | |
278 | ||
279 | ||
280 | how to use the routines | |
281 | ||
282 | datatypes: | |
283 | pointer to 8 byte area of type DesData | |
284 | used to hold keys and input/output blocks to des. | |
285 | ||
286 | pointer to 128 byte area of type DesKeys | |
287 | used to hold full 768-bit key. | |
288 | must be long-aligned. | |
289 | ||
290 | DesQuickInit() | |
291 | call this before using any other routine with `Quick' in its name. | |
292 | it generates the special 64k table these routines need. | |
293 | DesQuickDone() | |
294 | frees this table | |
295 | ||
296 | DesMethod(m, k) | |
297 | m points to a 128byte block, k points to an 8 byte des key | |
298 | which must have odd parity (or -1 is returned) and which must | |
299 | not be a (semi-)weak key (or -2 is returned). | |
300 | normally DesMethod() returns 0. | |
301 | m is filled in from k so that when one of the routines below | |
302 | is called with m, the routine will act like standard des | |
303 | en/decryption with the key k. if you use DesMethod, | |
304 | you supply a standard 56bit key; however, if you fill in | |
305 | m yourself, you will get a 768bit key - but then it won't | |
306 | be standard. it's 768bits not 1024 because the least significant | |
307 | two bits of each byte are not used. note that these two bits | |
308 | will be set to magic constants which speed up the encryption/decryption | |
309 | on some machines. and yes, each byte controls | |
310 | a specific sbox during a specific iteration. | |
311 | you really shouldn't use the 768bit format directly; i should | |
312 | provide a routine that converts 128 6-bit bytes (specified in | |
313 | S-box mapping order or something) into the right format for you. | |
314 | this would entail some byte concatenation and rotation. | |
315 | ||
316 | Des{Small|Quick}{Fips|Core}{Encrypt|Decrypt}(d, m, s) | |
317 | performs des on the 8 bytes at s into the 8 bytes at d. (d,s: char *). | |
318 | uses m as a 768bit key as explained above. | |
319 | the Encrypt|Decrypt choice is obvious. | |
320 | Fips|Core determines whether a completely standard FIPS initial | |
321 | and final permutation is done; if not, then the data is loaded | |
322 | and stored in a nonstandard bit order (FIPS w/o IP/FP). | |
323 | Fips slows down Quick by 10%, Small by 9%. | |
324 | Small|Quick determines whether you use the normal routine | |
325 | or the crazy quick one which gobbles up 64k more of memory. | |
326 | Small is 50% slower then Quick, but Quick needs 32 times as much | |
327 | memory. Quick is included for programs that do nothing but DES, | |
328 | e.g., encryption filters, etc. | |
329 | ||
330 | ||
331 | Getting it to compile on your machine | |
332 | ||
333 | there are no machine-dependencies in the code (see porting), | |
334 | except perhaps the `now()' macro in desTest.c. | |
335 | ALL generated tables are machine independent. | |
336 | you should edit the Makefile with the appropriate optimization flags | |
337 | for your compiler (MAX optimization). | |
338 | ||
339 | ||
340 | Speeding up kerberos (and/or its des library) | |
341 | ||
342 | note that i have included a kerberos-compatible interface in desUtil.c | |
343 | through the functions des_key_sched() and des_ecb_encrypt(). | |
344 | to use these with kerberos or kerberos-compatible code put desCore.a | |
345 | ahead of the kerberos-compatible library on your linker's command line. | |
346 | you should not need to #include desCore.h; just include the header | |
347 | file provided with the kerberos library. | |
348 | ||
349 | Other uses | |
350 | ||
351 | the macros in desCode.h would be very useful for putting inline des | |
352 | functions in more complicated encryption routines. |