Commit | Line | Data |
---|---|---|
9f803664 KC |
1 | # Kernel Self-Protection |
2 | ||
3 | Kernel self-protection is the design and implementation of systems and | |
4 | structures within the Linux kernel to protect against security flaws in | |
5 | the kernel itself. This covers a wide range of issues, including removing | |
6 | entire classes of bugs, blocking security flaw exploitation methods, | |
7 | and actively detecting attack attempts. Not all topics are explored in | |
8 | this document, but it should serve as a reasonable starting point and | |
9 | answer any frequently asked questions. (Patches welcome, of course!) | |
10 | ||
11 | In the worst-case scenario, we assume an unprivileged local attacker | |
12 | has arbitrary read and write access to the kernel's memory. In many | |
13 | cases, bugs being exploited will not provide this level of access, | |
14 | but with systems in place that defend against the worst case we'll | |
15 | cover the more limited cases as well. A higher bar, and one that should | |
16 | still be kept in mind, is protecting the kernel against a _privileged_ | |
17 | local attacker, since the root user has access to a vastly increased | |
18 | attack surface. (Especially when they have the ability to load arbitrary | |
19 | kernel modules.) | |
20 | ||
21 | The goals for successful self-protection systems would be that they | |
22 | are effective, on by default, require no opt-in by developers, have no | |
23 | performance impact, do not impede kernel debugging, and have tests. It | |
24 | is uncommon that all these goals can be met, but it is worth explicitly | |
25 | mentioning them, since these aspects need to be explored, dealt with, | |
26 | and/or accepted. | |
27 | ||
28 | ||
29 | ## Attack Surface Reduction | |
30 | ||
31 | The most fundamental defense against security exploits is to reduce the | |
32 | areas of the kernel that can be used to redirect execution. This ranges | |
33 | from limiting the exposed APIs available to userspace, making in-kernel | |
34 | APIs hard to use incorrectly, minimizing the areas of writable kernel | |
35 | memory, etc. | |
36 | ||
37 | ### Strict kernel memory permissions | |
38 | ||
39 | When all of kernel memory is writable, it becomes trivial for attacks | |
40 | to redirect execution flow. To reduce the availability of these targets | |
41 | the kernel needs to protect its memory with a tight set of permissions. | |
42 | ||
43 | #### Executable code and read-only data must not be writable | |
44 | ||
45 | Any areas of the kernel with executable memory must not be writable. | |
46 | While this obviously includes the kernel text itself, we must consider | |
47 | all additional places too: kernel modules, JIT memory, etc. (There are | |
48 | temporary exceptions to this rule to support things like instruction | |
49 | alternatives, breakpoints, kprobes, etc. If these must exist in a | |
50 | kernel, they are implemented in a way where the memory is temporarily | |
51 | made writable during the update, and then returned to the original | |
52 | permissions.) | |
53 | ||
54 | In support of this are (the poorly named) CONFIG_DEBUG_RODATA and | |
55 | CONFIG_DEBUG_SET_MODULE_RONX, which seek to make sure that code is not | |
56 | writable, data is not executable, and read-only data is neither writable | |
57 | nor executable. | |
58 | ||
59 | #### Function pointers and sensitive variables must not be writable | |
60 | ||
61 | Vast areas of kernel memory contain function pointers that are looked | |
62 | up by the kernel and used to continue execution (e.g. descriptor/vector | |
63 | tables, file/network/etc operation structures, etc). The number of these | |
64 | variables must be reduced to an absolute minimum. | |
65 | ||
66 | Many such variables can be made read-only by setting them "const" | |
67 | so that they live in the .rodata section instead of the .data section | |
68 | of the kernel, gaining the protection of the kernel's strict memory | |
69 | permissions as described above. | |
70 | ||
71 | For variables that are initialized once at __init time, these can | |
72 | be marked with the (new and under development) __ro_after_init | |
73 | attribute. | |
74 | ||
75 | What remains are variables that are updated rarely (e.g. GDT). These | |
76 | will need another infrastructure (similar to the temporary exceptions | |
77 | made to kernel code mentioned above) that allow them to spend the rest | |
78 | of their lifetime read-only. (For example, when being updated, only the | |
79 | CPU thread performing the update would be given uninterruptible write | |
80 | access to the memory.) | |
81 | ||
82 | #### Segregation of kernel memory from userspace memory | |
83 | ||
84 | The kernel must never execute userspace memory. The kernel must also never | |
85 | access userspace memory without explicit expectation to do so. These | |
86 | rules can be enforced either by support of hardware-based restrictions | |
87 | (x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains). | |
88 | By blocking userspace memory in this way, execution and data parsing | |
89 | cannot be passed to trivially-controlled userspace memory, forcing | |
90 | attacks to operate entirely in kernel memory. | |
91 | ||
92 | ### Reduced access to syscalls | |
93 | ||
94 | One trivial way to eliminate many syscalls for 64-bit systems is building | |
95 | without CONFIG_COMPAT. However, this is rarely a feasible scenario. | |
96 | ||
97 | The "seccomp" system provides an opt-in feature made available to | |
98 | userspace, which provides a way to reduce the number of kernel entry | |
99 | points available to a running process. This limits the breadth of kernel | |
100 | code that can be reached, possibly reducing the availability of a given | |
101 | bug to an attack. | |
102 | ||
103 | An area of improvement would be creating viable ways to keep access to | |
104 | things like compat, user namespaces, BPF creation, and perf limited only | |
105 | to trusted processes. This would keep the scope of kernel entry points | |
106 | restricted to the more regular set of normally available to unprivileged | |
107 | userspace. | |
108 | ||
109 | ### Restricting access to kernel modules | |
110 | ||
111 | The kernel should never allow an unprivileged user the ability to | |
112 | load specific kernel modules, since that would provide a facility to | |
113 | unexpectedly extend the available attack surface. (The on-demand loading | |
114 | of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is | |
115 | considered "expected" here, though additional consideration should be | |
116 | given even to these.) For example, loading a filesystem module via an | |
117 | unprivileged socket API is nonsense: only the root or physically local | |
118 | user should trigger filesystem module loading. (And even this can be up | |
119 | for debate in some scenarios.) | |
120 | ||
121 | To protect against even privileged users, systems may need to either | |
122 | disable module loading entirely (e.g. monolithic kernel builds or | |
123 | modules_disabled sysctl), or provide signed modules (e.g. | |
124 | CONFIG_MODULE_SIG_FORCE, or dm-crypt with LoadPin), to keep from having | |
125 | root load arbitrary kernel code via the module loader interface. | |
126 | ||
127 | ||
128 | ## Memory integrity | |
129 | ||
130 | There are many memory structures in the kernel that are regularly abused | |
131 | to gain execution control during an attack, By far the most commonly | |
132 | understood is that of the stack buffer overflow in which the return | |
133 | address stored on the stack is overwritten. Many other examples of this | |
134 | kind of attack exist, and protections exist to defend against them. | |
135 | ||
136 | ### Stack buffer overflow | |
137 | ||
138 | The classic stack buffer overflow involves writing past the expected end | |
139 | of a variable stored on the stack, ultimately writing a controlled value | |
140 | to the stack frame's stored return address. The most widely used defense | |
141 | is the presence of a stack canary between the stack variables and the | |
142 | return address (CONFIG_CC_STACKPROTECTOR), which is verified just before | |
143 | the function returns. Other defenses include things like shadow stacks. | |
144 | ||
145 | ### Stack depth overflow | |
146 | ||
147 | A less well understood attack is using a bug that triggers the | |
148 | kernel to consume stack memory with deep function calls or large stack | |
149 | allocations. With this attack it is possible to write beyond the end of | |
150 | the kernel's preallocated stack space and into sensitive structures. Two | |
151 | important changes need to be made for better protections: moving the | |
152 | sensitive thread_info structure elsewhere, and adding a faulting memory | |
153 | hole at the bottom of the stack to catch these overflows. | |
154 | ||
155 | ### Heap memory integrity | |
156 | ||
157 | The structures used to track heap free lists can be sanity-checked during | |
158 | allocation and freeing to make sure they aren't being used to manipulate | |
159 | other memory areas. | |
160 | ||
161 | ### Counter integrity | |
162 | ||
163 | Many places in the kernel use atomic counters to track object references | |
164 | or perform similar lifetime management. When these counters can be made | |
165 | to wrap (over or under) this traditionally exposes a use-after-free | |
166 | flaw. By trapping atomic wrapping, this class of bug vanishes. | |
167 | ||
168 | ### Size calculation overflow detection | |
169 | ||
170 | Similar to counter overflow, integer overflows (usually size calculations) | |
171 | need to be detected at runtime to kill this class of bug, which | |
172 | traditionally leads to being able to write past the end of kernel buffers. | |
173 | ||
174 | ||
175 | ## Statistical defenses | |
176 | ||
177 | While many protections can be considered deterministic (e.g. read-only | |
178 | memory cannot be written to), some protections provide only statistical | |
179 | defense, in that an attack must gather enough information about a | |
180 | running system to overcome the defense. While not perfect, these do | |
181 | provide meaningful defenses. | |
182 | ||
183 | ### Canaries, blinding, and other secrets | |
184 | ||
185 | It should be noted that things like the stack canary discussed earlier | |
186 | are technically statistical defenses, since they rely on a (leakable) | |
187 | secret value. | |
188 | ||
189 | Blinding literal values for things like JITs, where the executable | |
190 | contents may be partially under the control of userspace, need a similar | |
191 | secret value. | |
192 | ||
193 | It is critical that the secret values used must be separate (e.g. | |
194 | different canary per stack) and high entropy (e.g. is the RNG actually | |
195 | working?) in order to maximize their success. | |
196 | ||
197 | ### Kernel Address Space Layout Randomization (KASLR) | |
198 | ||
199 | Since the location of kernel memory is almost always instrumental in | |
200 | mounting a successful attack, making the location non-deterministic | |
201 | raises the difficulty of an exploit. (Note that this in turn makes | |
202 | the value of leaks higher, since they may be used to discover desired | |
203 | memory locations.) | |
204 | ||
205 | #### Text and module base | |
206 | ||
207 | By relocating the physical and virtual base address of the kernel at | |
208 | boot-time (CONFIG_RANDOMIZE_BASE), attacks needing kernel code will be | |
209 | frustrated. Additionally, offsetting the module loading base address | |
210 | means that even systems that load the same set of modules in the same | |
211 | order every boot will not share a common base address with the rest of | |
212 | the kernel text. | |
213 | ||
214 | #### Stack base | |
215 | ||
216 | If the base address of the kernel stack is not the same between processes, | |
217 | or even not the same between syscalls, targets on or beyond the stack | |
218 | become more difficult to locate. | |
219 | ||
220 | #### Dynamic memory base | |
221 | ||
222 | Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up | |
223 | being relatively deterministic in layout due to the order of early-boot | |
224 | initializations. If the base address of these areas is not the same | |
225 | between boots, targeting them is frustrated, requiring a leak specific | |
226 | to the region. | |
227 | ||
228 | ||
229 | ## Preventing Leaks | |
230 | ||
231 | Since the locations of sensitive structures are the primary target for | |
232 | attacks, it is important to defend against leaks of both kernel memory | |
233 | addresses and kernel memory contents (since they may contain kernel | |
234 | addresses or other sensitive things like canary values). | |
235 | ||
236 | ### Unique identifiers | |
237 | ||
238 | Kernel memory addresses must never be used as identifiers exposed to | |
239 | userspace. Instead, use an atomic counter, an idr, or similar unique | |
240 | identifier. | |
241 | ||
242 | ### Memory initialization | |
243 | ||
244 | Memory copied to userspace must always be fully initialized. If not | |
245 | explicitly memset(), this will require changes to the compiler to make | |
246 | sure structure holes are cleared. | |
247 | ||
248 | ### Memory poisoning | |
249 | ||
250 | When releasing memory, it is best to poison the contents (clear stack on | |
251 | syscall return, wipe heap memory on a free), to avoid reuse attacks that | |
252 | rely on the old contents of memory. This frustrates many uninitialized | |
253 | variable attacks, stack info leaks, heap info leaks, and use-after-free | |
254 | attacks. | |
255 | ||
256 | ### Destination tracking | |
257 | ||
258 | To help kill classes of bugs that result in kernel addresses being | |
259 | written to userspace, the destination of writes needs to be tracked. If | |
260 | the buffer is destined for userspace (e.g. seq_file backed /proc files), | |
261 | it should automatically censor sensitive values. |