Commit | Line | Data |
---|---|---|
8e0aa6d4 MS |
1 | |
2 | Firmware-Assisted Dump | |
3 | ------------------------ | |
4 | July 2011 | |
5 | ||
6 | The goal of firmware-assisted dump is to enable the dump of | |
7 | a crashed system, and to do so from a fully-reset system, and | |
8 | to minimize the total elapsed time until the system is back | |
9 | in production use. | |
10 | ||
11 | - Firmware assisted dump (fadump) infrastructure is intended to replace | |
12 | the existing phyp assisted dump. | |
13 | - Fadump uses the same firmware interfaces and memory reservation model | |
14 | as phyp assisted dump. | |
15 | - Unlike phyp dump, fadump exports the memory dump through /proc/vmcore | |
16 | in the ELF format in the same way as kdump. This helps us reuse the | |
17 | kdump infrastructure for dump capture and filtering. | |
18 | - Unlike phyp dump, userspace tool does not need to refer any sysfs | |
19 | interface while reading /proc/vmcore. | |
20 | - Unlike phyp dump, fadump allows user to release all the memory reserved | |
21 | for dump, with a single operation of echo 1 > /sys/kernel/fadump_release_mem. | |
22 | - Once enabled through kernel boot parameter, fadump can be | |
23 | started/stopped through /sys/kernel/fadump_registered interface (see | |
24 | sysfs files section below) and can be easily integrated with kdump | |
25 | service start/stop init scripts. | |
26 | ||
27 | Comparing with kdump or other strategies, firmware-assisted | |
28 | dump offers several strong, practical advantages: | |
29 | ||
30 | -- Unlike kdump, the system has been reset, and loaded | |
31 | with a fresh copy of the kernel. In particular, | |
32 | PCI and I/O devices have been reinitialized and are | |
33 | in a clean, consistent state. | |
34 | -- Once the dump is copied out, the memory that held the dump | |
35 | is immediately available to the running kernel. And therefore, | |
36 | unlike kdump, fadump doesn't need a 2nd reboot to get back | |
37 | the system to the production configuration. | |
38 | ||
39 | The above can only be accomplished by coordination with, | |
40 | and assistance from the Power firmware. The procedure is | |
41 | as follows: | |
42 | ||
43 | -- The first kernel registers the sections of memory with the | |
44 | Power firmware for dump preservation during OS initialization. | |
45 | These registered sections of memory are reserved by the first | |
46 | kernel during early boot. | |
47 | ||
48 | -- When a system crashes, the Power firmware will save | |
49 | the low memory (boot memory of size larger of 5% of system RAM | |
50 | or 256MB) of RAM to the previous registered region. It will | |
51 | also save system registers, and hardware PTE's. | |
52 | ||
53 | NOTE: The term 'boot memory' means size of the low memory chunk | |
54 | that is required for a kernel to boot successfully when | |
55 | booted with restricted memory. By default, the boot memory | |
56 | size will be the larger of 5% of system RAM or 256MB. | |
57 | Alternatively, user can also specify boot memory size | |
58 | through boot parameter 'fadump_reserve_mem=' which will | |
59 | override the default calculated size. Use this option | |
60 | if default boot memory size is not sufficient for second | |
61 | kernel to boot successfully. | |
62 | ||
63 | -- After the low memory (boot memory) area has been saved, the | |
64 | firmware will reset PCI and other hardware state. It will | |
65 | *not* clear the RAM. It will then launch the bootloader, as | |
66 | normal. | |
67 | ||
68 | -- The freshly booted kernel will notice that there is a new | |
69 | node (ibm,dump-kernel) in the device tree, indicating that | |
70 | there is crash data available from a previous boot. During | |
71 | the early boot OS will reserve rest of the memory above | |
72 | boot memory size effectively booting with restricted memory | |
73 | size. This will make sure that the second kernel will not | |
74 | touch any of the dump memory area. | |
75 | ||
76 | -- User-space tools will read /proc/vmcore to obtain the contents | |
77 | of memory, which holds the previous crashed kernel dump in ELF | |
78 | format. The userspace tools may copy this info to disk, or | |
79 | network, nas, san, iscsi, etc. as desired. | |
80 | ||
81 | -- Once the userspace tool is done saving dump, it will echo | |
82 | '1' to /sys/kernel/fadump_release_mem to release the reserved | |
83 | memory back to general use, except the memory required for | |
84 | next firmware-assisted dump registration. | |
85 | ||
86 | e.g. | |
87 | # echo 1 > /sys/kernel/fadump_release_mem | |
88 | ||
89 | Please note that the firmware-assisted dump feature | |
90 | is only available on Power6 and above systems with recent | |
91 | firmware versions. | |
92 | ||
93 | Implementation details: | |
94 | ---------------------- | |
95 | ||
96 | During boot, a check is made to see if firmware supports | |
97 | this feature on that particular machine. If it does, then | |
98 | we check to see if an active dump is waiting for us. If yes | |
99 | then everything but boot memory size of RAM is reserved during | |
100 | early boot (See Fig. 2). This area is released once we finish | |
101 | collecting the dump from user land scripts (e.g. kdump scripts) | |
102 | that are run. If there is dump data, then the | |
103 | /sys/kernel/fadump_release_mem file is created, and the reserved | |
104 | memory is held. | |
105 | ||
106 | If there is no waiting dump data, then only the memory required | |
107 | to hold CPU state, HPTE region, boot memory dump and elfcore | |
108 | header, is reserved at the top of memory (see Fig. 1). This area | |
109 | is *not* released: this region will be kept permanently reserved, | |
110 | so that it can act as a receptacle for a copy of the boot memory | |
111 | content in addition to CPU state and HPTE region, in the case a | |
112 | crash does occur. | |
113 | ||
114 | o Memory Reservation during first kernel | |
115 | ||
116 | Low memory Top of memory | |
117 | 0 boot memory size | | |
118 | | | |<--Reserved dump area -->| | |
119 | V V | Permanent Reservation V | |
120 | +-----------+----------/ /----------+---+----+-----------+----+ | |
121 | | | |CPU|HPTE| DUMP |ELF | | |
122 | +-----------+----------/ /----------+---+----+-----------+----+ | |
123 | | ^ | |
124 | | | | |
125 | \ / | |
126 | ------------------------------------------- | |
127 | Boot memory content gets transferred to | |
128 | reserved area by firmware at the time of | |
129 | crash | |
130 | Fig. 1 | |
131 | ||
132 | o Memory Reservation during second kernel after crash | |
133 | ||
134 | Low memory Top of memory | |
135 | 0 boot memory size | | |
136 | | |<------------- Reserved dump area ----------- -->| | |
137 | V V V | |
138 | +-----------+----------/ /----------+---+----+-----------+----+ | |
139 | | | |CPU|HPTE| DUMP |ELF | | |
140 | +-----------+----------/ /----------+---+----+-----------+----+ | |
141 | | | | |
142 | V V | |
143 | Used by second /proc/vmcore | |
144 | kernel to boot | |
145 | Fig. 2 | |
146 | ||
147 | Currently the dump will be copied from /proc/vmcore to a | |
148 | a new file upon user intervention. The dump data available through | |
149 | /proc/vmcore will be in ELF format. Hence the existing kdump | |
150 | infrastructure (kdump scripts) to save the dump works fine with | |
151 | minor modifications. | |
152 | ||
153 | The tools to examine the dump will be same as the ones | |
154 | used for kdump. | |
155 | ||
156 | How to enable firmware-assisted dump (fadump): | |
157 | ------------------------------------- | |
158 | ||
159 | 1. Set config option CONFIG_FA_DUMP=y and build kernel. | |
160 | 2. Boot into linux kernel with 'fadump=on' kernel cmdline option. | |
161 | 3. Optionally, user can also set 'fadump_reserve_mem=' kernel cmdline | |
162 | to specify size of the memory to reserve for boot memory dump | |
163 | preservation. | |
164 | ||
165 | NOTE: If firmware-assisted dump fails to reserve memory then it will | |
166 | fallback to existing kdump mechanism if 'crashkernel=' option | |
167 | is set at kernel cmdline. | |
168 | ||
169 | Sysfs/debugfs files: | |
170 | ------------ | |
171 | ||
172 | Firmware-assisted dump feature uses sysfs file system to hold | |
173 | the control files and debugfs file to display memory reserved region. | |
174 | ||
175 | Here is the list of files under kernel sysfs: | |
176 | ||
177 | /sys/kernel/fadump_enabled | |
178 | ||
179 | This is used to display the fadump status. | |
180 | 0 = fadump is disabled | |
181 | 1 = fadump is enabled | |
182 | ||
183 | This interface can be used by kdump init scripts to identify if | |
184 | fadump is enabled in the kernel and act accordingly. | |
185 | ||
186 | /sys/kernel/fadump_registered | |
187 | ||
188 | This is used to display the fadump registration status as well | |
189 | as to control (start/stop) the fadump registration. | |
190 | 0 = fadump is not registered. | |
191 | 1 = fadump is registered and ready to handle system crash. | |
192 | ||
193 | To register fadump echo 1 > /sys/kernel/fadump_registered and | |
194 | echo 0 > /sys/kernel/fadump_registered for un-register and stop the | |
195 | fadump. Once the fadump is un-registered, the system crash will not | |
196 | be handled and vmcore will not be captured. This interface can be | |
197 | easily integrated with kdump service start/stop. | |
198 | ||
199 | /sys/kernel/fadump_release_mem | |
200 | ||
201 | This file is available only when fadump is active during | |
202 | second kernel. This is used to release the reserved memory | |
203 | region that are held for saving crash dump. To release the | |
204 | reserved memory echo 1 to it: | |
205 | ||
206 | echo 1 > /sys/kernel/fadump_release_mem | |
207 | ||
208 | After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region | |
209 | file will change to reflect the new memory reservations. | |
210 | ||
211 | The existing userspace tools (kdump infrastructure) can be easily | |
212 | enhanced to use this interface to release the memory reserved for | |
213 | dump and continue without 2nd reboot. | |
214 | ||
215 | Here is the list of files under powerpc debugfs: | |
216 | (Assuming debugfs is mounted on /sys/kernel/debug directory.) | |
217 | ||
218 | /sys/kernel/debug/powerpc/fadump_region | |
219 | ||
220 | This file shows the reserved memory regions if fadump is | |
221 | enabled otherwise this file is empty. The output format | |
222 | is: | |
223 | <region>: [<start>-<end>] <reserved-size> bytes, Dumped: <dump-size> | |
224 | ||
225 | e.g. | |
226 | Contents when fadump is registered during first kernel | |
227 | ||
228 | # cat /sys/kernel/debug/powerpc/fadump_region | |
229 | CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0 | |
230 | HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0 | |
231 | DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0 | |
232 | ||
233 | Contents when fadump is active during second kernel | |
234 | ||
235 | # cat /sys/kernel/debug/powerpc/fadump_region | |
236 | CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020 | |
237 | HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000 | |
238 | DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000 | |
239 | : [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000 | |
240 | ||
241 | NOTE: Please refer to Documentation/filesystems/debugfs.txt on | |
242 | how to mount the debugfs filesystem. | |
243 | ||
244 | ||
245 | TODO: | |
246 | ----- | |
247 | o Need to come up with the better approach to find out more | |
248 | accurate boot memory size that is required for a kernel to | |
249 | boot successfully when booted with restricted memory. | |
250 | o The fadump implementation introduces a fadump crash info structure | |
251 | in the scratch area before the ELF core header. The idea of introducing | |
252 | this structure is to pass some important crash info data to the second | |
253 | kernel which will help second kernel to populate ELF core header with | |
254 | correct data before it gets exported through /proc/vmcore. The current | |
255 | design implementation does not address a possibility of introducing | |
256 | additional fields (in future) to this structure without affecting | |
257 | compatibility. Need to come up with the better approach to address this. | |
258 | The possible approaches are: | |
259 | 1. Introduce version field for version tracking, bump up the version | |
260 | whenever a new field is added to the structure in future. The version | |
261 | field can be used to find out what fields are valid for the current | |
262 | version of the structure. | |
263 | 2. Reserve the area of predefined size (say PAGE_SIZE) for this | |
264 | structure and have unused area as reserved (initialized to zero) | |
265 | for future field additions. | |
266 | The advantage of approach 1 over 2 is we don't need to reserve extra space. | |
267 | --- | |
268 | Author: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> | |
269 | This document is based on the original documentation written for phyp | |
270 | assisted dump by Linas Vepstas and Manish Ahuja. |