Commit | Line | Data |
---|---|---|
1da177e4 LT |
1 | |
2 | ||
3 | PCI Bus EEH Error Recovery | |
4 | -------------------------- | |
5 | Linas Vepstas | |
6 | <linas@austin.ibm.com> | |
7 | 12 January 2005 | |
8 | ||
9 | ||
10 | Overview: | |
11 | --------- | |
12 | The IBM POWER-based pSeries and iSeries computers include PCI bus | |
13 | controller chips that have extended capabilities for detecting and | |
14 | reporting a large variety of PCI bus error conditions. These features | |
15 | go under the name of "EEH", for "Extended Error Handling". The EEH | |
16 | hardware features allow PCI bus errors to be cleared and a PCI | |
17 | card to be "rebooted", without also having to reboot the operating | |
18 | system. | |
19 | ||
20 | This is in contrast to traditional PCI error handling, where the | |
21 | PCI chip is wired directly to the CPU, and an error would cause | |
22 | a CPU machine-check/check-stop condition, halting the CPU entirely. | |
23 | Another "traditional" technique is to ignore such errors, which | |
24 | can lead to data corruption, both of user data or of kernel data, | |
25 | hung/unresponsive adapters, or system crashes/lockups. Thus, | |
26 | the idea behind EEH is that the operating system can become more | |
27 | reliable and robust by protecting it from PCI errors, and giving | |
28 | the OS the ability to "reboot"/recover individual PCI devices. | |
29 | ||
30 | Future systems from other vendors, based on the PCI-E specification, | |
31 | may contain similar features. | |
32 | ||
33 | ||
34 | Causes of EEH Errors | |
35 | -------------------- | |
36 | EEH was originally designed to guard against hardware failure, such | |
37 | as PCI cards dying from heat, humidity, dust, vibration and bad | |
38 | electrical connections. The vast majority of EEH errors seen in | |
01dd2fbf ML |
39 | "real life" are due to either poorly seated PCI cards, or, |
40 | unfortunately quite commonly, due to device driver bugs, device firmware | |
1da177e4 LT |
41 | bugs, and sometimes PCI card hardware bugs. |
42 | ||
43 | The most common software bug, is one that causes the device to | |
44 | attempt to DMA to a location in system memory that has not been | |
45 | reserved for DMA access for that card. This is a powerful feature, | |
46 | as it prevents what; otherwise, would have been silent memory | |
47 | corruption caused by the bad DMA. A number of device driver | |
48 | bugs have been found and fixed in this way over the past few | |
49 | years. Other possible causes of EEH errors include data or | |
50 | address line parity errors (for example, due to poor electrical | |
51 | connectivity due to a poorly seated card), and PCI-X split-completion | |
52 | errors (due to software, device firmware, or device PCI hardware bugs). | |
53 | The vast majority of "true hardware failures" can be cured by | |
54 | physically removing and re-seating the PCI card. | |
55 | ||
56 | ||
57 | Detection and Recovery | |
58 | ---------------------- | |
59 | In the following discussion, a generic overview of how to detect | |
60 | and recover from EEH errors will be presented. This is followed | |
61 | by an overview of how the current implementation in the Linux | |
62 | kernel does it. The actual implementation is subject to change, | |
63 | and some of the finer points are still being debated. These | |
64 | may in turn be swayed if or when other architectures implement | |
65 | similar functionality. | |
66 | ||
67 | When a PCI Host Bridge (PHB, the bus controller connecting the | |
68 | PCI bus to the system CPU electronics complex) detects a PCI error | |
69 | condition, it will "isolate" the affected PCI card. Isolation | |
70 | will block all writes (either to the card from the system, or | |
71 | from the card to the system), and it will cause all reads to | |
72 | return all-ff's (0xff, 0xffff, 0xffffffff for 8/16/32-bit reads). | |
73 | This value was chosen because it is the same value you would | |
74 | get if the device was physically unplugged from the slot. | |
75 | This includes access to PCI memory, I/O space, and PCI config | |
76 | space. Interrupts; however, will continued to be delivered. | |
77 | ||
78 | Detection and recovery are performed with the aid of ppc64 | |
79 | firmware. The programming interfaces in the Linux kernel | |
80 | into the firmware are referred to as RTAS (Run-Time Abstraction | |
81 | Services). The Linux kernel does not (should not) access | |
82 | the EEH function in the PCI chipsets directly, primarily because | |
83 | there are a number of different chipsets out there, each with | |
84 | different interfaces and quirks. The firmware provides a | |
85 | uniform abstraction layer that will work with all pSeries | |
86 | and iSeries hardware (and be forwards-compatible). | |
87 | ||
88 | If the OS or device driver suspects that a PCI slot has been | |
89 | EEH-isolated, there is a firmware call it can make to determine if | |
90 | this is the case. If so, then the device driver should put itself | |
91 | into a consistent state (given that it won't be able to complete any | |
92 | pending work) and start recovery of the card. Recovery normally | |
d6bc8ac9 | 93 | would consist of resetting the PCI device (holding the PCI #RST |
1da177e4 LT |
94 | line high for two seconds), followed by setting up the device |
95 | config space (the base address registers (BAR's), latency timer, | |
96 | cache line size, interrupt line, and so on). This is followed by a | |
97 | reinitialization of the device driver. In a worst-case scenario, | |
98 | the power to the card can be toggled, at least on hot-plug-capable | |
99 | slots. In principle, layers far above the device driver probably | |
100 | do not need to know that the PCI card has been "rebooted" in this | |
101 | way; ideally, there should be at most a pause in Ethernet/disk/USB | |
102 | I/O while the card is being reset. | |
103 | ||
104 | If the card cannot be recovered after three or four resets, the | |
105 | kernel/device driver should assume the worst-case scenario, that the | |
106 | card has died completely, and report this error to the sysadmin. | |
107 | In addition, error messages are reported through RTAS and also through | |
108 | syslogd (/var/log/messages) to alert the sysadmin of PCI resets. | |
109 | The correct way to deal with failed adapters is to use the standard | |
110 | PCI hotplug tools to remove and replace the dead card. | |
111 | ||
112 | ||
113 | Current PPC64 Linux EEH Implementation | |
114 | -------------------------------------- | |
115 | At this time, a generic EEH recovery mechanism has been implemented, | |
116 | so that individual device drivers do not need to be modified to support | |
117 | EEH recovery. This generic mechanism piggy-backs on the PCI hotplug | |
312c004d | 118 | infrastructure, and percolates events up through the userspace/udev |
a2ffd275 | 119 | infrastructure. Following is a detailed description of how this is |
1da177e4 LT |
120 | accomplished. |
121 | ||
122 | EEH must be enabled in the PHB's very early during the boot process, | |
123 | and if a PCI slot is hot-plugged. The former is performed by | |
2ef9481e | 124 | eeh_init() in arch/powerpc/platforms/pseries/eeh.c, and the later by |
1da177e4 LT |
125 | drivers/pci/hotplug/pSeries_pci.c calling in to the eeh.c code. |
126 | EEH must be enabled before a PCI scan of the device can proceed. | |
127 | Current Power5 hardware will not work unless EEH is enabled; | |
128 | although older Power4 can run with it disabled. Effectively, | |
129 | EEH can no longer be turned off. PCI devices *must* be | |
130 | registered with the EEH code; the EEH code needs to know about | |
131 | the I/O address ranges of the PCI device in order to detect an | |
132 | error. Given an arbitrary address, the routine | |
133 | pci_get_device_by_addr() will find the pci device associated | |
134 | with that address (if any). | |
135 | ||
b8b572e1 | 136 | The default arch/powerpc/include/asm/io.h macros readb(), inb(), insb(), |
d533f671 | 137 | etc. include a check to see if the i/o read returned all-0xff's. |
1da177e4 LT |
138 | If so, these make a call to eeh_dn_check_failure(), which in turn |
139 | asks the firmware if the all-ff's value is the sign of a true EEH | |
140 | error. If it is not, processing continues as normal. The grand | |
141 | total number of these false alarms or "false positives" can be | |
142 | seen in /proc/ppc64/eeh (subject to change). Normally, almost | |
143 | all of these occur during boot, when the PCI bus is scanned, where | |
144 | a large number of 0xff reads are part of the bus scan procedure. | |
145 | ||
2ef9481e JM |
146 | If a frozen slot is detected, code in |
147 | arch/powerpc/platforms/pseries/eeh.c will print a stack trace to | |
148 | syslog (/var/log/messages). This stack trace has proven to be very | |
149 | useful to device-driver authors for finding out at what point the EEH | |
150 | error was detected, as the error itself usually occurs slightly | |
151 | beforehand. | |
1da177e4 LT |
152 | |
153 | Next, it uses the Linux kernel notifier chain/work queue mechanism to | |
154 | allow any interested parties to find out about the failure. Device | |
155 | drivers, or other parts of the kernel, can use | |
156 | eeh_register_notifier(struct notifier_block *) to find out about EEH | |
157 | events. The event will include a pointer to the pci device, the | |
158 | device node and some state info. Receivers of the event can "do as | |
159 | they wish"; the default handler will be described further in this | |
160 | section. | |
161 | ||
162 | To assist in the recovery of the device, eeh.c exports the | |
163 | following functions: | |
164 | ||
165 | rtas_set_slot_reset() -- assert the PCI #RST line for 1/8th of a second | |
166 | rtas_configure_bridge() -- ask firmware to configure any PCI bridges | |
167 | located topologically under the pci slot. | |
168 | eeh_save_bars() and eeh_restore_bars(): save and restore the PCI | |
169 | config-space info for a device and any devices under it. | |
170 | ||
171 | ||
172 | A handler for the EEH notifier_block events is implemented in | |
173 | drivers/pci/hotplug/pSeries_pci.c, called handle_eeh_events(). | |
174 | It saves the device BAR's and then calls rpaphp_unconfig_pci_adapter(). | |
175 | This last call causes the device driver for the card to be stopped, | |
312c004d | 176 | which causes uevents to go out to user space. This triggers |
1da177e4 LT |
177 | user-space scripts that might issue commands such as "ifdown eth0" |
178 | for ethernet cards, and so on. This handler then sleeps for 5 seconds, | |
179 | hoping to give the user-space scripts enough time to complete. | |
180 | It then resets the PCI card, reconfigures the device BAR's, and | |
181 | any bridges underneath. It then calls rpaphp_enable_pci_slot(), | |
182 | which restarts the device driver and triggers more user-space | |
183 | events (for example, calling "ifup eth0" for ethernet cards). | |
184 | ||
185 | ||
186 | Device Shutdown and User-Space Events | |
187 | ------------------------------------- | |
188 | This section documents what happens when a pci slot is unconfigured, | |
189 | focusing on how the device driver gets shut down, and on how the | |
190 | events get delivered to user-space scripts. | |
191 | ||
192 | Following is an example sequence of events that cause a device driver | |
193 | close function to be called during the first phase of an EEH reset. | |
194 | The following sequence is an example of the pcnet32 device driver. | |
195 | ||
196 | rpa_php_unconfig_pci_adapter (struct slot *) // in rpaphp_pci.c | |
197 | { | |
198 | calls | |
199 | pci_remove_bus_device (struct pci_dev *) // in /drivers/pci/remove.c | |
200 | { | |
201 | calls | |
202 | pci_destroy_dev (struct pci_dev *) | |
203 | { | |
204 | calls | |
205 | device_unregister (&dev->dev) // in /drivers/base/core.c | |
206 | { | |
207 | calls | |
208 | device_del (struct device *) | |
209 | { | |
210 | calls | |
211 | bus_remove_device() // in /drivers/base/bus.c | |
212 | { | |
213 | calls | |
214 | device_release_driver() | |
215 | { | |
216 | calls | |
217 | struct device_driver->remove() which is just | |
218 | pci_device_remove() // in /drivers/pci/pci_driver.c | |
219 | { | |
220 | calls | |
221 | struct pci_driver->remove() which is just | |
222 | pcnet32_remove_one() // in /drivers/net/pcnet32.c | |
223 | { | |
224 | calls | |
225 | unregister_netdev() // in /net/core/dev.c | |
226 | { | |
227 | calls | |
228 | dev_close() // in /net/core/dev.c | |
229 | { | |
230 | calls dev->stop(); | |
231 | which is just pcnet32_close() // in pcnet32.c | |
232 | { | |
233 | which does what you wanted | |
234 | to stop the device | |
235 | } | |
236 | } | |
237 | } | |
238 | which | |
239 | frees pcnet32 device driver memory | |
240 | } | |
241 | }}}}}} | |
242 | ||
243 | ||
244 | in drivers/pci/pci_driver.c, | |
245 | struct device_driver->remove() is just pci_device_remove() | |
246 | which calls struct pci_driver->remove() which is pcnet32_remove_one() | |
247 | which calls unregister_netdev() (in net/core/dev.c) | |
248 | which calls dev_close() (in net/core/dev.c) | |
249 | which calls dev->stop() which is pcnet32_close() | |
250 | which then does the appropriate shutdown. | |
251 | ||
252 | --- | |
253 | Following is the analogous stack trace for events sent to user-space | |
254 | when the pci device is unconfigured. | |
255 | ||
256 | rpa_php_unconfig_pci_adapter() { // in rpaphp_pci.c | |
257 | calls | |
258 | pci_remove_bus_device (struct pci_dev *) { // in /drivers/pci/remove.c | |
259 | calls | |
260 | pci_destroy_dev (struct pci_dev *) { | |
261 | calls | |
312c004d | 262 | device_unregister (&dev->dev) { // in /drivers/base/core.c |
1da177e4 | 263 | calls |
312c004d | 264 | device_del(struct device * dev) { // in /drivers/base/core.c |
1da177e4 | 265 | calls |
312c004d | 266 | kobject_del() { //in /libs/kobject.c |
1da177e4 | 267 | calls |
312c004d | 268 | kobject_uevent() { // in /libs/kobject.c |
1da177e4 | 269 | calls |
312c004d | 270 | kset_uevent() { // in /lib/kobject.c |
1da177e4 | 271 | calls |
312c004d | 272 | kset->uevent_ops->uevent() // which is really just |
1da177e4 | 273 | a call to |
312c004d | 274 | dev_uevent() { // in /drivers/base/core.c |
1da177e4 | 275 | calls |
312c004d KS |
276 | dev->bus->uevent() which is really just a call to |
277 | pci_uevent () { // in drivers/pci/hotplug.c | |
1da177e4 LT |
278 | which prints device name, etc.... |
279 | } | |
280 | } | |
312c004d KS |
281 | then kobject_uevent() sends a netlink uevent to userspace |
282 | --> userspace uevent | |
283 | (during early boot, nobody listens to netlink events and | |
284 | kobject_uevent() executes uevent_helper[], which runs the | |
285 | event process /sbin/hotplug) | |
1da177e4 LT |
286 | } |
287 | } | |
288 | kobject_del() then calls sysfs_remove_dir(), which would | |
289 | trigger any user-space daemon that was watching /sysfs, | |
290 | and notice the delete event. | |
291 | ||
292 | ||
293 | Pro's and Con's of the Current Design | |
294 | ------------------------------------- | |
295 | There are several issues with the current EEH software recovery design, | |
296 | which may be addressed in future revisions. But first, note that the | |
297 | big plus of the current design is that no changes need to be made to | |
298 | individual device drivers, so that the current design throws a wide net. | |
299 | The biggest negative of the design is that it potentially disturbs | |
300 | network daemons and file systems that didn't need to be disturbed. | |
301 | ||
302 | -- A minor complaint is that resetting the network card causes | |
303 | user-space back-to-back ifdown/ifup burps that potentially disturb | |
304 | network daemons, that didn't need to even know that the pci | |
305 | card was being rebooted. | |
306 | ||
307 | -- A more serious concern is that the same reset, for SCSI devices, | |
308 | causes havoc to mounted file systems. Scripts cannot post-facto | |
309 | unmount a file system without flushing pending buffers, but this | |
310 | is impossible, because I/O has already been stopped. Thus, | |
311 | ideally, the reset should happen at or below the block layer, | |
312 | so that the file systems are not disturbed. | |
313 | ||
314 | Reiserfs does not tolerate errors returned from the block device. | |
315 | Ext3fs seems to be tolerant, retrying reads/writes until it does | |
316 | succeed. Both have been only lightly tested in this scenario. | |
317 | ||
318 | The SCSI-generic subsystem already has built-in code for performing | |
319 | SCSI device resets, SCSI bus resets, and SCSI host-bus-adapter | |
320 | (HBA) resets. These are cascaded into a chain of attempted | |
321 | resets if a SCSI command fails. These are completely hidden | |
322 | from the block layer. It would be very natural to add an EEH | |
323 | reset into this chain of events. | |
324 | ||
325 | -- If a SCSI error occurs for the root device, all is lost unless | |
326 | the sysadmin had the foresight to run /bin, /sbin, /etc, /var | |
327 | and so on, out of ramdisk/tmpfs. | |
328 | ||
329 | ||
330 | Conclusions | |
331 | ----------- | |
332 | There's forward progress ... | |
333 | ||
334 |