[deliverable/linux.git] / Documentation / pci-error-recovery.txt


                       PCI Error Recovery
                       ------------------
                         May 31, 2005

               Current document maintainer:
           Linas Vepstas <linas@austin.ibm.com>


Some PCI bus controllers are able to detect certain "hard" PCI errors
on the bus, such as parity errors on the data and address busses, as
well as SERR and PERR errors.  These chipsets are then able to disable
I/O to/from the affected device, so that, for example, a bad DMA
address doesn't end up corrupting system memory.  These same chipsets
are also able to reset the affected PCI device, and return it to
working condition.  This document describes a generic API form
performing error recovery.

The core idea is that after a PCI error has been detected, there must
be a way for the kernel to coordinate with all affected device drivers
so that the pci card can be made operational again, possibly after
performing a full electrical #RST of the PCI card.  The API below
provides a generic API for device drivers to be notified of PCI
errors, and to be notified of, and respond to, a reset sequence.

Preliminary sketch of API, cut-n-pasted-n-modified email from
Ben Herrenschmidt, circa 5 april 2005

The error recovery API support is exposed to the driver in the form of
a structure of function pointers pointed to by a new field in struct
pci_driver. The absence of this pointer in pci_driver denotes an
"non-aware" driver, behaviour on these is platform dependant.
Platforms like ppc64 can try to simulate pci hotplug remove/add.

The definition of "pci_error_token" is not covered here. It is based on
Seto's work on the synchronous error detection. We still need to define
functions for extracting infos out of an opaque error token. This is
separate from this API.

This structure has the form:

struct pci_error_handlers
{
	int (*error_detected)(struct pci_dev *dev, pci_error_token error);
	int (*mmio_enabled)(struct pci_dev *dev);
	int (*resume)(struct pci_dev *dev);
	int (*link_reset)(struct pci_dev *dev);
	int (*slot_reset)(struct pci_dev *dev);
};

A driver doesn't have to implement all of these callbacks. The
only mandatory one is error_detected(). If a callback is not
implemented, the corresponding feature is considered unsupported.
For example, if mmio_enabled() and resume() aren't there, then the
driver is assumed as not doing any direct recovery and requires
a reset. If link_reset() is not implemented, the card is assumed as
not caring about link resets, in which case, if recover is supported,
the core can try recover (but not slot_reset() unless it really did
reset the slot). If slot_reset() is not supported, link_reset() can
be called instead on a slot reset.

At first, the call will always be :

	1) error_detected()

	Error detected. This is sent once after an error has been detected. At
this point, the device might not be accessible anymore depending on the
platform (the slot will be isolated on ppc64). The driver may already
have "noticed" the error because of a failing IO, but this is the proper
"synchronisation point", that is, it gives a chance to the driver to
cleanup, waiting for pending stuff (timers, whatever, etc...) to
complete; it can take semaphores, schedule, etc... everything but touch
the device. Within this function and after it returns, the driver
shouldn't do any new IOs. Called in task context. This is sort of a
"quiesce" point. See note about interrupts at the end of this doc.

	Result codes:
		- PCIERR_RESULT_CAN_RECOVER:
		  Driever returns this if it thinks it might be able to recover
		  the HW by just banging IOs or if it wants to be given
		  a chance to extract some diagnostic informations (see
		  below).
		- PCIERR_RESULT_NEED_RESET:
		  Driver returns this if it thinks it can't recover unless the
		  slot is reset.
		- PCIERR_RESULT_DISCONNECT:
		  Return this if driver thinks it won't recover at all,
		  (this will detach the driver ? or just leave it
		  dangling ? to be decided)

So at this point, we have called error_detected() for all drivers
on the segment that had the error. On ppc64, the slot is isolated. What
happens now typically depends on the result from the drivers. If all
drivers on the segment/slot return PCIERR_RESULT_CAN_RECOVER, we would
re-enable IOs on the slot (or do nothing special if the platform doesn't
isolate slots) and call 2). If not and we can reset slots, we go to 4),
if neither, we have a dead slot. If it's an hotplug slot, we might
"simulate" reset by triggering HW unplug/replug though.

>>> Current ppc64 implementation assumes that a device driver will
>>> *not* schedule or semaphore in this routine; the current ppc64
>>> implementation uses one kernel thread to notify all devices;
>>> thus, of one device sleeps/schedules, all devices are affected.
>>> Doing better requires complex multi-threaded logic in the error
>>> recovery implementation (e.g. waiting for all notification threads
>>> to "join" before proceeding with recovery.)  This seems excessively
>>> complex and not worth implementing.

>>> The current ppc64 implementation doesn't much care if the device
>>> attempts i/o at this point, or not.  I/O's will fail, returning
>>> a value of 0xff on read, and writes will be dropped. If the device
>>> driver attempts more than 10K I/O's to a frozen adapter, it will
>>> assume that the device driver has gone into an infinite loop, and
>>> it will panic the the kernel.

	2) mmio_enabled()

	This is the "early recovery" call. IOs are allowed again, but DMA is
not (hrm... to be discussed, I prefer not), with some restrictions. This
is NOT a callback for the driver to start operations again, only to
peek/poke at the device, extract diagnostic information, if any, and
eventually do things like trigger a device local reset or some such,
but not restart operations. This is sent if all drivers on a segment
agree that they can try to recover and no automatic link reset was
performed by the HW. If the platform can't just re-enable IOs without
a slot reset or a link reset, it doesn't call this callback and goes
directly to 3) or 4). All IOs should be done _synchronously_ from
within this callback, errors triggered by them will be returned via
the normal pci_check_whatever() api, no new error_detected() callback
will be issued due to an error happening here. However, such an error
might cause IOs to be re-blocked for the whole segment, and thus
invalidate the recovery that other devices on the same segment might
have done, forcing the whole segment into one of the next states,
that is link reset or slot reset.

	Result codes:
		- PCIERR_RESULT_RECOVERED
		  Driver returns this if it thinks the device is fully
		  functionnal and thinks it is ready to start
		  normal driver operations again. There is no
		  guarantee that the driver will actually be
		  allowed to proceed, as another driver on the
		  same segment might have failed and thus triggered a
		  slot reset on platforms that support it.

		- PCIERR_RESULT_NEED_RESET
		  Driver returns this if it thinks the device is not
		  recoverable in it's current state and it needs a slot
		  reset to proceed.

		- PCIERR_RESULT_DISCONNECT
		  Same as above. Total failure, no recovery even after
		  reset driver dead. (To be defined more precisely)

>>> The current ppc64 implementation does not implement this callback.

	3) link_reset()

	This is called after the link has been reset. This is typically
a PCI Express specific state at this point and is done whenever a
non-fatal error has been detected that can be "solved" by resetting
the link. This call informs the driver of the reset and the driver
should check if the device appears to be in working condition.
This function acts a bit like 2) mmio_enabled(), in that the driver
is not supposed to restart normal driver I/O operations right away.
Instead, it should just "probe" the device to check it's recoverability
status. If all is right, then the core will call resume() once all
drivers have ack'd link_reset().

	Result codes:
		(identical to mmio_enabled)

>>> The current ppc64 implementation does not implement this callback.

	4) slot_reset()

	This is called after the slot has been soft or hard reset by the
platform.  A soft reset consists of asserting the adapter #RST line
and then restoring the PCI BARs and PCI configuration header. If the
platform supports PCI hotplug, then it might instead perform a hard
reset by toggling power on the slot off/on. This call gives drivers
the chance to re-initialize the hardware (re-download firmware, etc.),
but drivers shouldn't restart normal I/O processing operations at
this point.  (See note about interrupts; interrupts aren't guaranteed
to be delivered until the resume() callback has been called). If all
device drivers report success on this callback, the patform will call
resume() to complete the error handling and let the driver restart
normal I/O processing.

A driver can still return a critical failure for this function if
it can't get the device operational after reset.  If the platform
previously tried a soft reset, it migh now try a hard reset (power
cycle) and then call slot_reset() again.  It the device still can't
be recovered, there is nothing more that can be done;  the platform
will typically report a "permanent failure" in such a case.  The
device will be considered "dead" in this case.

	Result codes:
		- PCIERR_RESULT_DISCONNECT
		Same as above.

>>> The current ppc64 implementation does not try a power-cycle reset
>>> if the driver returned PCIERR_RESULT_DISCONNECT. However, it should.

	5) resume()

	This is called if all drivers on the segment have returned
PCIERR_RESULT_RECOVERED from one of the 3 prevous callbacks.
That basically tells the driver to restart activity, tht everything
is back and running. No result code is taken into account here. If
a new error happens, it will restart a new error handling process.

That's it. I think this covers all the possibilities. The way those
callbacks are called is platform policy. A platform with no slot reset
capability for example may want to just "ignore" drivers that can't
recover (disconnect them) and try to let other cards on the same segment
recover. Keep in mind that in most real life cases, though, there will
be only one driver per segment.

Now, there is a note about interrupts. If you get an interrupt and your
device is dead or has been isolated, there is a problem :)

After much thinking, I decided to leave that to the platform. That is,
the recovery API only precies that:

 - There is no guarantee that interrupt delivery can proceed from any
device on the segment starting from the error detection and until the
restart callback is sent, at which point interrupts are expected to be
fully operational.

 - There is no guarantee that interrupt delivery is stopped, that is, ad
river that gets an interrupts after detecting an error, or that detects
and error within the interrupt handler such that it prevents proper
ack'ing of the interrupt (and thus removal of the source) should just
return IRQ_NOTHANDLED. It's up to the platform to deal with taht
condition, typically by masking the irq source during the duration of
the error handling. It is expected that the platform "knows" which
interrupts are routed to error-management capable slots and can deal
with temporarily disabling that irq number during error processing (this
isn't terribly complex). That means some IRQ latency for other devices
sharing the interrupt, but there is simply no other way. High end
platforms aren't supposed to share interrupts between many devices
anyway :)


Revised: 31 May 2005 Linas Vepstas <linas@austin.ibm.com>
Commit	Line	Data
065c6359	1
	2	PCI Error Recovery
	3	------------------
	4	May 31, 2005
	5
	6	Current document maintainer:
	7	Linas Vepstas <linas@austin.ibm.com>
	8
	9
	10	Some PCI bus controllers are able to detect certain "hard" PCI errors
	11	on the bus, such as parity errors on the data and address busses, as
	12	well as SERR and PERR errors. These chipsets are then able to disable
	13	I/O to/from the affected device, so that, for example, a bad DMA
	14	address doesn't end up corrupting system memory. These same chipsets
	15	are also able to reset the affected PCI device, and return it to
	16	working condition. This document describes a generic API form
	17	performing error recovery.
	18
	19	The core idea is that after a PCI error has been detected, there must
	20	be a way for the kernel to coordinate with all affected device drivers
	21	so that the pci card can be made operational again, possibly after
	22	performing a full electrical #RST of the PCI card. The API below
	23	provides a generic API for device drivers to be notified of PCI
	24	errors, and to be notified of, and respond to, a reset sequence.
	25
	26	Preliminary sketch of API, cut-n-pasted-n-modified email from
	27	Ben Herrenschmidt, circa 5 april 2005
	28
	29	The error recovery API support is exposed to the driver in the form of
	30	a structure of function pointers pointed to by a new field in struct
	31	pci_driver. The absence of this pointer in pci_driver denotes an
	32	"non-aware" driver, behaviour on these is platform dependant.
	33	Platforms like ppc64 can try to simulate pci hotplug remove/add.
	34
	35	The definition of "pci_error_token" is not covered here. It is based on
	36	Seto's work on the synchronous error detection. We still need to define
	37	functions for extracting infos out of an opaque error token. This is
	38	separate from this API.
	39
	40	This structure has the form:
	41
	42	struct pci_error_handlers
	43	{
	44	int (error_detected)(struct pci_dev dev, pci_error_token error);
	45	int (mmio_enabled)(struct pci_dev dev);
	46	int (resume)(struct pci_dev dev);
	47	int (link_reset)(struct pci_dev dev);
	48	int (slot_reset)(struct pci_dev dev);
	49	};
	50
	51	A driver doesn't have to implement all of these callbacks. The
	52	only mandatory one is error_detected(). If a callback is not
	53	implemented, the corresponding feature is considered unsupported.
	54	For example, if mmio_enabled() and resume() aren't there, then the
	55	driver is assumed as not doing any direct recovery and requires
	56	a reset. If link_reset() is not implemented, the card is assumed as
	57	not caring about link resets, in which case, if recover is supported,
	58	the core can try recover (but not slot_reset() unless it really did
	59	reset the slot). If slot_reset() is not supported, link_reset() can
	60	be called instead on a slot reset.
	61
	62	At first, the call will always be :
	63
	64	1) error_detected()
65
66	Error detected. This is sent once after an error has been detected. At
67	this point, the device might not be accessible anymore depending on the
68	platform (the slot will be isolated on ppc64). The driver may already
69	have "noticed" the error because of a failing IO, but this is the proper
70	"synchronisation point", that is, it gives a chance to the driver to
71	cleanup, waiting for pending stuff (timers, whatever, etc...) to
72	complete; it can take semaphores, schedule, etc... everything but touch
73	the device. Within this function and after it returns, the driver
74	shouldn't do any new IOs. Called in task context. This is sort of a
75	"quiesce" point. See note about interrupts at the end of this doc.
76
77	Result codes:
78	- PCIERR_RESULT_CAN_RECOVER:
79	Driever returns this if it thinks it might be able to recover
80	the HW by just banging IOs or if it wants to be given
81	a chance to extract some diagnostic informations (see
82	below).
83	- PCIERR_RESULT_NEED_RESET:
84	Driver returns this if it thinks it can't recover unless the
85	slot is reset.
86	- PCIERR_RESULT_DISCONNECT:
87	Return this if driver thinks it won't recover at all,
88	(this will detach the driver ? or just leave it
89	dangling ? to be decided)
90
91	So at this point, we have called error_detected() for all drivers
92	on the segment that had the error. On ppc64, the slot is isolated. What
93	happens now typically depends on the result from the drivers. If all
94	drivers on the segment/slot return PCIERR_RESULT_CAN_RECOVER, we would
95	re-enable IOs on the slot (or do nothing special if the platform doesn't
96	isolate slots) and call 2). If not and we can reset slots, we go to 4),
97	if neither, we have a dead slot. If it's an hotplug slot, we might
98	"simulate" reset by triggering HW unplug/replug though.
99
100	>>> Current ppc64 implementation assumes that a device driver will
101	>>> not schedule or semaphore in this routine; the current ppc64
102	>>> implementation uses one kernel thread to notify all devices;
103	>>> thus, of one device sleeps/schedules, all devices are affected.
104	>>> Doing better requires complex multi-threaded logic in the error
105	>>> recovery implementation (e.g. waiting for all notification threads
106	>>> to "join" before proceeding with recovery.) This seems excessively
107	>>> complex and not worth implementing.
108
109	>>> The current ppc64 implementation doesn't much care if the device
110	>>> attempts i/o at this point, or not. I/O's will fail, returning
111	>>> a value of 0xff on read, and writes will be dropped. If the device
112	>>> driver attempts more than 10K I/O's to a frozen adapter, it will
113	>>> assume that the device driver has gone into an infinite loop, and
114	>>> it will panic the the kernel.
115
116	2) mmio_enabled()
117
118	This is the "early recovery" call. IOs are allowed again, but DMA is
119	not (hrm... to be discussed, I prefer not), with some restrictions. This
120	is NOT a callback for the driver to start operations again, only to
121	peek/poke at the device, extract diagnostic information, if any, and
122	eventually do things like trigger a device local reset or some such,
123	but not restart operations. This is sent if all drivers on a segment
124	agree that they can try to recover and no automatic link reset was
125	performed by the HW. If the platform can't just re-enable IOs without
126	a slot reset or a link reset, it doesn't call this callback and goes
127	directly to 3) or 4). All IOs should be done _synchronously_ from
128	within this callback, errors triggered by them will be returned via
129	the normal pci_check_whatever() api, no new error_detected() callback
130	will be issued due to an error happening here. However, such an error
131	might cause IOs to be re-blocked for the whole segment, and thus
132	invalidate the recovery that other devices on the same segment might
133	have done, forcing the whole segment into one of the next states,
134	that is link reset or slot reset.
135
136	Result codes:
137	- PCIERR_RESULT_RECOVERED
138	Driver returns this if it thinks the device is fully
139	functionnal and thinks it is ready to start
140	normal driver operations again. There is no
141	guarantee that the driver will actually be
142	allowed to proceed, as another driver on the
143	same segment might have failed and thus triggered a
144	slot reset on platforms that support it.
145
146	- PCIERR_RESULT_NEED_RESET
147	Driver returns this if it thinks the device is not
148	recoverable in it's current state and it needs a slot
149	reset to proceed.
150
151	- PCIERR_RESULT_DISCONNECT
152	Same as above. Total failure, no recovery even after
153	reset driver dead. (To be defined more precisely)
154
155	>>> The current ppc64 implementation does not implement this callback.
156
157	3) link_reset()
158
159	This is called after the link has been reset. This is typically
160	a PCI Express specific state at this point and is done whenever a
161	non-fatal error has been detected that can be "solved" by resetting
162	the link. This call informs the driver of the reset and the driver
163	should check if the device appears to be in working condition.
164	This function acts a bit like 2) mmio_enabled(), in that the driver
165	is not supposed to restart normal driver I/O operations right away.
166	Instead, it should just "probe" the device to check it's recoverability
167	status. If all is right, then the core will call resume() once all
168	drivers have ack'd link_reset().
169
170	Result codes:
171	(identical to mmio_enabled)
172
173	>>> The current ppc64 implementation does not implement this callback.
174
175	4) slot_reset()
176
177	This is called after the slot has been soft or hard reset by the
178	platform. A soft reset consists of asserting the adapter #RST line
179	and then restoring the PCI BARs and PCI configuration header. If the
180	platform supports PCI hotplug, then it might instead perform a hard
181	reset by toggling power on the slot off/on. This call gives drivers
182	the chance to re-initialize the hardware (re-download firmware, etc.),
183	but drivers shouldn't restart normal I/O processing operations at
184	this point. (See note about interrupts; interrupts aren't guaranteed
185	to be delivered until the resume() callback has been called). If all
186	device drivers report success on this callback, the patform will call
187	resume() to complete the error handling and let the driver restart
188	normal I/O processing.
189
190	A driver can still return a critical failure for this function if
191	it can't get the device operational after reset. If the platform
192	previously tried a soft reset, it migh now try a hard reset (power
193	cycle) and then call slot_reset() again. It the device still can't
194	be recovered, there is nothing more that can be done; the platform
195	will typically report a "permanent failure" in such a case. The
196	device will be considered "dead" in this case.
197
198	Result codes:
199	- PCIERR_RESULT_DISCONNECT
200	Same as above.
201
202	>>> The current ppc64 implementation does not try a power-cycle reset
203	>>> if the driver returned PCIERR_RESULT_DISCONNECT. However, it should.
204
205	5) resume()
206
207	This is called if all drivers on the segment have returned
208	PCIERR_RESULT_RECOVERED from one of the 3 prevous callbacks.
209	That basically tells the driver to restart activity, tht everything
210	is back and running. No result code is taken into account here. If
211	a new error happens, it will restart a new error handling process.
212
213	That's it. I think this covers all the possibilities. The way those
214	callbacks are called is platform policy. A platform with no slot reset
215	capability for example may want to just "ignore" drivers that can't
216	recover (disconnect them) and try to let other cards on the same segment
217	recover. Keep in mind that in most real life cases, though, there will
218	be only one driver per segment.
219
220	Now, there is a note about interrupts. If you get an interrupt and your
221	device is dead or has been isolated, there is a problem :)
222
223	After much thinking, I decided to leave that to the platform. That is,
224	the recovery API only precies that:
225
226	- There is no guarantee that interrupt delivery can proceed from any
227	device on the segment starting from the error detection and until the
228	restart callback is sent, at which point interrupts are expected to be
229	fully operational.
230
231	- There is no guarantee that interrupt delivery is stopped, that is, ad
232	river that gets an interrupts after detecting an error, or that detects
233	and error within the interrupt handler such that it prevents proper
234	ack'ing of the interrupt (and thus removal of the source) should just
235	return IRQ_NOTHANDLED. It's up to the platform to deal with taht
236	condition, typically by masking the irq source during the duration of
237	the error handling. It is expected that the platform "knows" which
238	interrupts are routed to error-management capable slots and can deal
239	with temporarily disabling that irq number during error processing (this
240	isn't terribly complex). That means some IRQ latency for other devices
241	sharing the interrupt, but there is simply no other way. High end
242	platforms aren't supposed to share interrupts between many devices
243	anyway :)
244
245
246	Revised: 31 May 2005 Linas Vepstas <linas@austin.ibm.com>