Commit | Line | Data |
---|---|---|
1da177e4 LT |
1 | HISTORY: |
2 | February 16/2002 -- revision 0.2.1: | |
3 | COR typo corrected | |
4 | February 10/2002 -- revision 0.2: | |
5 | some spell checking ;-> | |
6 | January 12/2002 -- revision 0.1 | |
7 | This is still work in progress so may change. | |
8 | To keep up to date please watch this space. | |
9 | ||
10 | Introduction to NAPI | |
11 | ==================== | |
12 | ||
13 | NAPI is a proven (www.cyberus.ca/~hadi/usenix-paper.tgz) technique | |
14 | to improve network performance on Linux. For more details please | |
15 | read that paper. | |
16 | NAPI provides a "inherent mitigation" which is bound by system capacity | |
17 | as can be seen from the following data collected by Robert on Gigabit | |
18 | ethernet (e1000): | |
19 | ||
20 | Psize Ipps Tput Rxint Txint Done Ndone | |
21 | --------------------------------------------------------------- | |
22 | 60 890000 409362 17 27622 7 6823 | |
23 | 128 758150 464364 21 9301 10 7738 | |
24 | 256 445632 774646 42 15507 21 12906 | |
25 | 512 232666 994445 241292 19147 241192 1062 | |
26 | 1024 119061 1000003 872519 19258 872511 0 | |
27 | 1440 85193 1000003 946576 19505 946569 0 | |
28 | ||
29 | ||
30 | Legend: | |
31 | "Ipps" stands for input packets per second. | |
32 | "Tput" == packets out of total 1M that made it out. | |
33 | "txint" == transmit completion interrupts seen | |
34 | "Done" == The number of times that the poll() managed to pull all | |
35 | packets out of the rx ring. Note from this that the lower the | |
36 | load the more we could clean up the rxring | |
37 | "Ndone" == is the converse of "Done". Note again, that the higher | |
fff9289b | 38 | the load the more times we couldn't clean up the rxring. |
1da177e4 LT |
39 | |
40 | Observe that: | |
41 | when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated. | |
42 | The system cant handle the processing at 1 interrupt/packet at that load level. | |
43 | At lower rates on the other hand, rx interrupts go up and therefore the | |
44 | interrupt/packet ratio goes up (as observable from that table). So there is | |
45 | possibility that under low enough input, you get one poll call for each | |
46 | input packet caused by a single interrupt each time. And if the system | |
47 | cant handle interrupt per packet ratio of 1, then it will just have to | |
48 | chug along .... | |
49 | ||
50 | ||
51 | 0) Prerequisites: | |
52 | ================== | |
53 | A driver MAY continue using the old 2.4 technique for interfacing | |
54 | to the network stack and not benefit from the NAPI changes. | |
55 | NAPI additions to the kernel do not break backward compatibility. | |
56 | NAPI, however, requires the following features to be available: | |
57 | ||
58 | A) DMA ring or enough RAM to store packets in software devices. | |
59 | ||
60 | B) Ability to turn off interrupts or maybe events that send packets up | |
61 | the stack. | |
62 | ||
63 | NAPI processes packet events in what is known as dev->poll() method. | |
64 | Typically, only packet receive events are processed in dev->poll(). | |
65 | The rest of the events MAY be processed by the regular interrupt handler | |
66 | to reduce processing latency (justified also because there are not that | |
67 | many of them). | |
68 | Note, however, NAPI does not enforce that dev->poll() only processes | |
69 | receive events. | |
70 | Tests with the tulip driver indicated slightly increased latency if | |
71 | all of the interrupt handler is moved to dev->poll(). Also MII handling | |
72 | gets a little trickier. | |
73 | The example used in this document is to move the receive processing only | |
74 | to dev->poll(); this is shown with the patch for the tulip driver. | |
75 | For an example of code that moves all the interrupt driver to | |
76 | dev->poll() look at the ported e1000 code. | |
77 | ||
78 | There are caveats that might force you to go with moving everything to | |
79 | dev->poll(). Different NICs work differently depending on their status/event | |
80 | acknowledgement setup. | |
81 | There are two types of event register ACK mechanisms. | |
82 | I) what is known as Clear-on-read (COR). | |
83 | when you read the status/event register, it clears everything! | |
84 | The natsemi and sunbmac NICs are known to do this. | |
85 | In this case your only choice is to move all to dev->poll() | |
86 | ||
87 | II) Clear-on-write (COW) | |
88 | i) you clear the status by writing a 1 in the bit-location you want. | |
89 | These are the majority of the NICs and work the best with NAPI. | |
90 | Put only receive events in dev->poll(); leave the rest in | |
91 | the old interrupt handler. | |
92 | ii) whatever you write in the status register clears every thing ;-> | |
93 | Cant seem to find any supported by Linux which do this. If | |
94 | someone knows such a chip email us please. | |
95 | Move all to dev->poll() | |
96 | ||
97 | C) Ability to detect new work correctly. | |
fa00e7e1 ML |
98 | NAPI works by shutting down event interrupts when there's work and |
99 | turning them on when there's none. | |
1da177e4 LT |
100 | New packets might show up in the small window while interrupts were being |
101 | re-enabled (refer to appendix 2). A packet might sneak in during the period | |
102 | we are enabling interrupts. We only get to know about such a packet when the | |
103 | next new packet arrives and generates an interrupt. | |
104 | Essentially, there is a small window of opportunity for a race condition | |
105 | which for clarity we'll refer to as the "rotting packet". | |
106 | ||
107 | This is a very important topic and appendix 2 is dedicated for more | |
108 | discussion. | |
109 | ||
110 | Locking rules and environmental guarantees | |
111 | ========================================== | |
112 | ||
113 | -Guarantee: Only one CPU at any time can call dev->poll(); this is because | |
114 | only one CPU can pick the initial interrupt and hence the initial | |
115 | netif_rx_schedule(dev); | |
116 | - The core layer invokes devices to send packets in a round robin format. | |
fa00e7e1 | 117 | This implies receive is totally lockless because of the guarantee that only |
1da177e4 LT |
118 | one CPU is executing it. |
119 | - contention can only be the result of some other CPU accessing the rx | |
120 | ring. This happens only in close() and suspend() (when these methods | |
121 | try to clean the rx ring); | |
122 | ****guarantee: driver authors need not worry about this; synchronization | |
123 | is taken care for them by the top net layer. | |
124 | -local interrupts are enabled (if you dont move all to dev->poll()). For | |
125 | example link/MII and txcomplete continue functioning just same old way. | |
126 | This improves the latency of processing these events. It is also assumed that | |
127 | the receive interrupt is the largest cause of noise. Note this might not | |
128 | always be true. | |
129 | [according to Manfred Spraul, the winbond insists on sending one | |
130 | txmitcomplete interrupt for each packet (although this can be mitigated)]. | |
131 | For these broken drivers, move all to dev->poll(). | |
132 | ||
133 | For the rest of this text, we'll assume that dev->poll() only | |
134 | processes receive events. | |
135 | ||
136 | new methods introduce by NAPI | |
137 | ============================= | |
138 | ||
139 | a) netif_rx_schedule(dev) | |
140 | Called by an IRQ handler to schedule a poll for device | |
141 | ||
142 | b) netif_rx_schedule_prep(dev) | |
143 | puts the device in a state which allows for it to be added to the | |
144 | CPU polling list if it is up and running. You can look at this as | |
145 | the first half of netif_rx_schedule(dev) above; the second half | |
146 | being c) below. | |
147 | ||
148 | c) __netif_rx_schedule(dev) | |
149 | Add device to the poll list for this CPU; assuming that _prep above | |
150 | has already been called and returned 1. | |
151 | ||
152 | d) netif_rx_reschedule(dev, undo) | |
153 | Called to reschedule polling for device specifically for some | |
154 | deficient hardware. Read Appendix 2 for more details. | |
155 | ||
156 | e) netif_rx_complete(dev) | |
157 | ||
158 | Remove interface from the CPU poll list: it must be in the poll list | |
159 | on current cpu. This primitive is called by dev->poll(), when | |
160 | it completes its work. The device cannot be out of poll list at this | |
161 | call, if it is then clearly it is a BUG(). You'll know ;-> | |
162 | ||
a982ac06 | 163 | All of the above methods are used below, so keep reading for clarity. |
1da177e4 LT |
164 | |
165 | Device driver changes to be made when porting NAPI | |
166 | ================================================== | |
167 | ||
168 | Below we describe what kind of changes are required for NAPI to work. | |
169 | ||
170 | 1) introduction of dev->poll() method | |
171 | ===================================== | |
172 | ||
173 | This is the method that is invoked by the network core when it requests | |
174 | for new packets from the driver. A driver is allowed to send upto | |
175 | dev->quota packets by the current CPU before yielding to the network | |
176 | subsystem (so other devices can also get opportunity to send to the stack). | |
177 | ||
178 | dev->poll() prototype looks as follows: | |
179 | int my_poll(struct net_device *dev, int *budget) | |
180 | ||
181 | budget is the remaining number of packets the network subsystem on the | |
182 | current CPU can send up the stack before yielding to other system tasks. | |
183 | *Each driver is responsible for decrementing budget by the total number of | |
184 | packets sent. | |
185 | Total number of packets cannot exceed dev->quota. | |
186 | ||
187 | dev->poll() method is invoked by the top layer, the driver just sends if it | |
188 | can to the stack the packet quantity requested. | |
189 | ||
190 | more on dev->poll() below after the interrupt changes are explained. | |
191 | ||
192 | 2) registering dev->poll() method | |
193 | =================================== | |
194 | ||
195 | dev->poll should be set in the dev->probe() method. | |
196 | e.g: | |
197 | dev->open = my_open; | |
198 | . | |
199 | . | |
200 | /* two new additions */ | |
201 | /* first register my poll method */ | |
202 | dev->poll = my_poll; | |
203 | /* next register my weight/quanta; can be overridden in /proc */ | |
204 | dev->weight = 16; | |
205 | . | |
206 | . | |
207 | dev->stop = my_close; | |
208 | ||
209 | ||
210 | ||
211 | 3) scheduling dev->poll() | |
212 | ============================= | |
213 | This involves modifying the interrupt handler and the code | |
214 | path which takes the packet off the NIC and sends them to the | |
215 | stack. | |
216 | ||
217 | it's important at this point to introduce the classical D Becker | |
218 | interrupt processor: | |
219 | ||
220 | ------------------ | |
221 | static irqreturn_t | |
222 | netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs) | |
223 | { | |
224 | ||
225 | struct net_device *dev = (struct net_device *)dev_instance; | |
226 | struct my_private *tp = (struct my_private *)dev->priv; | |
227 | ||
228 | int work_count = my_work_count; | |
229 | status = read_interrupt_status_reg(); | |
230 | if (status == 0) | |
231 | return IRQ_NONE; /* Shared IRQ: not us */ | |
232 | if (status == 0xffff) | |
233 | return IRQ_HANDLED; /* Hot unplug */ | |
234 | if (status & error) | |
235 | do_some_error_handling() | |
236 | ||
237 | do { | |
238 | acknowledge_ints_ASAP(); | |
239 | ||
240 | if (status & link_interrupt) { | |
241 | spin_lock(&tp->link_lock); | |
242 | do_some_link_stat_stuff(); | |
243 | spin_lock(&tp->link_lock); | |
244 | } | |
245 | ||
246 | if (status & rx_interrupt) { | |
247 | receive_packets(dev); | |
248 | } | |
249 | ||
250 | if (status & rx_nobufs) { | |
251 | make_rx_buffs_avail(); | |
252 | } | |
253 | ||
254 | if (status & tx_related) { | |
255 | spin_lock(&tp->lock); | |
256 | tx_ring_free(dev); | |
257 | if (tx_died) | |
258 | restart_tx(); | |
259 | spin_unlock(&tp->lock); | |
260 | } | |
261 | ||
262 | status = read_interrupt_status_reg(); | |
263 | ||
264 | } while (!(status & error) || more_work_to_be_done); | |
265 | return IRQ_HANDLED; | |
266 | } | |
267 | ||
268 | ---------------------------------------------------------------------- | |
269 | ||
270 | We now change this to what is shown below to NAPI-enable it: | |
271 | ||
272 | ---------------------------------------------------------------------- | |
273 | static irqreturn_t | |
274 | netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs) | |
275 | { | |
276 | struct net_device *dev = (struct net_device *)dev_instance; | |
277 | struct my_private *tp = (struct my_private *)dev->priv; | |
278 | ||
279 | status = read_interrupt_status_reg(); | |
280 | if (status == 0) | |
281 | return IRQ_NONE; /* Shared IRQ: not us */ | |
282 | if (status == 0xffff) | |
283 | return IRQ_HANDLED; /* Hot unplug */ | |
284 | if (status & error) | |
285 | do_some_error_handling(); | |
286 | ||
287 | do { | |
288 | /************************ start note *********************************/ | |
289 | acknowledge_ints_ASAP(); // dont ack rx and rxnobuff here | |
290 | /************************ end note *********************************/ | |
291 | ||
292 | if (status & link_interrupt) { | |
293 | spin_lock(&tp->link_lock); | |
294 | do_some_link_stat_stuff(); | |
295 | spin_unlock(&tp->link_lock); | |
296 | } | |
297 | /************************ start note *********************************/ | |
298 | if (status & rx_interrupt || (status & rx_nobuffs)) { | |
299 | if (netif_rx_schedule_prep(dev)) { | |
300 | ||
301 | /* disable interrupts caused | |
302 | * by arriving packets */ | |
303 | disable_rx_and_rxnobuff_ints(); | |
304 | /* tell system we have work to be done. */ | |
305 | __netif_rx_schedule(dev); | |
306 | } else { | |
307 | printk("driver bug! interrupt while in poll\n"); | |
308 | /* FIX by disabling interrupts */ | |
309 | disable_rx_and_rxnobuff_ints(); | |
310 | } | |
311 | } | |
312 | /************************ end note note *********************************/ | |
313 | ||
314 | if (status & tx_related) { | |
315 | spin_lock(&tp->lock); | |
316 | tx_ring_free(dev); | |
317 | ||
318 | if (tx_died) | |
319 | restart_tx(); | |
320 | spin_unlock(&tp->lock); | |
321 | } | |
322 | ||
323 | status = read_interrupt_status_reg(); | |
324 | ||
325 | /************************ start note *********************************/ | |
326 | } while (!(status & error) || more_work_to_be_done(status)); | |
327 | /************************ end note note *********************************/ | |
328 | return IRQ_HANDLED; | |
329 | } | |
330 | ||
331 | --------------------------------------------------------------------- | |
332 | ||
333 | ||
334 | We note several things from above: | |
335 | ||
336 | I) Any interrupt source which is caused by arriving packets is now | |
337 | turned off when it occurs. Depending on the hardware, there could be | |
338 | several reasons that arriving packets would cause interrupts; these are the | |
339 | interrupt sources we wish to avoid. The two common ones are a) a packet | |
340 | arriving (rxint) b) a packet arriving and finding no DMA buffers available | |
341 | (rxnobuff) . | |
342 | This means also acknowledge_ints_ASAP() will not clear the status | |
343 | register for those two items above; clearing is done in the place where | |
344 | proper work is done within NAPI; at the poll() and refill_rx_ring() | |
345 | discussed further below. | |
346 | netif_rx_schedule_prep() returns 1 if device is in running state and | |
347 | gets successfully added to the core poll list. If we get a zero value | |
348 | we can _almost_ assume are already added to the list (instead of not running. | |
349 | Logic based on the fact that you shouldn't get interrupt if not running) | |
350 | We rectify this by disabling rx and rxnobuf interrupts. | |
351 | ||
352 | II) that receive_packets(dev) and make_rx_buffs_avail() may have disappeared. | |
353 | These functionalities are still around actually...... | |
354 | ||
355 | infact, receive_packets(dev) is very close to my_poll() and | |
356 | make_rx_buffs_avail() is invoked from my_poll() | |
357 | ||
358 | 4) converting receive_packets() to dev->poll() | |
359 | =============================================== | |
360 | ||
361 | We need to convert the classical D Becker receive_packets(dev) to my_poll() | |
362 | ||
363 | First the typical receive_packets() below: | |
364 | ------------------------------------------------------------------- | |
365 | ||
366 | /* this is called by interrupt handler */ | |
367 | static void receive_packets (struct net_device *dev) | |
368 | { | |
369 | ||
370 | struct my_private *tp = (struct my_private *)dev->priv; | |
371 | rx_ring = tp->rx_ring; | |
372 | cur_rx = tp->cur_rx; | |
373 | int entry = cur_rx % RX_RING_SIZE; | |
374 | int received = 0; | |
375 | int rx_work_limit = tp->dirty_rx + RX_RING_SIZE - tp->cur_rx; | |
376 | ||
377 | while (rx_ring_not_empty) { | |
378 | u32 rx_status; | |
379 | unsigned int rx_size; | |
380 | unsigned int pkt_size; | |
381 | struct sk_buff *skb; | |
382 | /* read size+status of next frame from DMA ring buffer */ | |
383 | /* the number 16 and 4 are just examples */ | |
384 | rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset)); | |
385 | rx_size = rx_status >> 16; | |
386 | pkt_size = rx_size - 4; | |
387 | ||
388 | /* process errors */ | |
389 | if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) || | |
390 | (!(rx_status & RxStatusOK))) { | |
391 | netdrv_rx_err (rx_status, dev, tp, ioaddr); | |
392 | return; | |
393 | } | |
394 | ||
395 | if (--rx_work_limit < 0) | |
396 | break; | |
397 | ||
398 | /* grab a skb */ | |
399 | skb = dev_alloc_skb (pkt_size + 2); | |
400 | if (skb) { | |
401 | . | |
402 | . | |
403 | netif_rx (skb); | |
404 | . | |
405 | . | |
406 | } else { /* OOM */ | |
407 | /*seems very driver specific ... some just pass | |
408 | whatever is on the ring already. */ | |
409 | } | |
410 | ||
411 | /* move to the next skb on the ring */ | |
412 | entry = (++tp->cur_rx) % RX_RING_SIZE; | |
413 | received++ ; | |
414 | ||
415 | } | |
416 | ||
417 | /* store current ring pointer state */ | |
418 | tp->cur_rx = cur_rx; | |
419 | ||
420 | /* Refill the Rx ring buffers if they are needed */ | |
421 | refill_rx_ring(); | |
422 | . | |
423 | . | |
424 | ||
425 | } | |
426 | ------------------------------------------------------------------- | |
427 | We change it to a new one below; note the additional parameter in | |
428 | the call. | |
429 | ||
430 | ------------------------------------------------------------------- | |
431 | ||
432 | /* this is called by the network core */ | |
433 | static int my_poll (struct net_device *dev, int *budget) | |
434 | { | |
435 | ||
436 | struct my_private *tp = (struct my_private *)dev->priv; | |
437 | rx_ring = tp->rx_ring; | |
438 | cur_rx = tp->cur_rx; | |
439 | int entry = cur_rx % RX_BUF_LEN; | |
440 | /* maximum packets to send to the stack */ | |
441 | /************************ note note *********************************/ | |
442 | int rx_work_limit = dev->quota; | |
443 | ||
444 | /************************ end note note *********************************/ | |
445 | do { // outer beginning loop starts here | |
446 | ||
447 | clear_rx_status_register_bit(); | |
448 | ||
449 | while (rx_ring_not_empty) { | |
450 | u32 rx_status; | |
451 | unsigned int rx_size; | |
452 | unsigned int pkt_size; | |
453 | struct sk_buff *skb; | |
454 | /* read size+status of next frame from DMA ring buffer */ | |
455 | /* the number 16 and 4 are just examples */ | |
456 | rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset)); | |
457 | rx_size = rx_status >> 16; | |
458 | pkt_size = rx_size - 4; | |
459 | ||
460 | /* process errors */ | |
461 | if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) || | |
462 | (!(rx_status & RxStatusOK))) { | |
463 | netdrv_rx_err (rx_status, dev, tp, ioaddr); | |
464 | return 1; | |
465 | } | |
466 | ||
467 | /************************ note note *********************************/ | |
468 | if (--rx_work_limit < 0) { /* we got packets, but no quota */ | |
469 | /* store current ring pointer state */ | |
470 | tp->cur_rx = cur_rx; | |
471 | ||
472 | /* Refill the Rx ring buffers if they are needed */ | |
473 | refill_rx_ring(dev); | |
474 | goto not_done; | |
475 | } | |
476 | /********************** end note **********************************/ | |
477 | ||
478 | /* grab a skb */ | |
479 | skb = dev_alloc_skb (pkt_size + 2); | |
480 | if (skb) { | |
481 | . | |
482 | . | |
483 | /************************ note note *********************************/ | |
484 | netif_receive_skb (skb); | |
485 | /********************** end note **********************************/ | |
486 | . | |
487 | . | |
488 | } else { /* OOM */ | |
489 | /*seems very driver specific ... common is just pass | |
490 | whatever is on the ring already. */ | |
491 | } | |
492 | ||
493 | /* move to the next skb on the ring */ | |
494 | entry = (++tp->cur_rx) % RX_RING_SIZE; | |
495 | received++ ; | |
496 | ||
497 | } | |
498 | ||
499 | /* store current ring pointer state */ | |
500 | tp->cur_rx = cur_rx; | |
501 | ||
502 | /* Refill the Rx ring buffers if they are needed */ | |
503 | refill_rx_ring(dev); | |
504 | ||
505 | /* no packets on ring; but new ones can arrive since we last | |
506 | checked */ | |
507 | status = read_interrupt_status_reg(); | |
508 | if (rx status is not set) { | |
509 | /* If something arrives in this narrow window, | |
510 | an interrupt will be generated */ | |
511 | goto done; | |
512 | } | |
fa00e7e1 | 513 | /* done! at least that's what it looks like ;-> |
1da177e4 LT |
514 | if new packets came in after our last check on status bits |
515 | they'll be caught by the while check and we go back and clear them | |
516 | since we havent exceeded our quota */ | |
517 | } while (rx_status_is_set); | |
518 | ||
519 | done: | |
520 | ||
521 | /************************ note note *********************************/ | |
522 | dev->quota -= received; | |
523 | *budget -= received; | |
524 | ||
525 | /* If RX ring is not full we are out of memory. */ | |
526 | if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) | |
527 | goto oom; | |
528 | ||
529 | /* we are happy/done, no more packets on ring; put us back | |
530 | to where we can start processing interrupts again */ | |
531 | netif_rx_complete(dev); | |
532 | enable_rx_and_rxnobuf_ints(); | |
533 | ||
534 | /* The last op happens after poll completion. Which means the following: | |
535 | * 1. it can race with disabling irqs in irq handler (which are done to | |
536 | * schedule polls) | |
537 | * 2. it can race with dis/enabling irqs in other poll threads | |
5d3f083d ML |
538 | * 3. if an irq raised after the beginning of the outer beginning |
539 | * loop (marked in the code above), it will be immediately | |
1da177e4 LT |
540 | * triggered here. |
541 | * | |
5d3f083d | 542 | * Summarizing: the logic may result in some redundant irqs both |
1da177e4 LT |
543 | * due to races in masking and due to too late acking of already |
544 | * processed irqs. The good news: no events are ever lost. | |
545 | */ | |
546 | ||
547 | return 0; /* done */ | |
548 | ||
549 | not_done: | |
550 | if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 || | |
551 | tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) | |
552 | refill_rx_ring(dev); | |
553 | ||
554 | if (!received) { | |
555 | printk("received==0\n"); | |
556 | received = 1; | |
557 | } | |
558 | dev->quota -= received; | |
559 | *budget -= received; | |
560 | return 1; /* not_done */ | |
561 | ||
562 | oom: | |
563 | /* Start timer, stop polling, but do not enable rx interrupts. */ | |
564 | start_poll_timer(dev); | |
565 | return 0; /* we'll take it from here so tell core "done"*/ | |
566 | ||
567 | /************************ End note note *********************************/ | |
568 | } | |
569 | ------------------------------------------------------------------- | |
570 | ||
571 | From above we note that: | |
572 | 0) rx_work_limit = dev->quota | |
573 | 1) refill_rx_ring() is in charge of clearing the bit for rxnobuff when | |
574 | it does the work. | |
575 | 2) We have a done and not_done state. | |
576 | 3) instead of netif_rx() we call netif_receive_skb() to pass the skb. | |
577 | 4) we have a new way of handling oom condition | |
578 | 5) A new outer for (;;) loop has been added. This serves the purpose of | |
579 | ensuring that if a new packet has come in, after we are all set and done, | |
580 | and we have not exceeded our quota that we continue sending packets up. | |
581 | ||
582 | ||
583 | ----------------------------------------------------------- | |
584 | Poll timer code will need to do the following: | |
585 | ||
586 | a) | |
587 | ||
588 | if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 || | |
589 | tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) | |
590 | refill_rx_ring(dev); | |
591 | ||
592 | /* If RX ring is not full we are still out of memory. | |
593 | Restart the timer again. Else we re-add ourselves | |
594 | to the master poll list. | |
595 | */ | |
596 | ||
597 | if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) | |
598 | restart_timer(); | |
599 | ||
600 | else netif_rx_schedule(dev); /* we are back on the poll list */ | |
601 | ||
602 | 5) dev->close() and dev->suspend() issues | |
603 | ========================================== | |
4ae0edc2 | 604 | The driver writer needn't worry about this; the top net layer takes |
1da177e4 LT |
605 | care of it. |
606 | ||
607 | 6) Adding new Stats to /proc | |
608 | ============================= | |
609 | In order to debug some of the new features, we introduce new stats | |
610 | that need to be collected. | |
611 | TODO: Fill this later. | |
612 | ||
613 | APPENDIX 1: discussion on using ethernet HW FC | |
614 | ============================================== | |
615 | Most chips with FC only send a pause packet when they run out of Rx buffers. | |
616 | Since packets are pulled off the DMA ring by a softirq in NAPI, | |
617 | if the system is slow in grabbing them and we have a high input | |
618 | rate (faster than the system's capacity to remove packets), then theoretically | |
619 | there will only be one rx interrupt for all packets during a given packetstorm. | |
620 | Under low load, we might have a single interrupt per packet. | |
621 | FC should be programmed to apply in the case when the system cant pull out | |
622 | packets fast enough i.e send a pause only when you run out of rx buffers. | |
623 | Note FC in itself is a good solution but we have found it to not be | |
624 | much of a commodity feature (both in NICs and switches) and hence falls | |
4ae0edc2 ML |
625 | under the same category as using NIC based mitigation. Also, experiments |
626 | indicate that it's much harder to resolve the resource allocation | |
627 | issue (aka lazy receiving that NAPI offers) and hence quantify its usefulness | |
1da177e4 LT |
628 | proved harder. In any case, FC works even better with NAPI but is not |
629 | necessary. | |
630 | ||
631 | ||
632 | APPENDIX 2: the "rotting packet" race-window avoidance scheme | |
633 | ============================================================= | |
634 | ||
635 | There are two types of associations seen here | |
636 | ||
637 | 1) status/int which honors level triggered IRQ | |
638 | ||
639 | If a status bit for receive or rxnobuff is set and the corresponding | |
640 | interrupt-enable bit is not on, then no interrupts will be generated. However, | |
641 | as soon as the "interrupt-enable" bit is unmasked, an immediate interrupt is | |
642 | generated. [assuming the status bit was not turned off]. | |
643 | Generally the concept of level triggered IRQs in association with a status and | |
644 | interrupt-enable CSR register set is used to avoid the race. | |
645 | ||
646 | If we take the example of the tulip: | |
647 | "pending work" is indicated by the status bit(CSR5 in tulip). | |
648 | the corresponding interrupt bit (CSR7 in tulip) might be turned off (but | |
649 | the CSR5 will continue to be turned on with new packet arrivals even if | |
650 | we clear it the first time) | |
651 | Very important is the fact that if we turn on the interrupt bit on when | |
652 | status is set that an immediate irq is triggered. | |
653 | ||
654 | If we cleared the rx ring and proclaimed there was "no more work | |
655 | to be done" and then went on to do a few other things; then when we enable | |
656 | interrupts, there is a possibility that a new packet might sneak in during | |
657 | this phase. It helps to look at the pseudo code for the tulip poll | |
658 | routine: | |
659 | ||
660 | -------------------------- | |
661 | do { | |
662 | ACK; | |
663 | while (ring_is_not_empty()) { | |
664 | work-work-work | |
665 | if quota is exceeded: exit, no touching irq status/mask | |
666 | } | |
667 | /* No packets, but new can arrive while we are doing this*/ | |
668 | CSR5 := read | |
669 | if (CSR5 is not set) { | |
670 | /* If something arrives in this narrow window here, | |
671 | * where the comments are ;-> irq will be generated */ | |
672 | unmask irqs; | |
673 | exit poll; | |
674 | } | |
675 | } while (rx_status_is_set); | |
676 | ------------------------ | |
677 | ||
678 | CSR5 bit of interest is only the rx status. | |
679 | If you look at the last if statement: | |
680 | you just finished grabbing all the packets from the rx ring .. you check if | |
fa00e7e1 | 681 | status bit says there are more packets just in ... it says none; you then |
1da177e4 LT |
682 | enable rx interrupts again; if a new packet just came in during this check, |
683 | we are counting that CSR5 will be set in that small window of opportunity | |
fa00e7e1 | 684 | and that by re-enabling interrupts, we would actually trigger an interrupt |
1da177e4 LT |
685 | to register the new packet for processing. |
686 | ||
687 | [The above description nay be very verbose, if you have better wording | |
688 | that will make this more understandable, please suggest it.] | |
689 | ||
690 | 2) non-capable hardware | |
691 | ||
692 | These do not generally respect level triggered IRQs. Normally, | |
693 | irqs may be lost while being masked and the only way to leave poll is to do | |
694 | a double check for new input after netif_rx_complete() is invoked | |
695 | and re-enable polling (after seeing this new input). | |
696 | ||
697 | Sample code: | |
698 | ||
699 | --------- | |
700 | . | |
701 | . | |
702 | restart_poll: | |
703 | while (ring_is_not_empty()) { | |
704 | work-work-work | |
705 | if quota is exceeded: exit, not touching irq status/mask | |
706 | } | |
707 | . | |
708 | . | |
709 | . | |
710 | enable_rx_interrupts() | |
711 | netif_rx_complete(dev); | |
712 | if (ring_has_new_packet() && netif_rx_reschedule(dev, received)) { | |
713 | disable_rx_and_rxnobufs() | |
714 | goto restart_poll | |
715 | } while (rx_status_is_set); | |
716 | --------- | |
717 | ||
718 | Basically netif_rx_complete() removes us from the poll list, but because a | |
719 | new packet which will never be caught due to the possibility of a race | |
720 | might come in, we attempt to re-add ourselves to the poll list. | |
721 | ||
722 | ||
723 | ||
724 | ||
725 | APPENDIX 3: Scheduling issues. | |
726 | ============================== | |
727 | As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the | |
728 | general solution to schedule softirq's to run before next interrupt and by putting | |
729 | them under scheduler control. Also this prevents consecutive softirq's from | |
730 | monopolize the CPU. This also have the effect that the priority of ksoftirq needs | |
731 | to be considered when running very CPU-intensive applications and networking to | |
732 | get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0 | |
733 | (eventually more) is reported cure problems with low network performance at high | |
734 | CPU load. | |
735 | ||
736 | Most used processes in a GIGE router: | |
737 | USER PID %CPU %MEM SIZE RSS TTY STAT START TIME COMMAND | |
738 | root 3 0.2 0.0 0 0 ? RWN Aug 15 602:00 (ksoftirqd_CPU0) | |
739 | root 232 0.0 7.9 41400 40884 ? S Aug 15 74:12 gated | |
740 | ||
741 | -------------------------------------------------------------------- | |
742 | ||
743 | relevant sites: | |
744 | ================== | |
745 | ftp://robur.slu.se/pub/Linux/net-development/NAPI/ | |
746 | ||
747 | ||
748 | -------------------------------------------------------------------- | |
749 | TODO: Write net-skeleton.c driver. | |
750 | ------------------------------------------------------------- | |
751 | ||
752 | Authors: | |
753 | ======== | |
754 | Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> | |
755 | Jamal Hadi Salim <hadi@cyberus.ca> | |
756 | Robert Olsson <Robert.Olsson@data.slu.se> | |
757 | ||
758 | Acknowledgements: | |
759 | ================ | |
760 | People who made this document better: | |
761 | ||
762 | Lennert Buytenhek <buytenh@gnu.org> | |
763 | Andrew Morton <akpm@zip.com.au> | |
764 | Manfred Spraul <manfred@colorfullife.com> | |
765 | Donald Becker <becker@scyld.com> | |
766 | Jeff Garzik <jgarzik@pobox.com> |