Commit | Line | Data |
---|---|---|
84df082c | 1 | Last reviewed: 04/04/2016 |
47bece87 | 2 | |
84df082c NC |
3 | HPE iLO NMI Watchdog Driver |
4 | NMI sourcing for iLO based ProLiant Servers | |
47bece87 | 5 | Documentation and Driver by |
84df082c | 6 | Thomas Mingarelli <thomas.mingarelli@hpe.com> |
47bece87 | 7 | |
84df082c | 8 | The HPE iLO NMI Watchdog driver is a kernel module that provides basic |
47bece87 TM |
9 | watchdog functionality and the added benefit of NMI sourcing. Both the |
10 | watchdog functionality and the NMI sourcing capability need to be enabled | |
25985edc | 11 | by the user. Remember that the two modes are not dependent on one another. |
47bece87 | 12 | A user can have the NMI sourcing without the watchdog timer and vice-versa. |
84df082c NC |
13 | All references to iLO in this document imply it also works on iLO2 and all |
14 | subsequent generations. | |
47bece87 TM |
15 | |
16 | Watchdog functionality is enabled like any other common watchdog driver. That | |
17 | is, an application needs to be started that kicks off the watchdog timer. A | |
18 | basic application exists in the Documentation/watchdog/src directory called | |
19 | watchdog-test.c. Simply compile the C file and kick it off. If the system | |
84df082c | 20 | gets into a bad state and hangs, the HPE ProLiant iLO timer register will |
47bece87 TM |
21 | not be updated in a timely fashion and a hardware system reset (also known as |
22 | an Automatic Server Recovery (ASR)) event will occur. | |
23 | ||
84df082c | 24 | The hpwdt driver also has three (3) module parameters. They are the following: |
47bece87 | 25 | |
84df082c NC |
26 | soft_margin - allows the user to set the watchdog timer value. |
27 | Default value is 30 seconds. | |
28 | allow_kdump - allows the user to save off a kernel dump image after an NMI. | |
29 | Default value is 1/ON | |
47bece87 TM |
30 | nowayout - basic watchdog parameter that does not allow the timer to |
31 | be restarted or an impending ASR to be escaped. | |
84df082c NC |
32 | Default value is set when compiling the kernel. If it is set |
33 | to "Y", then there is no way of disabling the watchdog once | |
34 | it has been started. | |
47bece87 TM |
35 | |
36 | NOTE: More information about watchdog drivers in general, including the ioctl | |
37 | interface to /dev/watchdog can be found in | |
38 | Documentation/watchdog/watchdog-api.txt and Documentation/IPMI.txt. | |
39 | ||
44df7535 | 40 | The NMI sourcing capability is disabled by default due to the inability to |
47bece87 TM |
41 | distinguish between "NMI Watchdog Ticks" and "HW generated NMI events" in the |
42 | Linux kernel. What this means is that the hpwdt nmi handler code is called | |
43 | each time the NMI signal fires off. This could amount to several thousands of | |
44 | NMIs in a matter of seconds. If a user sees the Linux kernel's "dazed and | |
45 | confused" message in the logs or if the system gets into a hung state, then | |
84df082c | 46 | the hpwdt driver can be reloaded. |
47bece87 TM |
47 | |
48 | 1. If the kernel has not been booted with nmi_watchdog turned off then | |
84df082c NC |
49 | edit and place the nmi_watchdog=0 at the end of the currently booting |
50 | kernel line. Depending on your Linux distribution and platform setup: | |
51 | For non-UEFI systems | |
52 | /boot/grub/grub.conf or | |
53 | /boot/grub/menu.lst | |
54 | For UEFI systems | |
55 | /boot/efi/EFI/distroname/grub.conf or | |
56 | /boot/efi/efi/distroname/elilo.conf | |
47bece87 | 57 | 2. reboot the sever |
84df082c NC |
58 | 3. Once the system comes up perform a modprobe -r hpwdt |
59 | 4. modprobe /lib/modules/`uname -r`/kernel/drivers/watchdog/hpwdt.ko | |
47bece87 TM |
60 | |
61 | Now, the hpwdt can successfully receive and source the NMI and provide a log | |
84df082c | 62 | message that details the reason for the NMI (as determined by the HPE BIOS). |
47bece87 | 63 | |
84df082c | 64 | Below is a list of NMIs the HPE BIOS understands along with the associated |
47bece87 TM |
65 | code (reason): |
66 | ||
67 | No source found 00h | |
68 | ||
69 | Uncorrectable Memory Error 01h | |
70 | ||
71 | ASR NMI 1Bh | |
72 | ||
73 | PCI Parity Error 20h | |
74 | ||
75 | NMI Button Press 27h | |
76 | ||
77 | SB_BUS_NMI 28h | |
78 | ||
79 | ILO Doorbell NMI 29h | |
80 | ||
81 | ILO IOP NMI 2Ah | |
82 | ||
83 | ILO Watchdog NMI 2Bh | |
84 | ||
85 | Proc Throt NMI 2Ch | |
86 | ||
87 | Front Side Bus NMI 2Dh | |
88 | ||
89 | PCI Express Error 2Fh | |
90 | ||
91 | DMA controller NMI 30h | |
92 | ||
93 | Hypertransport/CSI Error 31h | |
94 | ||
95 | ||
96 | ||
97 | -- Tom Mingarelli | |
84df082c | 98 | (thomas.mingarelli@hpe.com) |