It was a regular Monday morning – busy as usual. Then one client popped up with a very interesting problem. His server became unresponsive lately, had very high load, and he was wondering why CloudLinux wasn't stopping the issue.
I quickly logged into the server (and the unresponsiveness become obvious right away) and ran top. I start with top pretty much every time someone has “overload” issue with the server. Running top was the right thing to do this time over as well as it gave me an idea where to look next. The si was at 70% -- something was wrong, really wrong
si stands for % of CPU used to handle software interrupts. On most servers you would rarely see si using more then 2 to 4% of CPU. Software interrupt is an asynchronous signal that needs to be handled by some code. They are normal, and happen all the time. For example, software interrupts happen on each timer tick or when network card receives a packet of data, and it needs software to process that data.
My next step was to see which software interrupts are the most frequent on this system, and might be causing the issue.
# cat /proc/interrupts
CPU0 CPU1
0: 1566845520 60143 IO-APIC-edge timer
1: 1 2 IO-APIC-edge i8042
8: 0 1 IO-APIC-edge rtc
9: 0 0 IO-APIC-level acpi
12: 1 3 IO-APIC-edge i8042
50: 226 0 PCI-MSI hda_intel
169: 0 0 IO-APIC-level uhci_hcd:usb5
209: 0 0 IO-APIC-level uhci_hcd:usb4
217: 0 0 IO-APIC-level ehci_hcd:usb1, uhci_hcd:usb2
225: 13475 111221010 IO-APIC-level uhci_hcd:usb3, ata_piix
233: 62 327366496 PCI-MSI eth0
NMI: 493362 580033
LOC: 1566919097 1566920751
RES: 27519611 15339092
ERR: 0
MIS: 0
Now, the first column stands of IRQ (interrupt request) number, CPU0, and CPU1 stand for number of times interrupt was handled by particular CPU. The next column is type of the interrupt – which is not important in this case, and the last column are modules that are listening for the interrupt.
I knew that timer could be ignored, it increments on each clock tick.
The LOC stands for local timer – and can be ignored as well.
RES stands for Rescheduling interrupts – and it looked fine.
The other two IRQ numbers that were very active were 225 and 233.
225 was used by uhci_hcd:usb3, ata_piix while 233 was used by eth0.
This is a web server, so I expected lots of network traffic, and high IRQ activity for eth0 was normal.
IRQ 225 didn't look as good. uhci_hcd is used for USB, and ata_piix is your standard ATA hard disk. The disk activity (based on iotop and iostat output) wasn't that high, but the counter was increasing very fast. Could it be interrupt storm caused by some conflict between two devices?
Well, USB is not needed on a web server, so it was easy to test.
# rmmod uhci_hcd ohci_hcd ehci_hcd
unloaded USB related modules, and now
# cat /proc/interrupts
CPU0 CPU1
0: 1567154926 60143 IO-APIC-edge timer
1: 1 2 IO-APIC-edge i8042
8: 0 1 IO-APIC-edge rtc
9: 0 0 IO-APIC-level acpi
12: 1 3 IO-APIC-edge i8042
50: 226 0 PCI-MSI hda_intel
225: 13475 111239531 IO-APIC-level ata_piix
233: 62 327404778 PCI-MSI eth0
NMI: 493550 580292
LOC: 1567228505 1567230167
RES: 27526034 15340183
ERR: 0
MIS: 0
USB was no more. System became responsive, and si dropped to 3%, load average dropped as well. It was a conflict between USB and ATA. I added USB modules to blacklist so they wouldn't be loaded after reboot. That was done by adding following lines to /etc/modprobe.d/blacklist.conf
# disable usb
blacklist uhci_hcd
blacklist ohci_hcd
blacklist ehci_hcd
While the situation is highly unusual, and probably sign of faulty hardware or bios, it raised an interesting question if USB should be disabled on the server. For most web servers USB is not in use* anyway. Is there harm in having those modules loaded? First of all they take up a little memory. Yet, on some motherboards (like in this example) they might share same IRQ number with another device, and that is bad. It means each time such interrupt happens, both interrupt handlers will wake up and try to decide which will handle it. That wastes CPU cycles. It might not be as bad as in this case (as this one for caused by some hardware issue), but it is still a waste.
It makes a lot of sense to disable any hardware not in use – it might give some extra breathing space for the server.
* If you use KVM - it might use USB for console access, disabling USB is not recommended in this case.
