650i Ultra crashing under heavy I/O load

I have an EVGA NF66 motherboard based on the nVidia 650i Ultra chipset that I was configuring to use as a VMware host / iSCSI SAN, but every time I started to stress test it, it would hang in a very uncerimonious fashion and corrupt the hard drives. Very frustrating I can assure you. 

After some investigation, I found that the BIOS on the board puts almost all the interrupts on IRQ 10 and 11 with no way of moving them around. Some testing showed me that IRQ 11 was being shared by the builtin Gigabit NIC and the second SATA controller. When I would copy data from the network to the drives on that second SATA controller everything would come to a screeching halt after about an hour. Installing a PCI NIC or SATA controller didn’t help either as all the PCI busses get assigned to the same two IRQs. What a stupid BIOS design. I guess they don’t expect people to use more than a single Hard Drive, or if they do and the board hangs, they’ll just “blame it on Windows.”

Never one to give up without a fight, I decided to give it one more weekend of testing before I tossed it in the scrap bin. I’m glad I did. As it turns out, the board operates just fine as long as all the interrupts are handled by a single CPU. Linux’s ‘irqbalance’ service was moving the IRQ between the cores on my dual-core CPU and this was causing the board to lose interrupts, thus causing the hangs. Disabling irqbalance and forcing all interrupts to a single CPU core seems to have solved the problem. Knock on wood.

# /sbin/chkconfig irqbalance off
# /sbin/service irqbalance stop

This of course will slow down interrupt handling slightly, but that is a fair tradeoff for my situation. If I were feeling ambitious, I could always manually move some of the interrupts off to the second CPU by adding something like the following to the /etc/rc.local file…

# echo 1 > /proc/irq/209/smp_affinity
# echo 1 > /proc/irq/223/smp_affinity

This would set my NIC and second SATA controller (IRQ 11 mapped to APIC IRQ 209 and 223 respectively) to have their interrupt requests processed by CPU1 instead of CPU0. If you wanted CPU2 use an ‘echo 4’ or ‘echo 8’ for CPU3 or ‘echo 6’ for CPU1 and CPU2. The smp_affinity setting is a bitmask register. By doing so, I could manually balance the interrupts across all CPUs to gain a bit of speed, but at this point I’m happy to just have the box working properly and I’m not worried terribly about squeezing the last bit of performance out of it.

Leave a Reply

You must be logged in to post a comment.