Archive for September, 2008

650i Ultra crashing under heavy I/O load

Friday, September 5th, 2008

I have an EVGA NF66 motherboard based on the nVidia 650i Ultra chipset that I was configuring to use as a VMware host / iSCSI SAN, but every time I started to stress test it, it would hang in a very uncerimonious fashion and corrupt the hard drives. Very frustrating I can assure you. 

After some investigation, I found that the BIOS on the board puts almost all the interrupts on IRQ 10 and 11 with no way of moving them around. Some testing showed me that IRQ 11 was being shared by the builtin Gigabit NIC and the second SATA controller. When I would copy data from the network to the drives on that second SATA controller everything would come to a screeching halt after about an hour. Installing a PCI NIC or SATA controller didn’t help either as all the PCI busses get assigned to the same two IRQs. What a stupid BIOS design. I guess they don’t expect people to use more than a single Hard Drive, or if they do and the board hangs, they’ll just “blame it on Windows.”

Never one to give up without a fight, I decided to give it one more weekend of testing before I tossed it in the scrap bin. I’m glad I did. As it turns out, the board operates just fine as long as all the interrupts are handled by a single CPU. Linux’s ‘irqbalance’ service was moving the IRQ between the cores on my dual-core CPU and this was causing the board to lose interrupts, thus causing the hangs. Disabling irqbalance and forcing all interrupts to a single CPU core seems to have solved the problem. Knock on wood.

# /sbin/chkconfig irqbalance off
# /sbin/service irqbalance stop

This of course will slow down interrupt handling slightly, but that is a fair tradeoff for my situation. If I were feeling ambitious, I could always manually move some of the interrupts off to the second CPU by adding something like the following to the /etc/rc.local file…

# echo 1 > /proc/irq/209/smp_affinity
# echo 1 > /proc/irq/223/smp_affinity

This would set my NIC and second SATA controller (IRQ 11 mapped to APIC IRQ 209 and 223 respectively) to have their interrupt requests processed by CPU1 instead of CPU0. If you wanted CPU2 use an ‘echo 4’ or ‘echo 8’ for CPU3 or ‘echo 6’ for CPU1 and CPU2. The smp_affinity setting is a bitmask register. By doing so, I could manually balance the interrupts across all CPUs to gain a bit of speed, but at this point I’m happy to just have the box working properly and I’m not worried terribly about squeezing the last bit of performance out of it.

Using Wake-on-LAN with the forcedeth driver

Monday, September 1st, 2008

I use Wake-On-LAN quite a bit to boot my remote machines when I need access to do something on them or they have died after a prolonged power failure. OK, really I’m just too lazy to walk down to the basement an flip the power switch <rolleyes>.

One machine I have with an nVidia NIC in it was giving me fits. Sometimes a WOL packet would wake it up and other times it would just sit there daring me to get off my butt and press the power switch. Finally, after a bit of research I stumbled upon this article that describes setting up Wake-On-LAN in linux.  Now that I’ve configured the machine correctly, it taunts me no more.

Essentially, the nVidia NIC defaults to disabling WOL after every reboot.  You need to add an entry in your /etc/rc.local file to tell it to re-enable WOL or the next time you shut down, you won’t be able to remote start it.

# echo "/sbin/ethtool -s eth0 wol g" >> /etc/rc.local