Category : Linux

Nagios plugin: check_snmp_ifstatus.pl (redux)

Some time ago I posted a nagios plugin I had knocked together for monitoring network interfaces using snmp. Much to my surprise, a number of people have found the plugin useful and suggested some enhancements to make it more useful. I’ve finally taken some time to implement those changes and here are the results

# ./check_snmp_ifstatus.pl --help

Usage: check_snmp_ifstatus.pl -H hostaddress -i "if_description" [-b if_max_bps] [-C snmp_community] [-v 1|2] [-p snmp_port] [-w warn] [-c crit]Options:
  -H --host STRING or IPADDRESS
  IP Address of host to check. Required.
  -i --interface STRING
  Full name or numeric index number of the interface to check. Required.
  Examples would be "eth0", "FastEthernet0/1", 1 or 65539
  -C --community STRING
  SNMP Community string. Optional. Defaults to 'public'
  -v --version INTEGER
  SNMP Version ( 1 or 2 ). Optional. Defaults to 1
  -p --port INTEGER
  SNMP port. Optional. Defaults to 161
  -b --bandwidth INTEGER
  Interface maximum speed in bits per second. Optional.
  Use this to override the value returned by SNMP if it lies about
  the max speed of the interface.
  -6 --64bit
  Use 64bit counters for bandwidth usage. Not available on all devices.
  -w --warning INTEGER
  % of bandwidth usage necessary to result in warning status. Optional.
  Defaults to 85%
  -c --critical INTEGER
  % of bandwidth usage necessary to result in critical status. Optional.
  Defaults to 98%

The major difference for this new version is that it now allows you to input the ifIndex number of the interface you wish test directly rather than relying on the device supporting the ifDesc oid. This means you can now use it to test interfaces on Microsoft Windows computers and devices that don’t have unique descriptions for each interface. For example, if the result of an snmpwalk of the device showed “IF-MIB::ifIndex.1 = INTEGER: 1” and “IF-MIB::ifDescr.1 = STRING: eth0” you would be able to refer to this interface as either ‘-i eth0’ or ‘-i 1’. For a MS Windows computer, you would typically use ‘-i 65539’ for the first LAN interface.

You may also notice that I have added some preliminary support for reading the 64bit counters on devices that support this. Please note that this is entirely untested, as I do not have any devices to test it on. If you do try this out and it works for you, please let me know.

Installing the plugin is very straight-forward. First, download the check_snmp_ifstatus_v2.tar file and extract the check_snmp_ifstatus.pl file from it.

# tar -xf check_snmp_ifstatus_v2.tar

Copy the check_snmp_ifstatus.pl file to your /usr/lib/nagios/plugins/contrib folder.
Add a command to reference the plugin in your “/etc/nagios/objects/commands.cfg” file, like so…

define command {
command_name check_snmp_ifstatus
command_line /usr/bin/perl $USER1$/contrib/check_snmp_ifstatus.pl -H $HOSTADDRESS$ -i $ARG1$ -w $ARG2$ -c $ARG3$ $ARG4$
}

Add a service entry for each host you want to check in your “/etc/nagios/objects/services.cfg” file, like so…

define service{
use generic-service
check_command check_snmp_ifstatus!"Ethernet0/0"!85!98
service_description Network Status: e0/0
normal_check_interval 5
retry_check_interval 1
host_name cisco1
}

define service{
use generic-service
check_command check_snmp_ifstatus!"eth0"!50!90!-b 1000000000
service_description Network Status: eth0
normal_check_interval 1
retry_check_interval 1
host_name linux1,linux2
}

define service{
use generic-service
check_command check_snmp_ifstatus!65539!25!80
service_description Network Status: LAN
normal_check_interval 1
retry_check_interval 1
host_name winserver1
}

Now verify that you haven’t made any mistakes in the nagios configuration files…

# nagios -v nagios.cfg

And restart the nagios service if all looks well…

# /sbin/service nagios restart

You should now see new service entries for your hosts listing the current interface status. Enjoy.

650i Ultra crashing under heavy I/O load

I have an EVGA NF66 motherboard based on the nVidia 650i Ultra chipset that I was configuring to use as a VMware host / iSCSI SAN, but every time I started to stress test it, it would hang in a very uncerimonious fashion and corrupt the hard drives. Very frustrating I can assure you. 

After some investigation, I found that the BIOS on the board puts almost all the interrupts on IRQ 10 and 11 with no way of moving them around. Some testing showed me that IRQ 11 was being shared by the builtin Gigabit NIC and the second SATA controller. When I would copy data from the network to the drives on that second SATA controller everything would come to a screeching halt after about an hour. Installing a PCI NIC or SATA controller didn’t help either as all the PCI busses get assigned to the same two IRQs. What a stupid BIOS design. I guess they don’t expect people to use more than a single Hard Drive, or if they do and the board hangs, they’ll just “blame it on Windows.”

Never one to give up without a fight, I decided to give it one more weekend of testing before I tossed it in the scrap bin. I’m glad I did. As it turns out, the board operates just fine as long as all the interrupts are handled by a single CPU. Linux’s ‘irqbalance’ service was moving the IRQ between the cores on my dual-core CPU and this was causing the board to lose interrupts, thus causing the hangs. Disabling irqbalance and forcing all interrupts to a single CPU core seems to have solved the problem. Knock on wood.

# /sbin/chkconfig irqbalance off
# /sbin/service irqbalance stop

This of course will slow down interrupt handling slightly, but that is a fair tradeoff for my situation. If I were feeling ambitious, I could always manually move some of the interrupts off to the second CPU by adding something like the following to the /etc/rc.local file…

# echo 1 > /proc/irq/209/smp_affinity
# echo 1 > /proc/irq/223/smp_affinity

This would set my NIC and second SATA controller (IRQ 11 mapped to APIC IRQ 209 and 223 respectively) to have their interrupt requests processed by CPU1 instead of CPU0. If you wanted CPU2 use an ‘echo 4’ or ‘echo 8’ for CPU3 or ‘echo 6’ for CPU1 and CPU2. The smp_affinity setting is a bitmask register. By doing so, I could manually balance the interrupts across all CPUs to gain a bit of speed, but at this point I’m happy to just have the box working properly and I’m not worried terribly about squeezing the last bit of performance out of it.

Using Wake-on-LAN with the forcedeth driver

I use Wake-On-LAN quite a bit to boot my remote machines when I need access to do something on them or they have died after a prolonged power failure. OK, really I’m just too lazy to walk down to the basement an flip the power switch <rolleyes>.

One machine I have with an nVidia NIC in it was giving me fits. Sometimes a WOL packet would wake it up and other times it would just sit there daring me to get off my butt and press the power switch. Finally, after a bit of research I stumbled upon this article that describes setting up Wake-On-LAN in linux.  Now that I’ve configured the machine correctly, it taunts me no more.

Essentially, the nVidia NIC defaults to disabling WOL after every reboot.  You need to add an entry in your /etc/rc.local file to tell it to re-enable WOL or the next time you shut down, you won’t be able to remote start it.

# echo "/sbin/ethtool -s eth0 wol g" >> /etc/rc.local

Nagios plugin: check_cpu.sh and check_mem.sh

I have written a couple more Nagios plugins for use with NRPE on linux machines. The first one, “check_cpu.sh ,” grabs the cpu state from “/proc/stats” and sends back a status result and perfdata. You can tell it to send back either the aggregate data from all cpus as a single total or can have it return all cpus individually. Be aware though, that if you wish to have it send data on all cpu’s you will need to patch Nagios to allow for a larger perfdata return buffer. I didn’t want to mess with doing that, so I just have it watch the aggregate data.

check_cpu.sh preview

The second plugin, “check_mem.sh,” will parse the output of “free -mt” to give you a look at the current memory and swap utilization.

check_mem.sh preview

Nagios plugin: check_snmp_ifstatus.pl

While looking around for a nagios plugin to monitor the ethernet interfaces on my equipment, I just could not find one that did exactly what I wanted. So… I did exactly what OSS was designed for.  I took a plugin that had similar functionality as a starting point and re-wrote it to operate the way I needed. This is the result of my re-write of check_iftraffic.pl found on NagiosExchange.

“check_snmp_ifstatus.pl” will take a host IP address and interface name and return the current status of the interface. The returned data includes the UP/DOWN status, the interface line speed, and the current RX/TX bps. This status is returned as both an OK/Warning/Critical description and performance data usable with rrdtool. I have also created a pnp4nagios template for use with the plugin so that the perfdata can be easily viewed.

# check_snmp_ifstatus.pl -H 10.0.0.1 -i "Ethernet0/0"
OK: Ethernet0/0 is UP at 10Mbps. RX=72.27Kbps (0.72%), TX=4.094Kbps (0.04%) | RXbps=72271;8500000;9800000;0;10000000 TXbps=4094;8500000;9800000;0;10000000 RXpct=0.72%;85;98;0;100 TXpct=0.04%;85;98;0;100 elapsed=30s;3435;;;

Read More

Creating an NFS share on Windows XP

Recently, on the Mythtv-users mailing list, Gabe Rubin posted a question asking why he was getting no thumbnails in his mythweb view. Having already resolved that problem in my own environment, I told him that there was a bug in CIFS that showed up when using the MythTV uPnP code (which Mythweb does as of 0.21) and he was better off using NFS if possible. I also offered to help him get NFS working using Microsoft Services for Unix (SFU). Silly me 🙂

I’d never previously setup SFU under Windows XP. I’d only ever used it under a ‘Server’ version of Windows. As it turns out, there is a bug (or is that a feature?) in the User Name Mapping (UNM) feature of SFU that causes it to fail on Windows XP if the username/password combination of the *nix account you want to map does not match the username/password combination of the Windows account it is mapped to. The way UNM is supposed to work, is that it will take any *nix user account and map it to any Windows user account. Username and password is not supposed to matter, and indeed, this is the way it works in Windows Server. Not so for Windows XP though.

 So, for anybody out there trying to get SFU working properly (the NFS Server portion in particular) on a Windows XP machine, here are step-by-step instructions for what worked for me in my test-lab… Read More

Use VNC to view your X11 console display

Although I do 99% of the remote administration on my linux servers via a ssh terminal, it is sometimes desirable to connect to the root X11 display. Fortunately X11 has a built-in method for doing this.

Step 1: Add the following lines to your /etc/X11/xorg.conf

Section "Module"
  ...
  Load "vnc"
  ...
EndSection
Section "Screen"
  ...
  Option "SecurityTypes" "VncAuth"
  Option "UserPasswdVerifier" "VncAuth"
  Option "PasswordFile" "/root/.vnc/passwd"
  ...
EndSection

Step 2: Set a VNC password

# /usr/bin/vncpasswd
Password: ******
Verify: ******

Step 3: Open TCP port 5900 in the firewall

# echo --port=5900:tcp >> /etc/sysconfig/system-config-securitylevel

Step 4: Restart the X server with a ctrl-alt-backspace or reboot the server if possible

Step 5: Connect with your favorite vncviewer client

Note: If you are running in a secure environment and would like to connect without a password, you can replace the three ‘Option’ lines in the ‘Screen’ section of /etc/X11/xorg.conf (from Step 1 above) with a single line that reads

Section "Screen"
...
Option "SecurityTypes" "None"
...
EndSection

Installing Linux with a GUI under MS Virtual PC

Since I haven’t been able to find a version of VMWare Server that works properly under Vista as a host system, I have fallen back to using MS Virtual PC 2007 to run the VMs on my desktop machine. This works well enough for the windows VMs, but installing linux can be frustrating. The X-Windows installer always seems to want to come up in 24 bit colour mode, but Virtual PC thinks it’s 16 bit colour mode. This results in a corrupt display and the failure to continue the installation. To work around this, you need to remember to tell the linux installer to run in either text mode or vesa compatibility mode.

Use text mode if you won’t be installing X-Windows…

boot: linux text

Use vesa compatibility mode if you want the full GUI experience…

boot: linux vesa

I’ve tried this with various Fedora/Centos releases, but it should work with whatever your favorite flavour of linux is.

VMWare Guest clocks running too fast

I use VMWare Server on a couple of machines to run various linux and windows guest virtual machines. Depending on what cpu and what kind of power management I have enabled on the host machine I can see some very severe clock drift on the guests. To work around the clock drift I usually need to add a few items to the vmware hosts global config file.

For linux hosts, the config file can usually be found at…

/etc/vmware/config

For windows hosts, the config file location is…

C:\Documents and Settings\All Users\Application Data\VMware\VMware Server\config.ini

Open the global config file and add the following lines then restart the vmware server service/daemon…

host.cpukHz = 2000000
hostinfo.noTSC = TRUE
tools.syncTime = TRUE

The first line tells vmware what the maximum clock frequency of the host cpu is (2GHz in my example). The ‘hostinfo.noTSC’ line informs VMWare that the cpu is not running at a constant clock rate (speedstep or cpufreq or power management is active) and the timestamp counter can’t be trusted so use it as little as possible. The last line sets the default to use vmware-tools timesync function. The addition of these items is usually enough to keep the guest clocks running close enough to proper time that ntp will stay locked.

Scripting remote execution with SSH

Today, while doing some maintenance on my MythTV master backend, I rebooted it, but forgot to restart the backend services on my slave backends. Of course this meant that I missed some recordings. This started me thinking about how I could automate the process of restarting the remote backend processes whenever I started the master backend process. The solution was to configure ssh to authenticate between the machines using a public/private keypair, rather than using password authentication. I could then write a script that would ssh to the slave machines and restart their mythbackend services.

Setting up the ssh authentication is simple. Let’s say I want to run a script from a machine named mythmbe and have it execute a command using the username mythtv on the remote machine named mythbe1

First, generate a private/public keypair using ssh-keygen, which will be stored in ~/.ssh

$ ssh-keygen -t dsa

Now copy the public key to every remote machine you want to ssh in to without supplying a password…

$ scp ~/.ssh/id_dsa.pub mythtv@mythbe1:~mythtv/.ssh/authorized_keys

You can now log in to the remote machine with ssh without needing to supply a password. The public key you copied to the remote machine will be tested against your private key to verify your identity whenever you try to connect. To test this, try something like…

$ ssh mythbe1 /sbin/service mythbackend restart