In server monitoring with Nagios, nobody likes to get paged any more than necessary. This article will show you how to prevent multiple “CHECK_NRPE: Socket timeout after 10 seconds” alerts every time a host goes down.
In this circumstance, I’m not trying to get NRPE working. I’m trying to shut it up when there is an outage.
Every time a host goes down due to network issues, Nagios alerts a “host down” alert, which pages my phone. Nagios also sends one “CHECK_NRPE: Socket timeout after 10 seconds” alert for each remote service check on the box (check hard drive space, check processes, check zombies, check apt, check load, etc).
If I’m watching 10 services on a single box, this is a total of 11 pages on my phone when that box is down. When you are monitoring a lot of systems, this can get out of control very fast.
What would be better is when a host goes down, the ping host check on nagios alerts me the host is down, and the multiple other nrpe checks on the box are not alerting. I know the box is down, so I expect that I’m not going to be able to do those other checks.
The solution
In generic-service.cfg where we configure our service templates, create a new service template (copying generic-service if you want). Call it nrpe-service.
In the new nrpe-service template notification_options, remove the u to prevent notifying on “UNKNOWN”
On each of your services that use nrpe, change them to “use nrpe-service”:
host_name ipv4.securespot.net
check_command check_nrpe_1arg!check_disk
use nrpe-service
service_description Check Disk
}
Then we need to modify the call to check_nrpe and add a -u option. This is from the check_nrpe help page:
-u = Make socket timeouts return an UNKNOWN state instead of CRITICAL
On my box this is found in /etc/nagios-plugins/config/check_nrpe.cfg:
command_name check_nrpe
command_line /usr/lib/nagios/plugins/check_nrpe -u -H $HOSTADDRESS$ -c $ARG1$ -a $ARG2$
}
So we have told Nagios that for these types of services, we do not want to alert on UNKNOWN states. Then we told check_nrpe to make socket timeouts return an UNKNOWN state instead of CRITICAL.
Here is a reference for those nagios options
http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html
Now when a host goes down, only 1 alert for the ping host check only!
Good luck and happy monitoring!
Leave a Reply