How to prevent multple “CHECK_NRPE: Socket timeout after 10 seconds” alerts

Written by

Nagios Monitor

In server monitoring with Nagios, nobody likes to get paged any more than necessary. This article will show you how to prevent multiple “CHECK_NRPE: Socket timeout after 10 seconds” alerts every time a host goes down.

In this circumstance, I’m not trying to get NRPE working. I’m trying to shut it up when there is an outage.

Every time a host goes down due to network issues, Nagios alerts a “host down” alert, which pages my phone. Nagios also sends one “CHECK_NRPE: Socket timeout after 10 seconds” alert for each remote service check on the box (check hard drive space, check processes, check zombies, check apt, check load, etc).

If I’m watching 10 services on a single box, this is a total of 11 pages on my phone when that box is down. When you are monitoring a lot of systems, this can get out of control very fast.

What would be better is when a host goes down, the ping host check on nagios alerts me the host is down, and the multiple other nrpe checks on the box are not alerting. I know the box is down, so I expect that I’m not going to be able to do those other checks.

The solution

In generic-service.cfg where we configure our service templates, create a new service template (copying generic-service if you want). Call it nrpe-service.

In the new nrpe-service template notification_options, remove the u to prevent notifying on “UNKNOWN”

notification_options w,c,r

On each of your services that use nrpe, change them to “use nrpe-service”:

define service {
host_name ipv4.securespot.net
check_command check_nrpe_1arg!check_disk
use nrpe-service
service_description Check Disk
}

Then we need to modify the call to check_nrpe and add a -u option. This is from the check_nrpe help page:
-u = Make socket timeouts return an UNKNOWN state instead of CRITICAL

On my box this is found in /etc/nagios-plugins/config/check_nrpe.cfg:

define command {
command_name check_nrpe
command_line /usr/lib/nagios/plugins/check_nrpe -u -H $HOSTADDRESS$ -c $ARG1$ -a $ARG2$
}

So we have told Nagios that for these types of services, we do not want to alert on UNKNOWN states. Then we told check_nrpe to make socket timeouts return an UNKNOWN state instead of CRITICAL.

Here is a reference for those nagios options
http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html

Now when a host goes down, only 1 alert for the ping host check only!

Good luck and happy monitoring!

Nagios NRPE

Comments

10 responses to “How to prevent multple “CHECK_NRPE: Socket timeout after 10 seconds” alerts”

June 14, 2012

Chris

Brilliant, I was looking for that “-u” option in the nrpe command for ages!

Reply
July 22, 2012

marius

Thanks man , you saved my day !!

Reply
August 8, 2012

john

Greetings friends I’m trying to monitor a disk on an external server (/ dev/sda1). As I do? I have tried with the command check_ssh and gives me error. Please can you help?
thanks

Reply
1. August 11, 2012
  
  Chris Carey
  
  This is basic Nagios configuration. You should be installing nrpe server and plugins on the remote machine. Then on your Nagios server you use check_nrpe to call a remote check_disk on the remote server.
  
  Reply
September 13, 2012

Sven

What I don’t understand is, what happens, if e.g.my NFS-Share goes down which I check with NRPE.

Do I get any alert then? NFS-Shares are vital to us, so I (or the management) would like to be informed immediately after they got down.

Reply
February 12, 2013

Alvin_Pogi

Is changing to -u means that all critical states of NRPE will not be considered as critical so will not notify you for this critical errors? Also unknown errors are not notified .. so critical errors and unknown are both not notified? Kinda confused. Thanks for your reply.

Reply
1. March 20, 2013
  
  Chris Carey
  
  the -u changes the nrpe plugin status from CRITICAL to UNKNOWN only in the case of “Socket timeout after n seconds”. You can still notify on UNKNOWN if you want to. That’s in the notification options for your contact! This ability gives you more granularity since now you can distinguish between an explicit CRITICAL, such as 1% disk free, and a “UNKNOWN” state, where the Nagios is simply not able to reach the nrpe service, and we have no idea how full the disk is.
  
  Reply
October 22, 2013

Richard

Thanks for posting this! This is exactly what I want!!

Reply
June 2, 2016

gabriel

What about other plugins, like check_http? If there is a socket timeout, it shouts critical, I would prefer unknown. It even ignores the max_check_attempts configuration, which I have set to 3 attempts before shouting the CRITICAL event. Please let me know, if you have any idea how to handle this. Thanks.

Reply
1. June 11, 2016
  
  Chris Carey
  
  For check_http you can use the -t flag with optional “timeout state”:
  
  -t, –timeout=INTEGER:
  Seconds before connection times out (default: 10)
  Optional “:” can be a state integer (0,1,2,3) or a state STRING
  
  You might want to also check the service_check_timeout_state configuration option where you can change the default state that is returned when a service check times out.
  https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/configmain.html
  
  Reply