How to prevent multple “CHECK_NRPE: Socket timeout after 10 seconds” alerts

Nagios Monitor

In server monitoring with Nagios, nobody likes to get paged any more than necessary. This article will show you how to prevent multiple “CHECK_NRPE: Socket timeout after 10 seconds” alerts every time a host goes down.

In this circumstance, I’m not trying to get NRPE working. I’m trying to shut it up when there is an outage.

Every time a host goes down due to network issues, Nagios alerts a “host down” alert, which pages my phone. Nagios also sends one “CHECK_NRPE: Socket timeout after 10 seconds” alert for each remote service check on the box (check hard drive space, check processes, check zombies, check apt, check load, etc).

If I’m watching 10 services on a single box, this is a total of 11 pages on my phone when that box is down. When you are monitoring a lot of systems, this can get out of control very fast.

What would be better is when a host goes down, the ping host check on nagios alerts me the host is down, and the multiple other nrpe checks on the box are not alerting. I know the box is down, so I expect that I’m not going to be able to do those other checks.

The solution

In generic-service.cfg where we configure our service templates, create a new service template (copying generic-service if you want). Call it nrpe-service.

In the new nrpe-service template notification_options, remove the u to prevent notifying on “UNKNOWN”

notification_options            w,c,r

On each of your services that use nrpe, change them to “use nrpe-service”:

define service {
  host_name ipv4.securespot.net
  check_command                     check_nrpe_1arg!check_disk
  use                               nrpe-service
  service_description               Check Disk
}

Then we need to modify the call to check_nrpe and add a -u option. This is from the check_nrpe help page:
-u = Make socket timeouts return an UNKNOWN state instead of CRITICAL

On my box this is found in /etc/nagios-plugins/config/check_nrpe.cfg:

define command {
  command_name  check_nrpe
  command_line  /usr/lib/nagios/plugins/check_nrpe -u -H $HOSTADDRESS$ -c $ARG1$ -a $ARG2$
}

So we have told Nagios that for these types of services, we do not want to alert on UNKNOWN states. Then we told check_nrpe to make socket timeouts return an UNKNOWN state instead of CRITICAL.

Here is a reference for those nagios options
http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html

Now when a host goes down, only 1 alert for the ping host check only!

Good luck and happy monitoring!


Posted

in

,

by

Tags:

Comments

10 responses to “How to prevent multple “CHECK_NRPE: Socket timeout after 10 seconds” alerts”

  1. Chris Avatar
    Chris

    Brilliant, I was looking for that “-u” option in the nrpe command for ages!

  2. marius Avatar
    marius

    Thanks man , you saved my day !!

  3. john Avatar
    john

    Greetings friends I’m trying to monitor a disk on an external server (/ dev/sda1). As I do? I have tried with the command check_ssh and gives me error. Please can you help?
    thanks

    1. Chris Carey Avatar

      This is basic Nagios configuration. You should be installing nrpe server and plugins on the remote machine. Then on your Nagios server you use check_nrpe to call a remote check_disk on the remote server.

  4. Sven Avatar
    Sven

    What I don’t understand is, what happens, if e.g.my NFS-Share goes down which I check with NRPE.

    Do I get any alert then? NFS-Shares are vital to us, so I (or the management) would like to be informed immediately after they got down.

  5. Alvin_Pogi Avatar
    Alvin_Pogi

    Is changing to -u means that all critical states of NRPE will not be considered as critical so will not notify you for this critical errors? Also unknown errors are not notified .. so critical errors and unknown are both not notified? Kinda confused. Thanks for your reply.

    1. Chris Carey Avatar

      the -u changes the nrpe plugin status from CRITICAL to UNKNOWN only in the case of “Socket timeout after n seconds”. You can still notify on UNKNOWN if you want to. That’s in the notification options for your contact! This ability gives you more granularity since now you can distinguish between an explicit CRITICAL, such as 1% disk free, and a “UNKNOWN” state, where the Nagios is simply not able to reach the nrpe service, and we have no idea how full the disk is.

  6. Richard Avatar
    Richard

    Thanks for posting this! This is exactly what I want!!

  7. gabriel Avatar
    gabriel

    What about other plugins, like check_http? If there is a socket timeout, it shouts critical, I would prefer unknown. It even ignores the max_check_attempts configuration, which I have set to 3 attempts before shouting the CRITICAL event. Please let me know, if you have any idea how to handle this. Thanks.

    1. Chris Carey Avatar

      For check_http you can use the -t flag with optional “timeout state”:

      -t, –timeout=INTEGER:
      Seconds before connection times out (default: 10)
      Optional “:
      ” can be a state integer (0,1,2,3) or a state STRING

      You might want to also check the service_check_timeout_state configuration option where you can change the default state that is returned when a service check times out.
      https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/configmain.html

Leave a Reply

Your email address will not be published. Required fields are marked *