Getting nagios back up and running… again.

— SjG @ 12:42 pm

Nagios monitoring on one Centos 6.9 server seemed to have stopped working after an upgrade. All the tests showed status OK, but they hadn’t actually run in days.

Looking at the service details was weird, because the next scheduled check was about a minute in the past.

The Nagios help page wasn’t. And we’re running Core 4.3.x anyway, without a MySQL database.

The first clue was a bunch of lines in the event log:
Error: Could not open check result queue directory '/var/log/nagios/spool/checkresults' for reading.

Turns out we didn’t even have a /var/log/nagios/spool directory. Creating those directories helped. But Nagios still wouldn’t start from the usual startup scripts. Nothing in the main log. But then, another clue.

Who doesn’t love to see shit like this:

$ cat /var/log/nagios/nagios.configtest
ERROR: Errors in config files – see log for details: /var/log/nagios/nagios.configtest
$

So the startup script /etc/init.d/nagios searches for warnings, and aborts if they exist. It’s supposed to log them. For some reason it didn’t.

You can manually get those warnings and errors yourself by running

/usr/sbin/nagios -v /etc/nagios/nagios.cfg (adjust paths as appropriate).

I ended up with a bunch of warnings for deprecated parameters. So I went in and edited my config files to remove them or update them to the new equivalents. Oh yes. Software authors, please keep in mind: nothing pleases your users more than changing the names of variables in config files. We users live for this shit. When, oh when, will the author of our software next change “retry_check_interval” to “retry_interval”?

Fixing all of the warnings was not enough, though. The startup script gave the message “Starting nagios:” and then silently died. Well, sort of. It actually was starting now, but brokenly:

# ps aux | grep -i nagios
nagios 12610 0.0 0.0 12296 1220 ? Ss 16:36 0:00 /usr/sbin/nagios -d /etc/nagios/nagios.cfg
nagios 12611 0.0 0.0 0 0 ? Z 16:36 0:00 [nagios] <defunct>
nagios 12612 0.0 0.0 0 0 ? Z 16:36 0:00 [nagios] <defunct>
nagios 12613 0.0 0.0 0 0 ? Z 16:36 0:00 [nagios] <defunct>
nagios 12614 0.0 0.0 0 0 ? Z 16:36 0:00 [nagios] <defunct>
nagios 12616 0.0 0.0 11780 520 ? S 16:36 0:00 /usr/sbin/nagios -d /etc/nagios/nagios.cfg
root 12744 0.0 0.0 103328 876 pts/0 S+ 16:44 0:00 grep -i nagios

Starting directly from the command line worked:

# /usr/sbin/nagios -d /etc/nagios/nagios.cfg
# ps aux | grep nagios
nrpe 7282 0.0 0.0 41380 1340 ? Ss 16:51 0:00 /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -d
nagios 8010 0.0 0.0 16404 1280 ? Ss 17:31 0:00 /usr/sbin/nagios -d /etc/nagios/nagios.cfg
nagios 8011 0.0 0.0 10052 920 ? S 17:31 0:00 /usr/sbin/nagios –worker /var/spool/nagios/cmd/nagios.qh
nagios 8012 0.0 0.0 10052 920 ? S 17:31 0:00 /usr/sbin/nagios –worker /var/spool/nagios/cmd/nagios.qh
nagios 8013 0.0 0.0 10052 920 ? S 17:31 0:00 /usr/sbin/nagios –worker /var/spool/nagios/cmd/nagios.qh
nagios 8014 0.0 0.0 10052 920 ? S 17:31 0:00 /usr/sbin/nagios –worker /var/spool/nagios/cmd/nagios.qh
nagios 8015 0.0 0.0 15888 552 ? S 17:31 0:00 /usr/sbin/nagios -d /etc/nagios/nagios.cfg
root 8018 0.0 0.0 100956 616 pts/0 S+ 17:31 0:00 tail -f /var/log/nagios/nagios.log
root 8037 0.0 0.0 103328 856 pts/1 S+ 17:32 0:00 grep nagios

When things get this weird, there are only two options. Well, three, if you include the “rm -rf /” option. But the other two are: 1) reboot and see if stuff magically starts working, or 2) see what SELinux is breaking.

# tail -f /etc/audit/audit.log
type=AVC msg=audit(1504137329.209:41): avc: denied { execute_no_trans } for pid=7731 comm=”nagios” path=”/usr/sbin/nagios” dev=dm-0 ino=1201464 scontext=unconfined_u:system_r:nagios_t:s0 tcontext=system_u:object_r:nagios_exec_t:s0 tclass=file
…

Yup. As expected, SELinux breaking stuff.

So, my preferred way to solve this kind of problem now is to snip out all relevant the AVC “denied” sections from the log into a single file (which I called audit.log), and then using audit2allow to create a new module. Since there’s already a nagios module (containing insufficient privileges), I created a nagios2 module:

# audit2allow -M nagios2 < audit.log # semodule -i nagios2.pp

Hooray! After a few iterations of this process (discovering other blocked operation, granting them permission, restarting nagios), everything was working but check_disk_smb, which was returning “results from smbclient not suitable” even as it worked fine when tested from the command-line as follows:

# su – nagios -s /bin/bash -c “/usr/lib64/nagios/plugins/check_disk_smb -H SMBHOST -s share -a 10.X.X.X -u nagios -p \”password\” -w 90 -c 95″
Disk ok – 16.71G (11%) free on \\SMBHOST\share | ‘share’=134108676096B;136850492620.8;144453297766.4;0;152056102912

Diving in and editing check_disk_smb to throw the actual error message, I found nagios getting a “ERROR: Could not determine network interfaces, you must use a interfaces config line” from smbclient. So I edited /etc/samba/smb.conf, and explicitly told samba which interfaces it had available:

interfaces = lo eth0 10.X.X.X/24

Le sigh. Now this error went away, and I got to go for another fun and challenging round of “find all the SMB operations that SELinux is breaking.” This time, I got tripped up the “dontaudits” — there were operations being blocked, but not logged. I was saved by TrevorH and sfix, helpful people in Freenode’s #centos IRC channel:

TrevorH: semodule -DB to disable dontaudit rules, stay permissive, recreate, use the audit log to generate a policy as per the wiki
13:14 TrevorH: @selinux
13:14 centbot: Useful resources for SELinux: http://wiki.centos.org/HowTos/SELinux | http://wiki.centos.org/TipsAndTricks/SelinuxBooleans | http://docs.fedoraproject.org/en-US/Fedora/13/html/Security-Enhanced_Linux/ | http://www.youtube.com/watch?v=bQqX3RWn0Yw | http://opensource.com/business/13/11/selinux-policy-guide
13:15 _SjG_: Thanks
13:16 TrevorH: semodule -B when done (as well as setenforce 1)
13:25 _SjG_: TrevorH: thanks, that resolved it.
13:25 _SjG_: so what I was missing is that there can be donaudit rules that were preventing specific operations from showing up in the audit log?
13:28 TrevorH: yes
13:28 sfix: _SjG_: yep, there’s permissions in the policy that we know are requested but don’t want to allow for whatever reason. dontaudits are our way of preventing them from cluttering the audit log.
13:29 sfix: dontaudits tend to be a bit over-eager though

So there you have it.

I was finally back to where I had been mere days before.

Filed in:General, Technology, Unix

https://www.fogbound.net/archives/2017/08/31/getting-nagios-back-up/#respond

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Thu, 31 Aug 2017

Getting nagios back up and running… again.

Leave a Reply