fogbound.net




Thu, 31 Aug 2017

Getting nagios back up and running… again.

— SjG @ 12:42 pm

Nagios monitoring on one Centos 6.9 server seemed to have stopped working after an upgrade. All the tests showed status OK, but they hadn’t actually run in days.

Looking at the service details was weird, because the next scheduled check was about a minute in the past.

The Nagios help page wasn’t. And we’re running Core 4.3.x anyway, without a MySQL database.

The first clue was a bunch of lines in the event log:
Error: Could not open check result queue directory '/var/log/nagios/spool/checkresults' for reading.

Turns out we didn’t even have a /var/log/nagios/spool directory. Creating those directories helped. But Nagios still wouldn’t start from the usual startup scripts. Nothing in the main log. But then, another clue.

Who doesn’t love to see shit like this:

$ cat /var/log/nagios/nagios.configtest
ERROR: Errors in config files – see log for details: /var/log/nagios/nagios.configtest
$

So the startup script /etc/init.d/nagios searches for warnings, and aborts if they exist. It’s supposed to log them. For some reason it didn’t.

You can manually get those warnings and errors yourself by running

/usr/sbin/nagios -v /etc/nagios/nagios.cfg (adjust paths as appropriate).

I ended up with a bunch of warnings for deprecated parameters. So I went in and edited my config files to remove them or update them to the new equivalents. Oh yes. Software authors, please keep in mind: nothing pleases your users more than changing the names of variables in config files. We users live for this shit. When, oh when, will the author of our software next change “retry_check_interval” to “retry_interval”?

Fixing all of the warnings was not enough, though. The startup script gave the message “Starting nagios:” and then silently died. Well, sort of. It actually was starting now, but brokenly:

# ps aux | grep -i nagios
nagios 12610 0.0 0.0 12296 1220 ? Ss 16:36 0:00 /usr/sbin/nagios -d /etc/nagios/nagios.cfg
nagios 12611 0.0 0.0 0 0 ? Z 16:36 0:00 [nagios] <defunct>
nagios 12612 0.0 0.0 0 0 ? Z 16:36 0:00 [nagios] <defunct>
nagios 12613 0.0 0.0 0 0 ? Z 16:36 0:00 [nagios] <defunct>
nagios 12614 0.0 0.0 0 0 ? Z 16:36 0:00 [nagios] <defunct>
nagios 12616 0.0 0.0 11780 520 ? S 16:36 0:00 /usr/sbin/nagios -d /etc/nagios/nagios.cfg
root 12744 0.0 0.0 103328 876 pts/0 S+ 16:44 0:00 grep -i nagios

Starting directly from the command line worked:

# /usr/sbin/nagios -d /etc/nagios/nagios.cfg
# ps aux | grep nagios
nrpe 7282 0.0 0.0 41380 1340 ? Ss 16:51 0:00 /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -d
nagios 8010 0.0 0.0 16404 1280 ? Ss 17:31 0:00 /usr/sbin/nagios -d /etc/nagios/nagios.cfg
nagios 8011 0.0 0.0 10052 920 ? S 17:31 0:00 /usr/sbin/nagios –worker /var/spool/nagios/cmd/nagios.qh
nagios 8012 0.0 0.0 10052 920 ? S 17:31 0:00 /usr/sbin/nagios –worker /var/spool/nagios/cmd/nagios.qh
nagios 8013 0.0 0.0 10052 920 ? S 17:31 0:00 /usr/sbin/nagios –worker /var/spool/nagios/cmd/nagios.qh
nagios 8014 0.0 0.0 10052 920 ? S 17:31 0:00 /usr/sbin/nagios –worker /var/spool/nagios/cmd/nagios.qh
nagios 8015 0.0 0.0 15888 552 ? S 17:31 0:00 /usr/sbin/nagios -d /etc/nagios/nagios.cfg
root 8018 0.0 0.0 100956 616 pts/0 S+ 17:31 0:00 tail -f /var/log/nagios/nagios.log
root 8037 0.0 0.0 103328 856 pts/1 S+ 17:32 0:00 grep nagios

When things get this weird, there are only two options. Well, three, if you include the “rm -rf /” option. But the other two are: 1) reboot and see if stuff magically starts working, or 2) see what SELinux is breaking.

# tail -f /etc/audit/audit.log
type=AVC msg=audit(1504137329.209:41): avc: denied { execute_no_trans } for pid=7731 comm=”nagios” path=”/usr/sbin/nagios” dev=dm-0 ino=1201464 scontext=unconfined_u:system_r:nagios_t:s0 tcontext=system_u:object_r:nagios_exec_t:s0 tclass=file

Yup. As expected, SELinux breaking stuff.

So, my preferred way to solve this kind of problem now is to snip out all relevant the AVC “denied” sections from the log into a single file (which I called audit.log), and then using audit2allow to create a new module. Since there’s already a nagios module (containing insufficient privileges), I created a nagios2 module:

# audit2allow -M nagios2 < audit.log # semodule -i nagios2.pp

Hooray! After a few iterations of this process (discovering other blocked operation, granting them permission, restarting nagios), everything was working but check_disk_smb, which was returning “results from smbclient not suitable” even as it worked fine when tested from the command-line as follows:

# su – nagios -s /bin/bash -c “/usr/lib64/nagios/plugins/check_disk_smb -H SMBHOST -s share -a 10.X.X.X -u nagios -p \”password\” -w 90 -c 95″
Disk ok – 16.71G (11%) free on \\SMBHOST\share | ‘share’=134108676096B;136850492620.8;144453297766.4;0;152056102912

Diving in and editing check_disk_smb to throw the actual error message, I found nagios getting a “ERROR: Could not determine network interfaces, you must use a interfaces config line” from smbclient. So I edited /etc/samba/smb.conf, and explicitly told samba which interfaces it had available:

interfaces = lo eth0 10.X.X.X/24

Le sigh. Now this error went away, and I got to go for another fun and challenging round of “find all the SMB operations that SELinux is breaking.” This time, I got tripped up the “dontaudits” — there were operations being blocked, but not logged. I was saved by TrevorH and sfix, helpful people in Freenode’s #centos IRC channel:

TrevorH: semodule -DB to disable dontaudit rules, stay permissive, recreate, use the audit log to generate a policy as per the wiki
13:14 TrevorH: @selinux
13:14 centbot: Useful resources for SELinux: http://wiki.centos.org/HowTos/SELinux | http://wiki.centos.org/TipsAndTricks/SelinuxBooleans | http://docs.fedoraproject.org/en-US/Fedora/13/html/Security-Enhanced_Linux/ | http://www.youtube.com/watch?v=bQqX3RWn0Yw | http://opensource.com/business/13/11/selinux-policy-guide
13:15 _SjG_: Thanks
13:16 TrevorH: semodule -B when done (as well as setenforce 1)
13:25 _SjG_: TrevorH: thanks, that resolved it.
13:25 _SjG_: so what I was missing is that there can be donaudit rules that were preventing specific operations from showing up in the audit log?
13:28 TrevorH: yes
13:28 sfix: _SjG_: yep, there’s permissions in the policy that we know are requested but don’t want to allow for whatever reason. dontaudits are our way of preventing them from cluttering the audit log.
13:29 sfix: dontaudits tend to be a bit over-eager though

So there you have it.

I was finally back to where I had been mere days before.


Thu, 22 Sep 2016

Checking Solr index with nagios: obsolete versions

— SjG @ 12:33 pm

I needed to check that the index process that populates the Solr index succeeded and didn’t die during the night, leaving an empty index.

To make things more complicated, the versions of Solr and nagios in use are probably not the latest.

The check_solr -o numdocs command doesn’t work with our Solr configuration. But the internet tells me that the Solr query http://localhost:8983/solr/select/?debug=q‌uery&q=*:* includes the size of the result set. Testing it, I found this to be true:

<response>
   <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">0
      <lst name="params">
         <str name="q">*:*</str>
         <str name="debug">q‌uery</str>
      </lst>
   </lst>
   <result name="response" numFound="9832" start="0">
      <doc>
...

I want to use nagios to check that that numFound is never zero (or too small). I thought I’d just be able to use a nagios regex:

check_http -H localhost -p 8983 -u "/solr/select/?debug=query&q=*:*" -lr 'numFound=\"\d{2+}"'

It didn’t work. To make a long story short, there’s regex and then there’s regex. The kind that works for nagios is:

check_http -H localhost -p 8983 -u "/solr/select/?debug=query&q=*:*" -lr 'numFound=\"[1-9][0-9][0-9]'

This guarantees at least a hundred docs are in the index.


Mon, 29 Jun 2015

Sorting lots of files into directories, by date

— SjG @ 10:32 am

A process handles periodic tasks, and each time it does, it spits out some telemetry. The telemetry gets written off into a file each time. This process could be any of a number of interesting things: a procmail script doing something with incoming email, a Twitter-bot responding to searches, an IRC-bot responding to events, or whatever. The only important thing in this case is that it creates a variable number of log files each day, and the log files all get dumped into a single directory.

Before long, the log directory will be filled with thousands of files, and will be unmanageable. But, for some reason, we want to keep all these logs, and maybe actually use them. So the key is a script to move the logs into sub-directories based on date.

It turns out it’s easy to write a bash script to create directories in a YYYY-MM format, and move the files into them appropriately. The key is in the stat command. Conveniently, the implementation of this command is completely different and incompatible between Mac OS/BSD and Linux. Jesus H. Christ.

Linux:

#!/bin/bash
for i in *.log
do
filemonth=`stat --format=%y $i | cut -c 1-7`
mkdir -p $filemonth
mv $i $filemonth/$i
done

In Linux, the stat command will give you the data in an ISO format, and using the cut command, you can extract YYYY-MM information. The -p flag to mkdir makes it silently exit without complaining if a directory by that name already exists.

MacOS (or presumably other BSDs):

#!/bin/bash
for i in *.log
do
filemonth=`stat -f%Sm -t %Y-%m $i`
mkdir -p $filemonth
mv $i $filemonth/$i
done

In this case, we’re telling stat that we want the modified date as a string, and we specify the time format.

Either of these would be easy to modify to single day resolution (changing which columns you cut in the Linux version, or the timestamp format in the Mac version).


Tue, 2 Dec 2014

Compiling Kannel 1.4.4 under Centos 7.0

— SjG @ 4:28 pm

This took me while to get to work. If you follow these steps in order, it should work nicely.


# yum install mariadb-devel
# yum install libxml2-devel
# yum install bison
# yum install byacc
# cd /usr/local/src
# wget http://kannel.org/download/1.4.4/gateway-1.4.4.tar.gz
# tar xzvf gateway-1.4.4.tar.gz
# cd gateway-1.4.4
# ./configure --prefix=/usr/local/kannel --with-mysql --with-mysql-dir=/usr/include/mysql --disable-wap
# make

There are a few tricks here. First, just having libxml2 installed is not enough. You need the libxml development headers, etc. Should be obvious, but tricked me. Next, if you run ./configure before you have some of the dependencies installed (e.g., Bison), you will have modified source files that will still fail even after you install the dependency. Thus it’s important to install all that stuff before you run ./configure.

This stuff isn’t really that hard, but it can be time consuming to track down why it’s not working.


Tue, 22 Apr 2014

Annoying Xcode issue and resolution

— SjG @ 3:08 pm

I’d upgraded by home machine to Mavericks fairly soon after the OS was released, but hesitated in upgrading my main work machine. I didn’t want to have extensive downtime while tracking down odd dependencies and incompatibilities.

Well, time came to upgrade. It seemed safe. Everything was fine on my home machine. So I went ahead and upgraded my work machine.

Suddenly, I couldn’t get MacPorts to build MySQL.

I had followed the migration guide carefully. I tried all the usual tricks. In the port /opt/local/var/macports/build/…/config.log, the error was:

ld: library not found for -lcrt1.10.6.o

Google seemed to think this indicated that my Xcode command-line tools were not installed correctly. That library should be installed with all of the Unixy support that comes with Xcode’s command-line tools. Within the Xcode application, it told me that I had the command-line tools 5.1.1 (5B1008) properly installed. When trying various command-line options, the command-line tools were, in fact, installed. For example, xcrun gave the exact results one would expect.

Other tests also made it look like everything was good:
# xcode-select -p
/Applications/Xcode.app/Contents/Developer

Numerous Googled sites said to use the “xcode-select” command to install the tools if they were not functioning properly. Eventually, I gave in and tried it. Interestingly, this yielded an unexpected result:
# xcode-select --install
xcode-select: error: no developer tools were found, and no install
could be requested (perhaps no UI is present), please install
manually from 'developer.apple.com'.

Since I had originally installed Xcode under Leopard from a downloaded package rather than recently through the App store, I thought perhaps in the various upgrades something had gotten messed up. I decided to completely uninstalled Xcode:

/Developer/Library/uninstall-devtools --mode=all
I also removed the vestigial /Developer directory, and the /Application/Xcode directory.

I re-installed Xcode from the App store. Everything functioned and/or failed exactly as it had before.

In desperation, I downloaded the Mavericks command line tools package from Apple, and installed it. It should be the same thing as what was installed with Xcode. But it evidently is not, because now I can build the MacPort for MySQL.

edit/update: It may not be clear above, but normally doing the “xcode-select –install” is all that’s needed. It’s also not stated above, but I tried that after re-installing Xcode from scratch, and had the same issue. Evidently, whatever was mis-configured on my machine is quite rare.

Also, don’t trust Xcode when it tells you that it’s installed the command-line tools in the preferences panel like this: Locations_and_Macintosh_HD
It’s probably lying to you – it’s installed stubs, but not the actual tools.