fogbound.net




Thu, 31 Aug 2017

Getting nagios back up and running… again.

— SjG @ 12:42 pm

Nagios monitoring on one Centos 6.9 server seemed to have stopped working after an upgrade. All the tests showed status OK, but they hadn’t actually run in days.

Looking at the service details was weird, because the next scheduled check was about a minute in the past.

The Nagios help page wasn’t. And we’re running Core 4.3.x anyway, without a MySQL database.

The first clue was a bunch of lines in the event log:
Error: Could not open check result queue directory '/var/log/nagios/spool/checkresults' for reading.

Turns out we didn’t even have a /var/log/nagios/spool directory. Creating those directories helped. But Nagios still wouldn’t start from the usual startup scripts. Nothing in the main log. But then, another clue.

Who doesn’t love to see shit like this:

$ cat /var/log/nagios/nagios.configtest
ERROR: Errors in config files – see log for details: /var/log/nagios/nagios.configtest
$

So the startup script /etc/init.d/nagios searches for warnings, and aborts if they exist. It’s supposed to log them. For some reason it didn’t.

You can manually get those warnings and errors yourself by running

/usr/sbin/nagios -v /etc/nagios/nagios.cfg (adjust paths as appropriate).

I ended up with a bunch of warnings for deprecated parameters. So I went in and edited my config files to remove them or update them to the new equivalents. Oh yes. Software authors, please keep in mind: nothing pleases your users more than changing the names of variables in config files. We users live for this shit. When, oh when, will the author of our software next change “retry_check_interval” to “retry_interval”?

Fixing all of the warnings was not enough, though. The startup script gave the message “Starting nagios:” and then silently died. Well, sort of. It actually was starting now, but brokenly:

# ps aux | grep -i nagios
nagios 12610 0.0 0.0 12296 1220 ? Ss 16:36 0:00 /usr/sbin/nagios -d /etc/nagios/nagios.cfg
nagios 12611 0.0 0.0 0 0 ? Z 16:36 0:00 [nagios] <defunct>
nagios 12612 0.0 0.0 0 0 ? Z 16:36 0:00 [nagios] <defunct>
nagios 12613 0.0 0.0 0 0 ? Z 16:36 0:00 [nagios] <defunct>
nagios 12614 0.0 0.0 0 0 ? Z 16:36 0:00 [nagios] <defunct>
nagios 12616 0.0 0.0 11780 520 ? S 16:36 0:00 /usr/sbin/nagios -d /etc/nagios/nagios.cfg
root 12744 0.0 0.0 103328 876 pts/0 S+ 16:44 0:00 grep -i nagios

Starting directly from the command line worked:

# /usr/sbin/nagios -d /etc/nagios/nagios.cfg
# ps aux | grep nagios
nrpe 7282 0.0 0.0 41380 1340 ? Ss 16:51 0:00 /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -d
nagios 8010 0.0 0.0 16404 1280 ? Ss 17:31 0:00 /usr/sbin/nagios -d /etc/nagios/nagios.cfg
nagios 8011 0.0 0.0 10052 920 ? S 17:31 0:00 /usr/sbin/nagios –worker /var/spool/nagios/cmd/nagios.qh
nagios 8012 0.0 0.0 10052 920 ? S 17:31 0:00 /usr/sbin/nagios –worker /var/spool/nagios/cmd/nagios.qh
nagios 8013 0.0 0.0 10052 920 ? S 17:31 0:00 /usr/sbin/nagios –worker /var/spool/nagios/cmd/nagios.qh
nagios 8014 0.0 0.0 10052 920 ? S 17:31 0:00 /usr/sbin/nagios –worker /var/spool/nagios/cmd/nagios.qh
nagios 8015 0.0 0.0 15888 552 ? S 17:31 0:00 /usr/sbin/nagios -d /etc/nagios/nagios.cfg
root 8018 0.0 0.0 100956 616 pts/0 S+ 17:31 0:00 tail -f /var/log/nagios/nagios.log
root 8037 0.0 0.0 103328 856 pts/1 S+ 17:32 0:00 grep nagios

When things get this weird, there are only two options. Well, three, if you include the “rm -rf /” option. But the other two are: 1) reboot and see if stuff magically starts working, or 2) see what SELinux is breaking.

# tail -f /etc/audit/audit.log
type=AVC msg=audit(1504137329.209:41): avc: denied { execute_no_trans } for pid=7731 comm=”nagios” path=”/usr/sbin/nagios” dev=dm-0 ino=1201464 scontext=unconfined_u:system_r:nagios_t:s0 tcontext=system_u:object_r:nagios_exec_t:s0 tclass=file

Yup. As expected, SELinux breaking stuff.

So, my preferred way to solve this kind of problem now is to snip out all relevant the AVC “denied” sections from the log into a single file (which I called audit.log), and then using audit2allow to create a new module. Since there’s already a nagios module (containing insufficient privileges), I created a nagios2 module:

# audit2allow -M nagios2 < audit.log # semodule -i nagios2.pp

Hooray! After a few iterations of this process (discovering other blocked operation, granting them permission, restarting nagios), everything was working but check_disk_smb, which was returning “results from smbclient not suitable” even as it worked fine when tested from the command-line as follows:

# su – nagios -s /bin/bash -c “/usr/lib64/nagios/plugins/check_disk_smb -H SMBHOST -s share -a 10.X.X.X -u nagios -p \”password\” -w 90 -c 95″
Disk ok – 16.71G (11%) free on \\SMBHOST\share | ‘share’=134108676096B;136850492620.8;144453297766.4;0;152056102912

Diving in and editing check_disk_smb to throw the actual error message, I found nagios getting a “ERROR: Could not determine network interfaces, you must use a interfaces config line” from smbclient. So I edited /etc/samba/smb.conf, and explicitly told samba which interfaces it had available:

interfaces = lo eth0 10.X.X.X/24

Le sigh. Now this error went away, and I got to go for another fun and challenging round of “find all the SMB operations that SELinux is breaking.” This time, I got tripped up the “dontaudits” — there were operations being blocked, but not logged. I was saved by TrevorH and sfix, helpful people in Freenode’s #centos IRC channel:

TrevorH: semodule -DB to disable dontaudit rules, stay permissive, recreate, use the audit log to generate a policy as per the wiki
13:14 TrevorH: @selinux
13:14 centbot: Useful resources for SELinux: http://wiki.centos.org/HowTos/SELinux | http://wiki.centos.org/TipsAndTricks/SelinuxBooleans | http://docs.fedoraproject.org/en-US/Fedora/13/html/Security-Enhanced_Linux/ | http://www.youtube.com/watch?v=bQqX3RWn0Yw | http://opensource.com/business/13/11/selinux-policy-guide
13:15 _SjG_: Thanks
13:16 TrevorH: semodule -B when done (as well as setenforce 1)
13:25 _SjG_: TrevorH: thanks, that resolved it.
13:25 _SjG_: so what I was missing is that there can be donaudit rules that were preventing specific operations from showing up in the audit log?
13:28 TrevorH: yes
13:28 sfix: _SjG_: yep, there’s permissions in the policy that we know are requested but don’t want to allow for whatever reason. dontaudits are our way of preventing them from cluttering the audit log.
13:29 sfix: dontaudits tend to be a bit over-eager though

So there you have it.

I was finally back to where I had been mere days before.


Tue, 9 May 2017

This too shall pass

— SjG @ 9:49 pm

Back in October of 2015, I started writing the following, and never finished or published it:

I upgraded the Mac to Yosemite a year or so ago. Yesterday, I wanted to do some development on a project that I’d been idly thinking about. Unfortunately, it required a dependency in a package I’d installed via Mac Ports. I tried to upgrade it, but got an error that I was compiling for the wrong Darwin version. This means I haven’t actually updated any of my Ports since upgrading to Yosemite! For shame.

Rather than fix Mac Ports for Yosemite, and then again when I upgrade to El Capitan, I decided it was time to do that upgrade and then fix it. I also thought … hey, there’re all these neat new container technologies and configuration tools. Maybe I should look into some of those, and save myself the agony next time around.

So I dove into some articles, and pretty soon had become a seething mass of quivering rage.

To set up my environment in Docker, I need Docker, and a VM. I could set it up using Vagrant, or, as some people recommend, Vagrant running Chef or Solo. Then, of course, I need to set up some replacement for vboxsf so I can access my files in the Virtual environment. Each of these requires its own configuration, of course.

Today, I was struggling with something similar. I’m building an iOS app. Years ago, I’d built a few native iOS apps, but I’ve forgotten everything I ever knew about Objective C, and I don’t know Swift. Plus, I need to publish for Android too. So, six months ago, when I started this process, I decided I’d be using Ionic Framework. It had the advantage that it was based on AngularJS, and I’ve done some work in Angular.

Now that I’m starting, I discover that Ionic 2 is the way to go — oh wait, not Ionic 3 was just released! And my AngularJS experience is ancient v1.2.x, knowledge which is largely obsolete. I’d be learning Angular 2 — no, we’re up to Angular 4 now — so better get cracking on that.

I remember, many, many years ago, how excited I was was there was a new version of Windows. I couldn’t wait to get all those 3.5″ floppies home so I could upgrade my machine to the latest and greatest. Now, I dread each year when a new version of Mac OS comes out, and I need to upgrade and track down all the things that broke, and rebuild my ports and and and… Not to mention when I installed a recent Linux on a VM to host some sites, and discovered to my chagrin that systemd has replaced all manner of things Unixy that I’ve been doing mostly-the-same for thirty years.

Well shit. There it is. I’ve become the grumpy old software guy. “Why are they changing things? Why can’t they just leave them alone?” The fact is, some of these changes are indisputably improvements. But so many of them seem to be changes for the sake of change. We have to have “new, improved!” all the time, even if it’s just changing the syntax (why, oh why, is *ngFor so much better than ng-repeat !?).

Part of this is struggling with obsolescence in general. It’s hard being middle-aged in tech. You can’t help seeing that look in the eyes of the youngsters: that old guy is so backwards. But it goes beyond that. My neighborhood is changing around me. Younger families are moving in, and suddenly I’m that guy who’s been in the neighborhood for a long time. I find myself navigating by past landmarks — it’s across from the Burger King, er, those condos, right by the Foster’s Freeze, er, Dunkin’ Donuts. The world is changing around me rapidly. The political world I grew up in has shifted. Every year, I see more obituaries for people I know, or whose names I know. Things that were true when I was a child are no longer true.

I have vague memories of hearing these thoughts expressed when I was younger by people I thought were old. I didn’t understand them then. I’m beginning to understand them now.

I just have to remind myself that change is constant, and not all bad. When I was a kid, there were no known exoplanets. When I was in my twenties, I’d come home from a night out, and I’d be stinking of second-hand cigarette smoke. We had to struggle with card catalogs to find books in the library. When trying to reach my friends, I’d have to leave messages on their home answering machines, and I’d have to call from a pay phone where I’d enter in a multidigit phone card number. If you were interested in obscure music or books, you’d have to read tiny ads in the back of magazines to track down sources or information. LGBQT people were all but invisible, and same-sex marriage was barely even in the realm of speculative fiction.

So, that being said, I’d like change to slow down a bit. Could I please just finish a project before all the constituent languages, libraries, and frameworks have a major version increment?


Thu, 22 Sep 2016

Checking Solr index with nagios: obsolete versions

— SjG @ 12:33 pm

I needed to check that the index process that populates the Solr index succeeded and didn’t die during the night, leaving an empty index.

To make things more complicated, the versions of Solr and nagios in use are probably not the latest.

The check_solr -o numdocs command doesn’t work with our Solr configuration. But the internet tells me that the Solr query http://localhost:8983/solr/select/?debug=q‌uery&q=*:* includes the size of the result set. Testing it, I found this to be true:

<response>
   <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">0
      <lst name="params">
         <str name="q">*:*</str>
         <str name="debug">q‌uery</str>
      </lst>
   </lst>
   <result name="response" numFound="9832" start="0">
      <doc>
...

I want to use nagios to check that that numFound is never zero (or too small). I thought I’d just be able to use a nagios regex:

check_http -H localhost -p 8983 -u "/solr/select/?debug=query&q=*:*" -lr 'numFound=\"\d{2+}"'

It didn’t work. To make a long story short, there’s regex and then there’s regex. The kind that works for nagios is:

check_http -H localhost -p 8983 -u "/solr/select/?debug=query&q=*:*" -lr 'numFound=\"[1-9][0-9][0-9]'

This guarantees at least a hundred docs are in the index.


Tue, 7 Jun 2016

JavaScript compares things weirdly

— SjG @ 2:52 pm

We’ve already established that PHP compares things weirdly.

It shouldn’t surprise us that JavaScript does too.

Consider the following:

> var k=['hello'];
undefined
> (k=='hello'?'Equals':'Nope');
Equals

Now, purists will point out that that’s an “equals” operator not an “identity” operator, but I mean seriously? We’re just going to pretend that


> ['hello']=='hello'
true

I think I’ll just go and rewrite all my client side code in C now.


Mon, 28 Mar 2016

PHP Compares Things Weirdly

— SjG @ 10:36 am

This is a known .. uh … situation, but it bit me today.

So, consider the following:
$ php --version
PHP 5.4.16 (cli) (built: Jun 23 2015 21:17:27)
Copyright (c) 1997-2013 The PHP Group
Zend Engine v2.4.0, Copyright (c) 1998-2013 Zend Technologies
$ php -a
Interactive shell
php > $v1 = '479014103257633139480';
php > $v2 = '479014103257633139481';
php > echo ($v1==$v2?'Equal':'Not Equal');
Not Equal

Seems sane, yes? Reasonable. Kind of what you expect.

But then, consider this:

$ php --version
PHP 5.3.3 (cli) (built: Feb 9 2016 10:36:17)
Copyright (c) 1997-2010 The PHP Group
Zend Engine v2.3.0, Copyright (c) 1998-2010 Zend Technologies
$ php -a
Interactive shell
php > $v1 = '479014103257633139480';
php > $v2 = '479014103257633139481';
php > echo ($v1==$v2?'Equal':'Not Equal');
Equal

Yeah. Let that sink in for a moment.

Some versions of PHP (before 5.4.mumble) will preëmptively convert strings to numbers before comparing them (if they contain only digits). But if the number is large enough, you may lose the precision to compare them correctly.

Wow. I mean, just … well… I dunno.

For what it’s worth, strcmp will do the right thing regardless of PHP version. But seriously. I mean. Why do I use this turdburger of a language?