fogbound.net




Wed, 27 Jun 2007

Unix: How to find files lacking certain strings

— SjG @ 4:10 pm

So, I’m working on a convoluted web site, and a problem comes up. It seems that some vitally important code was not included in some pages (for the sake of argument, let’s say it’s a copyright string). This particular site has an ungodly mix of files, including .htm, .html, and .jsp files. Some of the .jsp files are actual pages, and others are stubs to be included in other .jsp pages. The majority of the full .jsp pages include a “footer.jsp” that has the desired string, so they’re good. But I need to generate a list of the full pages, of whatever sort, that lack this string.

The inverse of this problem is easy, and is the kind of thing I use all the time:
find . -name \*.htm -o -name \*.html -o -name \*.jsp -exec grep -il "myString" {} \;

Initially, I thought using the -v flag to grep would work for me, but grep -vl returns all files it sees, because -v returns the lines that match the invert expression, not the files that match the invert expression. Then there’s the problem that I need to match “full” pages rather than included .jsp stubs.

So here’s how the Mighty Power of Unix came to my rescue:

find . -name \*.htm -o -name \*.html -o -name \*.jsp | xargs grep -il "</html>" | sort -u > full_pages.txt

provides me with a list of pages that are not mere inclusions, if you accept my assumption that an inclusion won’t match the closing HTML tag.

Then I generate a list of full pages that contain the magic string and or include the footer.jsp that would contain the magic string:
find . -name \*.htm -o -name \*.html -o -name \*.jsp | xargs grep -il "</html>" | xargs grep -le "uniqueCopyrightTag\|footer\.jsp" | sort -u > pages_no_string.txt

Then I compare the files to find out which full pages lack both the magic string and the include:
comm -3 pages_no_string.txt full_pages.txt

Wow. There it is!

I bet there’s an easier way. Post an example in the comments if you know of one!

NOTE: All commands are on a single line, regardless of whether they wrap in this particular display.


Sun, 25 Mar 2007

Backups, cont.

— SjG @ 9:50 pm

OK. I’m a bonehead. The link I provided to my backup script tarball was broken. The link is fixed.

But wait! A new version of the scripts will be posted in a few days. It’s got some bug fixes and some new features. With it, the little birds really do sing more cheerfully, and the colors really will be brighter.

(As an aside … I don’t know why none of the people who clicked on the broken link bothered to send me an email or leave a comment to tell me there was a problem. Could that all have been robot traffic?)


Thu, 8 Mar 2007

Automated Backups – Updated!

— SjG @ 3:50 pm

[Update — fixed the link!]

Automated Backups are a good thing. Automated Backups make the little birds sing, the rainbows shine, and little fauns gambol about in beautiful green forests. When computers are backed up, the butterflies flutter, the flowers bloom, and the fruit from the trees taste just a little sweeter. But when computers are not backed up, the universe becomes angry.

An angry universe is not a good thing. An angry universe makes little birds cry. An angry universe makes Cthulhu come and visit.

So. Automated backups. I’m partial to rdiff-backup because it allows me to not only back up data, but keep previous versions available. Backing up nightly doesn’t help if you accidentally overwrite the contents of a file with something, and don’t notice for a day or two. But with rdiff-backup, you can restore the version before the error.

Unfortunately, rdiff-backup really is designed for server-to-server backups, where each end of the transaction has shell access. Enter duplicity, a related project. It’s more designed for storing backups on servers that you don’t control and/or don’t trust. It allows encryption of your backup sets, as well as supporting a wider variety of protocols (ftp, scp, s3, etc.)

So with a combination of these two scripts, you can backup pretty much any POSIX-ish server to pretty much anything that you can ftp or ssh into. Still, it’d be nice if you could:

  • Check that the backups completed successfully, and get email confirming that success or warning on a failure.
  • Configure up all of your various backups by a simple text file, rather than remembering the different command-line formats.
  • Create groups of options that can be applied to backup tasks.
  • Issue commands on the backup source and destinations before and/or after the backup (good for dumping databases into a flat file, for example, and then deleting it after it’s backed up).
  • Get email confirmation on completion of backups.
  • Have some tools to simplify the securing of the backup process.

For these reasons, I put together this backup script, which is basically a Ruby wrapper for rdiff-backup and duplicity. It’s almost entirely configured via two human-readable yaml files.

It’s flexible, reasonably simple to use, and comes without any guarantees whatsoever. Feel free to use it yourself!

DISCLAIMER: it’s as-is. Not to be used in place of a certified Cthulhu-deterrent. Use at your own risk. To quote the duplicity page: “[it] is not stable yet. It is thought to have a few bugs, but will work for normal usage, and should continue to work fine until you depend on it for your business or to protect important personal data.” — that goes for me too, only double.


Fri, 29 Dec 2006

eAccelerator Weirdness

— SjG @ 4:52 pm

I’ve been busy setting up a new hosting environment for a bunch of static HTML and PHP-based web sites on a Go Daddy Virtual Server. It was going swimmingly, until I came to an old CMS Made Simple site (running 0.10.x), which merely returned blank pages. Newer versions of CMS Made Simple ran fine. I could find nothing in the virtual host’s web error logs, the php log, the mysql error logs, the eaccelerator error logs, or any other system logs — except in the main Apache error log, there was:

child pid XXXXX exit signal Segmentation fault (11)

Searching around, this looks like it could be some kind of threading issue; however, I’m following the recommendations and using the Apache 2 prefork MPM.

Eventually, the (weak) solution I came up with is to turn off eaccelerator for that virtual host. This remedies the situation, although I can’t say it makes me very happy, since I don’t understand exactly what’s going on (or what the problem is).

I’d welcome insight into this.

Details: Fedora Core 4, Apache/2.0.54 (Fedora), PHP 5.0.4, eAccelerator 0.9.5.


Thu, 12 Jan 2006

sa-exim config tweak

— SjG @ 11:13 pm

This is probably obvious to everyone in the universe but me, but I was having a problem where my outbound email was being scanned by sa-exim, in addition to the desired scanning of incoming email.

The trick is in setting your SAEximRunCond in sa-exim.conf correctly. This is probably documented somewhere, but I totally missed it. In any case, assuming you want to skip scanning of email originating in your local network (e.g., IP address of 10.3.2.0/24) and that you changed the secret SA-Do-Not-Run header’s name to SA-Do-Not-Think-Of-Running, you would use the following line in your sa-exim.conf:

SAEximRunCond: ${if and {{def:sender_host_address} {!eq {${mask:$sender_host_add
ress/24}}{10.3.2.0/24}} {!eq {$h_X-SA-Do-Not-Think-Of-Running:}{Yes}} } {1}{0}}

Voila, outbound emails are no longer checked. Of course, if you are sending spam, please do not make the above change, but instead please swallow whole six to ten large, unpeeled pineapples.