fogbound.net




Fri, 18 Jul 2008

Using Regular Expressions for HTML Processing in PHP

— SjG @ 4:16 pm

Well,not really. This is just one example of a bad approach.

The problem: an HTML file is read, but needs to be entity-escaped. However, not all entities need escaping. Specifically, double quotes with anchor tags need to be left alone.

The right solution: process the HTML via a DOM parser, escape nodes that are not anchor tags. Oh, but did I mention these HTML files may be crappy, non-validating files, or even snippets?

The next solution: Use a regular expression. Yes, this is ugly. Yes, it also works 🙂

Originally, I tried using variable-length lookahead, but ran into problems (PHP 4.x). But PHP provides another solution which is perfect for this sort of thing. Here’s the code:

function pre_esc_quotes($inner)
{
return preg_replace('/"/','QUOTE',$inner[0]);
}
function post_esc_quotes($inner)
{
return preg_replace('/QUOTE/','"',$inner[0]);
}
$tmp=preg_replace_callback('/<a([^>]*?)>/s','pre_esc_quotes',$raw_html);
$tmp = html_entities($tmp);
echo preg_replace_callback(('/</a><a([^>]*?)>/s','post_esc_quotes',$tmp);

This, of course, presumes that the string “QUOTE” won’t show up anywhere in your raw html. Consider replacing it with an opaque string (like “JHG54JHGH76699597569” or something creative and long that will choke the interpreter).

This code is furthermore inefficient in a number of ways. It’s not something you should use. But it does show how preg_replace_callback avoids some scary regex work.


Tue, 20 May 2008

Email Round-Robin using Procmail

— SjG @ 2:56 pm

The need arose to have a specific email address round-robin (e.g., cycle through a collection of destination email addresses).

A solution was achieved through use of procmail and a little perl script. It probably could be done more easily and/or better, but I figured other people might find this interesting.

So, first, an alias was created in /etc/aliases (used by postfix in this case, but it should work for sendmail, and variants should work for other MTAs):

rrtest:         |"/usr/bin/procmail -m /etc/postfix/roundrobin_procmail.rc"

Then, the following file was saved as /etc/postfix/roundrobin_procmail.rc:

:0 w:/tmp/rrlock
{
        :0
                dest=|/etc/postfix/rr.pl
        :0
                ! ${dest}
}

And then, of course, we need the perl program. Here’s /etc/postfix/rr.pl:

#!/bin/perl
# ----------------------------------------------------------
@recipients = (
'address1@sample.com',
'address2@sample.com',
'address3@sample.com'
);

$index_file = 'rr-index.txt';

# ----------------------------------------------------------

$index_exists = 1;
open(IN,";
        close(IN);
        $index++;
        }
else
        {
        $index = 0;
        }


if ($index > $#recipients)
        {
        $index = 0;
        }
open(OUT,">/tmp/${index_file}");
print OUT "$index\n";
close(OUT);

print STDOUT $recipients[$index];

exit 0;

Elegant? Not really. But it seems to work 🙂


Mon, 24 Mar 2008

Open Source Software Development, Rant #1

— SjG @ 3:15 pm

Loath as I am to admit it, I know why Microsoft products all suffer from creeping featuritis. It’s because users are so damn creative.

In developing modules for CMS Made Simple, I’m continuously receiving feature requests. Some are reasonable. Many are not.

Reasonable:
“Could you extend your can opener to handle sardine tins as well as standard cylindrical cans?”

Unreasonable:
“I know it’s supposed to be a can opener, but I find it works well in extricating people from burning wreckage, so I was wondering if you could add a fire-hose feature, and maybe a siren or flashing lights.”

The skill I need to develop is saying “no” in an acceptable way. It’s easy when the requester phrases the question like “add this, or I won’t use your stupid system!” Yeah. Well. Golly, I’ll be awfully sad to see ’em go. Similarly, the ever-popular “it’s embarrassing to tell my client that I can’t provide them feature Y because you didn’t implement it!” always brings me copious, bitter tears at the thought of their shame and tragedy. Cry me a river indeed.

It’s a bit harder when the request is along the lines of “to be a truly professional system, it really should have feature Z,” because then I have to assess whether or not it really would be a professional grade feature.

Hardest yet is when someone requests a feature and gives at least a basic explanation of why it would be good for the project as a whole (in addition to their specific need). Even if I can’t see that I would use the feature myself, this will often sway me and I’ll add features, even against my better judgment.

Then, of course, there’s cash, which has a peculiar way of getting features added, no matter how ridiculous.


Fri, 21 Mar 2008

Interesting Image Problem

— SjG @ 3:33 pm

So we had a jpeg image from someone, and were distributing it through a web-based system (note that all non-technical details in this whole posting will be presented in annoying vague language). The web-based system is PHP and uses GD-lib. GD-lib successfully thumbnails the images, but when the images are downloaded, both Firefox and IE7 complain that the image has errors:

The image file "foo.bar.baz.quux.jpg" cannot be displayed, because it contains errors.

Windows image browser shows the image successfully, and Photoshop happily opens it. Looking at the file itself, I can see that it *is* a JFIF file (e.g., a valid jpeg). It starts with the FFD8 header, etc. It does have some strange characters in the IPTC data. This turns out to be a red herring, however. The problem turns out to be that it’s a jpeg image, but it’s using an 8-bit CMYK color space, which isn’t supported by Firefox, IE6, or IE7.

Firefox/Mozilla will be supporting CMYK jpegs in the future. Opera already does. I’m not sure if IE8 will.

Later, I found a blog entry on this very topic, that, strangely I didn’t find when I Googled the error message. But there is good information out there. In fact, if you know to include the term “CMYK” you get tons of useful responses.


Thu, 3 Jan 2008

Backups, Updated Again

— SjG @ 5:11 pm

Had some updates to the backup script which I never published.

Here it is.

Enjoy!

Backup Scripts

(for background, see Automated Backups.)