Fri, 18 Jul 2008

Using Regular Expressions for HTML Processing in PHP

— SjG @ 4:16 pm

Well,not really. This is just one example of a bad approach.

The problem: an HTML file is read, but needs to be entity-escaped. However, not all entities need escaping. Specifically, double quotes with anchor tags need to be left alone.

The right solution: process the HTML via a DOM parser, escape nodes that are not anchor tags. Oh, but did I mention these HTML files may be crappy, non-validating files, or even snippets?

The next solution: Use a regular expression. Yes, this is ugly. Yes, it also works 🙂

Originally, I tried using variable-length lookahead, but ran into problems (PHP 4.x). But PHP provides another solution which is perfect for this sort of thing. Here’s the code:

function pre_esc_quotes($inner)
return preg_replace('/"/','QUOTE',$inner[0]);
function post_esc_quotes($inner)
return preg_replace('/QUOTE/','"',$inner[0]);
$tmp = html_entities($tmp);
echo preg_replace_callback(('/</a><a([^>]*?)>/s','post_esc_quotes',$tmp);

This, of course, presumes that the string “QUOTE” won’t show up anywhere in your raw html. Consider replacing it with an opaque string (like “JHG54JHGH76699597569” or something creative and long that will choke the interpreter).

This code is furthermore inefficient in a number of ways. It’s not something you should use. But it does show how preg_replace_callback avoids some scary regex work.