fogbound.net




Sat, 13 Sep 2008

Generating Plausible Fake User Data

— SjG @ 6:45 pm

So it’s a familiar problem, where you’re developing a data-driven application, and you want to optimize the queries that will run against your database (I’ll have more interesting stuff on this later). The problem, of course, is that to really optimize those queries, you need a lot of sample data.

So I needed to do some address lookup code against a huge collection of users. But because there was the possibility of having to demo the prototype, I really didn’t want 100,000 users named “Foo McBar” living at “10101 Binary Place.” So, with the help of the almighty Internet, the all-frobnicating Perl, and the all-knowing US Bureau of the Census, I created a quick, semi-flexible script to generate people with plausible names and addresses that, if not Google-mappable, at least had agreement on city/state/zip. The city/state/zip is a collection of 250 random zip codes. If you have good zip code data, you can easily extend this to be complete! Names are generated from the most popular forenames and surnames, with a probabilistic bias towards the most common ones. The script also allows you to specify “pick one of n item” type fields, pick a number from a range, plausible email addresses, not-very-plausible phone numbers with or without extensions, and the ability to export as CSV or tab-delimited.

In principle, this should be easy to adapt to other countries, although you’ll need lists of common first names, surnames, street names, and a way of mapping cities to regions, states, districts, cantons, or whatever’s appropriate.

You can grab a copy of it here. It requires a Perl interpreter with the Text::CSV and Getopt::Long CPAN modules.

Usage: user-data-maker.pl [OPTIONS]
   -t, --header : header, a colon-delimited list of column headers
   -f, --format : format string, a colon-delimited list of column contents
       data types:
         fn - first name
         ln - last name
         a1 - street address
         a2 - apartment number
         c - city*
         s - state*
         z - zip 5*
         e - email address
         pne - phone (US), no extension
         pwe - phone (US), with extension
         [a,b,c] - one of a, b, or c
         {a,b,c} - one of a, b, or c in decreasing probability
         [x-y] - a number between x and y, inclusive

         * city, state, and zip will be agree to create a valid address
           if you need multiple addresses, use the code ! to reset the
           synch. The reset works on a left-to-right scan of the format string.

   -n, --number : number of records to create

   Flags:
  -c, --csv : output CSV format (otherwise, tab-delimited).
  -v, --(no)verbose : verbose mode (default false)

Example:


Viajante:samuelg$ user-data-maker.pl --header "First:Last:Age:Email" --format "fn:ln:[10-100]:e" -n 5 --c
First,Last,Age,Email
Margot,Sawyer,33,Margot.Sawyer@netscape.com
Francisco,Cantrell,18,Cantrell@sbcglobal.com
Lynetta,Orozco,28,Lynetta@mac.com
Latrice,Dunlap,41,Latrice.Dunlap@sbcglobal.com
Anissa,Fitzgerald,59,Anissa@hotmail.com

or, more exotically:


Viajante:samuelg$ user-data-maker.pl --header "First Name:Last Name:Address:City:State:Zip:Super Power" --format "fn:ln:a1:c:s:z:[Invisibility,Invincibility,X-Ray Vision,Flight,Likes Squirrels]" -n 5 -c
"First Name","Last Name",Address,City,State,Zip,"Super Power"
Roseanna,Best,"8821 7th Str.",Manati,PR,00674,Flight
Euna,Crawford,"8195 Lee Str.","Fort Washington",PA,19034,Invincibility
Ted,Williams,"7140 Birch Ave.",Monroe,CT,06468,Invincibility
Mariano,Miranda,"2657 1st Way",Lyford,TX,78569,Flight
Tammy,Flowers,"2135 Washington Blvd.",Duluth,MN,55806,"Likes Squirrels"

Enjoy!


Mon, 1 Sep 2008

Baltasar and Blimunda

— SjG @ 3:56 pm

A translation of Memorial do Convento, written by José Saramago, translated by Giovanni Pontiero, Harcourt Brace & Co, 1987.

I don’t really understand why Saramago’s work is so compelling. And I can’t give a pithy summary as to what this book was about. I could, perhaps, follow the example of countless others and say this book is a love story set in the still-mostly-medieval Portugal of the 18th century, but this would be inaccurate. Yes, there’s the romance between the titular characters, but one that we see as distant outsiders. We know more about the quarrying, transporting, and history of the portico-stone of the Basilica at Mofra or of the peccadillos of the Portuguese royal family than we really do of Balthasar and Blimunda’s love.

If anything, this book is less a romance than a musing on the fundamental inequities of life, a rambling explication of the abuse of power, or a celebration of the lives of simple country folk. Yet these descriptions too do poor service to the book; they leave out the fire, the lyricism, the anger and passion. It’s a powerful monument to the downtrodden and forgotten; a sneer at the Church; an idyll; a history; a tall-tale of music, second-sight, and alchemy.

And, like The Cave (the only other Saramago I’ve read), the language is challenging, funny, brilliant, and the book is hard to stop thinking about.

Filed in:

Fri, 29 Aug 2008

“Save the Planet”

— SjG @ 8:18 pm

Maybe I’m a curmudgeon, but I’m getting heartily sick of the exhortation to “save the planet” (and even the debate about cretinous comments by Rep. Michele Bachmann about salvation).

But here’s the deal — when people say “save the planet,” they don’t mean that. Come on. We puny humans can’t destroy the planet. Sure, we can poison the surface, and make it inhospitable for many of the species who presently inhabit the place (ourselves included). Yes, we can wipe out forests and cause extinctions. But the planet’s been through worse — much worse — and probably will go through worse again.

So let’s can the “save the planet” talk and say what we really mean: preserving conditions that keep us comfortable.

Rant over.


Thu, 7 Aug 2008

Complete Tales of Washington Irving

— SjG @ 10:23 pm

Edited with an introduction by Charles Neider, Da Capo Press, 1998.

(This book is only 798 pages, but I’ve been reading it for over a year. And you thought I’d just given up on posting about books.)

Washington Irving is known for a number of things: being the first professional literary writer of North America, creating of the character Diedrich Knickerbocker (for whom New York is called the Knickerbocker State), originating numerous popular legends (e.g., people though the earth flat until Columbus), and, of course, authoring a few famous stories such as The Legend of Sleepy Hollow, Rip van Winkel, and The Devil and Tom Walker. According to Neider, there was an anti-Irving backlash in the 1930s, which is probably why I was only familiar with the three tales mentioned above.

Irving is an amazing storyteller. Even given nearly two hundred years’ gap, his writing is still crystal clear and humorous and evocative and beautiful. He sketches out the Kentucky frontier, ghost-plagued swamps of New England, pre-Revolutionary War Dutch settlements in New York, medieval Spain, the mountains of Italy, and more with equal skill, each believable and very visually rendered. He tells rip-roaring adventures, satires, or fairy tales in those contexts. Some are simple — predictable, even, twee or corny to the modern reader — and yet the enthusiasm and charm with which he writes them makes it easy to forgive.

What really shines through in this collection of sixty some-odd tales, however, is how much Irving loves storytelling. He likes it so much that many of them are really framing stories, wherein the narrator meets up with some other character who tells a story — which may well itself be a framing story. Sometimes, I found myself popping out of a story-stack five or six deep.

Don’t let Hollywood’s pathetic interpretations sell Irving short. These are a lot of fun to read.

Filed in:

Fri, 18 Jul 2008

Using Regular Expressions for HTML Processing in PHP

— SjG @ 4:16 pm

Well,not really. This is just one example of a bad approach.

The problem: an HTML file is read, but needs to be entity-escaped. However, not all entities need escaping. Specifically, double quotes with anchor tags need to be left alone.

The right solution: process the HTML via a DOM parser, escape nodes that are not anchor tags. Oh, but did I mention these HTML files may be crappy, non-validating files, or even snippets?

The next solution: Use a regular expression. Yes, this is ugly. Yes, it also works 🙂

Originally, I tried using variable-length lookahead, but ran into problems (PHP 4.x). But PHP provides another solution which is perfect for this sort of thing. Here’s the code:

function pre_esc_quotes($inner)
{
return preg_replace('/"/','QUOTE',$inner[0]);
}
function post_esc_quotes($inner)
{
return preg_replace('/QUOTE/','"',$inner[0]);
}
$tmp=preg_replace_callback('/<a([^>]*?)>/s','pre_esc_quotes',$raw_html);
$tmp = html_entities($tmp);
echo preg_replace_callback(('/</a><a([^>]*?)>/s','post_esc_quotes',$tmp);

This, of course, presumes that the string “QUOTE” won’t show up anywhere in your raw html. Consider replacing it with an opaque string (like “JHG54JHGH76699597569” or something creative and long that will choke the interpreter).

This code is furthermore inefficient in a number of ways. It’s not something you should use. But it does show how preg_replace_callback avoids some scary regex work.