fogbound.net




Sat, 13 Sep 2008

Generating Plausible Fake User Data

— SjG @ 6:45 pm

So it’s a familiar problem, where you’re developing a data-driven application, and you want to optimize the queries that will run against your database (I’ll have more interesting stuff on this later). The problem, of course, is that to really optimize those queries, you need a lot of sample data.

So I needed to do some address lookup code against a huge collection of users. But because there was the possibility of having to demo the prototype, I really didn’t want 100,000 users named “Foo McBar” living at “10101 Binary Place.” So, with the help of the almighty Internet, the all-frobnicating Perl, and the all-knowing US Bureau of the Census, I created a quick, semi-flexible script to generate people with plausible names and addresses that, if not Google-mappable, at least had agreement on city/state/zip. The city/state/zip is a collection of 250 random zip codes. If you have good zip code data, you can easily extend this to be complete! Names are generated from the most popular forenames and surnames, with a probabilistic bias towards the most common ones. The script also allows you to specify “pick one of n item” type fields, pick a number from a range, plausible email addresses, not-very-plausible phone numbers with or without extensions, and the ability to export as CSV or tab-delimited.

In principle, this should be easy to adapt to other countries, although you’ll need lists of common first names, surnames, street names, and a way of mapping cities to regions, states, districts, cantons, or whatever’s appropriate.

You can grab a copy of it here. It requires a Perl interpreter with the Text::CSV and Getopt::Long CPAN modules.

Usage: user-data-maker.pl [OPTIONS]
   -t, --header : header, a colon-delimited list of column headers
   -f, --format : format string, a colon-delimited list of column contents
       data types:
         fn - first name
         ln - last name
         a1 - street address
         a2 - apartment number
         c - city*
         s - state*
         z - zip 5*
         e - email address
         pne - phone (US), no extension
         pwe - phone (US), with extension
         [a,b,c] - one of a, b, or c
         {a,b,c} - one of a, b, or c in decreasing probability
         [x-y] - a number between x and y, inclusive

         * city, state, and zip will be agree to create a valid address
           if you need multiple addresses, use the code ! to reset the
           synch. The reset works on a left-to-right scan of the format string.

   -n, --number : number of records to create

   Flags:
  -c, --csv : output CSV format (otherwise, tab-delimited).
  -v, --(no)verbose : verbose mode (default false)

Example:


Viajante:samuelg$ user-data-maker.pl --header "First:Last:Age:Email" --format "fn:ln:[10-100]:e" -n 5 --c
First,Last,Age,Email
Margot,Sawyer,33,Margot.Sawyer@netscape.com
Francisco,Cantrell,18,Cantrell@sbcglobal.com
Lynetta,Orozco,28,Lynetta@mac.com
Latrice,Dunlap,41,Latrice.Dunlap@sbcglobal.com
Anissa,Fitzgerald,59,Anissa@hotmail.com

or, more exotically:


Viajante:samuelg$ user-data-maker.pl --header "First Name:Last Name:Address:City:State:Zip:Super Power" --format "fn:ln:a1:c:s:z:[Invisibility,Invincibility,X-Ray Vision,Flight,Likes Squirrels]" -n 5 -c
"First Name","Last Name",Address,City,State,Zip,"Super Power"
Roseanna,Best,"8821 7th Str.",Manati,PR,00674,Flight
Euna,Crawford,"8195 Lee Str.","Fort Washington",PA,19034,Invincibility
Ted,Williams,"7140 Birch Ave.",Monroe,CT,06468,Invincibility
Mariano,Miranda,"2657 1st Way",Lyford,TX,78569,Flight
Tammy,Flowers,"2135 Washington Blvd.",Duluth,MN,55806,"Likes Squirrels"

Enjoy!


Tue, 20 May 2008

Email Round-Robin using Procmail

— SjG @ 2:56 pm

The need arose to have a specific email address round-robin (e.g., cycle through a collection of destination email addresses).

A solution was achieved through use of procmail and a little perl script. It probably could be done more easily and/or better, but I figured other people might find this interesting.

So, first, an alias was created in /etc/aliases (used by postfix in this case, but it should work for sendmail, and variants should work for other MTAs):

rrtest:         |"/usr/bin/procmail -m /etc/postfix/roundrobin_procmail.rc"

Then, the following file was saved as /etc/postfix/roundrobin_procmail.rc:

:0 w:/tmp/rrlock
{
        :0
                dest=|/etc/postfix/rr.pl
        :0
                ! ${dest}
}

And then, of course, we need the perl program. Here’s /etc/postfix/rr.pl:

#!/bin/perl
# ----------------------------------------------------------
@recipients = (
'address1@sample.com',
'address2@sample.com',
'address3@sample.com'
);

$index_file = 'rr-index.txt';

# ----------------------------------------------------------

$index_exists = 1;
open(IN,";
        close(IN);
        $index++;
        }
else
        {
        $index = 0;
        }


if ($index > $#recipients)
        {
        $index = 0;
        }
open(OUT,">/tmp/${index_file}");
print OUT "$index\n";
close(OUT);

print STDOUT $recipients[$index];

exit 0;

Elegant? Not really. But it seems to work 🙂


Thu, 3 Jan 2008

Backups, Updated Again

— SjG @ 5:11 pm

Had some updates to the backup script which I never published.

Here it is.

Enjoy!

Backup Scripts

(for background, see Automated Backups.)


Wed, 28 Nov 2007

Linux Security Camera Server on a Dell Vostro 400

— SjG @ 11:17 am

So I’ve been building a new security camera system. The last time I did this, I bought a Dell dual-core box, and spent about a week installing Debian, and building and rebuilding the kernel to support the dual cores, to support the BT878 video capture chipset, compile and configure motion, etc. It took a week of evenings and a weekend or two, because the Dell hardware wasn’t automatically supported, and it required special boot-time parameters to recognize the SATA controller, for example.

So I’m building a new system on a Dell Vostro 400. Right off the bat, I ran into problems with installing the Debian net-install. I was booting off a CD, but the installer couldn’t find an ATA/IDE controller for which it had a driver. Weird. This article showed me the solution to that — set the Dell controller’s SATA mode to “RAID.”

But then I hit a wall with the Intel Gigabit network controller. I couldn’t find any workarounds for it, but, after extensive Googling, found that some of the Ubuntu people may have a patch. The posting was six months old.

So screw it, thought I, and downloaded a shiny new ISO of Ubuntu 7.10 Server Edition to see whether it would work.

Damn, am I impressed! Not only did it recognize the ATA/IDE controller and the network controller, it happily recognized the BT878-based card and loaded the kernel modules. It even has motion installed as a package for easy installation. I was able to copy over all my support scripts and motion configuration files, and was up an running in less than two hours (and that includes setting up the web server, motd, special sshd tweaks, and all)!

Now, all I have to do is deal with my crisis of faith. Do I leave the Church of Debian for the radical new Ubuntuist movement?


Fri, 9 Nov 2007

Finding File and Directory Counts

— SjG @ 3:31 pm

So, in the process of organizing photographs, I wanted to examine my deeply-nested hierarchy to figure out how it’s possible I have 30,000 images (Aperture only wants me to have 10,000 in a project, so I need to re-organize the hierarchy even before I import).

So, I figured it’d be easy to use find to list all my directories, and how many images they contain. It turns out that (at least for me) it’s not.

My best stab so far is to use find and a loop, which gives me almost what I want (it not only includes the count of images in the each directory, but subdirectories as well). It fails if there are too many directories. It’s good enough. But it’s not elegant.

So CLI Deities — how would you make this pretty?

find . -type d | while read dir; do echo `ls -1 "$dir" | wc -l` $dir; done

Potential type-face issue disambiguation: after the ls, that first argument is a one, not an ell, although I suppose an ell would work too. The wc option is an ell.