So it’s a familiar problem, where you’re developing a data-driven application, and you want to optimize the queries that will run against your database (I’ll have more interesting stuff on this later). The problem, of course, is that to really optimize those queries, you need a lot of sample data.
So I needed to do some address lookup code against a huge collection of users. But because there was the possibility of having to demo the prototype, I really didn’t want 100,000 users named “Foo McBar” living at “10101 Binary Place.” So, with the help of the almighty Internet, the all-frobnicating Perl, and the all-knowing US Bureau of the Census, I created a quick, semi-flexible script to generate people with plausible names and addresses that, if not Google-mappable, at least had agreement on city/state/zip. The city/state/zip is a collection of 250 random zip codes. If you have good zip code data, you can easily extend this to be complete! Names are generated from the most popular forenames and surnames, with a probabilistic bias towards the most common ones. The script also allows you to specify “pick one of n item” type fields, pick a number from a range, plausible email addresses, not-very-plausible phone numbers with or without extensions, and the ability to export as CSV or tab-delimited.
In principle, this should be easy to adapt to other countries, although you’ll need lists of common first names, surnames, street names, and a way of mapping cities to regions, states, districts, cantons, or whatever’s appropriate.
You can grab a copy of it here. It requires a Perl interpreter with the Text::CSV and Getopt::Long CPAN modules.
Usage: user-data-maker.pl [OPTIONS]
-t, --header : header, a colon-delimited list of column headers
-f, --format : format string, a colon-delimited list of column contents
data types:
fn - first name
ln - last name
a1 - street address
a2 - apartment number
c - city*
s - state*
z - zip 5*
e - email address
pne - phone (US), no extension
pwe - phone (US), with extension
[a,b,c] - one of a, b, or c
{a,b,c} - one of a, b, or c in decreasing probability
[x-y] - a number between x and y, inclusive
* city, state, and zip will be agree to create a valid address
if you need multiple addresses, use the code ! to reset the
synch. The reset works on a left-to-right scan of the format string.
-n, --number : number of records to create
Flags:
-c, --csv : output CSV format (otherwise, tab-delimited).
-v, --(no)verbose : verbose mode (default false)
Example:
Viajante:samuelg$ user-data-maker.pl --header "First:Last:Age:Email" --format "fn:ln:[10-100]:e" -n 5 --c
First,Last,Age,Email
Margot,Sawyer,33,Margot.Sawyer@netscape.com
Francisco,Cantrell,18,Cantrell@sbcglobal.com
Lynetta,Orozco,28,Lynetta@mac.com
Latrice,Dunlap,41,Latrice.Dunlap@sbcglobal.com
Anissa,Fitzgerald,59,Anissa@hotmail.com
or, more exotically:
Viajante:samuelg$ user-data-maker.pl --header "First Name:Last Name:Address:City:State:Zip:Super Power" --format "fn:ln:a1:c:s:z:[Invisibility,Invincibility,X-Ray Vision,Flight,Likes Squirrels]" -n 5 -c
"First Name","Last Name",Address,City,State,Zip,"Super Power"
Roseanna,Best,"8821 7th Str.",Manati,PR,00674,Flight
Euna,Crawford,"8195 Lee Str.","Fort Washington",PA,19034,Invincibility
Ted,Williams,"7140 Birch Ave.",Monroe,CT,06468,Invincibility
Mariano,Miranda,"2657 1st Way",Lyford,TX,78569,Flight
Tammy,Flowers,"2135 Washington Blvd.",Duluth,MN,55806,"Likes Squirrels"
Enjoy!