fogbound.net

Page 1 of 212

Tue, 20 Dec 2011

Kerning Pairs

(— SjG @ 11:22 pm)

I’ve been playing around with font creation for a couple of projects (more on that will be posted here at some point). One of the more surprising aspects of computer typography is the sheer complexity of it — I may have once naively thought that just it was just a matter of splatting characters … er … glyphs out to some display device based on simple shapes, but I was sadly mistaken. In fact, True Type and its successor Open Type not only use complex mathematical equations for creating the curves that define font outlines, but they also contain rules for scaling, hints for rendering these “mathematically perfect” curves on a bit-mapped display, and metrics for spacing character combinations. Open Type has its own internal language for doing such complex tasks as replacing some glyph pairs with ligatures, or doing fancy substitutions of glyphs depending on the surrounding glyphs or other rules. This allows ambitious font designers to do such things as imitate handwriting or handle non-Roman languages naturally (for example, in Semitic languages, the same letter may be written quite differently if it’s at the beginning or end of a word, and sometimes also depending on where it is in the sentence).

There’s a lifetime of complexity in typography, and, as yet, I’ve only been swimming in the shallow end. Still, I was deep enough to be playing with kerning pairs. Kerning involves moving letters so they fit together nicely. For a visual demonstration and nice game, take a look here. This does more to explain kerning than anything I could write.

The program I’m using for font creation has a facility for creating kerning pair metrics. You can type in a pair of letters, and then adjust the spacing for that particular pair. Of course, you can’t really go through and tune them all1: consider the case where you only have upper case letters and digits from zero through nine. Neglecting accented characters, we’re talking 36 glyphs, or 666 combinations. Now throw in lower case, punctuation, etc, and you have an enormous list of possible combinations to tune.

But think about it for a moment. There are characters combinations that will want tuning in just about every kind of Roman-character-based font, like “VA” or “To” or “ij”. Equally, depending on your language, there are character combinations that will almost never need to be combined. For example, in English, you’ll almost never see a lowercase letter followed immediately by an uppercase, or combinations like “Yq” or “Td” or “zn” in sequence.

So in the interest of selecting kerning pairs intelligently, I wrote a script to analyze character combinations. My target audience is English-speakers, so for my source data, I used English-language texts. But which English texts to use? Being an absurdist, I selected Emma by Jane Austen, At The Mountains of Madness by H. P. Lovecraft, The Adventures of Tom Sawyer, by Mark Twain, An Inquiry into the Nature and Causes of the Wealth of Nations by Adam Smith, Alice, or The Mysteries, Complete by Edward Bulwer Lytton, Tales of the Jazz Age by F. Scott Fitzgerald, Tarzan of the Apes by Edgar Rice Burroughs, An Unsocial Socialist by George Bernard Shaw, the collected writings of Thomas Jefferson, the complete works of William Shakespeare, the Project Gutenberg license text, and the Unix version of the English Dictionary that lives in /usr/share/dict/words.

To analyze the data, I loaded up the text, and stripped out all but the letters, digits, and the following punctuation: period, single-quote, double-quotes, exclamation mark, question mark, comma, semicolon, colon, left parenthesis, and right parenthesis2. I took all of the two-character combinations, and filtered out all pairs where one character was a space. Then I simply counted the number of instances.

Of course, the statistical analysis doesn’t match the experience of reading. While the frequency of combinations that start with an uppercase character followed by a lowercase character is low, those are possibly more important than combinations of lowercase characters. After all, they start out each sentence, and are very visually prominent. Additionally, the shapes of letters increases the propensity of these combinations to need kerning adjustments. With these thoughts in mind, I generated a file of statistics from the same texts, but based solely on combinations containing an uppercase character.

You can download the lists for your own nefarious purposes. Here’s the complete list, and here’s the list containing caps. In the complete list, there is what appears to be bad data. Keep in mind that the text contained such things as Roman Numeral chapter headers, older style numeric abbreviations (e.g., “3dly” and “23d”), some currency abbreviations (e.g., “1s.6d” or “1/6d”, both of which stand for 1 shilling and sixpence), and poetic contractions (e.g., “oer,” “stol’n,”, or “capdv’d”). I also see what I suspect are errors due to imperfect OCR of the original texts.

Last, but not least, I have two files which are my collection of The 128 Vitally Important Kerning Pairs and The 255 Important Kerning Pairs With One Repeat which comprise the most common combinations from the other two files as a single text for examination when testing a font.

1 Ideally, the way you define the spacing of the glyphs themselves saves you from having to tune all combinations. Most should start out looking pretty good. But you do, of course, want your font to lay out perfectly, hence the rest of this discussion.

2 This was admittedly an arbitrary choice of allowable punctuation. I also excluded accented characters like ü and à which would obviously need to be taken into consideration for many European languages. Since my focus was on English, I deemed them rare enough to ignore.

Tue, 18 Oct 2011

Publishing Old Projects

I’ve been publishing a bunch of old projects that I may have posted here, or simply left on my hard drive to suffer the slings and arrows of outrageous bit-rot. Most of these are projects that I created for some specific purpose or another, and have either coded to the point where I’m satisfied with them, or abandoned them.

I’m publishing this stuff in the hopes that it’ll be useful to somebody somewhere. In some cases, the code’s primary use may be as an example of how not to accomplish a task. In other cases, they’re projects that are being used in mission-critical operations, and so are reasonably robust.

I’ll be maintaining them on GitHub, if you want to get creative with the definition of “maintaining.”

Mon, 4 Oct 2010

More Plausible User Data

Back a few years ago, I posted a quick’n'dirty tool for generating plausible user data. I had a need for some improvements, so I’m posting the new version here.

The new version supports back-references, composite fields, and SQL output. So, for example, you could do:

./user-data-maker.pl -t id:lname:fname:city:state_code:zip:company -f i:ln:fn:c:s:z:/1+^+[Cars,Trucks,Boats,Planes,Motorcycles,Ships,Trains]+^+of+^+/3 -s -m tbl_dealer -n 5

and get the following output:
-- generated data from ./user-data-maker.pl
INSERT INTO tbl_dealer (id,lname,fname,city,state_code,zip,company) VALUES (0,'Nelson','Leslee','Akron','OH',44311,'Nelson Boats of Akron');
INSERT INTO tbl_dealer (id,lname,fname,city,state_code,zip,company) VALUES (1,'Bowen','Beatriz','Miami','FL',33176,'Bowen Trucks of Miami');
INSERT INTO tbl_dealer (id,lname,fname,city,state_code,zip,company) VALUES (2,'Hammond','Raymond','Ninilchik','AK',99639,'Hammond Motorcycles of Ninilchik');
INSERT INTO tbl_dealer (id,lname,fname,city,state_code,zip,company) VALUES (3,'Kim','Arielle','Columbus','MI',48063,'Kim Ships of Columbus');
INSERT INTO tbl_dealer (id,lname,fname,city,state_code,zip,company) VALUES (4,'Estrada','Warner','Iuka','IL',62849,'Estrada Cars of Iuka');

Nothing earth-shattering, but useful to me. Maybe to you too!

Download it here: user-data-maker.pl.gz

Sat, 13 Sep 2008

Generating Plausible Fake User Data

So it’s a familiar problem, where you’re developing a data-driven application, and you want to optimize the queries that will run against your database (I’ll have more interesting stuff on this later). The problem, of course, is that to really optimize those queries, you need a lot of sample data.

So I needed to do some address lookup code against a huge collection of users. But because there was the possibility of having to demo the prototype, I really didn’t want 100,000 users named “Foo McBar” living at “10101 Binary Place.” So, with the help of the almighty Internet, the all-frobnicating Perl, and the all-knowing US Bureau of the Census, I created a quick, semi-flexible script to generate people with plausible names and addresses that, if not Google-mappable, at least had agreement on city/state/zip. The city/state/zip is a collection of 250 random zip codes. If you have good zip code data, you can easily extend this to be complete! Names are generated from the most popular forenames and surnames, with a probabilistic bias towards the most common ones. The script also allows you to specify “pick one of n item” type fields, pick a number from a range, plausible email addresses, not-very-plausible phone numbers with or without extensions, and the ability to export as CSV or tab-delimited.

In principle, this should be easy to adapt to other countries, although you’ll need lists of common first names, surnames, street names, and a way of mapping cities to regions, states, districts, cantons, or whatever’s appropriate.

You can grab a copy of it here. It requires a Perl interpreter with the Text::CSV and Getopt::Long CPAN modules.

Usage: user-data-maker.pl [OPTIONS]
   -t, --header : header, a colon-delimited list of column headers
   -f, --format : format string, a colon-delimited list of column contents
       data types:
         fn - first name
         ln - last name
         a1 - street address
         a2 - apartment number
         c - city*
         s - state*
         z - zip 5*
         e - email address
         pne - phone (US), no extension
         pwe - phone (US), with extension
         [a,b,c] - one of a, b, or c
         {a,b,c} - one of a, b, or c in decreasing probability
         [x-y] - a number between x and y, inclusive

         * city, state, and zip will be agree to create a valid address
           if you need multiple addresses, use the code ! to reset the
           synch. The reset works on a left-to-right scan of the format string.

   -n, --number : number of records to create

   Flags:
  -c, --csv : output CSV format (otherwise, tab-delimited).
  -v, --(no)verbose : verbose mode (default false)

Example:


Viajante:samuelg$ user-data-maker.pl --header "First:Last:Age:Email" --format "fn:ln:[10-100]:e" -n 5 --c
First,Last,Age,Email
Margot,Sawyer,33,Margot.Sawyer@netscape.com
Francisco,Cantrell,18,Cantrell@sbcglobal.com
Lynetta,Orozco,28,Lynetta@mac.com
Latrice,Dunlap,41,Latrice.Dunlap@sbcglobal.com
Anissa,Fitzgerald,59,Anissa@hotmail.com

or, more exotically:


Viajante:samuelg$ user-data-maker.pl --header "First Name:Last Name:Address:City:State:Zip:Super Power" --format "fn:ln:a1:c:s:z:[Invisibility,Invincibility,X-Ray Vision,Flight,Likes Squirrels]" -n 5 -c
"First Name","Last Name",Address,City,State,Zip,"Super Power"
Roseanna,Best,"8821 7th Str.",Manati,PR,00674,Flight
Euna,Crawford,"8195 Lee Str.","Fort Washington",PA,19034,Invincibility
Ted,Williams,"7140 Birch Ave.",Monroe,CT,06468,Invincibility
Mariano,Miranda,"2657 1st Way",Lyford,TX,78569,Flight
Tammy,Flowers,"2135 Washington Blvd.",Duluth,MN,55806,"Likes Squirrels"

Enjoy!

Thu, 3 Jan 2008

Backups, Updated Again

Had some updates to the backup script which I never published.

Here it is.

Enjoy!

Backup Scripts

(for background, see Automated Backups.)

Page 1 of 212