I’ve been playing around with font creation for a couple of projects (more on that will be posted here at some point). One of the more surprising aspects of computer typography is the sheer complexity of it — I may have once naively thought that just it was just a matter of splatting characters … er … glyphs out to some display device based on simple shapes, but I was sadly mistaken. In fact, True Type and its successor Open Type not only use complex mathematical equations for creating the curves that define font outlines, but they also contain rules for scaling, hints for rendering these “mathematically perfect” curves on a bit-mapped display, and metrics for spacing character combinations. Open Type has its own internal language for doing such complex tasks as replacing some glyph pairs with ligatures, or doing fancy substitutions of glyphs depending on the surrounding glyphs or other rules. This allows ambitious font designers to do such things as imitate handwriting or handle non-Roman languages naturally (for example, in Semitic languages, the same letter may be written quite differently if it’s at the beginning or end of a word, and sometimes also depending on where it is in the sentence).
There’s a lifetime of complexity in typography, and, as yet, I’ve only been swimming in the shallow end. Still, I was deep enough to be playing with kerning pairs. Kerning involves moving letters so they fit together nicely. For a visual demonstration and nice game, take a look here. This does more to explain kerning than anything I could write.
The program I’m using for font creation has a facility for creating kerning pair metrics. You can type in a pair of letters, and then adjust the spacing for that particular pair. Of course, you can’t really go through and tune them all1: consider the case where you only have upper case letters and digits from zero through nine. Neglecting accented characters, we’re talking 36 glyphs, or 666 combinations. Now throw in lower case, punctuation, etc, and you have an enormous list of possible combinations to tune.
But think about it for a moment. There are characters combinations that will want tuning in just about every kind of Roman-character-based font, like “VA” or “To” or “ij”. Equally, depending on your language, there are character combinations that will almost never need to be combined. For example, in English, you’ll almost never see a lowercase letter followed immediately by an uppercase, or combinations like “Yq” or “Td” or “zn” in sequence.
So in the interest of selecting kerning pairs intelligently, I wrote a script to analyze character combinations. My target audience is English-speakers, so for my source data, I used English-language texts. But which English texts to use? Being an absurdist, I selected Emma by Jane Austen, At The Mountains of Madness by H. P. Lovecraft, The Adventures of Tom Sawyer, by Mark Twain, An Inquiry into the Nature and Causes of the Wealth of Nations by Adam Smith, Alice, or The Mysteries, Complete by Edward Bulwer Lytton, Tales of the Jazz Age by F. Scott Fitzgerald, Tarzan of the Apes by Edgar Rice Burroughs, An Unsocial Socialist by George Bernard Shaw, the collected writings of Thomas Jefferson, the complete works of William Shakespeare, the Project Gutenberg license text, and the Unix version of the English Dictionary that lives in /usr/share/dict/words.
To analyze the data, I loaded up the text, and stripped out all but the letters, digits, and the following punctuation: period, single-quote, double-quotes, exclamation mark, question mark, comma, semicolon, colon, left parenthesis, and right parenthesis2. I took all of the two-character combinations, and filtered out all pairs where one character was a space. Then I simply counted the number of instances.
Of course, the statistical analysis doesn’t match the experience of reading. While the frequency of combinations that start with an uppercase character followed by a lowercase character is low, those are possibly more important than combinations of lowercase characters. After all, they start out each sentence, and are very visually prominent. Additionally, the shapes of letters increases the propensity of these combinations to need kerning adjustments. With these thoughts in mind, I generated a file of statistics from the same texts, but based solely on combinations containing an uppercase character.
You can download the lists for your own nefarious purposes. Here’s the complete list, and here’s the list containing caps. In the complete list, there is what appears to be bad data. Keep in mind that the text contained such things as Roman Numeral chapter headers, older style numeric abbreviations (e.g., “3dly” and “23d”), some currency abbreviations (e.g., “1s.6d” or “1/6d”, both of which stand for 1 shilling and sixpence), and poetic contractions (e.g., “oer,” “stol’n,”, or “capdv’d”). I also see what I suspect are errors due to imperfect OCR of the original texts.
Last, but not least, I have two files which are my collection of The 128 Vitally Important Kerning Pairs and The 255 Important Kerning Pairs With One Repeat which comprise the most common combinations from the other two files as a single text for examination when testing a font.
1 Ideally, the way you define the spacing of the glyphs themselves saves you from having to tune all combinations. Most should start out looking pretty good. But you do, of course, want your font to lay out perfectly, hence the rest of this discussion.
2 This was admittedly an arbitrary choice of allowable punctuation. I also excluded accented characters like ü and à which would obviously need to be taken into consideration for many European languages. Since my focus was on English, I deemed them rare enough to ignore.