State Secrets: Place Names in Databases, News Story Bylines, and Elsewhere

A friend turned me on to this post which heralds a very significant change to The AP StyleBook that has editors and reporters atwitter today. It impacts datelines and article body text starting May 15.

Formerly, with some important exceptions, US place names in writing that follows AP style were followed by a state abbreviation — but it wasn’t the 2-character USPS abbreviation familiar to us in the database world, it was usually longer (Calif. for California, Ill. for Illinois, for example). Now it’ll be spelled out completely, except that in other countries (including Canada) the place name will simply be followed by the country name.

This strikes me as a lot of churn that will not generate a net gain in clarity and explicitness that perhaps the AP is hoping for. And applying state names only to the US certainly isn’t going to improve consistency.

As a database developer working with data from all over the world, I have some perspective on this that I doubt the Style Police have thought about. It’s relevant to the changes they’re making and also relevant to my fellow developers, who often have a myopic US-centric approach to schema design and rule development that eventually gets them into rather serious trouble.

First of all, only four major countries seem to use consistent state or province names in their postal addresses: The US, Canada, Mexico and Australia. India and other geographically large countries have their own geopolitical subdivisions and even smaller countries use breakdowns, such as Japan’s prefectures. But consistent usage in postal addresses and reasonably consistent, official abbreviations seem to prevail only in the four countries mentioned above (I’ll be grateful to any readers with documentation to the contrary to point me to). And Mexico, as far as I’ve been able to tell, doesn’t publish an official list of state abbreviations, though they are reasonably well defined by common usage.

There are two parallel systems of country abbreviations maintained by the ISO — a two and a three character version. The ISO has also adopted US and Canadian postal abbreviations for states / provinces. But the intersection of human nature and standards creates some interesting train wrecks for me:

New Brunswick, Canada is abbreviated NB and Nebraska, US is abbreviated NE. But NB is sometimes mis-abbreviated NE and Nebraska is sometimes mis-abbreviated NB.
Similarly, Newfoundland, Canada (which since 2001 has officially been known as “Newfoundland and Labrador” is officially NL and Nuevo Leon, Mexico is NL and this has caused more than one address to pass through the systems I maintain that place St John’s in Mexico or Monterrey in Canada. Similar hilarity flows from BC being a common abbreviation for Baja California Norte (or collectively for both Baja Californias) as well as British Columbia in Canada.
Then there are the edge cases. I’ll bet you didn’t know the name of Canada’s newest province, Nunavut (NU), formerly part of the Northwest Terrirories (NT) and often still classified under NT — it’s distinguishable for sure only by parsing the Canadian postal code. And by the way, NT is also the official abbreviation for Northern Territory, Australia.
And did I mention that WA is not just Washington State, US, but also the state of Western Australia? Or that Ontario, California often ends up in Ontario, Canada? Or that JA (the Mexican state of Jalisco) sometimes picks up Japanese addresses despite the fact that the 2 character ISO code for Japan is really JP? Or that since the 2 character ISO country code for India and the USPS state abbreviation for Indiana are the same, they often trade addresses with each other? And once in a great while, with Iowa?

A comprehensive data hygiene system has to look at all sorts of address hints to properly correct such mistakes. For instance, it’s a mercy that Canadian postal codes consist of six alternating letters and numbers, unlike the US and Mexico (five digits) and Australia (four digits).

The folks at the AP are in over their heads on this one, I’m afraid, if they think they are going to clear anything up with this style change. And don’t get me started on database and application developers, who often unwisely sweep such issues under the rug for some hapless future individual to deal with. As normal growth occurs, many times a US company starts dealing with non-US addresses (and phone numbers, a whole additional topic) and finds it needlessly difficult because someone didn’t take a little extra effort to prepare themselves for future global needs.

This is something that is increasingly important to address in a global economy. Briefly, here are a few things DB developers need to be looking at.

Even if you don’t need them now, consider designing your schemas with text data types that can support accented characters / ligatures and the alternative sorting schemes that go with them.
Use variable length fields for state names / abbreviations (note, even some Australian state abbreviations are 3, not 2 characters, and other countries may well just spell them out). Also use variable length or at least generously long fields for phone numbers.
Decide in advance whether to use 2 or 3 character ISO codes for countries (it’s probably best to use 3 character codes in most cases).
If you import data from other sources (and at some point you almost certainly will), be ready to translate / correct from coding systems other than the ones you’ve adopted. This will avoid, for instance, the spelled out name JAPAN, getting truncated to JA and misrouted to the state field, thus being mistaken for Jalisco, Mexico.
Make sure that data entry systems properly validate entries.
For addresses in the US and Canada, and perhaps even Mexico and Australia, consider investing in postal standardization libraries to correct and standardize addresses, even if you’re not using the database primarily for mailing.

State Secrets: Place Names in Databases, News Story Bylines, and Elsewhere

Archives

Categories

Recent Comments

Search