To Unicode or Not To Unicode

by bob on April 9, 2010

In another post, I discussed some issues surrounding proper coding of country and state names and design of database schemas to support global names and addresses.  I mentioned that text data types for name and address elements should consider supporting accented characters — another way of saying “Unicode characters”.

In terms of SQL Server this means using NText rather than Text, and NVarChar rather than VarChar.  I’d like to say a little bit more about this issue, however, as it’s not a simple decision, and has to be finessed based upon a number of factors.

First, Unicode fields take twice as much disk and buffer space as ASCII fields, since each character occupies two byes rather than one.  Although disk and memory space are relatively inexpensive these days, there’s no getting around the fact that such fields are going to result in some kind of hit, both in terms of cost and performance.

Furthermore, using Unicode fields is going to result in some design and maintenance overhead.  This overhead falls into two main areas: proper translation of encoding schemes and collation sequencing.  Let’s first look at encoding.

Currently I oversee data acquisition, intake, and hygiene for a large database that is assembled from hundreds of sources.  Some of those sources include accented characters.

In an ideal world, each data source would take proper care of its own internal management of code pages and coding schemes and would document for me the coding scheme being used to transmit the data to me.  Then it’d be a straightforward matter of reading the data with the proper encoding, and writing it back into my own system using the encoding I’ve chosen.

In practice, it isn’t nearly that pretty.

To start with, particularly in the United States, databases and software are often willfully ignorant of the need to properly represent accents and ligatures.  The result is that data is effectively encoded as 7-bit ASCII or the EBCDIC equivalent.  When someone inputs or imports data that includes “extended” characters, those extended characters are simply corrupted.  In the best case they simply lose the accenting; more often they are dropped or become spaces or some other default character; in the worst case they become some random character.

The country name “MEXICO” for instance normally has the “e” accented by a native speaker.  In data streams such as I’ve just described, that might become, for instance, MXICO or M?XICO.  This may happen as I import, because the data source uses undocumented code pages or the encoding randomly changes due to inconsistent, non-automated exporting practices.  Or it may already be corrupted before it’s even exported to a file and sent to me for processing.

Even computer systems in other countries stagger along using US-written software that doesn’t really accommodate their needs — legacy COBOL banking systems come to mind.

Given all this, unless all your data is keyed in by native speakers on software systems that properly handle the local character set, even this far into the 21st century there are probably a lot of systems that would do well for the foreseeable future to just boil everything down to 7-bit characters or some other single-byte scheme.  For one thing, if you’re merging many different data sets you need consistency and you need to properly match names and addresses.  You can’t readily do that if some data coming into the system is accented and some is not.

For some use cases, such “lowest common denominator” systems are “good enough”.  It’s a business trade-off decision, and there’s no universal “right” answer.  The important thing is that you’re mindful of the issues and make an intelligent decision.

Now let’s consider the matter of collation sequence.  This is another complication of allowing accented / extended characters into your data mix.  If you have the same word starting a name, other than the fact that in one version of the word the leading “A” is unaccented, another has an accent grave, another has an accent aigu, another has an umlaut, and so forth … how do you sort those?

The answer for many applications is you sort them however a particular user is used to seeing them sorted in the local telephone directories.  This creates all sorts of interesting sorting and searching issues.  But underlying them all is a decision of what collation sequence to use.  In some applications you may need to use different sequences for different use cases.  It can get terribly hairy.

As you can imagine, it’s not entirely Ugly American arrogance that causes developers to sidestep handling extended characters; it’s mighty tempting to just live without them.  You need a compelling business case to justify investing the extra effort.

Of course I have conducted this entire discussion in the context of the Roman alphabet; I haven’t even touched on Cryllic or Arabic or the various Asian glyph systems and the Roman transliterations thereof.  It’s enough to make your head spin.

The most I can hope to accomplish with this post is to expand your awareness of the issues out there in the wider world and encourage you to educate yourself about them and take them into consideration as you design and upgrade databases and supporting software.  I hope I’ve done that, because in my experience most US developers completely ignore extended characters or handle them very sloppily.  And that makes even a single-byte system difficult to properly maintain.

{ 1 comment }

A friend turned me on to this post which heralds a very significant change to The AP StyleBook that has editors and reporters atwitter today.  It impacts datelines and article body text starting May 15.

Formerly, with some important exceptions, US place names in writing that follows AP style were followed by a state abbreviation — but it wasn’t the 2-character USPS abbreviation familiar to us in the database world, it was usually longer (Calif. for California, Ill. for Illinois, for example).  Now it’ll be spelled out completely, except that in other countries (including Canada) the place name will simply be followed by the country name.

This strikes me as a lot of churn that will not generate a net gain in clarity and explicitness that perhaps the AP is hoping for.  And applying state names only to the US certainly isn’t going to improve consistency.

As a database developer working with data from all over the world, I have some perspective on this that I doubt the Style Police have thought about.  It’s relevant to the changes they’re making and also relevant to my fellow developers, who often have a myopic US-centric approach to schema design and rule development that eventually gets them into rather serious trouble.

First of all, only four major countries seem to use consistent state or province names in their postal addresses: The US, Canada, Mexico and Australia.  India and other geographically large countries have their own geopolitical subdivisions and even smaller countries use breakdowns, such as Japan’s prefectures.  But consistent usage in postal addresses and reasonably consistent, official abbreviations seem to prevail only in the four countries mentioned above (I’ll be grateful to any readers with documentation to the contrary to point me to).   And Mexico, as far as I’ve been able to tell, doesn’t publish an official list of state abbreviations, though they are reasonably well defined by common usage.

There are two parallel systems of country abbreviations maintained by the ISO — a two and a three character version.  The ISO has also adopted US and Canadian postal abbreviations for states / provinces.  But the intersection of human nature and standards creates some interesting train wrecks for me:

  • New Brunswick, Canada is abbreviated NB and Nebraska, US is abbreviated NE.  But NB is sometimes mis-abbreviated NE and Nebraska is sometimes mis-abbreviated NB.
  • Similarly, Newfoundland, Canada (which since 2001 has officially been known as “Newfoundland and Labrador” is officially NL and Nuevo Leon, Mexico is NL and this has caused more than one address to pass through the systems I maintain that place St John’s in Mexico or Monterrey in Canada.  Similar hilarity flows from BC being a common abbreviation for Baja California Norte (or collectively for both Baja Californias) as well as British Columbia in Canada.
  • Then there are the edge cases.  I’ll bet you didn’t know the name of Canada’s newest province, Nunavut (NU), formerly part of the Northwest Terrirories (NT) and often still classified under NT — it’s distinguishable for sure only by parsing the Canadian postal code.  And by the way, NT is also the official abbreviation for Northern Territory, Australia.
  • And did I mention that WA is not just Washington State, US, but also the state of Western Australia?  Or that Ontario, California often ends up in Ontario, Canada?  Or that JA (the Mexican state of Jalisco) sometimes picks up Japanese addresses despite the fact that the 2 character ISO code for Japan is really JP?  Or that since the 2 character ISO country code for India and the USPS state abbreviation for Indiana are the same, they often trade addresses with each other?  And once in a great while, with Iowa?

A comprehensive data hygiene system has to look at all sorts of address hints to properly correct such mistakes.  For instance, it’s a mercy that Canadian postal codes consist of six alternating letters and numbers, unlike the US and Mexico (five digits) and Australia (four digits).

The folks at the AP are in over their heads on this one, I’m afraid, if they think they are going to clear anything up with this style change.  And don’t get me started on database and application developers, who often unwisely sweep such issues under the rug for some hapless future individual to deal with.  As normal growth occurs, many times a US company starts dealing with non-US addresses (and phone numbers, a whole additional topic) and finds it needlessly difficult because someone didn’t take a little extra effort to prepare themselves for future global needs.

This is something that is increasingly important to address in a global economy.  Briefly, here are a few things DB developers need to be looking at.

  • Even if you don’t need them now, consider designing your schemas with text data types that can support accented characters / ligatures and the alternative sorting schemes that go with them.
  • Use variable length fields for state names / abbreviations (note, even some Australian state abbreviations are 3, not 2 characters, and other countries may well just spell them out).  Also use variable length or at least generously long fields for phone numbers.
  • Decide in advance whether to use 2 or 3 character ISO codes for countries (it’s probably best to use 3 character codes in most cases).
  • If you import data from other sources (and at some point you almost certainly will), be ready to translate / correct from coding systems other than the ones you’ve adopted.  This will avoid, for instance, the spelled out name JAPAN, getting truncated to JA and misrouted to the state field, thus being mistaken for Jalisco, Mexico.
  • Make sure that data entry systems properly validate entries.
  • For addresses in the US and Canada, and perhaps even Mexico and Australia, consider investing in postal standardization libraries to correct and standardize addresses, even if you’re not using the database primarily for mailing.

{ 1 comment }

Dumb Computers Are A Good Thing

April 1, 2010

Eric Sink had some thoughts about a sea change he feels is underway in the world of computing.  The short version: computers are no longer driven by the demands of uber-geeks, who are quickly becoming a minority.  The new driving force are people who are willing to trade power and flexibility for simplicity.  Shiny new [...]

Read the full article →

Continuous Learning is Relative

March 27, 2010

The conventional wisdom in software development is that you must always keep ahead of the alphabet soup-and-acronym marketing juggernaut that drives the technology.  While I don’t disagree, exactly, I have to confess that I’m seeing an awful lot of technology that’s driven by the need of traditional software vendors to sell version upgrades and counter [...]

Read the full article →

System Administrivia

January 17, 2010

My son is a crack sysadmin, but it’s never been my area of interest.  I’m a software developer.  I like that if something goes wrong with software I wrote, I know exactly who to blame (me) and who can do something about it (me).   With operating systems and hardware — I’m at the mercy of [...]

Read the full article →

Python: Unraveling the Snake

November 24, 2009

Something has been making me restless lately.  Maybe it’s the periodic need to shake things up; maybe it’s the growing sense that I may be getting a little parochial in the Microsoft-centric alternative reality that I’ve lived in for the past 15 years or so; maybe it’s simply indigestion.  Whatever the reason, I’ve been looking [...]

Read the full article →

The Kindle DX — Is it Worth the Money?

November 19, 2009

As a purchaser of the original Kindle 1 back around November of 2007, I’m a relatively early adopter of Amazon.com’s eReader technology. In general, I’ve been well satisfied with it, for a first-generation device. I find the ability to purchase most books for $9.99 or less in a very readable / usable / [...]

Read the full article →

Advice for Self-Taught Programmers

October 19, 2009

Bob returns to blogging with some advice for noobs.

Read the full article →

Time To Jump Ship?

October 29, 2008

Does Redmond have any room left to redeem itself on the OS front?

Read the full article →

VB.NET Eye for the C# Guy

October 21, 2008

Coming to VB.NET from C#? Oh, behave!

Read the full article →