To Unicode or Not To Unicode

by bob on April 9, 2010

In another post, I discussed some issues surrounding proper coding of country and state names and design of database schemas to support global names and addresses.  I mentioned that text data types for name and address elements should consider supporting accented characters — another way of saying “Unicode characters”.

In terms of SQL Server this means using NText rather than Text, and NVarChar rather than VarChar.  I’d like to say a little bit more about this issue, however, as it’s not a simple decision, and has to be finessed based upon a number of factors.

First, Unicode fields take twice as much disk and buffer space as ASCII fields, since each character occupies two byes rather than one.  Although disk and memory space are relatively inexpensive these days, there’s no getting around the fact that such fields are going to result in some kind of hit, both in terms of cost and performance.

Furthermore, using Unicode fields is going to result in some design and maintenance overhead.  This overhead falls into two main areas: proper translation of encoding schemes and collation sequencing.  Let’s first look at encoding.

Currently I oversee data acquisition, intake, and hygiene for a large database that is assembled from hundreds of sources.  Some of those sources include accented characters.

In an ideal world, each data source would take proper care of its own internal management of code pages and coding schemes and would document for me the coding scheme being used to transmit the data to me.  Then it’d be a straightforward matter of reading the data with the proper encoding, and writing it back into my own system using the encoding I’ve chosen.

In practice, it isn’t nearly that pretty.

To start with, particularly in the United States, databases and software are often willfully ignorant of the need to properly represent accents and ligatures.  The result is that data is effectively encoded as 7-bit ASCII or the EBCDIC equivalent.  When someone inputs or imports data that includes “extended” characters, those extended characters are simply corrupted.  In the best case they simply lose the accenting; more often they are dropped or become spaces or some other default character; in the worst case they become some random character.

The country name “MEXICO” for instance normally has the “e” accented by a native speaker.  In data streams such as I’ve just described, that might become, for instance, MXICO or M?XICO.  This may happen as I import, because the data source uses undocumented code pages or the encoding randomly changes due to inconsistent, non-automated exporting practices.  Or it may already be corrupted before it’s even exported to a file and sent to me for processing.

Even computer systems in other countries stagger along using US-written software that doesn’t really accommodate their needs — legacy COBOL banking systems come to mind.

Given all this, unless all your data is keyed in by native speakers on software systems that properly handle the local character set, even this far into the 21st century there are probably a lot of systems that would do well for the foreseeable future to just boil everything down to 7-bit characters or some other single-byte scheme.  For one thing, if you’re merging many different data sets you need consistency and you need to properly match names and addresses.  You can’t readily do that if some data coming into the system is accented and some is not.

For some use cases, such “lowest common denominator” systems are “good enough”.  It’s a business trade-off decision, and there’s no universal “right” answer.  The important thing is that you’re mindful of the issues and make an intelligent decision.

Now let’s consider the matter of collation sequence.  This is another complication of allowing accented / extended characters into your data mix.  If you have the same word starting a name, other than the fact that in one version of the word the leading “A” is unaccented, another has an accent grave, another has an accent aigu, another has an umlaut, and so forth … how do you sort those?

The answer for many applications is you sort them however a particular user is used to seeing them sorted in the local telephone directories.  This creates all sorts of interesting sorting and searching issues.  But underlying them all is a decision of what collation sequence to use.  In some applications you may need to use different sequences for different use cases.  It can get terribly hairy.

As you can imagine, it’s not entirely Ugly American arrogance that causes developers to sidestep handling extended characters; it’s mighty tempting to just live without them.  You need a compelling business case to justify investing the extra effort.

Of course I have conducted this entire discussion in the context of the Roman alphabet; I haven’t even touched on Cryllic or Arabic or the various Asian glyph systems and the Roman transliterations thereof.  It’s enough to make your head spin.

The most I can hope to accomplish with this post is to expand your awareness of the issues out there in the wider world and encourage you to educate yourself about them and take them into consideration as you design and upgrade databases and supporting software.  I hope I’ve done that, because in my experience most US developers completely ignore extended characters or handle them very sloppily.  And that makes even a single-byte system difficult to properly maintain.

{ 1 comment… read it below or add one }

Tod Birdsall June 10, 2010 at 6:45 am

Hi Bob, I used to live in Fountain Hills and we met once or twice to talk shop. The org I work for is in the process of converting our existing SQL 2005 data from ASCII to Unicode. It has been mostly “search and replace”, so far. Not fun, but not too hard, if you are fluent in regular expressions. ;-)
Just a side note: You mention “NText rather than Text”. If you are using SQL 2005/2008, NTEXT has been deprecated; You should now use NVARCHAR(MAX), instead.

Bob responds:

Hi, Tod. Thanks for the tip, I had not noticed that deprecation (largely because I don’t have any current projects that actually need Unicode). I did a little research and found this interesting and helpful link that deals with a little “gotcha” when converting legacy ntext to nvarchar(max) and also documents the fact that SQL 2005+ has much worse performance with ntext than previous versions:

http://geekswithblogs.net/johnsPerfBlog/archive/2008/04/16/ntext-vs-nvarcharmax-in-sql-2005.aspx

Leave a Comment

Previous post:

Next post: