There are situations where an application needs to load large collections of strings or types that contain a lot of string references. By default, any two strings in memory are unique objects, even though an equality comparison shows they have the same values. This is a side effect of the immutable character of strings in .NET (and in Java). It isn’t always desirable.
Suppose you need to add a thousand hypothetical Person objects to a collection. These objects include a person’s name and address information. Let us further suppose that these Persons all live in the United States. Worst case, there are just over 50 possible state codes (the 50 states plus territories and the District of Columbia and possibly military addresses). Wouldn’t it be nice not to have hundreds of copies of each unique state code in memory?
This is where an often overlooked technique called string interning comes into play. Each .NET assembly has an intern pool, which is in essence a collection of unique strings. When your code is compiled, all the string literals you reference in your code are added to this pool. Since many literals in a program tend to appear in multiple places, this conserves memory. A simplistic example:
if (bar == "FOO") { // do something here } else if (foo == "FOO") { // do something here } else if (fooBar = "FOO") { // do something here }
In the above example, “FOO” is added to the intern pool at compile time, so there is one copy of “FOO” rather than three. This saves memory.
For performance reasons I’ll go into shortly, interning is not normally applied to runtime string values, but at times it makes sense to do so. In our collection of Persons, let’s say we load each Person from a DataReader. We can intern the states as we load them, like so:
personInstance.State = String.Intern(drPerson["state"].ToString());
If for example the state code is “AZ”, String.Intern() looks to see if “AZ” is already in the intern pool. If it is, a reference to the interned string is returned. Otherwise, a new “AZ” string is created and that reference is added to the intern pool. If you end up with 1,200 persons in Arizona, you will have one “AZ” instance instead of 1,200 of them — a tremendous memory savings.
Depending on the typical distribution of unique values, the city might also be useful to intern. Basically, any time you have an attribute with a limited number of values that will be assigned to a large number of entities, string interning might be worth considering. This is even truer for long strings.
Of course, any time you optimize for one thing, you de-optimize for one or more other things, and interning is no exception.
The first down side is performance. String.Intern() must search the intern pool, and this is extra overhead compared to a string assignment. I have not formally benchmarked it, but it’s substantial enough that it has a noticeable subjective impact on performance in many cases. So you’ll want to avoid interning techniques when you absolutely need blistering performance. In the situations I actually used it, adequate performance was acceptable, and interning allowed us to avoid unacceptably low system capacities. Besides, many alternatives to an entirely in-memory solution can put just as big a damper on performance.
The second downside is persistence. There is no way to remove string references from the intern pool, and the memory used by the intern pool is not likely to be released until the CLR itself exits. You heard that right: the intern pool and its memory usage will probably outlive your application’s execution and even the termination of your app domain. For purposes of managing string literals, this isn’t a practical problem. But if you are going to load zillions of additional unique values to the intern pool in a system with constant uptime (such as a process called from a Windows service) then you may use up enough memory to begin to impact overall performance of all CLR applications.
A final, more minor issue, is that to intern a string you still have to create it. A String.Intern() call momentarily creates a string and then searches the intern pool. If the value already is in the pool, then the string reference goes out of scope and the released memory is eventually garbage-collected, but the overhead of string creation is not eliminated with String.Intern().
Despite these downsides, string interning, judiciously applied, can be a real life saver. Returning to my (simplistic) example above, you wouldn’t ordinarily apply this technique to state or region codes because in most applications you will not (and probably should not) load a large number of persons into a memory structure at one time anyway. There is usually no point in the added complexity and performance hit just to save a few kilobytes.
But the technique shines in bulk data processing applications, such as making iterative scrubbing passes over raw data from external sources before persisting it to a database. You might use a DataTable or a custom collection of your own to load arbitrary data in memory and then filter and transform it in various ways, entirely in memory, before committing it to some permanent data store. You might be seeking overall performance and / or simplicity by keeping everything in memory. Applying interning to certain columns could greatly increase the capacity of such a system.
Basically, the larger the strings, and the fewer the possible number of unique values, the more memory saved per DataRow or other collection item. Your mileage may vary, but string interning is a good tool in certain situations.
There is another win with interned strings. When you know that two strings are interned, you can make a super-fast string comparison this way:
if (object.ReferenceEquals(oneInternedString,anotherInternedString)) {}
Or, remembering that all string literals are interned, this more common scenario also works:
if (object.ReferenceEquals(anInternedString,"FOO") {}
Again, this only makes sense if you have thousands of comparisons to do, of course — and you have to know at coding time that the two strings are both interned.
Update: It occurs to me that here lies the seeds of a near-perfect general usage for interning that should be high benefits / low risk. Any time you have a set of known possible string values that you need to compare against and take action on, you are going to have if / else or switch /case statements testing for those known string literals, so they will be in the intern pool anyway. Therefore, you might as well intern any runtime strings that will be members of that same set of values and use the above object.ReferenceEquals() optimization with them. You get a great performance benefit and you aren’t adding anything new to the intern pool at runtime. Again, this isn’t worth bothering with for a set of tests executed once but in many interative situations it could be well worth it.
{ 1 comment… read it below or add one }
thank you for this interesting info.
didnt know the Intern() method