Text Processing in .NET: #1 – String Comparisons, Ephemeral Allocations, and Pass-By-Reference

by bob on December 22, 2020

In my experience, many developers write unintentionally inefficient code for text processing applications. Having presided over the design and maintenance of back end code responsible for processing huge daily volumes of text for much of my career, I have a few suggestions that are relatively easy conceptually and not costly to implement. I’ve seen the application of techniques like these increase throughput in records processed per second by a factor of 3 to 4 in some cases.

Much of my advice won’t move the performance needle for you unless it is code that runs, for example, on a per-row basis against sizable datasets (which is to say, on “hot paths”). On the other hand, poor architectural decisions early on means you’ll have scaling problems later, so I prefer to see text handling done with good performance in mind from the start. It doesn’t harm readability, so being in the habit of using these techniques even outside of hot paths is a Good Thing.

The first thing to understand is that, mostly for reasons of thread safety, .NET strings are immutable. This means that any time you concatenate or trim or otherwise alter a string, you’re creating and allocating a new string. In some cases those allocations are what are known as “ephemeral” in that they may be quickly created and discarded, but if you’re plowing through many iterations of a loop and hurling those ephemeral allocations at the garbage collection system, you’ll slow throughput and increase memory consumption.

It turns out that putting pressure on the garbage collector by creating many objects with short lifetimes is a significant performance suck.

By way of illustration, let’s try and count some allocations in this code snippet where, if a string ends with “XYZ”, we strip that off the end of the string:

if (someString.ToUpper().EndsWith("XYZ")) {
    someString = someString.Substring(0, someString.Length - 3);
}

The “XYZ” string is created at compile time, not runtime, and is in the string intern pool. But the ToUpper() call copies someString and converts it to uppercase, creating an ephemeral uppercase instance of the string purely for comparison purposes, as it’s immediately discarded after the expression is evaluated. Then, the assignment doesn’t simply truncate someString, but creates a new, shorter someString and discards the old instance — another allocation.

A better version of this removes the ToUpper() call and its associated allocation:

if (someString.EndsWith("XYZ", StringComparison.OrdinalIgnoreCase)) {
    someString = someString.Substring(0, someString.Length - 3);
}

I suspect that an awful lot of developers don’t pay much attention to the StringComparison options and habitually use ToUpper() or ToLower() when doing case-insensitive compares because … well, because if you aren’t thinking about how the CLR works at runtime under the hood, it probably doesn’t seem like it should matter. But the reality is that those options exist in part to avoid needless transformations by intelligently examining the characters of the original string rather than a copy of it.

So rule #1 is, when using Equals(), StartsWith(), EndsWith(), or basically any comparison operation, take full advantage of the comparison options and avoid the allocations caused by tricks with ToUpper() and ToLower(). Also, where possible, use the *Ordinal comparison options, because the default is a much more expensive “culturally sensitive” comparison. If that’s what you actually need, fine; but in the systems I write I’m generally dealing with 7-bit character schemes as the desired target. In that scenario, StringComparison.Ordinal is most efficient, followed by OrdinalIgnoreCase. Even if you have more complex culturally sensitive comparison needs, handling case insensitive comparison are still far better accomplished via the appropriate StringComparison option.

If you have a lot of such comparisons to make, you can create a method that encapsulates the above implementation into a more succinct single-line call like this example:

StringUtils.StripFromEnd(ref someString,"XYZ", StringComparison.OrdinalIgnoreCase);

The implementation does nothing to the passed string unless the match value is present, so if there is no match, no string allocations occur. A more exact but arguably too verbose name for it would be StripFromEndIfPresent().


A side note on passing by reference: It’s important to note that in the above method signature, we’re using a void method and passing someString by reference rather than the typical pass-by-value semantics and returning a string, which would have looked like this:

someString = StringUtils.StripFromEnd(someString,"XYZ", StringComparison.OrdinalIgnoreCase);

Why would passing by value be problematic?  My benchmarking around these techniques in the Long Ago revealed a modest but consistent performance advantage for passing by reference, and since most of our text processing is just making various tests and transformations on the same string, it seemed intuitive to not be passing it in and out of methods, but to just have methods all “seeing” the same instance. Today, in .NET Core / 5, by-reference and by-value seem to be within margin of error of each other in terms of performance, so just use whatever semantics makes the most sense to you and your team. I’m going with reference semantics so I can more easily share my code with you.

Another thought on by reference semantics: it goes against the grain of what you were probably taught — that passing by reference is inherently “dangerous” and should be avoided when possible. But when it’s what you consciously want to do, and the implementation is designed to change that value, and the compiler forces you to pass ref arguments using the ref keyword, so that you’re consciously aware that this is happening — it’s just another way of obtaining the result. Indeed, pass-by-reference is being encouraged where appropriate beginning with the C# 7.3 compiler via additional support in the form of “ref returns” and “ref locals”.


You also can make string comparison type arguments optional and default to the most common comparison type, sparing you the need to pass that argument explicitly in most cases. My most common need is to make cheaper ordinal comparisons because we upper-case everything in our DB anyway, so that is my default, making the call even more concise for a case-sensitive comparison:

StringUtils.StripFromEnd(ref someString,"XYZ");

There’s a lot more we can do to avoid needless allocations which we’ll explore in later installments. Indeed, such optimizations were a major impetus for Microsoft to essentially fork the .NET Framework into .NET Core, and then re-unify it in .NET 5. It gave them an opportunity to go through the framework code with a fine-tooth comb with the objective of making a well-written C# app closely approach the performance of native C++ code. It accomplishes this in part by minimizing allocations, including string allocations, and therefore garbage collections. These improvements were then “dogfooded” back into the framework itself. This means that in the  .NET 5 world, your own code will be the major remaining source of allocation issues. You can improve performance with a bit of attention to, and awareness of, allocation issues and a few other common bottlenecks.

In our next installment in this series, we’ll look at the ways .NET 5 (and, to a lesser extent, .NET Core and .NET Framework. 4.5+) give us tools to get closer to zero-allocation string and character array manipulation.

Leave a Comment

Previous post:

Next post: