Text Processing in .NET #2: Zero Allocation Techniques

In our first installment, we discussed the importance of avoiding allocating new strings, which is devilishly easy to do when manipulating them. We also touched on the use of StringComparison types to avoid allocations via calls to ToUpper() or ToLower() when doing case-insensitive comparisons.

Strings are, deep in the .NET internals, not “normal” reference types and in some ways refuse to behave as such. They are their own special beast, due to their immutability. They are essentially a wrapper around a character array. In this fact lies a clue as to how we avoid code with allocating side effects like these:

string newString = oldString1 + oldString2 + oldString3;
string newString = oldString1.Substring(0, oldString1.IndexOf(' ') + 1) + oldString2;

You might look at these assignments and think that they are easily resolved via the use of StringBuilder. And you would be right. But is that the most efficient way? Not really, not for hot paths.

.NET Core and .NET 5 actually have some optimizations for concatenating two strings similar to what we’re about to discuss, and in the future, may have more optimizations for other concatenation scenarios. There also will doubtless be improvements in StringBuilder and String.Format() as well. We are out on the bleeding edge here with what we’re attempting, for a particular purpose: extensive string manipulation on hot paths. As with any performance optimization, you should test these techniques against conceptually simpler and arguably more readable code, and you should re-test when new major .NET releases occur annually. But for now, these techniques will wring the last bit of performance out of things for you, and they should be fast enough that there would be no particular advantage in ripping them out later, or even delegating to .NET internals really.

The techniques I’ll show you can be encapsulated in method libraries with easy and natural calling interfaces so that they don’t burden your mainline code with weird requirements or awkward calling conventions (beyond passing some string arguments by reference, which may “feel” weird but actually makes perfect sense in context).

So: let’s get to it.

In most cases, when we want to create a new string, we know, or can easily figure out, the total length of the string we’re trying to create. Since a string instance is “just” a wrapper around a character array containing the string’s contents, we can create a character array of the appropriate size and copy the contents of various (sub)strings into it at the appropriate places, then create a new string from that character array using the new string(characterArray) constructor.

The problem is that until .NET Core, we didn’t have very fast tools for doing this. Copying one character at a time from a source character array or string indexer to a target character array isn’t the fastest way to go about this.

This is solved for us in the latest .NET iterations with Span<T> and ReadOnlySpan<T>. For C programmers reading this, you can think of Spans as a mechanism to allow type-safe and bounds checked memcpy() calls that are every bit as fast as directly copying blocks of memory. For the rest of us, think of Spans as a lens through which you can access all or part of an array of any type for the purpose of safe, ultra-fast copying. A Span superimposes itself over the raw data in an array and allows you to examine and copy all or part of it easily.

Let’s start with a simple example routine to create a new string from components of an existing one, and show how we can use spans to move data around. Suppose we want to replace all instances of runs of a given character in a string, with a single instance of that character. You could use this, for example, to normalize a string to never have multiple spaces in it. A simple implementation to replace all character runs with single instances of a character might be:

public void RemoveConsecutiveCharacters(ref string s, char c)
{
    if (s == null || s.Length < 2)
        return;

    string consecutiveRun = new string(c, 2);

    if (s.IndexOf(consecutiveRun, StringComparison.Ordinal) > -1)
    {
        string charAsString = new string(c, 1);

        while (s.Length > 1 && s.IndexOf(consecutiveRun, StringComparison.Ordinal) > -1)
            s = s.Replace(consecutiveRun, charAsString);
    }

}

We can make this, on average, TWICE as fast, though, at the expense of lengthier code. But let’s assume it’s a frequently needed normalization call, and let’s recast this in a more efficient way.

First, let’s count the allocations in the above code. Creating the string consecutiveRun is an allocation. Creating charAsString is another. Each Replace() call is another. The biggest offender is inside the loop, creating a new string every time we encounter another run of characters. This is particularly inefficient if the string contains runs longer than 2. In the case, for example, where there’s a run of 3 characters, that will be changed to 2, then to 1.

The way we’ll approach this is to copy the string’s characters into a character array, then scan that array for character runs of any length, and use copy techniques to shift everything to the right of the character run to the left, overwriting all but one instance of the character in the run. This will leave “garbage” at the end of the array, but we’ll track the shrinking length of the real string data as we go.

At the end, we’ll do just one allocation — the new string — via a new string(char[],start,length) call where start is zero and length is the new, shorter length.

Here’s the entire code; we’ll break this down into steps in the ensuing commentary. In a future installment, we’ll tighten it down still further.

public static void RemoveConsecutiveCharacters(ref string s, char c) {

    if (s == null || s.Length < 2)
        return;

    int state = 0;
    int pos = -1;
    int runLength = 0;
    int newLength = s.Length;
    char[] newChars = new char[s.Length];
    s.AsSpan().CopyTo(newChars.AsSpan());
    int i = 0;
             
    while (i < newLength) {

        if (newChars[i] == c) {
 
            if (state == 0) {
                state = 1; // encountered start of a run of characters
                pos = i; // pointer to first character in run
                runLength = 1;
            } else {
                runLength++;
            }

        } else {

            if (state == 1) {

                if (runLength > 1) {
                    // exited run of characters; copy rest of string over top of all but one
                    int copyLength = newLength - pos - runLength + 1;
                    newLength -= runLength;
                    newLength++;
                    newChars.AsSpan(pos + runLength - 1, copyLength).CopyTo(newChars.AsSpan(pos, copyLength));
                    i -= (runLength - 1); // adjust pointer for next character to examine, as we've shortened the evolving new string value
                    newChars[newLength] = '\0'; // This is just for debugging / visualization ease, it can be commented out after testing
                }
 
                state = 0;
            }
 
         }
 
         i++;
     }
 
     if (newLength < s.Length) {

        if (state == 1 && runLength > 1) {
            // We're inside a run at the end of the string.
            newLength -= runLength;
            newLength++;
        }
 
        s = new string(newChars, 0, newLength);
     }
 
}

Let’s begin by looking at the creation of the char[] array to hold the string’s contents:

char[] newChars = new char[s.Length];
s.AsSpan().CopyTo(newChars.AsSpan());

Let’s break that down into equivalent steps that might be clearer to understand:

char[] newChars = new char[s.Length];
var sourceSpan = new ReadOnlySpan<char>(s);
var targetSpan = new Span<char>(newChars);
sourceSpan.CopyTo(targetSpan);

As the name suggests, a ReadOnlySpan is just a Span that we can’t write to. Since strings are read-only (remember, they’re immutable), and we just want to read from it anyway, we create a ReadOnlySpan. This is a good habit. We can simplify all this, though, using the AsSpan() extension method:

char[] newChars = new char[s.Length];
var sourceSpan = s.AsSpan();
var targetSpan = newChars.AsSpan();
sourceSpan.CopyTo(targetSpan);

String.AsSpan() automatically returns a ReadOnlySpan. Since the need for the source and target spans were ephemeral, we had no real need to store them in variables, so we just shortened those last three lines into one:

s.AsSpan().CopyTo(newChars.AsSpan());

This is saying, in compact form, “directly copy all the characters in s to newChars.” (Note that the character array allocation and copy could be accomplished via s.ToCharArray(), but there’s a reason we’re not doing that which will become clear in our next installment; besides this is an easily grasped example of using Spans).

You might say after looking at this, “but the Span over newChars ceases to exist after this call!” And you’d be right, but it would be no problem: the copying operation is done, and the result is already in the newChars array. Remember that a Span is just a type-safe, bounds-checked mechanism for accessing the memory occupied by some or all the items in an array; it’s only needed for copy operations.

To understand what’s going on in the rest of the method, let’s consider the following string:

"A string  that has some double  spaces here and there.  "

We need to collapse the double spaces between “string” and “that”, between “double” and “spaces”, and at the very end of the string. We’ll do that via this call, assuming variable myString holds the above string:

FastString.RemoveConsecutiveCharacters(ref myString, ' ');

The first run begins at position 8 in the string. The method now creates a Span over the portion of the newChars array beginning at position 9 to the end, and copies it to a Span over the the same array from position 8, for the same length. It then places a “null character” (zero value) just past the end of the “new” string data. Here’s a representation of the characters in the newChars array, initially, and just after each copy operation (“dead” characters represented by asterisks):

"A string  that has some double  spaces here and there.  "
"A string that has some double  spaces here and there.  *"
"A string that has some double spaces here and there.  **"
"A string that has some double spaces here and there. ***"

Note, per the code comments, that the setting of null characters is purely for easy visualization; you can change it to any value (such as asterisks, above) or simply comment it out for a release build.

Eventually, the main loop encounters the end of the actual string data, sparing us (if characters were actually removed) from examining the “junk” portion of the array any further, and the special case of a trailing run of spaces is handled by adjusting the newLength variable. The 56 character string has become a 53 character string with just one string allocation involved! That’s better than the five allocations (at least) that would happen under the first implementation shown above.

Testing under CLR 4.7.2 shows this to have a run time that is 39 to 62 percent of the naive, if straightforward, implementation we began with, and this holds pretty well regardless of whether we do a thousand or ten thousand iterations, detect one consecutive run of characters, or many. That first implementation would be adequate for occasional use, but clearly on hot paths with thousands of calls, the kind of savings represented by our Span-based implementation adds up rapidly. And there’s no difference in the actual call; all the extra effort (and testing) is hidden from the client, as it should be.

There’s one more allocation going on in this routine. It’s not a string allocation; it’s the allocation of the newChars char[] array. In our next installment, we’ll examine ways to eliminate even that!

{ 2 comments… read them below or add one }

Nathan June 1, 2022 at 4:16 pm: I’m curious why you are passing the string into the method by reference. It seems to me it would be semantically and computationally identical to just passing a string in and returning a new string.
bob June 3, 2022 at 6:55 pm: Historical reasons relating to the system I originally developed these techniques for. Either way works fine, but we were passing around the same string from function to function and when I benchmarked this a few years ago in that use case there was a slight performance advantage. But that was probably back on framework 4.0 and dependent on the usage pattern. The benchmark is actually slightly slower now. Also the original client for this was VB.NET and it was not as semantically clumsy to pass by reference in VB because the compiler figured out the called arguments that were ByRef and passed them that way from the caller for you (arguably nice). It even went so far as to handle passing properties by reference for you, which camouflaged that a temporary variable was being created and assigned (arguably bad).

Text Processing in .NET #2: Zero Allocation Techniques

Archives

Categories

Recent Comments

Search