Few things were less expected by me yesterday than the out of memory exception that was suddenly being thrown by very mature code that had functioned pretty much unchanged for about five years. Especially when the offending line of code simply created an empty DataRow with less than two dozen columns. What, I thought, could possibly cause the app to bonk its head on a memory ceiling with an innocuous creation of a single little ol’ DataRow?
Well, plenty, it turns out. After some Googling around and head-scratching I realized there appears to be a sick, codependent relationship between NewRow() and the DataTable it’s being called on, because others occasionally have this issue, too. My educated guess without actually probing around in the framework source is that NewRow() assumes that the DataRow it’s about to return will shortly be added to its DataTable, and so it checks to see if it will fit with current memory and collection allocations, and if not, makes room for it. This probably has the nice side effect of allowing a few milliseconds for any necessary garbage collection to take place before the (presumably inevitable) Rows.Add() call.
The assumption that the new DataRow will in fact be added to the DataTable is probably sound most of the time, but not in our particular use case. We were processing arbitrarily large input files in 5,000 record batches cached in a DataTable and we weren’t necessarily adding all the incoming rows to the DataTable. When each batch was committed, Rows.Clear() was called so that the DataTable could begin receiving the next batch.
What I didn’t realize is that Rows.Clear() apparently fails to release some or all of the memory held onto by the Rows collection, effectively creating a memory leak.
The reason we hadn’t encountered the problem before was that the leak wasn’t severe enough for our typical case. But this was an unusually large input table, plus, there was an unusual number of input records that, after they were examined, were rejected for one reason or another and thus not actually added to the DataTable. In other words, not only was this an unusual number of records but we were calling NewRow() about twenty times as often as we were actually adding records to the DataTable, which multiplied the allocation per record twenty-fold.
The simplest fix for this issue in our case was to not keep reusing the DataTable, but to recreate a new DataTable instance for each batch. This was simple; just create the table once, hold a reference to it, and call Clone() on that instance every time you want a new, empty DataTable. This stabilized our memory situation immediately; the local working set quit its open-ended growth and just oscillated up and down within a safe range.
On a related note, it’s good to remind ourselves that DataTables were never intended to be gigantic. Regardless of how much RAM is available or whether you’re on 32 or 64 bit Windows, a DataTable is limited to roughly 1.5G in size, and that includes versioning (e.g., uncommitted copies of changed rows) and other overhead (e.g., indexes). Use firehose cursors and database back-ends for heavier duty requirements. The main reason we used a DataTable in this application was for backward compatibility with a poorly written API that we had to retrofit to handle arbitrarily large sets of data by “chunkifying” the processing into manageable batches.