Working with Large ADO.NET DataSets

by bob on October 23, 2007

I’ve been working on a custom data transformation application for awhile and its default mode of operation is to load an entire input data file into a DataTable and make several transformation / scrubbing passes through it, then write it back out to a file and/or a database. This is very, very fast, but when the original file’s size would approach 70 megabytes or so, we’d start getting “out of memory” errors.

Now, given that a DataTable can theoretically reach a size of about 1.5G (out of a 2G process space size in 32 bit Windows), it seems that even allowing for some of the things that are going on within this application (copying / saving some initial values for later comparison to the scrubbed values, and so on) a 70 meg file would never consume more than about 250 meg of memory, still very far from the theoretical maximum.

Your mileage may vary, but what worked for us very well was to stop every 1000 rows we processed and do a DataTable.AcceptChanges() call followed by a GC.Collect(). Each of these pauses takes around 0.7 seconds, but it gets rid of the “out of memory” issue by minimizing memory fragmentation.

The basic problem here is that many, many copies of strings are being made in a tight processing loop, some of which are temporary and some of which are being persisted to the DataTable. Strings are immutable in .NET, so unless you are in a situation where string interning techniques are useful, two strings with the value “CAT” will take up twice the space of one string; the instances don’t point to the same data.

Compounding this is the way DataTables work: when you change a string column value, you do not overwrite the original string, you put the new one into a separate, uncommitted copy of the row, and so the reference to the original string(s) in the original row won’t be (non-deterministically) released until you call AcceptChanges() to commit your changes.

Calling AcceptChanges() regularly has a side effect in that it limits the operations that can happen inside a processing loop. You can’t delete rows within the loop, or you will violate the prohibition against changing a collection that’s being iterated (typically, with foreach).

In our case it was fairly simple to change the independent foreach to an internal for loop, driven by an outer while loop that performed housekeeping such as determining the start and end row for the next batch to be processed before another AcceptChanges / Collect. Deletions were requested by adding the row index to be deleted to a List<int>, and the outer loop took care of applying the deletions without losing track of the proper start row for the next “batch”.

In the end it looked something like this, given a DataTable named dt to iterate:

int startRow =-1; // signify first pass
int endRow = 0;
List<int> toDelete = new List<int>();

while (SUtility.CommitBatchAndSetUpNextBatch(dt,toDelete,ref startRow,ref endRow)) {

  for (int currentRow = startRow;currentRow < endRow;currentRow++) {
    DataRow dr = dt.Rows[currentRow];
    // do stuff to each row
  }

}

The CommitBatchAndSetUpNextBatch() method would apply the deletions, issue toDelete.Clear(), issue the AcceptChanges() and GC.Collect() calls, calculate the next row index range taking into account bounds conditions, and return false if there are no more batches left. The batch size of 1000 rows was hard coded but could have been made a config item. This method will allow the same housekeeping code to be reused in any loop that processes a DataTable.

{ 1 comment… read it below or add one }

liviu October 24, 2007 at 2:42 pm

I think at large processing the GC is showing its weakness, or better, the fact that data IS NOT allocated on the stack.
Stack is so easy to rewind….
Sometimes i think of the possibility to write data access in C++…

Bob responds: I don’t know that the GC is showing an inherent weakness so much as that it’s tuned for other, more common use cases. Arguably, this issue also shows a downside of making string immutable, but rumblings I’ve heard coming out of Microsoft recently suggest that more immutable collections and objects are in our future, essentially trading memory for thread safety and ease of use.

Something I wish for: a lightweight DataTable without all the versioning cruft and probably with some other little-used features jettisoned. The latter wouldn’t be read-only, but it’d just have one version of each row, without commit / rollback functionality. For many uses, that’s enough. It would have a smaller memory footprint and slightly better performance, and would tend not to create so much fragmentation in the first place. In addition I’d love to see a full-featured DataSet like we already have, but that uses a physical disk file or a DB table for its backing store — in other words, a DataSet of infinite size — like a DataReader, but able to be navigated at will in any direction.

Leave a Comment

Previous post:

Next post: