When updating in the way that you are, you need to retrieve the document content in order to inspect it and make such modifications. MongoDB has no atomic operations that act on existing values in the way that you want to do, so iteration is of course required.
There is no real difference in the "query" portion of how you are matching on the regular expression between your two versions of the statement. No matter what, the content is converted to BSON before sending to the server anyway, so if you use a standard expression builder or a direct BSON document is of little consequence.
But on to the performance improvements that can be made.
Use Bulk Operations to update
As stated, Bulk Operations are the way you should be updating on such list iteration, and you also "should" be using a cursor rather than converting all results to a list, since it will save on memory.
Eschewing all the specific type declarations and just representing as BsonDocument
( which will probably save you on marshalling, but not needed ) then the basic example process would be:
var pattern = @"(?si)<([^s<]*workUnit[^s<]*)>.*?</1>";
var filter = Builders<JobInfoRecord>.Filter.Regex(x => x.SerializedBackgroundJobInfo,
new BsonRegularExpression(pattern, "i"));
var ops = new List<WriteModel<BsonDocument>>();
var writeOptions = new BulkWriteOptions() { IsOrdered = false };
using ( var cursor = await records.FindAsync<BsonDocument>(filter))
{
while ( await cursor.MoveNextAsync())
{
foreach( var doc in cursor.Current )
{
// Replace inspected value
var updatedJobInfo = Regex.Replace(doc.SerializedBackgroundJobInfo, pattern, "<$1></$1>");
// Add WriteModel to list
ops.Add(
new UpdateOneModel<BsonDocument>(
Builders<BsonDocument>.Filter.Eq("JobTypeValue", doc.JobTypeValue),
Builders<BsonDocument>.Update.Set("SerializedBackgroundJobInfo", updatedJobInfo)
)
);
// Execute once in every 1000 and clear list
if (ops.Count == 1000)
{
BulkWriteResult<BsonDocument> result = await records.BulkWriteAsync(ops,writeOptions);
ops = new List<WriteModel<BsonDocument>>();
}
}
}
// Clear any remaining
if (ops.Count > 0 )
{
BulkWriteResult<BsonDocument> result = await records.BulkWriteAsync(ops,writeOptions);
}
}
So rather than make a request to the database for every single document retrieved from the query, you create a List
of WriteModel
operations instead.
Once this list has grown to a reasonable value ( 1000 in this example ) you commit the write operation to the server in a single request and response for all batched operations. Here we use BulkWriteAsync
.
You can create the batches in a size greater than 1000 if you like, but it generally is a reasonable number to deal with. The only real hard limit is the BSON limit of 16MB, which since all requests are still actually BSON documents this still applies. Anyway it takes a lot of requests to approach 16MB, but there is also an impedence match to consider in how the request will be processed when it actually reaches the server, as documented:
"Each group of operations can have at most 1000 operations. If a group exceeds this limit, MongoDB will divide the group into smaller groups of 1000 or less. For example, if the bulk operations list consists of 2000 insert operations, MongoDB creates 2 groups, each with 1000 operations."
Therefore by keeping the request size at the same level of how the server will process it, you also get the benefit from the yield
where "multiple batches" can be in fact acting in parallel connections to the server, rather than letting the server do the splitting and queueing.
The returned result is of BulkWriteResult
which will contain information on the number of "matches" and "modifications" etc from the batch of operations sent.
Naturally since the operations are in "batches", it makes sense to then check at the end of the loop iteration to see if any more "batched" operations exist in the list, and then of course submit in the same way.
Also noting the IsOrdered = false
as BulkWriteOptions
means that the batch of operations is not actually executed in serial order, which means that the server can in fact run the taks in "parallel". This can make "huge" speed improvements where the order of commitment is not required. The default is to submit "ordered" and serially.
This is not required to set this option, but if your order is not important ( which it should not be in this case since no other operation requests here depend on the prevvious modification of a document ) then the improvement you get is worthwhile.
What this is all about is "reducing" the number of actual requests made to the server. Sending updates and awaiting a response takes time, and in large operations is a very costly excercise. That is what Bulk Operations are meant to deal with, by applying several operations within the one request.
Reducing that overhead is a "huge" performance gain. That's why you use this.