Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
288 views
in Technique[技术] by (71.8m points)

c# - How to increase performance of the update operation in Mongo?

foreach (var doc in await records.Find(filter).ToListAsync())
{
    var query = Builders<JobInfoRecord>.Filter.Eq("JobTypeValue", doc.JobTypeValue);
    var updatedJobInfo = Regex.Replace(doc.SerializedBackgroundJobInfo, pattern, "<$1></$1>");
    var update = Builders<JobInfoRecord>.Update.Set("SerializedBackgroundJobInfo", updatedJobInfo);

    records.UpdateOneAsync(query, update).Wait();
}

Is it the best way to update document (I'm changed a values of tags in xml string that contains password in name to empty tag: ?<adminPassword></adminPassword> or demo )? I'm using Mongo driver 2.0.2

I have a collection with 500 000 documents, of which I am performing updates each minute (hopefully) of approx. 3000 documents.

How can I increase performance of the update operation?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

When updating in the way that you are, you need to retrieve the document content in order to inspect it and make such modifications. MongoDB has no atomic operations that act on existing values in the way that you want to do, so iteration is of course required.

There is no real difference in the "query" portion of how you are matching on the regular expression between your two versions of the statement. No matter what, the content is converted to BSON before sending to the server anyway, so if you use a standard expression builder or a direct BSON document is of little consequence.

But on to the performance improvements that can be made.

Use Bulk Operations to update


As stated, Bulk Operations are the way you should be updating on such list iteration, and you also "should" be using a cursor rather than converting all results to a list, since it will save on memory.

Eschewing all the specific type declarations and just representing as BsonDocument ( which will probably save you on marshalling, but not needed ) then the basic example process would be:

var pattern = @"(?si)<([^s<]*workUnit[^s<]*)>.*?</1>";
var filter = Builders<JobInfoRecord>.Filter.Regex(x => x.SerializedBackgroundJobInfo,
                                              new BsonRegularExpression(pattern, "i"));


var ops = new List<WriteModel<BsonDocument>>();
var writeOptions = new BulkWriteOptions() { IsOrdered = false };

using ( var cursor = await records.FindAsync<BsonDocument>(filter))
{
    while ( await cursor.MoveNextAsync())
    {
        foreach( var doc in cursor.Current )
        {
            // Replace inspected value
            var updatedJobInfo = Regex.Replace(doc.SerializedBackgroundJobInfo, pattern, "<$1></$1>");

            // Add WriteModel to list
            ops.Add(
                new UpdateOneModel<BsonDocument>(
                    Builders<BsonDocument>.Filter.Eq("JobTypeValue", doc.JobTypeValue),
                    Builders<BsonDocument>.Update.Set("SerializedBackgroundJobInfo", updatedJobInfo)
                )
            );

            // Execute once in every 1000 and clear list
            if (ops.Count == 1000)
            {
                BulkWriteResult<BsonDocument> result = await records.BulkWriteAsync(ops,writeOptions);
                ops = new List<WriteModel<BsonDocument>>();
            }
        }
    }

    // Clear any remaining
    if (ops.Count > 0 )
    {
        BulkWriteResult<BsonDocument> result = await records.BulkWriteAsync(ops,writeOptions);
    }

}

So rather than make a request to the database for every single document retrieved from the query, you create a List of WriteModel operations instead.

Once this list has grown to a reasonable value ( 1000 in this example ) you commit the write operation to the server in a single request and response for all batched operations. Here we use BulkWriteAsync.

You can create the batches in a size greater than 1000 if you like, but it generally is a reasonable number to deal with. The only real hard limit is the BSON limit of 16MB, which since all requests are still actually BSON documents this still applies. Anyway it takes a lot of requests to approach 16MB, but there is also an impedence match to consider in how the request will be processed when it actually reaches the server, as documented:

"Each group of operations can have at most 1000 operations. If a group exceeds this limit, MongoDB will divide the group into smaller groups of 1000 or less. For example, if the bulk operations list consists of 2000 insert operations, MongoDB creates 2 groups, each with 1000 operations."

Therefore by keeping the request size at the same level of how the server will process it, you also get the benefit from the yield where "multiple batches" can be in fact acting in parallel connections to the server, rather than letting the server do the splitting and queueing.

The returned result is of BulkWriteResult which will contain information on the number of "matches" and "modifications" etc from the batch of operations sent.

Naturally since the operations are in "batches", it makes sense to then check at the end of the loop iteration to see if any more "batched" operations exist in the list, and then of course submit in the same way.

Also noting the IsOrdered = false as BulkWriteOptions means that the batch of operations is not actually executed in serial order, which means that the server can in fact run the taks in "parallel". This can make "huge" speed improvements where the order of commitment is not required. The default is to submit "ordered" and serially.

This is not required to set this option, but if your order is not important ( which it should not be in this case since no other operation requests here depend on the prevvious modification of a document ) then the improvement you get is worthwhile.

What this is all about is "reducing" the number of actual requests made to the server. Sending updates and awaiting a response takes time, and in large operations is a very costly excercise. That is what Bulk Operations are meant to deal with, by applying several operations within the one request.

Reducing that overhead is a "huge" performance gain. That's why you use this.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...