Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

mongodb - How to delete documents by query efficiently in mongo?

I have a query, which selects documents to be removed. Right now, I remove them manually, like this (using python):

for id in mycoll.find(query, fields={}):
  mycoll.remove(id)

This does not seem to be very efficient. Is there a better way?

EDIT

OK, I owe an apology for forgetting to mention the query details, because it matters. Here is the complete python code:

def reduce_duplicates(mydb, max_group_size):
  # 1. Count the group sizes
  res = mydb.static.map_reduce(jstrMeasureGroupMap, jstrMeasureGroupReduce, 'filter_scratch', full_response = True)
  # 2. For each entry from the filter scratch collection having count > max_group_size
  deleteFindArgs = {'fields': {}, 'sort': [('test_date', ASCENDING)]}
  for entry in mydb.filter_scratch.find({'value': {'$gt': max_group_size}}):
    key = entry['_id']
    group_size = int(entry['value'])
    # 2b. query the original collection by the entry key, order it by test_date ascending, limit to the group size minus max_group_size.
    for id in mydb.static.find(key, limit = group_size - max_group_size, **deleteFindArgs):
      mydb.static.remove(id)
  return res['counts']['input']

So, what does it do? It reduces the number of duplicate keys to at most max_group_size per key value, leaving only the newest records. It works like this:

  1. MR the data to (key, count) pairs.
  2. Iterate over all the pairs with count > max_group_size
  3. Query the data by key, while sorting it ascending by the timestamp (the oldest first) and limiting the result to the count - max_group_size oldest records
  4. Delete each and every found record.

As you can see, this accomplishes the task of reducing the duplicates to at most N newest records. So, the last two steps are foreach-found-remove and this is the important detail of my question, that changes everything and I had to be more specific about it - sorry.

Now, about the collection remove command. It does accept query, but mine include sorting and limiting. Can I do it with remove? Well, I have tried:

mydb.static.find(key, limit = group_size - max_group_size, sort=[('test_date', ASCENDING)])

This attempt fails miserably. Moreover, it seems to screw mongo.Observe:

C:devpocSDR>python FilterOoklaData.py
bad offset:0 accessing file: /data/db/ookla.0 - consider repairing database

Needless to say, that the foreach-found-remove approach works and yields the expected results.

Now, I hope I have provided enough context and (hopefully) have restored my lost honour.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can use a query to remove all matching documents

var query = {name: 'John'};
db.collection.remove(query);

Be wary, though, if number of matching documents is high, your database might get less responsive. It is often advised to delete documents in smaller chunks.

Let's say, you have 100k documents to delete from a collection. It is better to execute 100 queries that delete 1k documents each than 1 query that deletes all 100k documents.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...