Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
290 views
in Technique[技术] by (71.8m points)

amazon web services - batch processing s3 objects using lambda

The use case is that 1000s of very small-sized files are uploaded to s3 every minute and all the incoming objects are to be processed and stored in a separate bucket using lambda. But using s3-object-create as a trigger will make many lambda invocations and concurrency needs to be taken care of. I am trying to batch process the newly created objects for every 5-10 minutes. S3 provides batch operations but it reports are generated everyday/week. Is there a service available that can help me?

question from:https://stackoverflow.com/questions/66058822/batch-processing-s3-objects-using-lambda

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

According to AWS documentation, S3 can publish "New object created events" to following destinations:

  • Amazon SNS
  • Amazon SQS
  • AWS Lambda

In your case I would:

  1. Create SQS.
  2. Configure S3 Bucket to publish S3 new object events to SQS.
  3. Reconfigure your existing Lambda to subscribe to SQS.
  4. Configure batching for input SQS events.

Currently, the maximum batch size for SQS-Lambda subscription is 1000 events. But since your Lambda needs around 2 seconds to process single event, then you should start with something smaller, otherwise Lambda will timeout, because it won't be able to process all of the events.

Thanks to this, uploading X items to S3 will produce X / Y events, where Y is maximum batch size of SQS. For 1000 S3 items and batch size of 100, it will only invoke around 10 concurrent Lambda executions.

The AWS document mentioned above explains, how to publish S3 events to SQS. I won't explain it here, as it's more about implementation details.

Execution time

However you might run into a problem, where the processing is too slow, because Lambda will be processing probably events one-by-one in a loop.

The workaround would be to use asynchronous processing and implementation depends what runtime you use for Lambda, for Node.js it would be very easy to achieve.

Also if you want to speed up the processing in other ways, simply reduce maximum batch size and increase Lambda memory configuration, so single execution will be processing smaller number of events and will have access to more CPU units.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...