Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
588 views
in Technique[技术] by (71.8m points)

python - Apache Spark reads for S3: can't pickle thread.lock objects

So I want my Spark App to read some text from Amazon's S3. I Wrote the following simple script:

import boto3
s3_client = boto3.client('s3')
text_keys = ["key1.txt", "key2.txt"]
data = sc.parallelize(text_keys).flatMap(lambda key: s3_client.get_object(Bucket="my_bucket", Key=key)['Body'].read().decode('utf-8'))

When I do data.collect I get the following error:

TypeError: can't pickle thread.lock objects

and I don't seem to find any help online. Have perhaps someone managed to solve the above?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Your s3_client isn't serialisable.

Instead of flatMap use mapPartitions, and initialise s3_client inside the lambda body to avoid overhead. That will:

  1. init s3_client on each worker
  2. reduce initialisation overhead

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...