I run spark word2vec on a 2G of data with the following config using pyspark 3.0.0.
spark = SparkSession
.builder
.appName("word2vec")
.master("local[*]")
.config("spark.driver.memory", "32g")
.config("spark.sql.execution.arrow.pyspark.enabled", "true")
.getOrCreate()
And with running code simply like following,
sentence = sample_corpus.withColumn('sentences',f.split(sample_corpus.sentences, ',')).select('sentences')
word2vec = Word2Vec(vectorSize=300, inputCol="sentences", outputCol="result", minCount=10)
model = word2vec.fit(sentence)
However, it will throw following error during the computation which does not give many useful information for debugging.
FetchFailed(BlockManagerId(driver, fedora, 34105, None), shuffleId=2, mapIndex=0, mapId=73, reduceId=0, message=
org.apache.spark.shuffle.FetchFailedException: requirement failed
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:748)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:663)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70)
at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155)
at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:50)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:110)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:268)
at org.apache.spark.storage.ShuffleBlockFetcherIterator$SuccessFetchResult.(ShuffleBlockFetcherIterator.scala:981)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchLocalBlocks(ShuffleBlockFetcherIterator.scala:422)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:536)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:171)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)
... 14 more
)
My machine has 64g memory. I wonder what is the cause here and how can I fix it. Thank you.
question from:
https://stackoverflow.com/questions/65602060/apache-spark-word2vec-throws-org-apache-spark-shuffle-fetchfailedexception-requ 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…