Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
353 views
in Technique[技术] by (71.8m points)

rdd - Spark Mlib FPGrowth job fails with Memory Error

I have a fairly simple use case, but potentially very large result set. My code does the following (on pyspark shell):

from pyspark.mllib.fpm import FPGrowth
data = sc.textFile("/Users/me/associationtestproject/data/sourcedata.txt")
transactions = data.map(lambda line: line.strip().split(' '))
model = FPGrowth.train(transactions, minSupport=0.000001, numPartitions=1000)
# Perform any RDD operation
for item in model.freqItemsets().toLocalIterator():
    # do something with item

I find that whenever I kick off the actual processing by calling either count() or toLocalIterator, my operation ultimately ends with out of memory error. Is FPGrowth not partitioning my data? Is my result data so big that getting even a single partition chokes up my memory? If yes, is there a way I can persist an RDD to disk in a "streaming" fashion without trying to hold it in memory?

Thanks for any insights.

Edit: A fundamental limitation of FPGrowth is that the entire FP Tree has to fit in memory. So, the suggestions about raising the minimum support threshold are valid.

-Raj

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Well, the problem is most likely a support threshold. When you set a very low value like here (I wouldn't call one-in-a-million frequent) you basically throw away all the benefits of downward-closure property.

It means that number of itemsets consider is growing exponentially and in the worst case scenario it will be equal to 2N - 1m where N is a number of items. Unless you have a toy data with a very small number of items it is simply not feasible.

Edit:

Note that with ~200K transactions (information taken from the comments) and support threshold 1e-6 every itemset in your data has to be frequent. So basically what you're trying to do here is to enumerate all observed itemsets.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...