I have a spark data frame which I want to divide into train, validation and test in the ratio 0.60, 0.20,0.20.
I used the following code for the same:
def data_split(x):
global data_map_var
d_map = data_map_var.value
data_row = x.asDict()
import random
rand = random.uniform(0.0,1.0)
ret_list = ()
if rand <= 0.6:
ret_list = (data_row['TRANS'] , d_map[data_row['ITEM']] , data_row['Ratings'] , 'train')
elif rand <=0.8:
ret_list = (data_row['TRANS'] , d_map[data_row['ITEM']] , data_row['Ratings'] , 'test')
else:
ret_list = (data_row['TRANS'] , d_map[data_row['ITEM']] , data_row['Ratings'] , 'validation')
return ret_list
?
?
split_sdf = ratings_sdf.map(data_split)
train_sdf = split_sdf.filter(lambda x : x[-1] == 'train').map(lambda x :(x[0],x[1],x[2]))
test_sdf = split_sdf.filter(lambda x : x[-1] == 'test').map(lambda x :(x[0],x[1],x[2]))
validation_sdf = split_sdf.filter(lambda x : x[-1] == 'validation').map(lambda x :(x[0],x[1],x[2]))
?
print "Total Records in Original Ratings RDD is {}".format(split_sdf.count())
?
print "Total Records in training data RDD is {}".format(train_sdf.count())
?
print "Total Records in validation data RDD is {}".format(validation_sdf.count())
?
print "Total Records in test data RDD is {}".format(test_sdf.count())
?
?
#help(ratings_sdf)
Total Records in Original Ratings RDD is 300001
Total Records in training data RDD is 180321
Total Records in validation data RDD is 59763
Total Records in test data RDD is 59837
My original data frame is ratings_sdf which I use to pass a mapper function which does the splitting.
If you check the total sum of train, validation and test does not sum to split (original ratings) count. And these numbers change at every run of the code.
Where is the remaining records going and why the sum is not equal?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…