My spark application is using RDD's of numpy arrays.
At the moment, I'm reading my data from AWS S3, and its represented as
a simple text file where each line is a vector and each element is seperated by space, for example:
1 2 3
5.1 3.6 2.1
3 0.24 1.333
I'm using numpy's function loadtxt()
in order to create a numpy array from it.
However, this method seems to be very slow and my app is spending too much time(I think) for converting my dataset to a numpy array.
Can you suggest me a better way for doing it? For example, should I keep my dataset as a binary file?,
should I create the RDD in another way?
Some code for how I create my RDD:
data = sc.textFile("s3_url", initial_num_of_partitions).mapPartitions(readData)
readData function:
def readPointBatch(iterator):
return [(np.loadtxt(iterator,dtype=np.float64)]
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…