I am trying the following code which adds a number to every row in an RDD and returns a list of RDDs using PySpark.
from pyspark.context import SparkContext
file = "file:///home/sree/code/scrap/sample.txt"
sc = SparkContext('local', 'TestApp')
data = sc.textFile(file)
splits = [data.map(lambda p : int(p) + i) for i in range(4)]
print splits[0].collect()
print splits[1].collect()
print splits[2].collect()
The content in the input file (sample.txt) is:
1
2
3
I was expecting an output like this (adding the numbers in the rdd with 0, 1, 2 respectively):
[1,2,3]
[2,3,4]
[3,4,5]
whereas the actual output was :
[4, 5, 6]
[4, 5, 6]
[4, 5, 6]
which means that the comprehension used only the value 3 for variable i, irrespective of the range(4).
Why does this behavior happen ?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…