I'm messing around with dataframes in pyspark 1.4 locally and am having issues getting the dropDuplicates
method to work. It keeps returning the error:
"AttributeError: 'list' object has no attribute 'dropDuplicates'"
Not quite sure why as I seem to be following the syntax in the latest documentation.
#loading the CSV file into an RDD in order to start working with the data
rdd1 = sc.textFile("C:myfilename.csv").map(lambda line: (line.split(",")[0], line.split(",")[1], line.split(",")[2], line.split(",")[3])).collect()
#loading the RDD object into a dataframe and assigning column names
df1 = sqlContext.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4']).collect()
#dropping duplicates from the dataframe
df1.dropDuplicates().show()
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…