Using Spark 1.4.0, Scala 2.10
I've been trying to figure out a way to forward fill null values with the last known observation, but I don't see an easy way. I would think this is a pretty common thing to do, but can't find an example showing how to do this.
I see functions to forward fill NaN with a value, or lag / lead functions to fill or shift data by an offset, but nothing to pick up the last known value.
Looking online, I see lots of Q/A about the same thing in R, but not in Spark / Scala.
I was thinking about mapping over a date range, filter the NaNs out of the results and pick the last element but I guess I'm confused about the syntax.
Using DataFrames I try something like
import org.apache.spark.sql.expressions.Window
val sqlContext = new HiveContext(sc)
var spec = Window.orderBy("Date")
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("test.csv")
val df2 = df.withColumn("testForwardFill", (90 to 0).map(i=>lag(df.col("myValue"),i,0).over(spec)).filter(p=>p.getItem.isNotNull).last)
but that doesn't get me anywhere.
The filter part doesn't work; the map function returns a Sequence of spark.sql.Columns, but the filter function expects to return a Boolean, so I need to get a value out of the Column to test on but there only seem to be Column methods that return a Column.
Is there any way to do this more 'simply' on Spark?
Thanks for your input
Edit:
Simple example sample input:
2015-06-01,33
2015-06-02,
2015-06-03,
2015-06-04,
2015-06-05,22
2015-06-06,
2015-06-07,
...
Expected output:
2015-06-01,33
2015-06-02,33
2015-06-03,33
2015-06-04,33
2015-06-05,22
2015-06-06,22
2015-06-07,22
Note:
- I have many columns, many of which have this missing data pattern, but not at the same date/time. If I need to I will do the transform one column at a time.
EDIT:
Following @zero323 's answer I tried this way:
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
val rows: RDD[Row] = df.orderBy($"Date").rdd
def notMissing(row: Row): Boolean = { !row.isNullAt(1) }
val toCarry: scala.collection.Map[Int,Option[org.apache.spark.sql.Row]] = rows.mapPartitionsWithIndex{
case (i, iter) => Iterator((i, iter.filter(notMissing(_)).toSeq.lastOption)) }
.collectAsMap
val toCarryBd = sc.broadcast(toCarry)
def fill(i: Int, iter: Iterator[Row]): Iterator[Row] = { if (iter.contains(null)) iter.map(row => Row(toCarryBd.value(i).get(1))) else iter }
val imputed: RDD[Row] = rows.mapPartitionsWithIndex{ case (i, iter) => fill(i, iter)}
the broadcast variable ends up as a list of values without nulls. That's progress but I still can't get the mapping to work.
but i get nothing, because the index i
in the doesn't map to the original data, it maps to the subset without null.
What am I missing here?
EDIT and solution (as infered from @zero323 's answer):
import org.apache.spark.sql.expressions.Window
val sqlContext = new HiveContext(sc)
var spec = Window.partitionBy("id").orderBy("Date")
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("test.csv")
val df2 = df.withColumn("test", coalesce((0 to 90).map(i=>lag(df.col("test"),i,0).over(spec)): _*))
See zero323's answer below for more options if you're using RDDs instead of DataFrames. The solution above may not be the most efficient but works for me. If you're looking to optimize, check out the RDD solution.
See Question&Answers more detail:
os