I'm using Spark 3.0.x in Java and we've a transformation requirement that goes as follows:
A sample (hypothetical) Dataframe:
+==============+===========+
| UserCategory | Location |
+==============+===========+
| Premium | Mumbai |
+--------------+-----------+
| Premium | Bangalore |
+--------------+-----------+
| Gold | Kolkata |
+--------------+-----------+
| Premium | Moskow |
+--------------+-----------+
| Silver | Tokyo |
+--------------+-----------+
| Gold | Sydney |
+--------------+-----------+
| Bronze | Dubai |
+--------------+-----------+
| ... | ... |
+--------------+-----------+
Let's say, we have a UserCategory
column on which we need to perform a groupby()
. Then on each row of the grouped data we need to apply certain function. The function is dependent on both the columns of the dataframe. E.g. we might generate a CouponCode
based on how many users are there in a specific UserCategory
and also the user Location
. Point is, it has to have a groupby() before the row-by-row rule coming in. The expected output might be something like the following:
+==============+===========+============+
| UserCategory | Location | CouponCode |
+==============+===========+============+
| Premium | Mumbai | IN01P |
+--------------+-----------+------------+
| Premium | Bangalore | IN02P |
+--------------+-----------+------------+
| Premium | Moskow | RU03P |
+--------------+-----------+------------+
| Gold | Kolkata | IN01G |
+--------------+-----------+------------+
| Gold | Sydney | AU01G |
+--------------+-----------+------------+
| Silver | Tokyo | JP01S |
+--------------+-----------+------------+
| Bronze | London | UK01B |
+--------------+-----------+------------+
| ... | ... | ... |
+--------------+-----------+------------+
Now, if we were using PySpark, I could've leveraged pandas udf to achieve the functionality. But our stack is built on Java so that's not a possibility. I was wondering what could be a way to achieve the same in Java.
I was looking at the javadoc here, and it seems a groupby() on dataframe returns a RelationalGroupedDataset. Then I browsed through the methods available on the RelationalGroupedDataset
class, and the only one that looks promising is the apply() function. Unfortunately it doesn't have any documentation and the method signature looks a bit intimidating:
public static RelationalGroupedDataset apply(Dataset<Row> df,
scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> groupingExprs,
RelationalGroupedDataset.GroupType groupType)
So, question is, is that the way to go? If so, then a dummy example of the apply() method would be very much appreciated. Else, if you could point me towards a different approach of achieving the same, that'd also be fine.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…