Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
785 views
in Technique[技术] by (71.8m points)

scala - How to do custom operations on GroupedData in Spark?

I want to rewrite some of my code written with RDDs to use DataFrames. It was working quite smoothly until I found this:

 events
  .keyBy(row => (row.getServiceId + row.getClientCreateTimestamp + row.getClientId, row) )
  .reduceByKey((e1, e2) => if(e1.getClientSendTimestamp <= e2.getClientSendTimestamp) e1 else e2)
  .values

it is simple to start with

 events
  .groupBy(events("service_id"), events("client_create_timestamp"), events("client_id"))

but what's next? What if I'd like to iterate over every element in the current group? Is it even possible? Thanks in advance.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

GroupedData cannot be used directly. Data is not physically grouped and it is just a logical operation. You have to apply some variant of agg method for example:

events
 .groupBy($"service_id", $"client_create_timestamp", $"client_id")
 .min("client_send_timestamp")

or

events
 .groupBy($"service_id", $"client_create_timestamp", $"client_id")
 .agg(min($"client_send_timestamp"))

where client_send_timestamp is a column you want to aggregate.

If you want to keep information than aggregate just join or use Window functions - see Find maximum row per group in Spark DataFrame

Spark also supports User Defined Aggregate Functions - see How to define and use a User-Defined Aggregate Function in Spark SQL?

Spark 2.0+

You could use Dataset.groupByKey which exposes groups as an iterator.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...