This is not really a correct approach for this task - you shouldn't do separate queries from inside UDF, as Spark won't be able to parallelize/optimize them.
The better way will be just to do a join between your streaming dataframe & appleMart
dataframe - this will allow Spark to optimize all operations. As I understand from your code, you just need to check that you have apples with given ID. In this case, you can just do the inner join - this will leave only IDs for which there are rows in the appleMart
, something like this:
val appleMart = spark.read.format("delta").load("path_to_delta")
val newApple = apples.join(appleMart, apples("appleId") === appleMart("appleId"))
if for some reason you need to leave apples
entries that doesn't exist in the appleMart
, you can use left
join instead...
P.S. If appleMart
doesn't change very often, you can cache it. Although, for streaming jobs, for lookup tables something like Cassandra could be better from performance standpoint.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…