scala - Calculate Cosine Similarity Spark Dataframe

Question

Welcome To Ask or Share your Answers For Others

scala - Calculate Cosine Similarity Spark Dataframe

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

scala - Calculate Cosine Similarity Spark Dataframe

I am using Spark Scala to calculate cosine similarity between the Dataframe rows.

Dataframe format is below

root
    |-- SKU: double (nullable = true)
    |-- Features: vector (nullable = true)

Sample of the dataframe below

    +-------+--------------------+
    |    SKU|            Features|
    +-------+--------------------+
    | 9970.0|[4.7143,0.0,5.785...|
    |19676.0|[5.5,0.0,6.4286,4...|
    | 3296.0|[4.7143,1.4286,6....|
    |13658.0|[6.2857,0.7143,4....|
    |    1.0|[4.2308,0.7692,5....|
    |  513.0|[3.0,0.0,4.9091,5...|
    | 3753.0|[5.9231,0.0,4.846...|
    |14967.0|[4.5833,0.8333,5....|
    | 2803.0|[4.2308,0.0,4.846...|
    |11879.0|[3.1429,0.0,4.5,4...|
    +-------+--------------------+

I tried to transpose the matrix and check the following mentioned links.Apache Spark Python Cosine Similarity over DataFrames, calculating-cosine-similarity-by-featurizing-the-text-into-vector-using-tf-idf But I believe there is a better solution

I am tried the below sample code

val irm = new IndexedRowMatrix(inClusters.rdd.map {
  case (v,i:Vector) => IndexedRow(v, i)


}).toCoordinateMatrix.transpose.toRowMatrix.columnSimilarities

But I got the below error

Error:(80, 12) constructor cannot be instantiated to expected type;
 found   : (T1, T2)
 required: org.apache.spark.sql.Row
      case (v,i:Vector) => IndexedRow(v, i)

I checked the following Link Apache Spark: How to create a matrix from a DataFrame? But can't do it using Scala

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:48:36+0000

DataFrame.rdd returns RDD[Row] not RDD[(T, U)]. You have to pattern match the Row or directly extract interesting parts.
ml Vector used with Datasets since Spark 2.0 is not the same as mllib Vector use by old API. You have to convert it to use with IndexedRowMatrix.
Index has to be Long not string.

import org.apache.spark.sql.Row

val irm = new IndexedRowMatrix(inClusters.rdd.map {
  Row(_, v: org.apache.spark.ml.linalg.Vector) => 
    org.apache.spark.mllib.linalg.Vectors.fromML(v)
}.zipWithIndex.map { case (v, i) => IndexedRow(i, v) })

Categories

scala - Calculate Cosine Similarity Spark Dataframe

scala - Calculate Cosine Similarity Spark Dataframe

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags