Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
310 views
in Technique[技术] by (71.8m points)

Converting Fields to Ints, Doubles, ect. in Scala in Spark Shell RDD

I have an assignment where I need to load a csv dataset in a spark-shell using spark.read.csv(), and accomplish the following:

  1. Convert the dataset to RDD
  2. Remove the heading (first record (line) in the dataset)
  3. Convert the first two fields to integers
  4. Convert other fields except the last one to doubles. Questions marks should be NaN. The last field should be converted to a Boolean.

I was able to do steps 1 and 2 with the following code:

//load the dataset as an RDD

val dataRDD = spark.read.csv("block_1.csv").rdd //output is org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[14] at rdd at <console>:23

dataRDD.count() //output 574914

//import Row since RDD is of Row

import org.apache.spark.sql.Row 

//function to recognize if a string contains "id_1"

def isHeader(r : Row) = r.toString.contains("id_1") 

//filter function will take !isHeader function and apply it to all lines in dataRDD and the //return will form another RDD

val nohead = dataRDD.filter(x => !isHeader(x))

nohead.count() //output is now 574913

nohead.first //output is [37291,53113,0.833333333333333,?,1,?,1,1,1,1,0,TRUE]

nohead //output is org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[15] at filter at <console>:28

I'm trying to convert the fields but every time I use a function like toDouble I get an error stating not a member of:

:25: error: value toDouble is not a member of org.apache.spark.sql.Row

if ("?".equals(s)) Double.NaN else s.toDouble

I'm not sure what I'm doing wrong and I've taken a look at the website https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Row.html#anyNull() but I still don't know what I'm doing wrong. I'm not sure how to convert something if there isn't a toDouble, toInt, or toBoolean function.

Can someone please guide me in the right direction to figure what I'm doing wrong? Where I can possibly look to answer? I need to convert the first two fields to integers, the other fields except for the last one to doubles. Question marks should be NaN. The last field should be converted to Boolean.

question from:https://stackoverflow.com/questions/65926834/converting-fields-to-ints-doubles-ect-in-scala-in-spark-shell-rdd

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
  1. Convert the first two fields to integers
  2. Convert other fields except the last one to doubles. Questions marks should be NaN. The last field should be converted to a Boolean.

You can do both 3 and 4 at once using a parse function. First create the toDouble function since it is used in the parse function:

def toDouble(s: String) = {
    if ("?".equals(s)) Double.NaN else s.toDouble
}

def parse(line: String) = {
    val pieces = line.split(',')  
    val id1 = pieces(0).toInt
    val id2 = pieces(1).toInt
    val scores = pieces.slice(2, 11).map(toDouble)  
    val matched = pieces(11).toBoolean
    (id1, id2, scores, matched)
}

After you do this, you can call parse on each row in your RDD using map; however, you still have the type issue. To fix this, you could convert nohead from an RDD[Row] to an RDD[String]; however its probably easier to just convert the row to a string as you pass it:

val parsed = noheader.map(line => parse(line.mkString(",")))

This will give parsed as type: RDD[(Int, Int, Array[Double], Boolean)]


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...