how to convert mix of text and numerical data to feature data in apache spark

Question

Welcome To Ask or Share your Answers For Others

how to convert mix of text and numerical data to feature data in apache spark

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

how to convert mix of text and numerical data to feature data in apache spark

I have a CSV of both textual and numerical data. I need to convert it to feature vector data in Spark (Double values). Is there any way to do that ?

I see some e.g where each keyword is mapped to some double value and use this to convert. However if there are multiple keywords, it is difficult to do this way.

Is there any other way out? I see Spark provides Extractors which will convert into feature vectors. Could someone please give an example?

48, Private, 105808, 9th, 5, Widowed, Transport-moving, Unmarried, White, Male, 0, 0, 40, United-States, >50K
42, Private, 169995, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, <=50K

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:23:34+0000

Finally I did this way. I iterate through each data and make a map with key as each item and increment a Double counter.

def createMap(data: RDD[String]) : Map[String,Double] = {  
 var mapData:Map[String,Double] = Map()
 var counter = 0.0
 data.collect().foreach{ item => 
  counter = counter +1
  mapData += (item -> counter)
 }
 mapData
}

def getLablelValue(input: String): Int = input match {
 case "<=50K" => 0
 case ">50K" => 1
}


val census = sc.textFile("/user/cloudera/census_data.txt")
val orgTypeRdd  = census.map(line => line.split(", ")(1)).distinct
val gradeTypeRdd = census.map(line => line.split(", ")(3)).distinct
val marStatusRdd = census.map(line => line.split(", ")(5)).distinct
val jobTypeRdd = census.map(line => line.split(", ")(6)).distinct
val familyStatusRdd = census.map(line => line.split(", ")(7)).distinct
val raceTypeRdd = census.map(line => line.split(", ")(8)).distinct
val genderTypeRdd = census.map(line => line.split(", ")(9)).distinct
val countryRdd = census.map(line => line.split(", ")(13)).distinct
val salaryRange = census.map(line => line.split(", ")(14)).distinct

val orgTypeMap = createMap(orgTypeRdd)
val gradeTypeMap = createMap(gradeTypeRdd)
val marStatusMap = createMap(marStatusRdd)
val jobTypeMap = createMap(jobTypeRdd)
val familyStatusMap = createMap(familyStatusRdd)
val raceTypeMap = createMap(raceTypeRdd)
val genderTypeMap = createMap(genderTypeRdd)
val countryMap = createMap(countryRdd)
val salaryRangeMap = createMap(salaryRange)


val featureVector = census.map{line => 
  val fields = line.split(", ")
 LabeledPoint(getLablelValue(fields(14).toString) , Vectors.dense(fields(0).toDouble,  orgTypeMap(fields(1).toString) , fields(2).toDouble , gradeTypeMap(fields(3).toString) , fields(4).toDouble , marStatusMap(fields(5).toString), jobTypeMap(fields(6).toString), familyStatusMap(fields(7).toString),raceTypeMap(fields(8).toString),genderTypeMap (fields(9).toString), fields(10).toDouble , fields(11).toDouble , fields(12).toDouble,countryMap(fields(13).toString) , salaryRangeMap(fields(14).toString)))
}

Categories

how to convert mix of text and numerical data to feature data in apache spark

how to convert mix of text and numerical data to feature data in apache spark

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags