Environment:
Spark version: 2.3.0
Run Mode: Local
Java version: Java 8
The spark application trys to do the following
1) Convert input data into a Dataset[GenericRecord]
2) Group by the key propery of the GenericRecord
3) Using mapGroups after group to iterate the value list and get some result in String format
4) Output the result as String in text file.
The error happens when writing to text file. Spark deduced that the Dataset generated in step 3 has a binary column, not a String column. But actually it returns a String in the mapGroups function.
Is there a way to do the column data type convertion or let Spark knows that it is actually a string column not binary?
val dslSourcePath = args(0)
val filePath = args(1)
val targetPath = args(2)
val df = spark.read.textFile(filePath)
implicit def kryoEncoder[A](implicit ct: ClassTag[A]): Encoder[A] = Encoders.kryo[A](ct)
val mapResult = df.flatMap(abc => {
JavaConversions.asScalaBuffer(some how return a list of Avro GenericRecord using a java library).seq;
})
val groupResult = mapResult.groupByKey(result => String.valueOf(result.get("key")))
.mapGroups((key, valueList) => {
val result = StringBuilder.newBuilder.append(key).append(",").append(valueList.count(_=>true))
result.toString()
})
groupResult.printSchema()
groupResult.write.text(targetPath + "-result-" + System.currentTimeMillis())
And the output said it is a bin
root
|-- value: binary (nullable = true)
Spark gives out an error that it can't write binary as text:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Text data source supports only a string column, but you have binary.;
at org.apache.spark.sql.execution.datasources.text.TextFileFormat.verifySchema(TextFileFormat.scala:55)
at org.apache.spark.sql.execution.datasources.text.TextFileFormat.prepareWrite(TextFileFormat.scala:78)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:140)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:595)
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…