Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
824 views
in Technique[技术] by (71.8m points)

scala - NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities while reading s3 Data with spark

I would like to run a simple spark job on my local dev machine (through Intellij) reading data from Amazon s3.

my build.sbt file:

scalaVersion := "2.11.12"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "2.3.1",
  "org.apache.spark" %% "spark-sql" % "2.3.1",
  "com.amazonaws" % "aws-java-sdk" % "1.11.407",
  "org.apache.hadoop" % "hadoop-aws" % "3.1.1"
)

my code snippet:

val spark = SparkSession
    .builder
    .appName("test")
    .master("local[2]")
    .getOrCreate()

  spark
    .sparkContext
    .hadoopConfiguration
    .set("fs.s3n.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")

  val schema_p = ...

  val df = spark
    .read
    .schema(schema_p)
    .parquet("s3a:///...")

And I get the following exception:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2093)
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2058)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2152)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2580)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
    at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
    at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:622)
    at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:606)
    at Test$.delayedEndpoint$Test$1(Test.scala:27)
    at Test$delayedInit$body.apply(Test.scala:4)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
    at scala.App$class.main(App.scala:76)
    at Test$.main(Test.scala:4)
    at Test.main(Test.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.StreamCapabilities
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 41 more

When replacing s3a:/// to s3:/// I get another error: No FileSystem for scheme: s3

As I am new to AWS, I do not know if I should user s3:///, s3a:/// or s3n:///. I have already setup my AWS credentials with aws-cli.

I have not any Spark installation on my machine.

Thanks in advance for your help

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I would start by looking at the S3A troubleshooting docs

Do not attempt to “drop in” a newer version of the AWS SDK than that which the Hadoop version was built with Whatever problem you have, changing the AWS SDK version will not fix things, only change the stack traces you see.

whatever version of the hadoop- JARs you have on your local spark installation, you need to have exactly the same version of hadoop-aws, and exactly the same version of the aws SDK which hadoop-aws was built with. Try mvnrepository for the details.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...