scala - What is RDD in spark

Question

Welcome To Ask or Share your Answers For Others

scala - What is RDD in spark

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

scala - What is RDD in spark

Definition says:

RDD is immutable distributed collection of objects

I don't quite understand what does it mean. Is it like data (partitioned objects) stored on hard disk If so then how come RDD's can have user-defined classes (Such as java, scala or python)

From this link: https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch03.html It mentions:

Users create RDDs in two ways: by loading an external dataset, or by distributing a collection of objects (e.g., a list or set) in their driver program

I am really confused understanding RDD in general and in relation to spark and hadoop.

Can some one please help.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:43:17+0000

An RDD is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. An RDD could come from any datasource, e.g. text files, a database via JDBC, etc.

The formal definition is:

RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators.

If you want the full details on what an RDD is, read one of the core Spark academic papers, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Categories

scala - What is RDD in spark

scala - What is RDD in spark

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags