I am getting heap space errors on even fairly small datasets. I can be sure that I'm not running out of system memory. For example, consider a dataset containing about 20M rows and 9 columns, and that takes up 1GB on disk. I am playing with it on a Google Compute node with 30gb of memory.
Let's say that I have this data in a dataframe called df
. The following works fine, albeit somewhat slowly:
library(tidyverse)
uniques <- search_raw_lt %>%
group_by(my_key) %>%
summarise() %>%
ungroup()
The following throws java.lang.OutOfMemoryError: Java heap space
.
library(tidyverse)
library(sparklyr)
sc <- spark_connect(master = "local")
df_tbl <- copy_to(sc, df)
unique_spark <- df_tbl %>%
group_by(my_key) %>%
summarise() %>%
ungroup() %>%
collect()
I tried this suggestion for increasing the heap space to Spark. The problem persists. Watching the machine's state on htop
, I see that total memory usage never goes over about 10gb.
library(tidyverse)
library(sparklyr)
config <- spark_config()
config[["sparklyr.shell.conf"]] <- "spark.driver.extraJavaOptions=-XX:MaxHeapSize=24G"
sc <- spark_connect(master = "local")
df_tbl <- copy_to(sc, df)
unique_spark <- df_tbl %>%
group_by(my_key) %>%
summarise() %>%
ungroup() %>%
collect()
Finally, per Sandeep's comment, I tried lowering MaxHeapSize
to 4G
. (Is MaxHeapSize
per virtual worker or for the entire Spark local instance?) I still got the heap space error, and again, I did not use much of the system's memory.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…