I'm trying to load a really big JSON file into R. Since the file is too big to fit into memory on my machine, I found that using the jsonlite
package's stream_in
/stream_out
functions is really helpful. With these functions, I can subset the data first in chunks without loading it, write the subset data to a new, smaller JSON file, and then load that file as a data.frame
. However, this intermediary JSON file is getting truncated (if that's the right term) while being written with stream_out
. I will now attempt to explain with further detail.
What I'm attempting:
I have written my code like this (following an example from documentation):
con_out <- file(tmp <- tempfile(), open = "wb")
stream_in(file("C:/User/myFile.json"), handler = function(df){
df <- df[which(df$Var > 0), ]
stream_out(df, con_out, pagesize = 1000)
}, pagesize = 5000)
myData <- stream_in(file(tmp))
As you can see, I open a connection to a temporary file, read my original JSON file with stream_in
and have the handler
function subset each chunk of data and write it to the connection.
The problem
This procedure runs without any problems, until I try to read it in myData <- stream_in(file(tmp))
, upon which I receive an error. Manually opening the new, temporary JSON file reveals that the bottom-most line is always incomplete. Something like the following:
{"Var1":"some data","Var2":3,"Var3":"some othe
I then have to manually remove that last line after which the file loads without issue.
Solutions I've tried
I've tried reading the documentation thoroughly and looking at the stream_out
function, and I can't figure out what may be causing this issue. The only slight clue I have is that the stream_out
function automatically closes the connection upon completion, so maybe it's closing the connection while some other component is still writing?
I inserted a print function to print the tail()
end of the data.frame
at every chunk inside the handler
function to rule out problems with the intermediary data.frame
. The data.frame
is produced flawlessly at every interval, and I can see that the final two or three rows of the data.frame
are getting truncated while being written to file (i.e., they're not being written). Notice that it's the very end of the entire data.frame
(after stream_out
has rbind
ed everything) that is getting chopped.
I've tried playing around with the pagesize
arguments, including trying very large numbers, no number, and Inf
. Nothing has worked.
I can't use jsonlite
's other functions like fromJSON
because the original JSON file is too large to read without streaming and it is actually in minified(?)/ndjson
format.
System info
I'm running R 3.3.3 x64 on Windows 7 x64. 6 GB of RAM, AMD Athlon II 4-Core 2.6 Ghz.
Treatment
I can still deal with this issue by manually opening the JSON files and correcting them, but it's leading to some data loss and it's not allowing my script to be automated, which is an inconvenience as I have to run it repeatedly throughout my project.
I really appreciate any help with this; thank you.
See Question&Answers more detail:
os