memory management - Reading 40 GB csv file into R using bigmemory

Question

Welcome To Ask or Share your Answers For Others

memory management - Reading 40 GB csv file into R using bigmemory

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

memory management - Reading 40 GB csv file into R using bigmemory

The title is pretty self explanatory here but I will elaborate as follows. Some of my current techniques in attacking this problem are based on the solutions presented in this question. However, I am facing several challenges and constraints so I was wondering if someone might attempt to take a stab at this problem. I am trying to figure out the problem using the bigmemory package but I have been running into difficulties.

Present Constraints:

Using a linux server with 16 GB of RAM
Size of 40 GB CSV
No of rows: 67,194,126,114

Challenges

Need to be able to randomly sample smaller datasets (5-10 Million rows) from a big.matrix or equivalent data structure.
Need to be able to remove any row with a single instance of NULL while parsing into a big.matrix or equivalent data structure.

So far, results are not good. Evidently, I am failing at something or maybe, I just don't understand the bigmemory documentation well enough. So, I thought I would ask here to see if anyone has used

Any tips, advice on this line of attack etc.? Or should I change to something else? I apologize if this question is very similar to the previous but I thought by scale of data was about 20 times bigger than the previous questions. Thanks !

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:06:55+0000

I don't know about bigmemory, but to satisfy your challenges you don't need to read the file in. Simply pipe some bash/awk/sed/python/whatever processing to do the steps you want, i.e. throw out NULL lines and randomly select N lines, and then read that in.

Here's an example using awk (assuming you want 100 random lines from a file that has 1M lines).

read.csv(pipe('awk -F, 'BEGIN{srand(); m = 100; length = 1000000;}
                       !/NULL/{if (rand() < m/(length - NR + 1)) {
                                 print; m--;
                                 if (m == 0) exit;
                              }}' filename'
        )) -> df

It wasn't obvious to me what you meant by NULL, so I used literal understanding of it, but it should be easy to modify it to fit your needs.

Categories

memory management - Reading 40 GB csv file into R using bigmemory

memory management - Reading 40 GB csv file into R using bigmemory

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags