Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
528 views
in Technique[技术] by (71.8m points)

r - Memory limits in data table: negative length vectors are not allowed

I have a data table with several social media users and his/her followers. The original data table has the following format:

X.USERID FOLLOWERS
1081     4053807021,2476584389,4713715543, ...

So each row contains a user together with his/her ID and a vector of followers (seperated by a comma). In total I have 24,000 unique user IDs together with 160,000,000 unique followers. I wish to convert my original table in the following format:

X.USERID          FOLLOWERS
1:     1081         4053807021
2:     1081         2476584389
3:     1081         4713715543
4:     1081          580410695
5:     1081         4827723557
6:     1081 704326016165142528

In order to get this data table I used the following line of code (assume that my original data table is called dt):

uf <- dt[,list(FOLLOWERS = unlist(strsplit(x = FOLLOWERS, split= ','))), by = X.USERID]

However when I run this code on the entire dataset I get the following error:

negative length vectors are not allowed

According to this post on stack overflow (Negative number of rows in data.table after incorrect use of set ), it seems that I am bumping into the memory limits of the column in data.table. As a workaround, I ran the code in smaller blocks (per 10,000) and this seemed to work.

My question is: if I change my code can I prevent this error from occuring or am I bumping into the limits of R?

PS. I have a machine with 140gb RAM at my disposal, so physical memory space should not be the issue.

> memory.limit()
[1] 147446
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This problem occurs when the number of rows in your dataset exceeds R's limit of 2^32-1. One of the ways to deal with this problem is to read your dataset in chunks (within a loop). It looks like your file is sorted by X.USERID field, so your chunks (when you read the file) should overlap by the number of unique followers to insure each user belongs to at least one chunk that contains all followers. The way you process this chunks would very much depend on what you need to do with your data.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...