Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
678 views
in Technique[技术] by (71.8m points)

r - data.table::fread doesn't like missing values in first column

Is this a bug in data.table::fread (version 1.9.2) or misplaced user expectation/error?

Consider this trivial example where I have a table of values, TAB separated with possibly missing values. If the values are missing in the first column, fread gets upset, but if missing values are elsewhere I return the data.table I expect:

# Data with missing value in first column, third row and last column, second row:
12  876 19
23  39  
    15  20

fread("12   876 19
23  39  
    15  20")
#Error in fread("1287619
2339
1520") : 
#  Not positioned correctly after testing format of header row. ch='    '

# Data with missing values last column, rows two and three: 
"12 876 19
23  39  
15  20  "

fread( "12  876 19
23  39  
15  20  " )
#   V1  V2 V3
#1: 12 876 19
#2: 23  39 NA
#3: 15  20 NA
# Returns as expected.

Is this a bug or is it not possible to have missing values in the first column (or do I have malformed data somehow?).

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I believe this is the same bug that I reported here.

The most recent version that I know will work with this type of input is Rev. 1180. You could checkout and build that version by adding @1180 to the end of the svn checkout command.

svn checkout svn://svn.r-forge.r-project.org/svnroot/datatable/@1180

If you're not familiar with checking out and building packages, see here

But, a lot of great features, bug fixes, enhancements have been implemented since Rev. 1180. (The deveolpment version at the time of this writing is Rev. 1272). So, a better solution, is to replace the R/fread.R and src/fread.c files with the versions from Rev. 1180 or older, and then re-building the package.

You can find those files online without checking them out here (sorry, I can't figure out how to post links that include '*', so you have to copy/paste):

fread.R:
http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/R/fread.R?revision=988&root=datatable

fread.c:
http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/src/fread.c?revision=1159&root=datatable

Once you've rebuilt the package, you'll be able to read your tsv file.

> fread("1287619
2339
1520")
   V1  V2 V3
1: 12 876 19
2: 23  39 NA
3: NA  15 20

The downside to doing this is that the old version of fread() does not pass a newer test -- you won't be able to read fields that have quotes in the middle.

> fread('A,B,C
1.2,Foo"Bar,"a"b"c"d"
fo"o,bar,"b,az""
')
Error in fread("A,B,C
1.2,Foo"Bar,"a"b"c"d"
fo"o,bar,"b,az""
") : 
  Not positioned correctly after testing format of header row. ch=','

With newer versions of fread, you would get this

> fread('A,B,C
1.2,Foo"Bar,"a"b"c"d"
fo"o,bar,"b,az""
')
      A       B       C
1:  1.2 Foo"Bar a"b"c"d
2: fo"o     bar   b,az"

So, for now, which version "works" depends on whether you're more likely to have missing values in the first column, or quotes in fields. For me, it's the former, so I'm still using the old code.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...