Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
384 views
in Technique[技术] by (71.8m points)

r - DT[!(x == .)] and DT[x != .] treat NA in x inconsistently

This is something that I thought I should ask following this question. I'd like to confirm if this is a bug/inconsistency before filing it as a such in the R-forge tracker.

Consider this data.table:

require(data.table)
DT <- data.table(x=c(1,0,NA), y=1:3)

Now, to access all rows of the DT that are not 0, we could do it in these ways:

DT[x != 0]
#    x y
# 1: 1 1
DT[!(x == 0)]
#     x y
# 1:  1 1
# 2: NA 3

Accessing DT[x != 0] and DT[!(x==0)] gives different results when the underlying logical operation is equivalent.

Note: Converting this into a data.frame and running these operations will give results that are identical with each other for both logically equivalent operations, but that result is different from both these data.table results. For an explanation of why, look at ?`[` under the section NAs in indexing.

Edit: Since some of you've stressed for equality with data.frame, here's the snippet of the output from the same operations on data.frame:

DF <- as.data.frame(DT)
# check ?`[` under the section `NAs in indexing` as to why this happens
DF[DF$x != 0, ]
#     x  y
# 1   1  1
# NA NA NA
DF[!(DF$x == 0), ]
#     x  y
# 1   1  1
# NA NA NA

I think this is an inconsistency and both should provide the same result. But, which result? The documentation for [.data.table says:

i ---> Integer, logical or character vector, expression of column names, list or data.table.

integer and logical vectors work the same way they do in [.data.frame. Other than NAs in logical i are treated as FALSE and a single NA logical is not recycled to match the number of rows, as it is in [.data.frame.

It's clear why the results are different from what one would get from doing the same operation on a data.frame. But still, within data.table, if this is the case, then both of them should return:

#    x y
# 1: 1 1

I went through [.data.table source code and now understand as to why this is happening. See this post for a detailed explanation of why this is happening.

Briefly, x != 0 evaluates to "logical" and NA gets replaced to FALSE. However, !(x==0), first (x == 0) gets evaluated to logical and NA gets replaced to FALSE. Then the negation happens, which results in NA basically becoming TRUE.

So, my first (or rather main) question is, is this a bug/inconsistency? If so, I'll file it as one in data.table R-forge tracker. If not, I'd like to know the reason for this difference and I would like to suggest a correction to the documentation explaining this difference (to the already amazing documentation!).

Edit: Following up with comments, the second question is, should data.table's handling for subsetting by indexing with columns containing NA resemble that of data.frame?? (But I agree, following @Roland's comment that this may be very well lead to opinions and I'm perfectly fine with not answering this question at all).

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I think it is documented and consistent behaviour.

The main thing to note is that the prefix ! within the i argument is a flag for a not join, so x != 0 and !(x==0) are no longer the same logical operation when working with the documented handling of NA within data.table

The section from the news regarding the not join

A new "!" prefix on i signals 'not-join' (a.k.a. 'not-where'), #1384i.
            DT[-DT["a", which=TRUE, nomatch=0]]   # old not-join idiom, still works
            DT[!"a"]                              # same result, now preferred.
            DT[!J(6),...]                         # !J == not-join
            DT[!2:3,...]                          # ! on all types of i
            DT[colA!=6L | colB!=23L,...]          # multiple vector scanning approach (slow)
            DT[!J(6L,23L)]                        # same result, faster binary search
        '!' has been used rather than '-' :
            * to match the 'not-join'/'not-where' nomenclature
            * with '-', DT[-0] would return DT rather than DT[0] and not be backwards
              compatible. With '!', DT[!0] returns DT both before (since !0 is TRUE in
              base R) and after this new feature.
            * to leave DT[+J...] and DT[-J...] available for future use

And from ?data.table

All types of 'i' may be prefixed with !. This signals a not-join or not-select should be performed. Throughout data.table documentation, where we refer to the type of 'i', we mean the type of 'i' after the '!', if present. See examples.


Why is it consistent with the documented handling of NA within data.table

NA values are considered FALSE. Think of it like doing isTRUE on each element.

so DT[x!=0] is indexed with TRUE FALSE NA which becomes TRUE FALSE FALSE due to the documented NA handling.

You are wanting to subset when things are TRUE.

This means you are getting those where x != 0 is TRUE ( and not NA)

DT[!(x==0)] uses the not join states you want everything that is not 0 (which can and will include the NA values).


follow up queries / further examples

DT[!(x!=0)]

## returns
    x y
1:  0 2
2: NA 3

x!=0 is TRUE for one value, so the not join will return what isn't true. (ie what was FALSE (actually == 0) or NA

DT[!!(x==0)]

## returns
    x y
1:  0 2
2: NA 3

This is parsed as !(!(x==0)). The prefix ! denotes a not join, and the inner !(x==0) is parsed identically to x!=0, so the reasoning from the case immediately above applies.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.9k users

...