Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
390 views
in Technique[技术] by (71.8m points)

memory - Why do logicals (booleans) in R require 4 bytes?

For a vector of logical values, why does R allocate 4 bytes, when a bit vector would consume 1 bit per entry? (See this question for examples.)

Now, I realize that R also facilitates storage of NA values, but couldn't that be done with an additional bit vector? In other words, why isn't it enough to just use a cheap two bit data structure?

For what it's worth, Matlab uses 1 byte for logicals, though it doesn't facilitate NA values. I'm not sure why MathWorks isn't satisfied with one bit functionality, much less a two bit data structure, but they have fancy pants marketers... [I'm gonna milk "two bit" for all it's worth in this question. ;-)]


Update 1. I think that the architecture reasons offered make some sense, but that feels a little ex post facto. I haven't checked 32 bit or 16 bit R to see how large their logicals are - that could lend some support to the idea. It seems, from the R Internals manual that logical vectors (LGLSXP) and integers (INTSXP) are 32 bits on every platform. I can understand a universal size for integers, independent of word size. Similarly, storage of logicals also seems to be independent of word size. But it's so BIG. :)

In addition, if the word size argument is so powerful, it seems strange to me to see Matlab (I think it's a 32 bit Matlab) consume only 1 byte - I wonder if MathWorks chose to be more memory efficient with a tradeoff for programming complexity and some other overhead for finding sub-word objects.

Also, there are certainly other options in are: as Brian Diggs notes, the bit package facilitates bit vectors, which was very useful for the problem in the question above (an 8X-10X speedup for the task was obtained by converting from 4 byte logical values to bit vectors). Although speed of accessing memory is important, moving 30-31 extra uninformative bits (from an information theory perspective) is wasteful. For instance, one could use something like the memory tricks used for integers described here - grab a bunch of extra memory (V cells) and then process things at the bit level (a la bit()). Why not do that and save 30 bits (1 for the value, 1 for NA) for a long vector?

To the extent that my RAM and computational speed are affected by booleans, I intend to switch over to using bit, but that's because a 97% savings in space matters in some cases. :)

I think that the answer to this question will come from someone with a deeper understanding of R's design or internals. The best example is that Matlab uses a different size for their logical, and memory word sizes wouldn't be the answer in that case. Python may be similar to R, for what it's worth.

A related way to phrase this might be: why would LGLSXP be 4 bytes on all platforms? (Is CHARSXP typically smaller, and wouldn't that work as well? Why not go even smaller, and just over-allocate?) (Updated The idea of using CHARSXP is likely bogus, because operations on CHARSXP aren't as fully useful as those for integers, such as sum. Using the same data structure as characters might save space, but would constrain which existing methods could operate on it. A more appropriate consideration is the use of smaller integers, as discussed below.)


Update 2. There have been some very good and enlightening answers here, especially relative to how one should implement retrieval and processing of booleans for the goals of speed and programming efficiency. I think that Tommy's answer is particularly plausible regarding the why it appears this way in R, which seems to arise from 2 premises:

  1. In order to support addition on a logical vector (note that "logical" is defined by programming language / environment, and is not the same as a boolean), one is best served by reusing code for adding integers. In the case of R, integers consume 4 bytes. In the case of Matlab, the smallest integer is 1 byte (i.e. int8). This would explain why something different would be a nuisance to write for logicals. [To those not familiar with R, it supports many numerical operations on logicals, such as sum(myVector), mean(myVector), etc.]

  2. Legacy support makes it exceedingly difficult to do something other than what has been done in R and S-Plus for a long time now. Moreover, I suspect that in the early days of S, S-Plus, and R, if someone was doing a lot of boolean operations, they did them in C, rather than trying to do so much work with logicals in R.

The other answers are fantastic for the purposes of how one might implement better boolean handling - don't naively assume that one can get at any individual bit: it's most efficient to load a word, then mask the bits that are not of interest, as Dervall has described. This is very, very useful advice should one write specialized code for boolean manipulation for R (e.g. my question on cross tabulations): don't iterate over bits, but instead work at the word level.

Thanks to all for a very thorough set of answers and insights.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Without knowing R at all, I suspect for much the same reason as C does, because it's way faster to load a size equal to the processors native word size.

Loading a single bit would be slow, especially from a bitfield since you'd have to mask out the bits that do not apply to your particular query. With a whole word you can just load it in a registry and be done with it. Since the size difference usually is not a problem the default implementation is to use a word sized variable. If the user wants something else there is always the option to do the bit-shifting required manually.

Concerning packing, at least on some processors it will cause a fault to read from a non-aligned address. So while you might declare a structure with a single byte in it surrounded by two int the byte might be padded to be 4 bytes in size regardless. Again, I don't know anything about R in particular, but I suspect the behaviour might be the same for performance reasons.

Addressing a single byte in an array is quite more involved, say you have an array bitfield and want to address bit x in it, the code would be something like this:

bit b = (bitfield[x/8] >> (x % 8)) & 1

to obtain either 0 or 1 for the bit you requested. In comparison to the straightforward array addressing of from a boolean array obtaining value number x: bool a = array[x]


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...