Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
418 views
in Technique[技术] by (71.8m points)

r - Why does as.matrix add extra spaces when converting numeric to character?

If you use apply over rows on a data.frame with character and numeric columns, apply uses as.matrix internally to convert the data.frame to only characters. But if the numeric column consists of numbers of different lengths, as.matrix adds spaces to match the highest/"longest" number.

An example:

df <- data.frame(id1=c(rep("a",3)),id2=c(100,90,8), stringsAsFactors = FALSE) 
df
##   id1 id2
## 1   a 100
## 2   a  90
## 3   a   8
as.matrix(df)
##      id1 id2  
## [1,] "a" "100"
## [2,] "a" " 90"
## [3,] "a" "  8"

I would have expected the result to be:

     id1 id2  
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"

Why the extra spaces?

They can create unexpected results when using apply on a data.frame:

myfunc <- function(row){
  paste(row[1], row[2], sep = "")
}
> apply(df, 1, myfunc)
[1] "a100" "a 90" "a  8"
> 

While looping gives the expected result.

> for (i in 1:nrow(df)){
  print(myfunc(df[i,]))
}
[1] "a100"
[1] "a90"
[1] "a8"

and

> paste(df[,1], df[,2], sep = "")
[1] "a100" "a90"  "a8"  

Are there any situations where the extra spaces that are added with as.matrix is useful?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This is because of the way non-numeric data are converted in the as.matrix.data.frame method. There is a simple work-around, shown below.

Details

?as.matrix notes that conversion is done via format(), and it is here that the additional spaces are added. Specifically, ?as.matrix has this in the Details section:

 ‘as.matrix’ is a generic function.  The method for data frames
 will return a character matrix if there is only atomic columns and
 any non-(numeric/logical/complex) column, applying ‘as.vector’ to
 factors and ‘format’ to other non-character columns.  Otherwise,
 the usual coercion hierarchy (logical < integer < double <
 complex) will be used, e.g., all-logical data frames will be
 coerced to a logical matrix, mixed logical-integer will give a
 integer matrix, etc.

?format also notes that

Character strings are padded with blanks to the display width of the widest.

Consider this example which illustrates the behaviour

> format(df[,2])
[1] "100" " 90" "  8"
> nchar(format(df[,2]))
[1] 3 3 3

format doesn't have to work this way as it has trim:

trim: logical; if ‘FALSE’, logical, numeric and complex values are
      right-justified to a common width: if ‘TRUE’ the leading
      blanks for justification are suppressed.

e.g.

> format(df[,2], trim = TRUE)
[1] "100" "90"  "8"

but there is no way to pass this argument along to the as.matrix.data.frame method.

Workaround

A way to work around this is to apply format() yourself, manually, via sapply. There you can pass in trim = TRUE

> sapply(df, format, trim = TRUE)
     id1 id2  
[1,] "a" "100"
[2,] "a" "90" 
[3,] "a" "8"

or, using vapply we can state what we expect to be returned (here character vectors of length 3 [nrow(df)]):

> vapply(df, format, FUN.VALUE = character(nrow(df)), trim = TRUE)
     id1 id2  
[1,] "a" "100"
[2,] "a" "90" 
[3,] "a" "8"

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...