• R语言
  • Base R 数据操作的一致性问题

故事源起于 @tctcab 在帖子 <https://d.cosx.org/d/420762/16> 贴的一个 SO 链接 里面在对比 data.table 和 dplyr,我这里不打算继续做对比。只是把我发现的一个问题摆出来,当然也可能不是问题,只是我水平有限导致的数据操作的姿势不对?

例子依然照搬 SO 上的,这波操作集合了筛选,分组统计,排序,比较有代表性

library(ggplot2)
data("diamonds")

myfun <- function(x) {
  c(AvgPrice = mean(x), MedianPrice = median(x), Count = length(x))
}
dat <- aggregate(data = diamonds, price ~ cut, FUN = myfun, subset = cut != "Fair", simplify = TRUE)
dat
        cut price.AvgPrice price.MedianPrice price.Count
1      Good       3928.864          3050.500    4906.000
2 Very Good       3981.760          2648.000   12082.000
3   Premium       4584.258          3185.000   13791.000
4     Ideal       3457.542          1810.000   21551.000
str(dat)
#> 'data.frame':    4 obs. of  2 variables:
#>  $ cut  : Ord.factor w/ 5 levels "Fair"<"Good"<..: 2 3 4 5
#>  $ price: num [1:4, 1:3] 3929 3982 4584 3458 3050 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr  "AvgPrice" "MedianPrice" "Count"
class(dat$cut)
#> [1] "ordered" "factor"
class(dat$price)
#> [1] "matrix"
dat$price
     AvgPrice MedianPrice Count
[1,] 3928.864      3050.5  4906
[2,] 3981.760      2648.0 12082
[3,] 4584.258      3185.0 13791
[4,] 3457.542      1810.0 21551

<sup>Created on 2019-07-04 by the reprex package (v0.3.0)</sup>

接下来,我不知道要怎么对 Count 排序了,因为数据框 dat 里面是向量和矩阵的合体,这是我最想吐槽 Base R 的地方,输入是一个数据框类型,输出其实不是一个一致的数据框类型(还混杂了矩阵类型) ,这导致后面想用 order 排序,不行了

dat[order(dat$cut, dat$Count,  decreasing = TRUE), ]

必须变成

dat[order(dat$cut, dat$price[,"Count"], decreasing = TRUE), ]
        cut price.AvgPrice price.MedianPrice price.Count
4     Ideal       3457.542          1810.000   21551.000
3   Premium       4584.258          3185.000   13791.000
2 Very Good       3981.760          2648.000   12082.000
1      Good       3928.864          3050.500    4906.000

细心的可能会发现,函数 aggregate 的帮助文档中关于 simplify 参数的含义是

a logical indicating whether results should be simplified to a vector or matrix if possible.

所以如果参数simplify 设置为 FALSE 返回的数据框里面包含列表,这里不知道该放什么表情!!

dat <- aggregate(data = diamonds, price ~ cut, FUN = myfun, subset = cut != "Fair", simplify = FALSE)
> dat
        cut                         price
1      Good  3928.864, 3050.500, 4906.000
2 Very Good    3981.76, 2648.00, 12082.00
3   Premium 4584.258, 3185.000, 13791.000
4     Ideal 3457.542, 1810.000, 21551.000
> str(dat)
'data.frame':   4 obs. of  2 variables:
 $ cut  : Ord.factor w/ 5 levels "Fair"<"Good"<..: 2 3 4 5
 $ price:List of 4
  ..$ : Named num  3929 3050 4906
  .. ..- attr(*, "names")= chr  "AvgPrice" "MedianPrice" "Count"
  ..$ : Named num  3982 2648 12082
  .. ..- attr(*, "names")= chr  "AvgPrice" "MedianPrice" "Count"
  ..$ : Named num  4584 3185 13791
  .. ..- attr(*, "names")= chr  "AvgPrice" "MedianPrice" "Count"
  ..$ : Named num  3458 1810 21551
  .. ..- attr(*, "names")= chr  "AvgPrice" "MedianPrice" "Count"

附上 dplyr 和 data.table 的实现

# copy from  https://stackoverflow.com/questions/21435339/
library(dplyr)

diamonds %>%
  filter(cut != "Fair") %>%
  group_by(cut) %>%
  summarize(
    AvgPrice = mean(price),
    MedianPrice = as.numeric(median(price)),
    Count = n()
  ) %>%
  arrange(desc(Count))

library(data.table)

diamondsDT <- data.table(diamonds)
diamondsDT[
  cut != "Fair", 
  .(AvgPrice = mean(price),
    MedianPrice = as.numeric(median(price)),
    Count = .N
  ), 
  by = cut
][ 
  order(-Count) 
]

这个帖子也是在补充说明我指的数据类型一致性的含义,在帖子 <https://d.cosx.org/d/420697/9> 我可能没有说清楚,所以开贴单独讨论

    Cloud2016

    太可怕了,看着都累,简单的数据操作还得花精力去解决list matrix data.frame混杂的问题

      tctcab 这也可以看出来, dplyr 的用心之处,对数据操作来说,解决类型一致性是多么重要

      8 个月 后
      with(aggregate(disp ~ cyl, mtcars, function(x) c(mean = mean(x), n = length(x))), as.data.frame(cbind(cyl, disp)))