- 已编辑
故事源起于 @tctcab 在帖子 <https://d.cosx.org/d/420762/16> 贴的一个 SO 链接 里面在对比 data.table 和 dplyr,我这里不打算继续做对比。只是把我发现的一个问题摆出来,当然也可能不是问题,只是我水平有限导致的数据操作的姿势不对?
例子依然照搬 SO 上的,这波操作集合了筛选,分组统计,排序,比较有代表性
library(ggplot2)
data("diamonds")
myfun <- function(x) {
c(AvgPrice = mean(x), MedianPrice = median(x), Count = length(x))
}
dat <- aggregate(data = diamonds, price ~ cut, FUN = myfun, subset = cut != "Fair", simplify = TRUE)
dat
cut price.AvgPrice price.MedianPrice price.Count
1 Good 3928.864 3050.500 4906.000
2 Very Good 3981.760 2648.000 12082.000
3 Premium 4584.258 3185.000 13791.000
4 Ideal 3457.542 1810.000 21551.000
str(dat)
#> 'data.frame': 4 obs. of 2 variables:
#> $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 2 3 4 5
#> $ price: num [1:4, 1:3] 3929 3982 4584 3458 3050 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr "AvgPrice" "MedianPrice" "Count"
class(dat$cut)
#> [1] "ordered" "factor"
class(dat$price)
#> [1] "matrix"
dat$price
AvgPrice MedianPrice Count
[1,] 3928.864 3050.5 4906
[2,] 3981.760 2648.0 12082
[3,] 4584.258 3185.0 13791
[4,] 3457.542 1810.0 21551
<sup>Created on 2019-07-04 by the reprex package (v0.3.0)</sup>
接下来,我不知道要怎么对 Count 排序了,因为数据框 dat 里面是向量和矩阵的合体,这是我最想吐槽 Base R 的地方,输入是一个数据框类型,输出其实不是一个一致的数据框类型(还混杂了矩阵类型) ,这导致后面想用 order 排序,不行了
dat[order(dat$cut, dat$Count, decreasing = TRUE), ]
必须变成
dat[order(dat$cut, dat$price[,"Count"], decreasing = TRUE), ]
cut price.AvgPrice price.MedianPrice price.Count
4 Ideal 3457.542 1810.000 21551.000
3 Premium 4584.258 3185.000 13791.000
2 Very Good 3981.760 2648.000 12082.000
1 Good 3928.864 3050.500 4906.000
细心的可能会发现,函数 aggregate
的帮助文档中关于 simplify
参数的含义是
a logical indicating whether results should be simplified to a vector or matrix if possible.
所以如果参数simplify
设置为 FALSE 返回的数据框里面包含列表,这里不知道该放什么表情!!
dat <- aggregate(data = diamonds, price ~ cut, FUN = myfun, subset = cut != "Fair", simplify = FALSE)
> dat
cut price
1 Good 3928.864, 3050.500, 4906.000
2 Very Good 3981.76, 2648.00, 12082.00
3 Premium 4584.258, 3185.000, 13791.000
4 Ideal 3457.542, 1810.000, 21551.000
> str(dat)
'data.frame': 4 obs. of 2 variables:
$ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 2 3 4 5
$ price:List of 4
..$ : Named num 3929 3050 4906
.. ..- attr(*, "names")= chr "AvgPrice" "MedianPrice" "Count"
..$ : Named num 3982 2648 12082
.. ..- attr(*, "names")= chr "AvgPrice" "MedianPrice" "Count"
..$ : Named num 4584 3185 13791
.. ..- attr(*, "names")= chr "AvgPrice" "MedianPrice" "Count"
..$ : Named num 3458 1810 21551
.. ..- attr(*, "names")= chr "AvgPrice" "MedianPrice" "Count"
附上 dplyr 和 data.table 的实现
# copy from https://stackoverflow.com/questions/21435339/
library(dplyr)
diamonds %>%
filter(cut != "Fair") %>%
group_by(cut) %>%
summarize(
AvgPrice = mean(price),
MedianPrice = as.numeric(median(price)),
Count = n()
) %>%
arrange(desc(Count))
library(data.table)
diamondsDT <- data.table(diamonds)
diamondsDT[
cut != "Fair",
.(AvgPrice = mean(price),
MedianPrice = as.numeric(median(price)),
Count = .N
),
by = cut
][
order(-Count)
]
这个帖子也是在补充说明我指的数据类型一致性的含义,在帖子 <https://d.cosx.org/d/420697/9> 我可能没有说清楚,所以开贴单独讨论