求证一个诡异的 lm() 处理因子的问题

Ihavenothing · 2017年8月10日

今天想展示一个简单的两因子方差分析，就拿 ggplot2 包中的钻石数据跑了一下，代码如下：

library(ggplot2)
data(diamonds)

m = lm(price ~ cut + color, data = diamonds)
summary(m)

输出结果是酱婶儿的：

Call:
lm(formula = price ~ cut + color, data = diamonds)

Residuals:
   Min     1Q Median     3Q    Max 
 -5511  -2628  -1344   1327  16091 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4226.45      26.06 162.203  < 2e-16 ***
cut.L        -293.50      67.06  -4.377 1.21e-05 ***
cut.Q        -265.77      59.75  -4.448 8.70e-06 ***
cut.C        -630.47      52.02 -12.119  < 2e-16 ***
cut^4        -262.85      41.92  -6.271 3.62e-10 ***
color.L      2064.35      56.76  36.370  < 2e-16 ***
color.Q       180.63      53.97   3.347 0.000818 ***
color.C      -264.05      50.80  -5.198 2.02e-07 ***
color^4        37.01      46.65   0.793 0.427662    
color^5      -248.25      44.11  -5.628 1.83e-08 ***
color^6        65.32      40.00   1.633 0.102486

这几个因子水平是什么鬼？？如果直接查看因子水平，应该是这样的

levels(diamonds$color)
# [1] "D" "E" "F" "G" "H" "I" "J"
levels(diamonds$cut)
# [1] "Fair"      "Good"      "Very Good" "Premium"   "Ideal"

我印象中 lm() 给水平命令是用“因子名+水平名”的办法，比如 colorD，cutFair 这种，什么时候变成这么诡异的输出了？只有我是这样吗？

Cloud2016 · 2017年8月10日

我的也是这样，变量如果不在一个数据框里，就是Ihavenothing 期望的那样，以?lm自带的例子来看

ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group-1)
summary(lm.D9)

Call:
lm(formula = weight ~ group - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0710 -0.4938  0.0685  0.2462  1.3690 

Coefficients:
         Estimate Std. Error t value Pr(>|t|)    
groupCtl   5.0320     0.2202   22.85 9.55e-15 ***
groupTrt   4.6610     0.2202   21.16 3.62e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6964 on 18 degrees of freedom
Multiple R-squared:  0.9818,    Adjusted R-squared:  0.9798 
F-statistic: 485.1 on 2 and 18 DF,  p-value: < 2.2e-16

再大胆猜测L、Q、C的含义
线性 linear
二次 quadratic
三次 cube

zhangpj · 2017年8月11日

diamond 數據集裏的 color 和 cut 是 ordinal variable。

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	53940 obs. of  10 variables:
 $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

可以當作普通 categorical variable 來處理，但這樣就忽略了順序這個信息。我記得 R 裡默認用 orthogonal polynomials 編碼，來表示不同的間隔。具體可以讀下這篇文章的最後一部分：The Basics of Encoding Categorical Data for Predictive Models。

Ihavenothing · 2017年8月11日

zhangpj 之前还真不知道这个信息，多谢了！

yihui · 2017年8月25日

zhangpj 上学期间我对这个问题似懂非懂，一直没明白为啥要对定序变量用正交多项式编码，今天再看了一遍你给的链接，明白多了。