为什么常数项放在自变量矩阵后R-square变了

drewlee

下面两个lm得到的r-square怎么不一样？

<br />
> x=matrix(rnorm(20),ncol=2)<br />
> xx=cbind(rep(1,10),x)<br />
> y=x%*%c(1,2)+rnorm(10)<br />
> <br />
> lm=lm(y~x)<br />
> summary(lm)<br />
<br />
Call:<br />
lm(formula = y ~ x)<br />
<br />
Residuals:<br />
     Min       1Q   Median       3Q      Max <br />
-1.55898 -0.33028 -0.08908  0.77964  1.12273 <br />
<br />
Coefficients:<br />
            Estimate Std. Error t value Pr(>|t|)    <br />
(Intercept)  -0.7713     0.3544  -2.177 0.065969 .  <br />
x1            0.7086     0.3621   1.957 0.091269 .  <br />
x2            2.1141     0.3470   6.093 0.000494 ***<br />
---<br />
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 <br />
<br />
Residual standard error: 1.019 on 7 degrees of freedom<br />
Multiple R-Squared: 0.8414,     Adjusted R-squared: 0.7961 <br />
F-statistic: 18.57 on 2 and 7 DF,  p-value: 0.001589 <br />
<br />
> <br />
> lm=lm(y~xx-1)<br />
> summary(lm)<br />
<br />
Call:<br />
lm(formula = y ~ xx - 1)<br />
<br />
Residuals:<br />
     Min       1Q   Median       3Q      Max <br />
-1.55898 -0.33028 -0.08908  0.77964  1.12273 <br />
<br />
Coefficients:<br />
    Estimate Std. Error t value Pr(>|t|)    <br />
xx1  -0.7713     0.3544  -2.177 0.065969 .  <br />
xx2   0.7086     0.3621   1.957 0.091269 .  <br />
xx3   2.1141     0.3470   6.093 0.000494 ***<br />
---<br />
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 <br />
<br />
Residual standard error: 1.019 on 7 degrees of freedom<br />
Multiple R-Squared: 0.9014,     Adjusted R-squared: 0.8592 <br />
F-statistic: 21.34 on 3 and 7 DF,  p-value: 0.0006728 <br />
<br />

drewlee

按照r2的定义计算了一下，第一个lm给出的r2是对的。不知道第二个lm的r2怎么算的。

drewlee

更奇怪了。原来大家不是都说增加自变量后的线性模型的R-square总会增加吗？下面的例子并不是这样的：

<br />
> x=matrix(rnorm(20),ncol=2)<br />
> y=rnorm(10)<br />
> <br />
> lm=lm(y~1)<br />
> y.hat=rep(1*lm$coefficients,length(y))<br />
> (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))<br />
[1] 2.646815e-33<br />
> <br />
> lm=lm(y~x-1)<br />
> y.hat=x%*%lm$coefficients<br />
> (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))<br />
[1] 0.4443356<br />
> <br />
> ################ This is the biggest model, but its R2 is not the biggest, why?<br />
> lm=lm(y~x)<br />
> y.hat=cbind(rep(1,length(y)),x)%*%lm$coefficients<br />
> (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))<br />
[1] 0.2704789

drewlee

还有就是R-square怎么还会大于1？

<br />
> x=rnorm(10)<br />
> y=runif(10)<br />
> lm=lm(y~x-1)<br />
> y.hat=x*lm$coefficients<br />
> (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))<br />
[1] 3.513865

drewlee

r-help里面的Duncan Murdoch帮助我解答了问题。现在终于豁然开朗了。还是说一下吧, 也许对一些像我一样理论基础不够扎实的学生还是有点益处的。

---------(1)----------- Does R2 always increase as variables are added?

Answer: Yes, it is correct when dealing with models with intercept and R^2 is defined as (r.square=sum((y.hat-mean (y))^2)/sum((y-mean(y))^2)).

---------(2)----------- Is R2 always smaller than 1?

Answer: No, it can be greater than 1 when the model is without a intercept term and R^2 is still defined as (r.square=sum((y.hat-mean (y))^2)/sum((y-mean(y))^2)).

---------(3)----------- How is R2 in summary(lm(y~x-1))$r.squared calculated?

Answer: After I tried the formula of R^2, sum(y.hat^2)/sum(y^2), I found the R^2 in summary(lm(y~x-1))$r.squared was calculated by this way. And this definition can ensure R^2 always smaller than 1.

wuguohui

茅塞顿开！

rtist

以前没注意这个问题，也没注意到这个帖子。

当模型没有截距的时候，也就相当于没有"corrected" total sum of squares，所以再把mean(y)减掉就不合理了。

这个时候R所使用的定义还是合理的。