sunjiadee
请教一个问题,在回归因子存在着自相关的时候,我采用了偏最小二乘法,在提取了第一因子后,发现以后的因子都没有通过F检验,但是R却随着因子的增加,而逐步提高,这是为什么呢,既然X矩阵相关,就必然有冗余信息,那么为什么,随着主成分的增加,R会提高呢?因该在一定的时候,R停止增长呀。
太麻烦您了。
rtist
R square NEVER decreases when more terms are added.
Adjusted R-sq, Mallow's Cp, AIC, BIC.... might be what you want to check
cran
But the model might be over parameteterized
rtist
My understanding is even though the model is overparameterized, R-sq shouldn't decrease, at most be kept at the same level.
cran
but some parameters might not be significant.
And multi-colinearity might happen
rtist
It does NOT matter whether multicollinearity exists, in terms of R-square, because in linear models, the predictors are known or at least conditioned on. That is, the X's are considered fixed, instead of random quantities. Even though multicollinearity causes practical problems, this is not to say that the linear models require non-multicollinearity assumption. Linear models make NO assumptions about the correlation structure of predictors.
Now go back to the problem, when more predictors are added to the model, the old model will be a nested model in the new one. In this case, the bigger model have more parameters and would fit to the data no worse than the smaller model, no matter whether the newly added term is significant or not. Think of likelihood ratio statistic, it'll never be negative! The R-square and the likelihood, in this sense, are equivalent in that they measure how the model fits to the data, without considering model complexity.
For Adjusted R-square, Mallow's Cp, AIC, BIC, & so on, they have components to measure both model fits and model complexity, thus better serving as information criteria.
abel
补充一下rtist描述的,不知道表达是否准确
在M-G的五个假设中,自变量是确定值,不是随机变量,不存在概率意义上的相关性,多重共线性并没有违反模型的假设
但是,通过样本量得到估计的时候需要使用inverse(X'X),如果这个行列式的值非常小的话,会导致参数估计的极度不稳定,凡是和这个行列式的值有关的统计量等等都会受到影响,至少是观察误差的影响。
处理这种情况已经有一些成熟的方法来做了。可以参考一些这方面的文献,《多元分析引论》(科学出版社)上面有介绍,《高等计量经济学》或者类似的书都有介绍,方法挺多的。
至于rtist提到的那些指标(因为有一些不算统计量,只好如此称呼了),是多重共线性必须要考虑的内容。一旦我发现哪个人做多元回归的时候没有提及这些指标,我都会建议他重新考虑模型的。其中R-square随着自变量的增加不会减少,从这个指标的公式上应该是可以推导出来的。
我个人认为多元的时候,随机项(残差)性质的考察诊断似乎更加重要,因为五条假设中多半和此有关。《多元分析引论》(科学出版社)上面有相对比较详细的论述。当然如果有兴趣的话,倒是可以自己推导一下,知道那些指标怎么来的,一条假设一条假设的检验,估计可以比较细致的了解到底应该怎么处理了。
最后补充一下,其实自变量在实际中不一定是确定值,一方面抽样的缘故了,引入了随机性;一方面是观察误差了,也有随机性。考虑自变量也是随机性变量的模型也有一些,这样子多重共线性的问题就突出了,同时自变量和误差项的相关性问题也提出来了,总之是模型过度的复杂了。话说回来,现实处理的时候,是需要加强一些看起来不合情理的假设的;当然,做研究的时候除外,尽量多考虑一些特殊的、边界的情况,说不定就有好玩的事情出现。
向rtist学习,呵呵。
rtist
a typo: it might be inverse(X'X) if X is the model matrix
向abel学习还差不多
abel
感谢rtist,真细心,我做事一向比较马虎吧,惭愧啊。
xqy3406320
看来两位都是谦虚之人啊!向你们学习!
liuxihe0405
学习!