求助，关于sas运行logistic回归的一个问题

怎么可能

运行logistic过程后提示：There is possibly a quasi-complete separation of data points. The maximum likelihood

estimate may not exist.

大虾给解释以下原因

outsider

There is an existing linear combination of regressiors almost perfectly predict the reponse variable. The condition cause the maximum likelihood estimation iteration either to fail or to produce extremely large estimators. A complete separation would be the speical case. You may need to find out the regressor to make some adjustment.

hongtianli

这提示自变量间完全线性可分，无法得出结论。

怎么可能

[quote]引用第2楼hongtianli于2007-10-10 11:55发表的“”:

这提示自变量间完全线性可分，无法得出结论。[/quote]

什么叫自变量完全线性可分？能举个例子解释以下吗

非常感谢

hongtianli

Existence of Maximum Likelihood Estimates

The likelihood equation for a logistic regression model does not always have a finite solution. Sometimes there is a nonunique maximum on the boundary of the parameter space, at infinity. The existence, finiteness, and uniqueness of maximum likelihood estimates for the logistic regression model depend on the patterns of data points in the observation space (Albert and Anderson 1984; Santner and Duffy 1986). The existence checks are not performed for conditional logistic regression.

Consider a binary response model. Let Yj be the response of the ith subject and let xj be the vector of explanatory variables (including the constant 1 associated with the intercept). There are three mutually exclusive and exhaustive types of data configurations: complete separation, quasi-complete separation, and overlap.

Complete Separation

There is a complete separation of data points if there exists a vector b that correctly allocates all observations to their response groups; that is,

This configuration gives nonunique infinite estimates. If the iterative process of maximizing the likelihood function is allowed to continue, the log likelihood diminishes to zero, and the dispersion matrix becomes unbounded.

Quasi-Complete Separation

The data are not completely separable but there is a vector b such that

and equality holds for at least one subject in each response group. This configuration also yields nonunique infinite estimates. If the iterative process of maximizing the likelihood function is allowed to continue, the dispersion matrix becomes unbounded and the log likelihood diminishes to a nonzero constant.

Overlap

If neither complete nor quasi-complete separation exists in the sample points, there is an overlap of sample points. In this configuration, the maximum likelihood estimates exist and are unique.

examples:

1、when you input:

data a;

input y x;

cards;

1 2

0 -1

1 2

0 -1

1 2

0 -1

1 2

0 -1

1 2

0 -1

1 2

0 -2

1 2

0 -1

1 2

0 -1

;

run;

proc logistic;

model y=x;

run;

then you'll read in log"

WARNING: There is a complete separation of data points. The maximum likelihood estimate does not exist.

WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable.

"

and while you input:

data a;

input y x;

cards;

1 0

0 -1

1 2

0 -1

1 2

0 -1

1 2

0 -1

1 2

0 -1

1 2

0 -2

1 2

0 0

1 2

0 -1

;

run;

proc print; run;

proc logistic;

model y=x;

run;

then the log shows:

"WARNING: There is possibly a quasi-complete separation of data points. The maximum likelihood estimate may not exist.

WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable.

"

综上：当用最大似然函数，采用迭代法估计回归系数时，完全线性可分（Complete Separation ）或者近似完全线性可分（Quasi-Complete Separation ），无法得到估计值；只有在重叠时（Overlap ），方可得到估计值。线性可分的含义是：用一条线，很明显就可以把两个组分开，没有交叉，故不用估计，就已经把两组分开了。

huadli

终于明白了谢谢！

losttemple

非常感谢hongtianli的解释