我知道这个好像是数据存在 perfect separation,但是我自己看并不觉得好像存在 perfect separation:
str(data)
# tibble [5,906 × 4] (S3: tbl_df/tbl/data.frame)
# $ sex : Factor w/ 2 levels "F","M": 2 2 1 2 1 2 2 1 2 2 ...
# $ age : num [1:5906] 76.5 82.7 60 45.4 85 82 37.5 41.3 23.7 63.3 ...
# $ trt : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
# $ death: Factor w/ 2 levels "0","1": 1 2 2 1 1 2 1 1 1 2 ...
xtabs(~ death + trt, data)
# trt
# death 0 1
# 0 4360 175
# 1 1331 40
summary(xtabs(~ death + trt, data))
# Call: xtabs(formula = ~death + trt, data = data)
# Number of cases in table: 5906
# Number of factors: 2
# Test for independence of all factors:
# Chisq = 2.6591, df = 1, p-value = 0.103
但是我只要尝试做 logistic 回归就会发现有问题:
f_raw <- glm(death ~ trt,
data = data,
# control=glm.control(maxit = 100),
family = binomial(link = "logit"))
# Warning message:
# glm.fit: fitted probabilities numerically 0 or 1 occurred
f_raw <- glm(death ~ trt,
data = data,
control=glm.control(maxit = 100),
family = binomial(link = "logit"))
# Warning message:
# glm.fit: fitted probabilities numerically 0 or 1 occurred
f_adj <- glm(death ~ trt + sex + age,
data = data,
# control=glm.control(maxit = 100),
family = binomial(link = "logit"))
# Warning message:
# glm.fit: fitted probabilities numerically 0 or 1 occurred
f_adj <- glm(death ~ trt + sex + age,
data = data,
control=glm.control(maxit = 100),
family = binomial(link = "logit"))
# Warning message:
# glm.fit: fitted probabilities numerically 0 or 1 occurred
# summary(f_raw)
#
# Call:
# glm(formula = death ~ trt, family = binomial(link = "logit"),
# data = data, control = glm.control(maxit = 100))
#
# Deviance Residuals:
# Min 1Q Median 3Q Max
# -8.49 -8.49 -8.49 0.00 8.49
#
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 7.690e+15 9.293e+05 8.275e+09 <2e-16 ***
# trt1 -3.796e+17 4.367e+06 -8.692e+10 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 6400.3 on 5905 degrees of freedom
# Residual deviance: 317184.1 on 5904 degrees of freedom
# AIC: 317188
#
# Number of Fisher Scoring iterations: 9
#
summary(f_adj)
#
# Call:
# glm(formula = death ~ trt + sex + age, family = binomial(link = "logit"),
# data = data, control = glm.control(maxit = 100))
#
# Deviance Residuals:
# Min 1Q Median 3Q Max
# -8.49 -8.49 -8.49 0.00 8.49
#
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 4.726e+17 1.095e+06 4.317e+11 <2e-16 ***
# trt1 -1.765e+18 5.150e+06 -3.428e+11 <2e-16 ***
# sexM -1.426e+17 9.124e+05 -1.563e+11 <2e-16 ***
# age 1.245e+15 3.110e+03 4.002e+11 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 6400.3 on 5905 degrees of freedom
# Residual deviance: 317184.1 on 5902 degrees of freedom
# AIC: 317192
#
# Number of Fisher Scoring iterations: 6
这个时候我该怎么找到到底是哪里出现了 perfect separation 呢?
完整数据放在这里了: data.rda
备份 data.rda