如何理解glm（generalized linear model）函数中的weights

Finley · 2020年9月11日

任务描述：需要对一组带权重的数据做logistic regression，这里的权重是指根据对象的年龄、性别、居住地等占总人数的比例确定的。

我在看help file中文档里对weights的解释是

an optional vector of ‘prior weights’ to be used in the fitting process. Should be NULL or a numeric vector.

然后我直接采用如下的代码，但感觉好像不对。

model2.full <- glm(data = regress_data, score2~self+family+neighbor+coworker+friends+others+age+gender+province+urban+education+vocation+income,weights = weight,family = binomial(link = "logit")) summary(model2.full)

想知道是否应该带入这样的数据权重，以及在glm 中逻辑回归的weights的解释和意义

caimiao0714 · 2020年9月12日

这个weight就是抽样中用样本来推断总体的办法。假设现在全中国有10亿人，假设我们通过人口普查已经知道总体中20%是女性，80%是男性。在实际抽样中我们当然不可能调查10亿人。

假设现在我们样本只有2个人，一男一女，我们可以用这两个人直接放到回归模型里面得到参数。但是这里存在的问题是我们已经知道总体中的性别分布，而我们样本中的性别分布跟总体不同，也就是说我们的样本不代表总体，所以样本的回归中推断出的beta并不能反映总体的真实情况。

在这个例子中我们就可以用到weight。我们可以给男性权重4，女性权重1，这样再进行估计的参数就能够更“准确地”反映出总体中的参数。有一点需要指出的是，对于样本的两行数据，如果你把男性重复4行，女性还是一行，总共五行进行回归，你得到的beta会跟用weight得到的相同，但是标准误会小很多。因为实际上我们并没有真正拥有4行男性的数据，只是用weight来模拟真实总体的分布。如果将男性重复4次，意味着你真的调查了4位男性的数据，你的统计功效会比只调查一位男性但是用weight的情形大很多。

实际研究过程中做propensity score的时候用weight很多，常常都是用inverse probability weighting的办法进行分析。

nan.xiao · 2020年9月13日

关键的信息在 glm 文档的 Details 部分：

Non-NULL weights can be used to indicate that different observations have different dispersions (with the values in weights being inversely proportional to the dispersions); or equivalently, when the elements of weights are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations. For a binomial GLM prior weights are used to give the number of trials when the response is the proportion of successes: they would rarely be used for a Poisson GLM.

所以这里的 weight 不可以是比例，而必须是整数。

library("msaenet")
dat <- msaenet.sim.binomial(n = 1000, p = 20, snr = 5, coef = c(1, 15), p.train = 0.8, seed = 2020)

w <- rep(1, nrow(dat$x.tr))
w[which(dat$y.tr == 1)] <- 10
fit <- glm.fit(dat$x.tr, dat$y.tr, weights = w, family = binomial(link = "logit"))

fit <- glm.fit(dat$x.tr, dat$y.tr, weights = w / (sum(w)), family = binomial(link = "logit"))

# Warning message:
#   In eval(family$initialize) : non-integer #successes in a binomial glm!

Finley · 2020年9月15日

caimiao0714 很感谢您的回答，我想再追问一下，在您的例子中weight取4和1与weight取8亿和2亿有区别吗？因为在@nan.xiao#437540 的code中如果取比例的话似乎会带来一些问题？

caimiao0714 · 2020年9月15日

Finley 对于beta是没有影响的。但是我觉得对于标准误应该会有影响，我觉得weight同时乘以一个很大的数会使得标准误变小。这两个我觉得都不一定是对的。你可以做个简单的simulation试试告诉我们结果。

caimiao0714 · 2020年9月15日

nan.xiao 在propensity score做inverse probability weighting(IPW)的时候，这个1/p应该永远都不是整数，所以IPW的weighted regression跟这里的weight不是一个东西吗？

Finley · 2020年9月15日

caimiao0714
好的！我做了一下simulation，从结果上来看似乎非整数的weight同样能够运行，至于比例也主要是通过影响标准误进而影响p-value，不过从这里也可以看出如果weight直接按照归一化的方式取得很小的话，那么可能β会很不显著...

n=500

#weight1=c(1,1,...)
weight1<-rep(1,n)
#weight2=c(100,100,...)
weight2<-weight1*100
#weight3=c(0.99,2.31,0.03,...)
weight3<-abs(round(rnorm(n),2))
#weight4=c(99,231,3,...)
weight4<-weight3*100

x<-rnorm(n=n,sd=1)
y<-10*x+5+rnorm(n,sd=10)
y0<-1/(1+exp(-y))
y<-round(y0)
data<-data.frame(x,y,y0)


fit0<-glm(data = data,y~x,family = binomial(link = "logit"))
coef(fit0)
#(Intercept)      x 
#0.7941683   1.6303079 
confint(fit0)
#Waiting for profiling to be done...
#               2.5 %   97.5 %
#(Intercept) 0.5662565 1.032330
#x           1.3327177 1.956054
fit1<-glm(data = data,y~x,weights = weight1,family = binomial(link = "logit"))
coef(fit1)
#(Intercept)      x 
#0.7941683   1.6303079 
confint(fit1)
#Waiting for profiling to be done...
#               2.5 %   97.5 %
#(Intercept) 0.5662565 1.032330
#x           1.3327177 1.956054
fit2<-glm(data = data,y~x,weights = weight2,family = binomial(link = "logit"))
coef(fit2)
#(Intercept)      x 
#0.7941683   1.6303079 
confint(fit2)
#Waiting for profiling to be done...
#               2.5 %    97.5 %
#(Intercept) 0.7709583 0.8174808
#x           1.5993421 1.6615557
fit3<-glm(data = data,y~x,weights = weight3,family = binomial(link = "logit"))
coef(fit3)
#(Intercept)      x 
#0.8426232   1.7929444 
confint(fit3)
#Waiting for profiling to be done...
#               2.5 %   97.5 %
#(Intercept) 0.5751225 1.125583
#x           1.4425999 2.183978
#There were 22 warnings (use warnings() to see them)
fit4<-glm(data = data,y~x,weights = weight4,family = binomial(link = "logit"))
coef(fit4)
#(Intercept)      x 
#0.8426232   1.7929444
confint(fit4)
#Waiting for profiling to be done...
#              2.5 %    97.5 %
#(Intercept) 0.8152457 0.8701551
#x           1.7561727 1.8301234