求助！R函数调试

areg

在学习PKG包HSAUR手册的第七章时，遇到个难题，机器运行了半个小时还不出结果，我把表达式及它的相关步骤复制到下面，望大家给予试试，最好能把每步应该注意的环节都注释出来，以供学习。

##################################################

logL <- function(param, x)

{

d1 <- dnorm(x, mean = param[2], sd = param[3])

d2 <- dnorm(x, mean = param[4], sd = param[5])

-sum(log(param[1] * d1 + (1 - param[1]) * d2))

}

startparam <- c(p = 0.5, mu1 = 50, sd1 = 3, mu2 = 80,sd2 = 3)

opp <- optim(startparam, logL, x = faithful$waiting,

method = "L-BFGS-B", lower = c(0.01, rep(1,4)),

upper = c(0.99, rep(200, 4)))

data("faithful", package = "datasets")

x <- faithful$waiting

library("mclust")

mc <- Mclust(faithful$waiting)

library("boot")

fit <- function(x, indx)

{

a <- Mclust(x[indx], minG = 2, maxG = 2)$parameters

if (a$pro[1] < 0.5)

return(c(p = a$pro[1], mu1 = a$mean[1],mu2 = a$mean[2]))

return(c(p = 1 - a$pro[1], mu1 = a$mean[2],mu2 = a$mean[1]))

}

bootpara <- boot(faithful$waiting, fit, R = 1000) # 此步半天不给结果

###################################################

huadeng

boot 函数是那个库里，我运行不能发现boot函数

areg

[quote]引用第1楼huadeng于2006-11-24 21:36发表的“”:

boot 函数是那个库里，我运行不能发现boot函数[/quote]

R的PKG中，“boot”，大约1M

areg

boot(boot)

Bootstrap Resampling

Description

Generate R bootstrap replicates of a statistic applied to data. Both parametric and nonparametric resampling are possible. For the nonparametric bootstrap, possible resampling methods are the ordinary bootstrap, the balanced bootstrap, antithetic resampling, and permutation. For nonparametric multi-sample problems stratified resampling is used. This is specified by including a vector of strata in the call to boot. Importance resampling weights may be specified.

Usage

boot(data, statistic, R, sim="ordinary", stype="i",

strata=rep(1,n), L=NULL, m=0, weights=NULL,

ran.gen=function(d, p) d, mle=NULL, ...)

Arguments

data

The data as a vector, matrix or data frame. If it is a matrix or data frame then each row is considered as one multivariate observation.

statistic

A function which when applied to data returns a vector containing the statistic(s) of interest. When sim="parametric", the first argument to statistic must be the data. For each replicate a simulated dataset returned by ran.gen will be passed. In all other cases statistic must take at least two arguments. The first argument passed will always be the original data. The second will be a vector of indices, frequencies or weights which define the bootstrap sample. Further, if predictions are required, then a third argument is required which would be a vector of the random indices used to generate the bootstrap predictions. Any further arguments can be passed to statistic through the ...{} argument.

R

The number of bootstrap replicates. Usually this will be a single positive integer. For importance resampling, some resamples may use one set of weights and others use a different set of weights. In this case R would be a vector of integers where each component gives the number of resamples from each of the rows of weights.

……

Details

The statistic to be bootstrapped can be as simple or complicated as desired as long as its arguments correspond to the dataset and (for a nonparametric bootstrap) a vector of indices, frequencies or weights. statistic is treated as a black box by the boot function and is not checked to ensure that these conditions are met.

The first order balanced bootstrap is described in Davison, Hinkley and Schechtman (1986). The antithetic bootstrap is described by Hall (1989) and is experimental, particularly when used with strata. The other non-parametric simulation types are the ordinary bootstrap (possibly with unequal probabilities), and permutation which returns random permutations of cases. All of these methods work independently within strata if that argument is supplied.

For the parametric bootstrap it is necessary for the user to specify how the resampling is to be conducted. The best way of accomplishing this is to specify the function ran.gen which will return a simulated data set from the observed data set and a set of parameter estimates specified in mle.

Value

The returned value is an object of class "boot", containing the following components

t0 The observed value of statistic applied to data.

t A matrix with R rows each of which is a bootstrap replicate of statistic.

R The value of R as passed to boot.

data The data as passed to boot.

seed The value of .Random.seed when boot was called.

statistic The function statistic as passed to boot.

Examples

# usual bootstrap of the ratio of means using the city data

ratio <- function(d, w)

sum(d$x * w)/sum(d$u * w)

boot(city, ratio, R=999, stype="w")

# Stratified resampling for the difference of means. In this

# example we will look at the difference of means between the final

# two series in the gravity data.

diff.means <- function(d, f)

{ n <- nrow(d)

gp1 <- 1:table(as.numeric(d$series))[1]

m1 <- sum(d[gp1,1] * f[gp1])/sum(f[gp1])

m2 <- sum(d[-gp1,1] * f[-gp1])/sum(f[-gp1])

ss1 <- sum(d[gp1,1]^2 * f[gp1]) -

(m1 * m1 * sum(f[gp1]))

ss2 <- sum(d[-gp1,1]^2 * f[-gp1]) -

(m2 * m2 * sum(f[-gp1]))

c(m1-m2, (ss1+ss2)/(sum(f)-2))

}

grav1 <- gravity[as.numeric(gravity[,2])>=7,]

boot(grav1, diff.means, R=999, stype="f", strata=grav1[,2])

# In this example we show the use of boot in a prediction from

# regression based on the nuclear data. This example is taken

# from Example 6.8 of Davison and Hinkley (1997). Notice also

# that two extra arguments to statistic are passed through boot.

nuke <- nuclear[,c(1,2,5,7,8,10,11)]

nuke.lm <- glm(log(cost)~date+log(cap)+ne+ ct+log(cum.n)+pt, data=nuke)

nuke.diag <- glm.diag(nuke.lm)

nuke.res <- nuke.diag$res*nuke.diag$sd

nuke.res <- nuke.res-mean(nuke.res)

# We set up a new data frame with the data, the standardized

# residuals and the fitted values for use in the bootstrap.

nuke.data <- data.frame(nuke,resid=nuke.res,fit=fitted(nuke.lm))

# Now we want a prediction of plant number 32 but at date 73.00

new.data <- data.frame(cost=1, date=73.00, cap=886, ne=0,

ct=0, cum.n=11, pt=1)

new.fit <- predict(nuke.lm, new.data)

nuke.fun <- function(dat, inds, i.pred, fit.pred, x.pred)

{

assign(".inds", inds, envir=.GlobalEnv)

lm.b <- glm(fit+resid[.inds] ~date+log(cap)+ne+ct+

log(cum.n)+pt, data=dat)

pred.b <- predict(lm.b,x.pred)

remove(".inds", envir=.GlobalEnv)

c(coef(lm.b), pred.b-(fit.pred+dat$resid[i.pred]))

}

nuke.boot <- boot(nuke.data, nuke.fun, R=999, m=1,

fit.pred=new.fit, x.pred=new.data)

# The bootstrap prediction error would then be found by

mean(nuke.boot$t[,8]^2)

# Basic bootstrap prediction limits would be

new.fit-sort(nuke.boot$t[,8])[c(975,25)]

# Finally a parametric bootstrap. For this example we shall look

# at the air-conditioning data. In this example our aim is to test

# the hypothesis that the true value of the index is 1 (i.e. that

# the data come from an exponential distribution) against the

# alternative that the data come from a gamma distribution with

# index not equal to 1.

air.fun <- function(data)

{ ybar <- mean(data$hours)

para <- c(log(ybar),mean(log(data$hours)))

ll <- function(k) {

if (k <= 0) out <- 1e200 # not NA

else out <- lgamma(k)-k*(log(k)-1-para[1]+para[2])

out

}

khat <- nlm(ll,ybar^2/var(data$hours))$estimate

c(ybar, khat)

}

air.rg <- function(data, mle)

# Function to generate random exponential variates. mle will contain

# the mean of the original data

{ out <- data

out$hours <- rexp(nrow(out), 1/mle)

out

}

air.boot <- boot(aircondit, air.fun, R=999, sim="parametric",

ran.gen=air.rg, mle=mean(aircondit$hours))

# The bootstrap p-value can then be approximated by

sum(abs(air.boot$t[,2]-1) > abs(air.boot$t0[2]-1))/(1+air.boot$R)

huadeng

嘿嘿，搞错了，我调试了一下。估计错在最后两步。由于我对bootstrap内容了解很少，故不能解释。不好意思了。

areg

帮助里的例题都运行正常，但这个手册中的题原因还没有找到

huadeng

我去看看这手册

yihui

我没敢试1000次，只作了50次bootstrap就挺慢的了

> system.time(bootpara <- boot(faithful$waiting, fit, R = 50))

[1] 39.66 0.16 45.05 NA NA

我的电脑本身也只有256M内存。

areg

我的100次还调出来，一分钟吧，而1000次，好几次了，最长一次，45分钟没有结果

huadeng

程序没问题，我做了100次，但很慢，我的电脑是双核，512 的，花了一分多钟。结果如下：

> bootpara <- boot(faithful$waiting, fit, R = 100)

> bootpara

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:

boot(data = faithful$waiting, statistic = fit, R = 100)

Bootstrap Statistics :

original bias std. error

t1* 0.3610159 -0.003395616 0.03105552

t2* 54.6191115 -0.014340831 0.77774902

t3* 80.0938427 -0.235098402 1.81724872

>

areg

> bootpara <- boot(faithful$waiting, fit, R = 500)

> bootpara

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:

boot(data = faithful$waiting, statistic = fit, R = 500)

Bootstrap Statistics :

original bias std. error

t1* 0.3610159 -0.007233957 0.03717666

t2* 54.6191115 -0.115011885 0.92811284

t3* 80.0938427 -0.411865915 2.66701469

>

## 抽500次，花了9分钟

anning189

程序运行了长时间没有结束，还没细看什么原因

areg

这该死的手册，弄那么大，以为操作错了

谢谢楼上两位帮忙

areg

[quote]引用第11楼anning189于2006-11-24 22:36发表的“”:

程序运行了长时间没有结束，还没细看什么原因[/quote]

huadeng刚才回了个长途电话，说是这种抽样就是慢，那么在学习中，把原手册中的R改小点，由1000改成100次，应该多数人都能行。R=500次，我的机器P4, 1.7G；内存512，虚拟内存2G，花了快10分钟。

谢谢大家支持

anning189

咳，我的机子不太爽啊

AMD2500+，256DDR，虚拟2G，配置低

fu_neng

捣鼓这些 bootstrapt ,monter carlo permutation之类的建议还是用少一些随机化次数, 生物方面达到95%的置信水平即可. 若数据很多, 赶紧弄个linux系统吧 , 我在linux上试过的, 可提高近10倍的效率. windows对科学运算软件的支持都不如unix类在行.

rtist

[quote]引用第15楼fu_neng于2006-12-17 16:51发表的“”:

捣鼓这些 bootstrapt ,monter carlo permutation之类的建议还是用少一些随机化次数, 生物方面达到95%的置信水平即可. 若数据很多, 赶紧弄个linux系统吧 , 我在linux上试过的, 可提高近10倍的效率. windows对科学运算软件的支持都不如unix类在行.[/quote]100次得到的95%CI误差多大？？如果用bootstrap估计方差，也许一百次差不多了。如果找CI，这还远远不够。

rtist

[quote]引用第9楼huadeng于2006-11-24 22:26发表的“”:

程序没问题，我做了100次，但很慢，我的电脑是双核，512 的，花了一分多钟。结果如下：

> bootpara <- boot(faithful$waiting, fit, R = 100)

> bootpara

ORDINARY NONPARAMETRIC BOOTSTRAP

.......[/quote]R能用到另外那个核么？按这个程序来看，看不出那里明确指出用双核了，所以通常双核也只能当作单核用，或者在两个核间跳来跳去的

fu_neng

[quote]引用第16楼rtist于2006-12-17 23:25发表的“”:

100次得到的95%CI误差多大？？如果用bootstrap估计方差，也许一百次差不多了。如果找CI，这还远远不够。[/quote]

例如估计ripley k 值的CI, 99%的置信水平需模拟100次

fu_neng

[quote]引用第17楼rtist于2006-12-17 23:28发表的“”:

R能用到另外那个核么？按这个程序来看，看不出那里明确指出用双核了，所以通常双核也只能当作单核用，或者在两个核间跳来跳去的[/quote]

这点还真的没注意到在linux服务器运算时,设置使用上全部的cpu.请教一下你多核的如何设置?