SAS中Bootstrap方法的示例

abel

本资料来源于网上，因为有人问我Bootstrap的事情，并强调用SAS来完成；故转载于此。

内中中例子源于Introduction to the Bootstrap by Eforn, etc

在R中，bootstrap方法在boot中得到了非常好的处理，推荐大家使用。Introduction to the Bootstrap后面就是用S的！

/*********************************************************************

name: jackboot

title: Jackknife and Bootstrap Analyses

product: stat

system: all

support:

update: 21Sep95

DISCLAIMER:

THIS INFORMATION IS PROVIDED BY SAS INSTITUTE INC. AS A SERVICE

TO ITS USERS. IT IS PROVIDED "AS IS". THERE ARE NO WARRANTIES,

EXPRESSED OR IMPLIED, AS TO MERCHANTABILITY OR FITNESS FOR A

PARTICULAR PURPOSE REGARDING THE ACCURACY OF THE MATERIALS OR CODE

CONTAINED HEREIN.

Introduction

------------

The %JACK macro does jackknife analyses for simple random samples,

computing approximate standard errors, bias-corrected estimates, and

confidence intervals assuming a normal sampling distribution.

The %BOOT macro does elementary nonparametric bootstrap analyses for

simple random samples, computing approximate standard errors,

bias-corrected estimates, and confidence intervals assuming a normal

sampling distribution. Also, for regression models, the %BOOT macro can

resample either observations or residuals.

The %BOOTCI macro computes several varieties of confidence intervals

that are suitable for sampling distributions that are not normal.

In order to use the %JACK or %BOOT macros, you need to know enough about

the SAS macro language to write simple macros yourself. See _The SAS

Guide to Macro Processing_ for information on the SAS macro language.

This document does not explain how the jackknife and bootstrap are

performed or how the various confidence intervals are computed, but does

provide some advice and caveats regarding usage. For an elementary

introduction, see Dixon in the bibliography below. There is a thorough

exposition in E&T that should be accessible to anyone who has done a

year or more of statistical study.

There is a widespread myth that bootstrapping is a magical spell to

perform valid statistical inference on _anything_. S&T dispell this myth

very effectively and very technically. For an elementary demonstration

of the dangers of bootstrapping, see the "Cautionary Example" below.

The Jackknife

-------------

The jackknife works only for statistics that are smooth functions of the

data. Statistics that are not smooth functions of the data, such as

quantiles, may yield inconsistent jackknife estimates. The best results

are obtained with statistics that are linear functions of the data. For

highly nonlinear statistics, the jackknife can be inaccurate. See S&T,

chapter 2, for a detailed discussion of the validity of the jackknife.

The Bootstrap

-------------

Bootstrap estimates of standard errors are valid for many commonly-used

statistics, generally requiring no major assumptions other than simple

random sampling and finite variance. There do exist some statistics for

which the standard error estimates will fail, such as the maximum or

minimum. The bootstrap standard error is consistent for some nonsmooth

statistics such as the median. However, the bootstrap standard error may

not be consistent even for very smooth statistics when the population

distribution has very heavy tails. Inconsistency of the usual bootstrap

estimators can often be remedied by using a resample size m(n) that is

smaller than the sample size n, so that m(n)->infinity and m(n)/n->0 as

n->infinity. Theoretical results on the consistency of the bootstrap

standard error are not extensive. See S&T, chapter 3, for details.

The bootstrap estimates of bias provided by the %BOOT macro are valid

under simple random sampling for many commonly-used _plug-in_

estimators. A _plug-in_ estimator is one that uses the same formula to

compute an estimate from a sample that is used to compute a parameter

from the population. For example, if the sample variance is computed

with a divisor of n (VARDEF=N), it is a plug-in estimate; if it is

computed with a divisor of n-1 (VARDEF=DF, the default), it is _not_ a

plug-in estimate. R-squared is a plug-in estimator; adjusted r-squared

is not. Estimating the bias of a non-plug-in estimators requires

special treatment; see "Bias Estimation" below. If you are using an

estimator that is known to be unbiased, use the BIASCORR=0 argument with

%BOOT. See E&T, chapter 10, for more discussion of bootstrap estimation

of bias.

The approximate normal confidence intervals computed by the %BOOT macro

are valid if both the bias and standard error estimates are valid and

if the sampling distribution is approximately normal. For non-normal

sampling distributions, you should use the %BOOTCI macro, which

requires a much larger number of resamples for adequate approximation.

If you plan to use only %BOOT, 200 resamples will typically be

enough. If you plan to use %BOOTCI, 1000 or more resamples are likely

to be needed for a 90% confidence interval; greater confidence

levels require even more resamples. The proper use of bootstrap

confidence intervals is a matter of considerable controversy; see

S&T, chapter 4, for a review.

The %BOOT macro does balanced resampling when possible. Balanced

resampling yields more accurate approximations to the ideal bootstrap

estimators of bias and standard errors than does uniform resampling. Of

course, both balanced resampling and uniform resampling produce

approximations that converge to the same ideal bootstrap estimators as

the number of resamples goes to infinity. Balanced resampling is of

little benefit with %BOOTCI. See Hall, appendix II, for a discussion of

balanced resampling and other methods from improving the computational

efficiency of the bootstrap.

Using %JACK and %BOOT

---------------------

To use the %JACK or %BOOT macros, you must write a macro called %ANALYZE

to do the data analysis that you want to bootstrap. The %ANALYZE macro

must have two arguments:

DATA= the name of the input data set to analyze

OUT= the name of the output data set containing the statistics

for which you want to compute bootstrap distributions.

If possible, you should write the %ANALYZE macro to use BY processing.

The BY statement must be specified via the %BYSTMT macro, which

generates a BY statement in which the list of BY variables is given by a

macro variable &BY. The &BY macro variable is not an argument to

%ANALYZE or to %BYSTMT, but is specified by a %LET statement when

needed. The %JACK and %BOOT macros run %ANALYZE once without a BY

variable and then once with the the BY variable _SAMPLE_.

If you do not use the %BYSTMT macro, the computations will be done with

a macro loop instead of with BY processing. A macro loop takes much more

computer time than BY processing but requires less disk space.

If the %ANALYZE macro uses the %BYSTMT macro, two output data sets

are created by the %JACK macro:

JACKDATA contains the jackknife resamples. The variable _SAMPLE_

gives the resample number, and _OBS_ gives the original

observation number.

JACKDIST contains the resampling distributions of the statistics

in the OUT= data set created by the %ANALYZE macro. The

variable _SAMPLE_ gives the resample number.

Two similar data sets are also created by the %BOOT macro when the

%BYSTMT macro is used:

BOOTDATA contains the bootstrap resamples. The variable _SAMPLE_

gives the resample number, and _OBS_ gives the original

observation number.

BOOTDIST contains the resampling distributions of the statistics

in the OUT= data set created by the %ANALYZE macro. The

variable _SAMPLE_ gives the resample number.

In addition, the %JACK macro creates a data set JACKSTAT and the %BOOT

macro creates a data set BOOTSTAT regardless of whether the %BYSTMT

macro is used. These data sets contain the approximate standard errors,

bias-corrrected estimates, and 95% confidence intervals assuming a

normal sampling distribution. The %BOOTCI macro creates a data set

BOOTCI containing the confidence intervals.

If the OUT= data set contains more than one observation per BY group,

you must specify a list of ID= variables when you run the %JACK or %BOOT

macros. These ID= variables identify observations that correspond to

the same statistic in different BY groups. For many procedures, these

ID= variables would naturally be _TYPE_ and _NAME_, but those names are

_not_ allowed to be used as ID= variables--you must use the RENAME= data

set option to rename them. (Renaming variables can be tricky. You must

use the _old_ name with the DROP= and KEEP= data set options, but you

must use the _new_ name with the WHERE= data set option.)

Consider analyzing the correlation of the LSAT and GPA variables from

Efron and Tibshirani (1993):

title 'Law School Data from Efron and Tibshirani, p. 19';

data law; input lsat gpa; cards;

576 3.39

635 3.30

558 2.81

578 3.03

666 3.44

580 3.07

555 3.00

661 3.43

651 3.36

605 3.13

653 3.12

575 2.74

545 2.76

572 2.88

594 2.96

;

The following %ANALYZE macro could be used to process all the statistics

in the OUT= data set from PROC CORR:

%macro analyze(data=,out=);

proc corr noprint data=&data

out=&out(rename=(_type_=stat _name_=with));

var lsat gpa;

%bystmt;

run;

%mend;

title2 'Jacknife Analysis';

%jack(data=law,id=stat with)

title2 'Bootstrap Analysis';

%boot(data=law,id=stat with,random=123)

However, if you are interested only in the correlation, it is more

efficient to extract only the relevant observations and variables. It

is also helpful to provide descriptive names and labels, as in this

example:

%macro analyze(data=,out=);

proc corr noprint data=&data out=&out;

var lsat gpa;

%bystmt;

run;

%if &syserr=0 %then %do;

data &out;

set &out;

where _type_='CORR' & _name_='LSAT';

corr=gpa;

label corr='Correlation';

keep corr &by;

run;

%end;

%mend;

title2 'Jacknife Analysis';

%jack(data=law)

title2 'Bootstrap Analysis';

%boot(data=law,random=123)

It is advisable to make the OUT= data set as small as possible to

conserve computer time and disk space. If you are running release 6.11

or later, you can use a WHERE= data set option on an output data set:

title2 'Using WHERE= with an output data set--6.11 only';

%macro analyze(data=,out=);

proc corr noprint data=&data

out=&out(where=(_type_='CORR' & _name_='LSAT')

rename=(gpa=corr)

keep=gpa _type_ _name_ &by);

var lsat gpa;

%bystmt;

run;

%mend;

title3 'Jacknife Analysis';

%jack(data=law)

title3 'Bootstrap Analysis';

%boot(data=law,random=123)

Unfortunately, you may not DROP any variable used in a WHERE= data set

option.

Bias Estimation

---------------

The sample correlation is a plug-in estimator and hence is suitable for

the bias estimator in %BOOT. The sample variance computed with a divisor

of n-1 is not a plug-in estimator and therefore requires special

treatment. In some procedures, you can use the VARDEF= option to obtain

a plug-in estimate of the variance. The default value of VARDEF= is DF,

which yields the usual adjustment for degrees of freedom, instead of the

plug-in estimate. For example:

title2 'The unbiased variance estimator is not a plug-in estimator';

proc means data=law var vardef=df;

var lsat gpa;

run;

The following %ANALYZE macro could be used to jackknife the unbiased

variance estimator, but the bootstrap over-corrects for the nonexistent

bias:

title2 'Estimating the bias of the unbiased estimator of variance';

%macro analyze(data=,out=);

proc means noprint data=&data vardef=df;

output out=&out(drop=_freq_ _type_) var=var_lsat var_gpa;

var lsat gpa;

%bystmt;

run;

%mend;

title3 'The jackknife computes the correct bias of zero';

%jack(data=law)

title3 'The bootstrap over-corrects for bias';

%boot(data=law,random=123)

By specifying VARDEF=N instead of VARDEF=DF, you can tell the MEANS

procedure to compute a plug-in estimate of the variance:

title2 'Estimating the bias of the plug-in estimator of variance';

%macro analyze(data=,out=);

proc means noprint data=&data vardef=n;

output out=&out(drop=_freq_ _type_) var=var_lsat var_gpa;

var lsat gpa;

%bystmt;

run;

%mend;

With the above %ANALYZE macro, %JACK yields an exact bias correction,

while the bias-corrected estimates from %BOOT are very close to the

unbiased estimates:

title3 'Jacknife Analysis';

%jack(data=law)

title3 'Bootstrap Analysis';

%boot(data=law,random=123)

If the procedure you are using supports the VARDEF= option to produce

plug-in estimates, you can use the %VARDEF macro to obtain correct

bootstrap bias estimates of the corresponding non-plug-in estimates. The

%VARDEF macro generates a VARDEF= option with a value of either N or DF

as appropriate for use with the %BOOT macro (The %JACK macro ignores the

%VARDEF macro). In the %ANALYZE macro, use %VARDEF in the procedure

statement where the VARDEF= option would be syntactically correct. For

example:

title2 'Estimating the bias of the unbiased variance estimator';

%macro analyze(data=,out=);

proc means noprint data=&data %vardef;

output out=&out(drop=_freq_ _type_) var=var_lsat var_gpa;

var lsat gpa;

%bystmt;

run;

%mend;

title3 'Bootstrap Analysis';

%boot(data=law,random=123)

The variance estimator using VARDEF=DF is unbiased, so the bias

correction estimated by bootstrapping is much smaller than in the

previous example, in which the biased plug-in estimator was used.

susanyue

多谢楼主，感兴趣ing，研究研究