abel
本资料来源于网上,因为有人问我Bootstrap的事情,并强调用SAS来完成;故转载于此。
内中中例子源于Introduction to the Bootstrap by Eforn, etc
在R中,bootstrap方法在boot中得到了非常好的处理,推荐大家使用。Introduction to the Bootstrap后面就是用S的!
/*********************************************************************
name: jackboot
title: Jackknife and Bootstrap Analyses
product: stat
system: all
support:
update: 21Sep95
DISCLAIMER:
THIS INFORMATION IS PROVIDED BY SAS INSTITUTE INC. AS A SERVICE
TO ITS USERS. IT IS PROVIDED "AS IS". THERE ARE NO WARRANTIES,
EXPRESSED OR IMPLIED, AS TO MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE REGARDING THE ACCURACY OF THE MATERIALS OR CODE
CONTAINED HEREIN.
Introduction
------------
The %JACK macro does jackknife analyses for simple random samples,
computing approximate standard errors, bias-corrected estimates, and
confidence intervals assuming a normal sampling distribution.
The %BOOT macro does elementary nonparametric bootstrap analyses for
simple random samples, computing approximate standard errors,
bias-corrected estimates, and confidence intervals assuming a normal
sampling distribution. Also, for regression models, the %BOOT macro can
resample either observations or residuals.
The %BOOTCI macro computes several varieties of confidence intervals
that are suitable for sampling distributions that are not normal.
In order to use the %JACK or %BOOT macros, you need to know enough about
the SAS macro language to write simple macros yourself. See _The SAS
Guide to Macro Processing_ for information on the SAS macro language.
This document does not explain how the jackknife and bootstrap are
performed or how the various confidence intervals are computed, but does
provide some advice and caveats regarding usage. For an elementary
introduction, see Dixon in the bibliography below. There is a thorough
exposition in E&T that should be accessible to anyone who has done a
year or more of statistical study.
There is a widespread myth that bootstrapping is a magical spell to
perform valid statistical inference on _anything_. S&T dispell this myth
very effectively and very technically. For an elementary demonstration
of the dangers of bootstrapping, see the "Cautionary Example" below.
The Jackknife
-------------
The jackknife works only for statistics that are smooth functions of the
data. Statistics that are not smooth functions of the data, such as
quantiles, may yield inconsistent jackknife estimates. The best results
are obtained with statistics that are linear functions of the data. For
highly nonlinear statistics, the jackknife can be inaccurate. See S&T,
chapter 2, for a detailed discussion of the validity of the jackknife.
The Bootstrap
-------------
Bootstrap estimates of standard errors are valid for many commonly-used
statistics, generally requiring no major assumptions other than simple
random sampling and finite variance. There do exist some statistics for
which the standard error estimates will fail, such as the maximum or
minimum. The bootstrap standard error is consistent for some nonsmooth
statistics such as the median. However, the bootstrap standard error may
not be consistent even for very smooth statistics when the population
distribution has very heavy tails. Inconsistency of the usual bootstrap
estimators can often be remedied by using a resample size m(n) that is
smaller than the sample size n, so that m(n)->infinity and m(n)/n->0 as
n->infinity. Theoretical results on the consistency of the bootstrap
standard error are not extensive. See S&T, chapter 3, for details.
The bootstrap estimates of bias provided by the %BOOT macro are valid
under simple random sampling for many commonly-used _plug-in_
estimators. A _plug-in_ estimator is one that uses the same formula to
compute an estimate from a sample that is used to compute a parameter
from the population. For example, if the sample variance is computed
with a divisor of n (VARDEF=N), it is a plug-in estimate; if it is
computed with a divisor of n-1 (VARDEF=DF, the default), it is _not_ a
plug-in estimate. R-squared is a plug-in estimator; adjusted r-squared
is not. Estimating the bias of a non-plug-in estimators requires
special treatment; see "Bias Estimation" below. If you are using an
estimator that is known to be unbiased, use the BIASCORR=0 argument with
%BOOT. See E&T, chapter 10, for more discussion of bootstrap estimation
of bias.
The approximate normal confidence intervals computed by the %BOOT macro
are valid if both the bias and standard error estimates are valid and
if the sampling distribution is approximately normal. For non-normal
sampling distributions, you should use the %BOOTCI macro, which
requires a much larger number of resamples for adequate approximation.
If you plan to use only %BOOT, 200 resamples will typically be
enough. If you plan to use %BOOTCI, 1000 or more resamples are likely
to be needed for a 90% confidence interval; greater confidence
levels require even more resamples. The proper use of bootstrap
confidence intervals is a matter of considerable controversy; see
S&T, chapter 4, for a review.
The %BOOT macro does balanced resampling when possible. Balanced
resampling yields more accurate approximations to the ideal bootstrap
estimators of bias and standard errors than does uniform resampling. Of
course, both balanced resampling and uniform resampling produce
approximations that converge to the same ideal bootstrap estimators as
the number of resamples goes to infinity. Balanced resampling is of
little benefit with %BOOTCI. See Hall, appendix II, for a discussion of
balanced resampling and other methods from improving the computational
efficiency of the bootstrap.
Using %JACK and %BOOT
---------------------
To use the %JACK or %BOOT macros, you must write a macro called %ANALYZE
to do the data analysis that you want to bootstrap. The %ANALYZE macro
must have two arguments:
DATA= the name of the input data set to analyze
OUT= the name of the output data set containing the statistics
for which you want to compute bootstrap distributions.
If possible, you should write the %ANALYZE macro to use BY processing.
The BY statement must be specified via the %BYSTMT macro, which
generates a BY statement in which the list of BY variables is given by a
macro variable &BY. The &BY macro variable is not an argument to
%ANALYZE or to %BYSTMT, but is specified by a %LET statement when
needed. The %JACK and %BOOT macros run %ANALYZE once without a BY
variable and then once with the the BY variable _SAMPLE_.
If you do not use the %BYSTMT macro, the computations will be done with
a macro loop instead of with BY processing. A macro loop takes much more
computer time than BY processing but requires less disk space.
If the %ANALYZE macro uses the %BYSTMT macro, two output data sets
are created by the %JACK macro:
JACKDATA contains the jackknife resamples. The variable _SAMPLE_
gives the resample number, and _OBS_ gives the original
observation number.
JACKDIST contains the resampling distributions of the statistics
in the OUT= data set created by the %ANALYZE macro. The
variable _SAMPLE_ gives the resample number.
Two similar data sets are also created by the %BOOT macro when the
%BYSTMT macro is used:
BOOTDATA contains the bootstrap resamples. The variable _SAMPLE_
gives the resample number, and _OBS_ gives the original
observation number.
BOOTDIST contains the resampling distributions of the statistics
in the OUT= data set created by the %ANALYZE macro. The
variable _SAMPLE_ gives the resample number.
In addition, the %JACK macro creates a data set JACKSTAT and the %BOOT
macro creates a data set BOOTSTAT regardless of whether the %BYSTMT
macro is used. These data sets contain the approximate standard errors,
bias-corrrected estimates, and 95% confidence intervals assuming a
normal sampling distribution. The %BOOTCI macro creates a data set
BOOTCI containing the confidence intervals.
If the OUT= data set contains more than one observation per BY group,
you must specify a list of ID= variables when you run the %JACK or %BOOT
macros. These ID= variables identify observations that correspond to
the same statistic in different BY groups. For many procedures, these
ID= variables would naturally be _TYPE_ and _NAME_, but those names are
_not_ allowed to be used as ID= variables--you must use the RENAME= data
set option to rename them. (Renaming variables can be tricky. You must
use the _old_ name with the DROP= and KEEP= data set options, but you
must use the _new_ name with the WHERE= data set option.)
Consider analyzing the correlation of the LSAT and GPA variables from
Efron and Tibshirani (1993):
title 'Law School Data from Efron and Tibshirani, p. 19';
data law; input lsat gpa; cards;
576 3.39
635 3.30
558 2.81
578 3.03
666 3.44
580 3.07
555 3.00
661 3.43
651 3.36
605 3.13
653 3.12
575 2.74
545 2.76
572 2.88
594 2.96
;
The following %ANALYZE macro could be used to process all the statistics
in the OUT= data set from PROC CORR:
%macro analyze(data=,out=);
proc corr noprint data=&data
out=&out(rename=(_type_=stat _name_=with));
var lsat gpa;
%bystmt;
run;
%mend;
title2 'Jacknife Analysis';
%jack(data=law,id=stat with)
title2 'Bootstrap Analysis';
%boot(data=law,id=stat with,random=123)
However, if you are interested only in the correlation, it is more
efficient to extract only the relevant observations and variables. It
is also helpful to provide descriptive names and labels, as in this
example:
%macro analyze(data=,out=);
proc corr noprint data=&data out=&out;
var lsat gpa;
%bystmt;
run;
%if &syserr=0 %then %do;
data &out;
set &out;
where _type_='CORR' & _name_='LSAT';
corr=gpa;
label corr='Correlation';
keep corr &by;
run;
%end;
%mend;
title2 'Jacknife Analysis';
%jack(data=law)
title2 'Bootstrap Analysis';
%boot(data=law,random=123)
It is advisable to make the OUT= data set as small as possible to
conserve computer time and disk space. If you are running release 6.11
or later, you can use a WHERE= data set option on an output data set:
title2 'Using WHERE= with an output data set--6.11 only';
%macro analyze(data=,out=);
proc corr noprint data=&data
out=&out(where=(_type_='CORR' & _name_='LSAT')
rename=(gpa=corr)
keep=gpa _type_ _name_ &by);
var lsat gpa;
%bystmt;
run;
%mend;
title3 'Jacknife Analysis';
%jack(data=law)
title3 'Bootstrap Analysis';
%boot(data=law,random=123)
Unfortunately, you may not DROP any variable used in a WHERE= data set
option.
Bias Estimation
---------------
The sample correlation is a plug-in estimator and hence is suitable for
the bias estimator in %BOOT. The sample variance computed with a divisor
of n-1 is not a plug-in estimator and therefore requires special
treatment. In some procedures, you can use the VARDEF= option to obtain
a plug-in estimate of the variance. The default value of VARDEF= is DF,
which yields the usual adjustment for degrees of freedom, instead of the
plug-in estimate. For example:
title2 'The unbiased variance estimator is not a plug-in estimator';
proc means data=law var vardef=df;
var lsat gpa;
run;
The following %ANALYZE macro could be used to jackknife the unbiased
variance estimator, but the bootstrap over-corrects for the nonexistent
bias:
title2 'Estimating the bias of the unbiased estimator of variance';
%macro analyze(data=,out=);
proc means noprint data=&data vardef=df;
output out=&out(drop=_freq_ _type_) var=var_lsat var_gpa;
var lsat gpa;
%bystmt;
run;
%mend;
title3 'The jackknife computes the correct bias of zero';
%jack(data=law)
title3 'The bootstrap over-corrects for bias';
%boot(data=law,random=123)
By specifying VARDEF=N instead of VARDEF=DF, you can tell the MEANS
procedure to compute a plug-in estimate of the variance:
title2 'Estimating the bias of the plug-in estimator of variance';
%macro analyze(data=,out=);
proc means noprint data=&data vardef=n;
output out=&out(drop=_freq_ _type_) var=var_lsat var_gpa;
var lsat gpa;
%bystmt;
run;
%mend;
With the above %ANALYZE macro, %JACK yields an exact bias correction,
while the bias-corrected estimates from %BOOT are very close to the
unbiased estimates:
title3 'Jacknife Analysis';
%jack(data=law)
title3 'Bootstrap Analysis';
%boot(data=law,random=123)
If the procedure you are using supports the VARDEF= option to produce
plug-in estimates, you can use the %VARDEF macro to obtain correct
bootstrap bias estimates of the corresponding non-plug-in estimates. The
%VARDEF macro generates a VARDEF= option with a value of either N or DF
as appropriate for use with the %BOOT macro (The %JACK macro ignores the
%VARDEF macro). In the %ANALYZE macro, use %VARDEF in the procedure
statement where the VARDEF= option would be syntactically correct. For
example:
title2 'Estimating the bias of the unbiased variance estimator';
%macro analyze(data=,out=);
proc means noprint data=&data %vardef;
output out=&out(drop=_freq_ _type_) var=var_lsat var_gpa;
var lsat gpa;
%bystmt;
run;
%mend;
title3 'Bootstrap Analysis';
%boot(data=law,random=123)
The variance estimator using VARDEF=DF is unbiased, so the bias
correction estimated by bootstrapping is much smaller than in the
previous example, in which the biased plug-in estimator was used.