- Toy collector solution
- Plug-In and the Bootstrap
- Nonparametric and Parametric Bootstraps
- Examples
Children (and some adults) are frequently enticed to buy breakfast cereal in an effort to collect all the action figures. Assume there are 15 action figures and each cereal box contains exactly one with each figure being equally likely.
Figure | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Probability | .2 | .1 | .1 | .1 | .1 | .1 | .05 | .05 | .05 | .05 | .02 | .02 | .02 | .02 | .02 |
prob.table <- c(.2, .1, .1, .1, .1, .1, .05, .05, .05, .05, .02, .02, .02, .02, .02) prob.table <- rep(1, 15)/15 boxes <- seq(1,15) box.count <- function(prob=prob.table){ check <- double(length(prob)) i <- 0 while(sum(check)<length(prob)){ x <- sample(boxes, 1, prob=prob) check[x] <- 1 i <- i+1 } return(i) }
trials <- 1000 sim.boxes <- double(trials) for(i in 1:trials){ sim.boxes[i] <- box.count() } est <- mean(sim.boxes) sd(sim.boxes)
## [1] 18.67468
mcse <- sd(sim.boxes) / sqrt(trials) interval <- est + c(-1,1)*1.96*mcse est
## [1] 50.928
interval
## [1] 49.77053 52.08547
hist(sim.boxes, main="Histogram of Total Boxes", xlab="Boxes") abline(v=300, col="red", lwd=2)
World | Real | Bootstrap |
---|---|---|
true distribution | \(F\) | \(\hat{F}_n\) |
data | \(X_1, \dots, X_n\) i.i.d. \(F\) | \(X_1^*, \dots, X_n^*\) i.i.d. \(\hat{F}_n\) |
empirical distribution | \(\hat{F}_n\) | \(F_n^*\) |
parameter | \(\theta = t(F)\) | \(\hat{\theta}_n = t(\hat{F}_n)\) |
estimator | \(\hat{\theta}_n = t(\hat{F}_n)\) | \(\theta_n^* = t(F_n^*)\) |
error | \(\hat{\theta}_n - \theta\) | \(\theta_n^* - \hat{\theta}_n\) |
standardized error | \(\frac{\hat{\theta}_n - \theta}{s(\hat{F}_n)}\) | \(\frac{\theta_n^* - \hat{\theta}_n}{s(F_n^*)}\) |
Notation \(\theta = t(F)\) means \(\theta\) is some function of the true unknown distribution
A bootstrap estimate of the standard error is \[ \hat{se}(\hat{\theta}^*) = \sqrt{\frac1{B-1}\sum_{b=1}^B (\hat{\theta}^{(b)}-\bar{\hat{\theta}^{*}})^2}, \] where \(\bar{\hat{\theta}^{*}} = \frac1{B}\sum_{b=1}^B \hat{\theta}^{(b)}\).
library(bootstrap) print(cor(law$LSAT, law$GPA)) # Small data set
## [1] 0.7763745
print(cor(law82$LSAT, law82$GPA)) # Full data
## [1] 0.7599979
# Set up the bootstrap B <- 200 # Number of bootstrapping n <- nrow(law) # Sample size R <- numeric(B) # Bootstrap estimate of standard error of Pearson correlation for(b in 1:B){ i <- sample(1:n, size=n, replace = TRUE) # i is a vector of indices LSAT <- law$LSAT[i] GPA <- law$GPA[i] R[b] <- cor(LSAT, GPA) } print(se.R <- sd(R))
## [1] 0.1285065
hist(R, probability = TRUE)
We can also use the built in bootstrapping functions, see boot
and bootstrap
library.
library(boot) r <- function(x, i) cor(x[i,1], x[i,2]) obj <- boot(data = law, statistic = r, R = 200) obj
## ## ORDINARY NONPARAMETRIC BOOTSTRAP ## ## ## Call: ## boot(data = law, statistic = r, R = 200) ## ## ## Bootstrap Statistics : ## original bias std. error ## t1* 0.7763745 0.01031661 0.1166956
sd(obj$t)
## [1] 0.1166956
library(bootstrap) theta <- function(x,xdata){ cor(xdata[x,1],xdata[x,2]) } obj2 <- bootstrap(1:n, nboot = 200, theta = theta, law) sd(obj2$thetastar)
## [1] 0.1148487
A bootstrap estimate of the bias is \[ bias(\hat\theta) = \bar{\hat{\theta}^{*}} - \hat{\theta}. \]
R.hat <- cor(law$LSAT, law$GPA) bias <- mean(R - R.hat) bias
## [1] -0.007035124
By CLT, \[ Z = \frac{\hat{\theta} - \theta}{se(\hat\theta)} \sim \mathcal{N}(0,1), \text{ as } n \to \infty. \]
Then the 100(1 - \(\alpha\))% approximation confidence interval is given by \[ (\hat{\theta} - z_{\alpha/2} se(\hat{\theta}), \ \hat{\theta} + z_{\alpha/2} se(\hat{\theta})), \] where \(z_{\alpha/2} = \Phi^{-1}(1-\alpha/2)\).
The 100(1 - \(\alpha\))% BCa confidence interval is \((\hat{\theta}^*_{\alpha_1}, \ \hat{\theta}^*_{\alpha_2})\), where \[ \alpha_1 = \Phi\left(\hat{z}_0 + \frac{\hat{z}_0 + z_{1-\alpha/2}}{1-\hat{a}(\hat{z}_0 + z_{1-\alpha/2})}\right),\\ \alpha_2 = \Phi\left(\hat{z}_0 + \frac{\hat{z}_0 + z_{\alpha/2}}{1-\hat{a}(\hat{z}_0 + z_{\alpha/2})}\right), \]
Acceleration factor: \[ \hat{a} = \frac{\sum_{i=1}^n (\hat\theta_{(i)} - \hat{\hat\theta_{(\cdot)}})^3}{6(\sum_{i=1}^n (\hat\theta_{(i)} - \hat{\hat\theta_{(\cdot)}})^2)^{3/2}} \]
library(boot) data(law, package = "bootstrap") boot.obj <- boot(law, R = 2000, statistic = function(x, i){cor(x[i,1], x[i,2])}) print(boot.ci(boot.obj, type=c("norm","basic", "perc", "bca")))
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 2000 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = boot.obj, type = c("norm", "basic", "perc", ## "bca")) ## ## Intervals : ## Level Normal Basic ## 95% ( 0.5218, 1.0355 ) ( 0.5906, 1.0741 ) ## ## Level Percentile BCa ## 95% ( 0.4787, 0.9621 ) ( 0.3173, 0.9406 ) ## Calculations and Intervals on Original Scale ## Some BCa intervals may be unstable
Exercise: Calculate the studentized bootstrap interval.
World | Real | Bootstrap |
---|---|---|
parameter | \(\theta\) | \(\hat{\theta}_n\) |
true distribution | \(F_{\theta}\) | \(F_{\hat{\theta}_n}\) |
data | \(X_1, \dots, X_n\) i.i.d. \(F_{\theta}\) | \(X_1^*, \dots, X_n^*\) i.i.d. \(F_{\hat{\theta}_n}\) |
estimator | \(\hat{\theta}_n = t(X_1, \dots, X_n)\) | \(\theta_n^* = t(X_1^*, \dots, X_n^*)\) |
error | \(\hat{\theta}_n - \theta\) | \(\theta_n^* - \hat{\theta}_n\) |
standardized error | \(\frac{\hat{\theta}_n - \theta}{s(X_1, \dots, X_n)}\) | \(\frac{\theta_n^* - \hat{\theta}_n}{s(X_1^*, \dots, X_n^*)}\) |
In this example our aim is to test the hypothesis that the true value of the index is 1 (i.e. that the data come from an exponential distribution) against the alternative that the data come from a gamma distribution with index not equal to 1.
air.fun <- function(data) { ybar <- mean(data$hours) para <- c(log(ybar), mean(log(data$hours))) ll <- function(k) { if (k <= 0) 1e200 else lgamma(k)-k*(log(k)-1-para[1]+para[2]) } khat <- nlm(ll, ybar^2/var(data$hours))$estimate c(ybar, khat) } air.rg <- function(data, mle) { # Function to generate random exponential variates. # mle will contain the mean of the original data out <- data out$hours <- rexp(nrow(out), 1/mle) out }
air.boot <- boot(aircondit, air.fun, R = 999, sim = "parametric", ran.gen = air.rg, mle = mean(aircondit$hours)) # The bootstrap p-value can then be approximated by sum(abs(air.boot$t[,2]-1) > abs(air.boot$t0[2]-1))/(1+air.boot$R)
## [1] 0.463
speed <- c(28, -44, 29, 30, 26, 27, 22, 23, 33, 16, 24, 29, 24, 40 , 21, 31, 34, -2, 25, 19)
mean(speed)
## [1] 21.75
newspeed <- speed - mean(speed) + 33.02
newspeed
will have exactly the same shape as speed, but will be shiftedn <- 1000 bstrap <- double(n) for (i in 1:n){ newsample <- sample(newspeed, 20, replace=T) bstrap[i] <- mean(newsample) }
(sum(bstrap < 21.75) + sum(bstrap > 44.29))/1000
## [1] 0.008
bootstrap.resample <- function (object) sample (object, length(object), replace=TRUE) diff.in.means <- function(df) { mean(df[df$group==1,"extra"]) - mean(df[df$group==2,"extra"]) } resample.diffs <- replicate(2000, diff.in.means(sleep[bootstrap.resample(1:nrow(sleep)),]))
hist(resample.diffs, main="Bootstrap Sampling Distribution") abline(v=diff.in.means(sleep), col=2, lwd=3)