Bootstrap Estimates of Confidence Intervals (2024)

Bootstrapping is a statistical procedure that utilizes resampling (with replacement) of a sample to infer properties of a wider population.

More often than not, we want to understand the properties of a population but we only have access to a small sample of that population. Sometimes, we are unable to gather more data because it is too expensive, too time consuming, or just not possible. When this is the situation, we must use the sample we already have in a clever way to learn about the characteristics of the population we are interested in. In comes the bootstrap method! The whole idea of bootstrapping is to randomly resample (with replacement) our existing sample so we in effect have more "samples" to work with. These resamples can be used to estimate confidence intervals (which will be the focus of this blog post), reduce biases, perform hypothesis tests, and more. With bootstrapping, we are quite literally pulling our data up by its bootstraps. Let's take a look at how it works.

How Do We Use the Bootstrap Method to Estimate a Confidence Interval?

  1. Take n repeated random samples, with replacement, from the given dataset. These are called "resamples" and should be the same size as the original sample.
  2. Calculate the statistic of interest for each of these resamples, e.g. mean, median, standard deviation, etc.
  3. Now you that you have a distribution of n different estimates of the statistic of interest, you can calculate the confidence interval on that statistic to determine its variability.

You might be wondering why exactly this works and how this method allows us to understand the properties of the population. Basically, bootstrapping works by treating the distribution from the resamples as a reasonable approximation of the true probability distribution. Also, the variability of statistics calculated from the original sample is approximated well by that of each resample.

The basic idea is that inferences made from the resampled data is a good proxy for inferences about the population itself. Check out Bradley Efron's paper if you are interested in diving into this reasoning deeper.

Worked Example with Python

In our Python example we will use data from the Hubble Space Telescope. This data contains distances and velocities of 24 galaxies containing Cepheid stars, from the Hubble space telescope key project to measure the Hubble Constant.

The data contains three columns:

  • Galaxy: A factor label identifying the galaxy
  • y: The galaxy’s relative velocity measured in kilometers/second (km/s)
  • x: The galaxy’s distance from Earth measured in Megaparsecs (Mpc) (Note: 1 Mpc = 3.09e13 km)

We can use this data to estimate the Hubble Constant, \(\beta\), and the age of the universe, \(\beta^{-1}\), with the following:

\[\begin{align} y = \beta x \end{align}\]

Here I’ll give some quick scientific context as to what the Hubble Constant is and how it can be used to estimate the age of the universe.

According to the standard Big Bang model, the universe is expanding uniformly according to Hubble’s Law:

\[\begin{align} v = H_0 d \end{align}\]

where \(v\) is apparent velocity of the galaxy and \(d\) is the distance to the galaxy. \(v\) and \(d\) are related linearly by \(H_0\), which we call the Hubble Constant. These variables, \(v, d, H_0\), are the standard astrophysical notations for velocity, distance, and the Hubble Constant. In terms of the variables given in our dataset, Hubble’s Law is:

\[\begin{align} y = \beta x \end{align}\]

where \(y\) is the relative velocity of the galaxy, \(\beta\) is the Hubble Constant, and \(x\) is the distance to the galaxy. From now on, I’ll use \(y, x,\) and \(\beta\) to denote galactic velocity, distance, and the Hubble Constant.

Now, how does the Hubble Constant help us determine the age of the universe? The inverse of the Hubble Constant, \(\beta ^{-1}\), gives the approximate age of the universe. This might not be immediately clear, so let’s take a look at the units of our quantities to see why it’s \(\beta ^{-1}\). \(y\) is measured in the units km/s, or distance over time. \(x\) is measured in units of Mpc, or distance. Writing this out we see:

\[\begin{align} \frac{distance}{time} = [\beta] distance \end{align}\]

Here, I’m using square brackets around \(\beta\) as a placeholer for the dimensions of \(\beta\).

Now, we can determine the units of \(\beta\) by comparing the units on the left side of the equation to the units on the right hand side of the equation, and figuring out what units \(\beta\) must take to make the dimensions of the two sides of the equation equal. From this, we can see that the units of the Hubble Constant, \(\beta\), must be \(\frac{1}{time}\):

\[\begin{align} \frac{distance}{time} = \frac{1}{time} distance \end{align}\]

\[\begin{align} \frac{distance}{time} = \frac{distance}{time} \end{align}\]

We now see the units of the Hubble Constant are \(\frac{1}{time}\), so taking the inverse of \(\beta\) will give units of time, and thus an estimate of the age of the universe (because the unit of age is time).

The next question that comes to mind is how do we determine the Hubble Constant from our velocity and distance data? Because these quantities are related linearly by Hubble’s Law, the answer is simply linear regression. We could perform a linear regression on the 24 velocity-distance pairs we have in our dataset and read off the linear coefficient to get a value for \(\beta\), but we can’t say much about how well the true value of \(\beta\) is constrained.

We could submit a proposal to the Hubble Telescope team to apply for more telescope time to measure the velocities of and distances to more galaxies to get more samples to analyze, but that would be expensive and time consuming and there’s no guarantee that our proposal would be accepted. So we turn to bootstrapping!

Now, let’s work through estimating a 95% confidence interval on the value of the Hubble Constant with Python.

First, let’s import the required libraries for our analysis:

import pandas as pdimport numpy as npfrom sklearn.linear_model import LinearRegressionfrom scipy.stats import bootstrapimport matplotlib.pyplot as plt

Next, let’s import the data and look at the first 5 rows using the read_csv() and head() functions from pandas. Click here to download the data if you would like to follow along.

data = pd.read_csv("hubble.csv")data.head()
 Galaxy y x0 NGC0300 133 2.001 NGC0925 664 9.162 NGC1326A 1794 16.143 NGC1365 1594 17.954 NGC1425 1473 21.88

We can see the three columns, Galaxy, y, and x. Again, Galaxy is the galaxy identifier/name, y is the relative velocity of the galaxy, and x is the distance to the galaxy. Let’s plot our data to investigate the relationship between distance and velocity.

# Extract velocity, y, and distance, x, from our imported datay = data["y"]x = data["x"] # Plot x vs y.plt.scatter(x,y)plt.title("Galactic Distance vs Relative Velocity")plt.xlabel("Distance (Mpc)")plt.ylabel("Relative Velocity (km/s)")plt.show()

Bootstrap Estimates of Confidence Intervals (1)

We can see that the relationship between relative velocity and distance is roughly linear.

Now our goal is to bootstrap this data to estimate a 95% confidence interval on the Hubble Constant. Recall bootstrapping requires that we resample the data many times so that we get a distribution of a particular statistic, in this case the Hubble Constant, that we can use to estimate a confidence interval on that statistic. Since we will resample the data many many times, let’s define a function that creates a resample of the data that we can call later in a loop:

def resample(data, seed): ''' Creates a resample of the provided data that is the same length as the provided data ''' import random random.seed(seed) res = random.choices(data, k=len(data)) return res

Our resample() function takes in data that it will resample from. It also takes in seed that will set a psuedorandom seed; this is purely for reproducibility of this example.

Now let’s set up our data so that we can feed it into our resample() function. Our data contains velocity-distance pairs: specific velocities correspond to specific distances. So, we want to randomly resample pairs of velocity and distance, we don’t want to randomly sample velocity then randomly sample distance separately. Let’s use Python’s zip() function to “zip” our corresponding velocities and distances together, then resample the “zipped” pairs. This will ensure that we maintain the correct velocity-distance pairs throughout our bootstrap analysis.

# Extract the distance, x, and velocity, y, values from our pandas dataframe distances = data["x"].valuesvelocities = data["y"].values# Zip our distances and velocities together and store the zipped pairs as a listdist_vel_pairs = list(zip(distances, velocities))# Print out the first 5 zipped distance-velocity pairsprint(dist_vel_pairs[:5])
[(2.0, 133), (9.16, 664), (16.14, 1794), (17.95, 1594), (21.88, 1473)]

In the above output, we can see the first 5 “zipped” distance-velocity pairs. Each pair is a tuple containing the distance in index 0, and the corresponding velocity in index 1. Now let’s generate 10,000 resamples of distance-velocity pairs using our resample() function in a list comprehension. After generating the 10,000 resamples, let’s use a for loop to perform a linear regression on each of them to get a distribution of 10,000 Hubble Constant, \(\beta\), estimates. Let’s use the LinearRegression() function from the sklearn.linear_model module to perform our linear regressions. In the argument of LinearRegression() we set fit_intercept=False so the regression does not fit an intercept coefficient. This is because there is no intercept in Hubble’s Law.

# Generate 10,000 resamples with a list comprehensionboot_resamples = [resample(dist_vel_pairs, val) for val in range(10000)]# Calculate beta from linear regression for each of the 10,000 resamples and store them in a list called "betas"betas = []for res in boot_resamples: # "Unzip" the resampled pairs to separate x and y so we can use them in the LinearRegression() function dist_unzipped, vel_unzipped = zip(*res) dist_unzipped = np.array(dist_unzipped).reshape((-1, 1)) # Find linear coefficient beta for this resample and append it to a list of betas betas.append(LinearRegression(fit_intercept=False).fit(dist_unzipped, vel_unzipped).coef_[0])# Print out the first 5 beta values print(betas[:5])
[70.49289924780366, 86.37984957925575, 75.39193217270235, 78.0888441398601, 75.35740068419938]

This may take a minute to run because we are performing 10,000 linear regressions. At the end I printed out the first 5 estimates of the Hubble Constant just so we can see what some of them look like. We do see some variability in the values! Let’s now take a look at the distribution of Hubble Constants we found with a histogram.

# Distribution of betas (hubble constants). plt.clf()plt.hist(betas, bins=50)plt.title("Distribution of the Hubble Constant")plt.show()

Bootstrap Estimates of Confidence Intervals (2)

Now that we have many possible values for the Hubble Constant, \(\beta\), we can calculate the 95% confidence interval on this distribution. This will serve as an approximate confidence interval on the true value of the Hubble Constant. Let’s use the numpy.percentile() function to calculate our confidence interval.

# Calculate the values of 2.5th and 97.5th percentiles of our distribution of betasconf_interval = np.percentile(betas, [2.5,97.5])print(conf_interval)
[66.86548795 86.30720865]

We find the boundaries of our 95% confidence interval are about 66.9 and about 86.3. This informs the uncertainty in our estimate. The process of sampling data and calculating 95% confidence intervals captures the true value we’re trying to estimate about 95% of the time. In this case we’re confident the true time is between 66.9 and 86.3 seconds\(^{-1}\).

Let’s replot our histogram with the confidence interval marked by vertical lines:

plt.clf()plt.hist(betas, bins=50);plt.title("Distribution of the Hubble Constant with 95% CI Indicated")plt.axvline(conf_interval[0], color="red")plt.axvline(conf_interval[1], color="red")plt.show()

Bootstrap Estimates of Confidence Intervals (3)

Now let’s take a look at the age of the universe from our estimates. The universe is currently estimated to be about 13.8 billion years old, let’s see how well our estimates match up to this value. When doing our calculations, we must keep our units consistent, so let’s convert megaparsecs (Mpc) to kilometers (km) and seconds (s) to years (yr).

The conversion from Mpc to km:

\[\begin{align} 1 \ \text{Mpc} = 3.09\times 10^{19} \ \text{km} \end{align}\]

The conversion from s to yr:

\[\begin{align} 1 \ \text{s} = 1 \ \text{year} * \frac{365 \ \text{days}}{1 \ \text{year }} * \frac{24 \ \text{hours}}{1 \ \text{day}} * \frac{60 \ \text{minutes}}{1 \ \text{hour}} * \frac{60 \ \text{seconds}}{1 \ \text{minute}} \end{align}\]

\[\begin{align} 1 \ \text{s} = 1 \ \text{year} * 365 * 24 * 60^2 \end{align}\]

Let’s use these conversion factors to convert our Hubble Constant confidence interval to a confidence interval on the age of the universe:

# Calulation of 95% confidence interval for the age of the universeconf_interval_age = 3.09e19/(conf_interval * (365*24*60**2))conf_interval_age
array([1.46537863e+10, 1.13528474e+10])

From this calculation, we’re 95% confident the age of the universe is between about 11.4 billion and 14.8 billion years old. That’s spot on with the current estimation of the age of the universe, 13.8 billion years! Very cool.

Summary of What We Did

We used the bootstrap method to randomly resample (with replacement) our 24 galactic relative velocity and distance datapoints 10,000 times, estimate the Hubble Constant by performing a linear regression for each of those resamples to get a distribution of values, and calculate a 95% confidence interval on the distribution of the Hubble Constant. We then used this confidence interval to calculate the confidence interval on the age of the universe. These confidence intervals serve as good proxies for that of the true Hubble Constant/age of the universe if we could calculate these values using relative velocity and distance data from the entire population of galaxies in the universe.

References

Samantha Lomuscio
StatLab Associate
University of Virginia Library
March 28, 2023

Bootstrap Estimates of Confidence Intervals (2024)

FAQs

What is bootstrapping for confidence intervals? ›

Bootstrapping is a statistical procedure that utilizes resampling (with replacement) of a sample to infer properties of a wider population.

How many bootstraps is enough? ›

Incidentally, the mean of the 1000 bootstrap samples was 0.0802 (compared to the sample mean of 0.0824). B = 1000 is usually adequate for the estimation of the sampling variance or standard deviation; however, good estimates of confidence intervals often require B = 5000 or more.

What is the 95 percentile confidence interval for bootstrap? ›

For example, a 95% percentile bootstrap CI with 1,000 bootstrap samples is the interval between the 25th quantile value and the 975th quantile value of the 1,000 bootstrap parameter estimates.

How many bootstrap replicates are necessary for confidence intervals? ›

In terms of the number of replications, there is no fixed answer such as “250” or “1,000” to the question. The right answer is that you should choose an infinite number of replications because, at a formal level, that is what the bootstrap requires.

What is a good bootstrapping value? ›

If we recover the same node through 95 of 100 iterations of taking out one character and resampling of our tree, then we have a good idea that the node is well supported with a bootstrap value of 95%. Bootstrap values less than 50% are not taken into account for tree construction.

What does bootstrapping estimate mean? ›

Bootstrapping is a procedure for estimating the distribution of an estimator by resampling (often with replacement) one's data or a model estimated from the data. Bootstrapping assigns measures of accuracy (bias, variance, confidence intervals, prediction error, etc.) to sample estimates.

What is the rule of thumb for bootstrap? ›

A rule of thumb for the number of resamples needed for a reasonable bootstrap distribution is 10,000, however for the use of this class, use 500. Too few bootstrap samples can create problems for getting a good bootstrap sampling distribution.

How to decide the number of bootstraps? ›

In order to avoid a power loss of more than, say, 1%, it is necessary to use a rather large number of bootstrap samples. If our simulation results can be relied upon, B = 399 would seem to be about the minimum for a test at the . 05 level, and B = 1499 for a test at the . 01 level.

What is the minimum sample for bootstrapping? ›

For the bootstrap samples the usual recommendation is the use at least 500 (for an initial screening, because it usually gives already quite good results) and 5,000 for the final model. Generally, the more the better (i.e., the more precise are your estimates of the standard error and the confidence intervals).

What is a good confidence interval at 95%? ›

Analysts often use confidence intervals that contain either 95% or 99% of expected observations. Thus, if a point estimate is generated from a statistical model of 10.00 with a 95% confidence interval of 9.50 to 10.50, it means one is 95% confident that the true value falls within that range.

Is 80% confidence interval wider than 95%? ›

The confidence interval with a confidence level of 95% will be wider than that of 80% because the margin of error will be greater and with a wider confidence level, the interval is more imprecise.

What are bias-corrected bootstrap confidence intervals? ›

The bias-corrected bootstrap confidence interval (BCBCI) was once the method of choice for conducting inference on the indirect effect in mediation analysis due to its high power in small samples, but now it is criticized by methodologists for its inflated type I error rates.

What does a bootstrap confidence interval tell you? ›

The spread in these bootstrap estimates tells us (approximately) how large is the effect of chance error in the original sample upon the variation in the estimateˆθ. The approximation improves as n increases. Suppose we want to set a 95% confidence interval on θ, the true parameter value for the real population f.

How will you determine how many replicates are enough? ›

One way to estimate the number of replicates you need for your experiment is to use a power analysis. A power analysis is a statistical tool that calculates the probability of detecting a significant effect given a certain sample size, effect size, and significance level.

What is the standard error of bootstrapping statistics? ›

We use each distribution to estimate certain things about the corresponding sampling distribution, including: standard error: the bootstrap standard error is the sample standard deviation of the bootstrap distribution, s b = 1 / ( r - 1 ) ∑ i = 1 r ( θ ^ i * - θ ^ * ‾ ) 2 .

What is the purpose of bootstrapping? ›

Particularly useful for assessing the quality of a machine learning model, bootstrapping is a method of inferring results for a population from results found on a collection of smaller random samples of the population, using replacement during the sampling process.

What is bootstrapping for dummies? ›

In statistics and econometrics, bootstrapping has come to mean to resample repeatedly and randomly from an original, initial sample using each bootstrapped sample to compute a statistic.

What is considered bootstrapping? ›

Bootstrapping is the process of founding and running a company using only personal finances or operating revenue. It is a form of financing that allows the entrepreneur to maintain more control even though it can increase financial strain.

What does bootstrapping do in regression? ›

Bootstrapping uses the sample data to estimate relevant characteristics of the population. The sampling distribution of a statistic is then constructed empirically by resampling from the sample. The resampling procedure is designed to parallel the process by which sample observations were drawn from the population.

Top Articles
Set up a card for contactless payments - Android
Amazon's 'pay to quit' program: Why Jeff Bezos 'paid' employees to leave
Antisis City/Antisis City Gym
San Angelo, Texas: eine Oase für Kunstliebhaber
Craigslist Niles Ohio
Ross Dress For Less Hiring Near Me
Soap2Day Autoplay
Bernie Platt, former Cherry Hill mayor and funeral home magnate, has died at 90
Cooktopcove Com
Inevitable Claymore Wow
Sony E 18-200mm F3.5-6.3 OSS LE Review
50 Shades Darker Movie 123Movies
"Une héroïne" : les funérailles de Rebecca Cheptegei, athlète olympique immolée par son compagnon | TF1 INFO
Prosser Dam Fish Count
Hanger Clinic/Billpay
Ibukunore
Georgetown 10 Day Weather
Beverage Lyons Funeral Home Obituaries
Jail View Sumter
Gazette Obituary Colorado Springs
Jeff Nippard Push Pull Program Pdf
Bill Remini Obituary
Manuela Qm Only
Arrest Gif
Arlington Museum of Art to show shining, shimmering, splendid costumes from Disney Archives
Netspend Ssi Deposit Dates For 2022 November
Pokémon Unbound Starters
Mami No 1 Ott
Nikki Catsouras: The Tragic Story Behind The Face And Body Images
Craig Woolard Net Worth
Tokioof
Rund um die SIM-Karte | ALDI TALK
Solve 100000div3= | Microsoft Math Solver
404-459-1280
Craigslist Org Sf
Clark County Ky Busted Newspaper
October 31St Weather
Woodman's Carpentersville Gas Price
Temu Y2K
Merkantilismus – Staatslexikon
Dogs Craiglist
Lyndie Irons And Pat Tenore
Foxxequeen
Blackwolf Run Pro Shop
John Wick: Kapitel 4 (2023)
Www Pig11 Net
Anonib New
Wwba Baseball
Craigs List Sarasota
Bomgas Cams
Www Extramovies Com
Leslie's Pool Supply Redding California
Latest Posts
Article information

Author: Pres. Lawanda Wiegand

Last Updated:

Views: 6456

Rating: 4 / 5 (51 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Pres. Lawanda Wiegand

Birthday: 1993-01-10

Address: Suite 391 6963 Ullrich Shore, Bellefort, WI 01350-7893

Phone: +6806610432415

Job: Dynamic Manufacturing Assistant

Hobby: amateur radio, Taekwondo, Wood carving, Parkour, Skateboarding, Running, Rafting

Introduction: My name is Pres. Lawanda Wiegand, I am a inquisitive, helpful, glamorous, cheerful, open, clever, innocent person who loves writing and wants to share my knowledge and understanding with you.