Bootstrapping the M-\sigma relation

Bootstrapping the M-\(\sigma\) relation#

Yasmeen Asali,Imad Pasha, William Cerny, Sebastian Monzon (Yale University)

Description: Learn about bootstrap resampling while fitting the supermassive black hole M-\(\sigma\) relation

Intended Audience: Beginner Undergraduate

tags: libraries:numpy, model-fitting,

Requirements: requirements.txt

Last Updated: August 9, 2024

Learning Objectives

  1. Be able to apply bootstrap resampling in the context of model fitting.

  2. Understand the limitations of linear least squares when data have uncertainties in both directions

Introduction#

The M-sigma relation, also known as the black hole mass-velocity dispersion relation, is an empirical correlation observed between the mass of supermassive black holes (\(M_{bh}\)) at the centers of galaxies and the velocity dispersion (\(\sigma\)) of stars in the bulges of those galaxies. This relationship suggests that there is a tight connection between the properties of galaxies and the central black holes they host. Specifically, galaxies with larger velocity dispersions tend to harbor more massive black holes at their centers.

The M-\(\sigma\) relation has profound implications for our understanding of galaxy formation and evolution, as well as the co-evolution of galaxies and their central black holes. It provides valuable insights into the mechanisms governing the growth and regulation of supermassive black holes, the role of feedback processes in galaxy evolution, and the relationship between the properties of galactic bulges and their central black holes.

Loading the data#

We’ll begin by loading some measurements of black hole mass (M_bh) and bulge velocity dispersion (sigma) for a set of nearby galaxies.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('m-sigma.txt')
df
Galaxy M_bh sigma sigma_err M_bh_err
0 MW 2.950000e+06 100 20 3.500000e+05
1 I1459 4.600000e+08 312 41 2.800000e+08
2 N221 3.900000e+06 76 10 9.000000e+05
3 N3115 9.200000e+08 278 36 3.000000e+08
4 N3379 1.350000e+08 201 26 7.300000e+07
5 N4258 3.900000e+07 138 18 3.400000e+06
6 N4261 5.400000e+08 290 38 1.200000e+08
7 N4342 3.300000e+08 261 34 1.900000e+08
8 N4374 1.700000e+09 286 37 1.200000e+09
9 N4486 3.570000e+09 345 45 1.020000e+09
10 N6251 5.900000e+08 297 39 2.000000e+08
11 N7052 3.700000e+08 261 34 2.600000e+08
12 N821 5.100000e+07 196 26 2.000000e+07
13 N224 3.500000e+07 112 15 2.500000e+07
14 N1023 3.900000e+07 201 14 1.100000e+07
15 N1068 1.700000e+07 149 19 1.300000e+07
16 N2778 2.000000e+07 171 22 1.600000e+07
17 N3377 1.030000e+08 131 17 1.600000e+08
18 N3384 1.850000e+07 151 20 9.100000e+06
19 N3608 1.130000e+08 206 27 1.440000e+08
20 N4291 1.540000e+08 269 35 3.100000e+08
21 N4473 1.026000e+08 188 25 8.200000e+07
22 N4564 5.700000e+07 153 20 1.700000e+07
23 N4649 2.060000e+09 331 43 1.020000e+09
24 N4697 1.220000e+08 163 21 4.000000e+07
25 N5845 3.520000e+08 275 36 2.000000e+08
26 N7457 3.500000e+06 73 10 2.700000e+06

The relation is typically plotted with the black hole masses in log form. Let’s add two columns to

plt.plot(df.sigma,df.M_bh,'s')
plt.yscale('log')
../../_images/d5deb5183f94d0df97456cf1a15ddf3dadfabf15fb20b890fe74c2b78018e573.png
log_BH = np.log10(df.M_bh)
BH_err = df.M_bh_err # note: we have already converted the errors to logspace for you! 
fig, ax = plt.subplots(figsize=(7,7))
ax.errorbar(df.sigma, log_BH, yerr=BH_err, xerr=df.sigma_err, fmt='None', color='gray')
ax.plot(df.sigma, log_BH, 's', color='C0', ms=12, alpha=0.5, mec='k')
ax.set_xlabel('Velocity Dispersion [km/s]', fontsize=16)
ax.tick_params(which='both', top=True, right=True, direction='in', length=6)
ax.set_ylabel(r'log M$_{\rm BH}$ [$M_{\odot}$]', fontsize=16);
../../_images/1b7420f8b1d0bf8dfce546019f64a68cd72cef61195c974f422a1af58a12711d.png

Last time, we showed you how you could use np.polyfit to fit a line. The code we used to do this looked like:

# Fit a straight line (polynomial of degree 1) to the data
coefficients = np.polyfit(x, y, deg=1)

# For a first order polynomial, there are two coefficients
# Extract the slope (m) and intercept (b) from the coefficients
slope, intercept = coefficients 

# Generate points along the fitted line for plotting
x_fit = np.linspace(min(x), max(x), 100)
y_fit = slope * x_fit + intercept

We can do this more efficiently using the np.polyval function to generate y_fit without needing to write out the equation y = mx + b.

coefficients = np.polyfit(df.sigma,log_BH,deg=1,w=1/BH_err)
x_fit = np.linspace(50,400,100)
y_fit = np.polyval(coefficients, x_fit)
fig, ax = plt.subplots(figsize=(7,7))
ax.errorbar(df.sigma, log_BH, yerr=BH_err, xerr=df.sigma_err, fmt='None', color='gray')
ax.plot(df.sigma, log_BH, 's', color='blue', ms=12, alpha=0.5, mec='k')
ax.plot(x_fit, y_fit, 'k', lw=3)
ax.set_xlabel('Velocity Dispersion [km/s]', fontsize=16)
ax.tick_params(which='both', top=True, right=True, direction='in', length=6)
ax.set_ylabel(r'log M$_{\rm BH}$ [$M_{\odot}$]', fontsize=16);
../../_images/1ce3e2fdc62c34c8f27cf101d12a3b07b2509d0b1bd79e016901ce88d3354b02.png

Fitting x(y)#

We also discussed how linear least squares is only accounting for the uncertainties in the independent variable (here our \(y\)-axis). But both quantities being plotted had major uncertainties, so we asked you to fit x(y) as well, using the errors on x instead of y. The solution to that attempt is below.

coeff2 = np.polyfit(log_BH,df.sigma,deg=1,w=1/df.sigma_err)
in_masses = np.linspace(-1.5,2,100)
out_velocities = np.polyval(coeff2,in_masses)
fig, ax = plt.subplots(figsize=(7,7))
ax.errorbar(df.sigma, log_BH, yerr=BH_err, xerr=df.sigma_err, fmt='None', color='gray')
ax.plot(df.sigma, log_BH, 's', color='blue', ms=12, alpha=0.5, mec='k')
ax.plot(x_fit, y_fit, 'k', lw=3, label='fit y(x)')
ax.plot(out_velocities, in_masses, 'r', lw=3, label='fit x(y)')
ax.set_xlabel('Velocity Dispersion [km/s]', fontsize=16)
ax.tick_params(which='both', top=True, right=True, direction='in', length=6)
ax.legend()
ax.set_ylabel(r'log M$_{\rm BH}$ [$M_{\odot}$]', fontsize=16);
../../_images/edfb0e9a495e32e9b87ff5a973d1fe22da7ca132caa3048fde37896292325648.png

Bootstrap Resampling#

Bootstrapping is a resampling technique used in statistics to estimate the sampling distribution of a statistic by repeatedly sampling with replacement from the observed data. It is particularly useful when the underlying population distribution is unknown or difficult to model.

  1. Sample Creation: Bootstrapping starts with the creation of multiple bootstrap samples by randomly selecting observations from the original dataset with replacement. This means that each observation in the original dataset has the same chance of being selected for each bootstrap sample, and some observations may be selected multiple times while others may not be selected at all.

  2. Statistical Estimation: After creating the bootstrap samples, the statistic of interest (e.g., mean, median, standard deviation, regression coefficient) is calculated for each bootstrap sample. This results in a distribution of the statistic across the bootstrap samples, known as the bootstrap distribution. In our case, this would be fitting our line to the data and seeing how much the paramters (slope and intercept) change for each fit. Now, we have some way of expressing uncertainty in our fitted parameters!

The goal for today is to use bootstrapping to estimate the uncertainty on our fit for the M-sigma relation!

bootstrapping_image

Exercise 1: Bootstrap Resampling for our Linear Fit to the M-sigma Relation

Using your code from Monday, start by loading the data and fitting a linear model.

  1. Perform Bootstrap Resampling: To do this, you will re-fit the data many times after creating subsamples of randomly drawn datasets (with replacement) from our actual data. In general, you can choose the number of bootstrap samples to generate based on computational resources and desired precision. For this case, you can start by generating ~500 bootstrap samples. For each bootstrap iteration:

    • Generate a bootstrap sample from the observed data by randomly sampling with replacement N points from the data where N is the length of the data. Remember: you need to sample x, y, and yerr in some way that preserves the correct ordering (aka you can’t take the black hole mass of one galaxy and the diserpersion of another galaxy!). There is a numpy function that can do this for you, so your first task is to search for this function!

    • For each bootstrap sample, fit a linear model to the resampled data using the same procedure as before (first order polynomial with np.polyfit).

    • Append the coefficients (slope and intercept) of the linear fit for each bootstrap sample to a container of your choice (probably a list or array).

  2. Visualize Results: You should generate some plots to assess our bootstrapping results.

    • First, add all 500 lines from your 500 bootstrap fits to your original M-sigma plot. Note: use a low alpha for your plots so the lines are semi-transparent!

    • Plot histograms or density plots of the distributions of slopes and intercepts obtained from the bootstrap resampling. Additionally, overlay the slope and intercept derived from the original linear fit on the same plot for comparison.

  3. Evaluate and Interpret: Examine the distribution of slopes and intercepts across the bootstrap samples to assess their variability and uncertainty. How well-constrained is our fit to the M-sigma relation on the basis of this data?