Estimating Uncertainty in the Black Hole Mass–Velocity Dispersion Relation

Estimating Uncertainty in the Black Hole Mass–Velocity Dispersion Relation#

Yasmeen Asali, Imad Pasha, William Cerny, Sebastian Monzon (Yale University)

Description: Estimate the uncertainty on a linear fit to the M-$\sigma$ relation using bootstrap resampling. This exercise introduces statistical resampling and uncertainty quantification using real data and linear regression.

Intended Audience: Beginner Undergraduate

Tags: linear-regression, model-fitting, error-propagation, uncertanties

Requirements: numpy, matplotlib, pandas, scipy (requirements.txt)

Last Updated: August 9, 2024

Learning Objectives

Understand and implement bootstrap resampling to assess model uncertainty.
Practice using numpy to generate random samples.
Fit and interpret linear models.
Visualize parameter distributions to quantify fit variability.

Introduction#

The M-sigma relation, also known as the black hole mass-velocity dispersion relation, is an empirical correlation observed between the mass of supermassive black holes ($M_{bh}$) at the centers of galaxies and the velocity dispersion ($\sigma$) of stars in the bulges of those galaxies. This relationship suggests that there is a tight connection between the properties of galaxies and the central black holes they host. Specifically, galaxies with larger velocity dispersions tend to harbor more massive black holes at their centers.

The M-$\sigma$ relation has profound implications for our understanding of galaxy formation and evolution, as well as the co-evolution of galaxies and their central black holes. It provides valuable insights into the mechanisms governing the growth and regulation of supermassive black holes, the role of feedback processes in galaxy evolution, and the relationship between the properties of galactic bulges and their central black holes.

Loading the data#

We’ll begin by loading some measurements of black hole mass (M_bh) and bulge velocity dispersion (sigma) for a set of nearby galaxies.

Important

Use the pandas library to load the m-sigma.txt file as a DataFrame (download here); it will read in with column names.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('m-sigma.txt')
df.head()

	Galaxy	M_bh	sigma	sigma_err	M_bh_err
0	MW	2950000.0	100	20	350000.0
1	I1459	460000000.0	312	41	280000000.0
2	N221	3900000.0	76	10	900000.0
3	N3115	920000000.0	278	36	300000000.0
4	N3379	135000000.0	201	26	73000000.0

Linear Regression#

Linear regression just means “let’s fit a line”! It’s a statistical method used to model the relationship between a dependent variable (often denoted as ‘y’) and one or more independent variables (often denoted as ‘x’). It assumes that the relationship between the variables can be described by a linear equation, such as a straight line in the case of simple linear regression. Like an inference task, the goal of linear regression is to find the best-fitting line that minimizes the differences between the observed values and the values predicted by the model.

For the simpliest case of a straight line, aka a first order polynomial, we can describe our model using two parameters: slope and intercept. In the case of the M-sigma relation, these fit parameters can be used to constrain our models of galaxy and black hole co-evolution.

How can we implement this in Python?#

There are many ways you can perform linear regression in Python! np.polyfit is a function in NumPy used for polynomial regression, which fits a polynomial curve to a set of data points. In the context of linear regression, np.polyfit can fit a polynomial of degree 1 (a straight line) to the given data points by minimizing the sum of squared errors between the observed and predicted values. This function returns the coefficients of the polynomial that best fits the data, allowing us to determine the slope and intercept of the regression line.

Exercise 1: Fitting the M-$\sigma$ Relation

We will apply the concept of linear regression to analyze the M-sigma relation using observational data. We have provided a dataset containing measurements of black hole masses (M_bh), velocity dispersions (sigma), and their associated uncertainties. Your task is to perform linear regression to fit both the black hole mass as a function of velocity dispersion (M_bh vs. sigma) and velocity dispersion as a function of black hole mass (sigma vs. M_bh).

Prepare the Data: Inspect the first few rows of the Pandas DataFrame to familiarize yourself with the structure of the data. The M-sigma relation is typically plotted with the black hole masses in log form, so you will need to convert your data to logspace before fitting the relation. Add new columns to the DataFrame representing log(M_bh) and the error on log(M_bh).

Hint

Hint: Error Propagation in Logspace In order to compute the error on log(M_bh), we will need to propagate the error when taking the logarithm (base 10). The formula that describes this is as follows: $$ \sigma_{\log_{10}(x)} = \frac{1}{\ln(10) \cdot x} \cdot \sigma_x $$ The above will give you the uncertainty on the logarithm of the black hole mass, and you can store this in the DataFrame.

Plot the Data: Create a scatter plot of the log of the black hole mass (log(M_bh)) against velocity dispersion (sigma), with error bars representing the uncertainties in both quantities.
Perform Linear Regression: In standard least-squares regression, uncertainties are only applied to the dependent variable (y). This means that if you fit $M_{\mathrm{bh}}$ as a function of $\sigma$, the method only accounts for uncertainties in $M_{\mathrm{bh}}$, while ignoring errors in $\sigma$. But both $M_{\mathrm{bh}}$ and $\sigma$ have significant measurement uncertainties in our dataset. To explore this limitation, we will fit the relation in both directions. This is not the same as taking the inverse of one fit. Each regression is a different model that incorporates errors on different variables.
- Fit a linear regression model to the data, treating M_bh as the dependent variable (y) and sigma as the independent variable (x) and incorporate the errors on sigma in the fit.
- Fit another linear regression model, treating sigma as the dependent variable (y) and M_bh as the independent variable (x) and incorporate the errors on M_bh in the fit.

Hint

Hint: Incorporating Errors in the Fit Look into the weights keyword argument for np.polyfit to figure out how to incorporate errors.

Visualize the Results: Plot the regression lines along with the data points to observe how well they fit the M-sigma relation. Compare the slopes and intercepts of the regression lines for both fits.

Beyond a Single Best-Fit Line: Representing Uncertainty in our Fit#

By now you should have produced a figure resembling the one below: m-sigma-figure

In our linear fits, we were only able to account for errors in one variable at a time. Standard least squares regression assumes that all uncertainty lies along the dependent variable (the y-axis). But in the M–$\sigma$ relation, both quantities (black hole mass and velocity dispersion) have significant observational uncertainties. To explore this, we asked you to fit the relation in two directions: $M_{\mathrm{bh}}(\sigma)$ and $\sigma(M_{\mathrm{bh}})$.

As you probably noticed, these two fits give very different slopes! Neither is “more correct” than the other. Rather, the discrepancy reflects the fact that our data are noisy in both dimensions, and with limited data we cannot uniquely pin down the “true” relation. What we can say is that there is uncertainty in the fitted parameters of the M–$\sigma$ relation.

Up to this point we’ve been thinking about measurement errors (the observational error bars on each galaxy’s $M_{\mathrm{bh}}$ and $\sigma$). But there is a second kind of uncertainty that arises in astronomy: sample variance. If you repeated the experiment with a different set of galaxies, you might measure a slightly different relation. Since we cannot observe every galaxy in the universe, we need a way to estimate how much our finite sample might bias or vary our results.

This is where bootstrap resampling comes in. Bootstrapping provides a way to simulate many alternative versions of our dataset and quantify how much the fitted parameters (slope and intercept) might change from sample to sample. Instead of quoting a single slope, we can quote a slope with an uncertainty range based on the distribution of slopes recovered from many bootstrap resamples.

Bootstrap Resampling#

Bootstrapping is a resampling technique used in statistics to estimate the sampling distribution of a statistic by repeatedly sampling with replacement from the observed data. It is particularly useful when the underlying population distribution is unknown or difficult to model. There are two general steps needed for bootstrap resampling:

Sample Creation: Bootstrapping starts with the creation of multiple bootstrap samples by randomly selecting observations from the original dataset with replacement. This means that each observation in the original dataset has the same chance of being selected for each bootstrap sample, and some observations may be selected multiple times while others may not be selected at all.
Statistical Estimation: After creating the bootstrap samples, the statistic of interest (e.g., mean, median, standard deviation, regression coefficient) is calculated for each bootstrap sample. This results in a distribution of the statistic across the bootstrap samples, known as the bootstrap distribution. In our case, this would be fitting our line to the data and seeing how much the paramters (slope and intercept) change for each fit. Now, we have some way of expressing uncertainty in our fitted parameters!

The goal for this next exercise is to use bootstrapping to estimate the uncertainty on our fit for the M-sigma relation!

bootstrapping_image

Exercise 2: Bootstrap Resampling for our Linear Fit to the M-sigma Relation

Starting from your linear model fit from Exercise 1:

Perform Bootstrap Resampling: To do this, you will re-fit the data many times after creating subsamples of randomly drawn datasets (with replacement) from our actual data. In general, you can choose the number of bootstrap samples to generate based on computational resources and desired precision. For this case, you can start by generating ~500 bootstrap samples. For each bootstrap iteration:
- Generate a bootstrap sample from the observed data by randomly sampling with replacement N points from the data where N is the length of the data. Remember: you need to sample x, y, and yerr in some way that preserves the correct ordering (aka you can’t take the black hole mass of one galaxy and the diserpersion of another galaxy!). There is a numpy function that can do this for you, so your first task is to search for this function!
Hint

Hint: Sampling with numpy If you can’t find it, the function you should use for this step is np.random.choice. Look at the documentation to figure out how to implement it.
- For each bootstrap sample, fit a linear model to the resampled data using the same procedure as before (first order polynomial with np.polyfit).
- Append the coefficients (slope and intercept) of the linear fit for each bootstrap sample to a container of your choice (probably a list or array).
Visualize Results: You should generate some plots to assess our bootstrapping results.
- First, add all 500 lines from your 500 bootstrap fits to your original M-sigma plot. Note: use a low alpha for your plots so the lines are semi-transparent! Does this result match your expectations? What can you infer about the magnitude of the uncertainty in the M-$\sigma$ relation based on this visualization?
- Plot histograms or density plots of the distributions of slopes and intercepts obtained from the bootstrap resampling. Additionally, overlay the slope and intercept derived from the original linear fit on the same plot for comparison.
Evaluate and Interpret: Examine the distribution of slopes and intercepts across the bootstrap samples to assess their variability and uncertainty. How well-constrained is our fit to the M-sigma relation on the basis of this data?

Orthogonal Distance Regression and Bootstrapping Together#

So far, you have seen that fitting the M-$\sigma$ relation while considering uncertainty is not straightforward:

Ordinary least squares can only incorporate uncertainties in one variable at a time.
Bootstrapping helps us capture how much our finite sample influences the fitted parameters.

But ideally, we would like to account for both sources of uncertainty at once: measurement errors in both $M_{\mathrm{bh}}$ and $\sigma$, and the sampling error from having only a finite number of galaxies.

Orthogonal Distance Regression (ODR), also known as total least squares, provides a way to fit a line that accounts for uncertainties in both x and y simultaneously. In Python, this is implemented in the scipy.odr module. ODR minimizes the perpendicular distance from each data point to the fit line, rather than only the vertical offset, and can be weighted by the uncertainties on both variables.

Exercise 3: Putting It All Together

In this exercise you will combine what you learned in Exercises 1 and 2 by applying ODR together with bootstrapping.

Implement ODR for the M–σ Relation:
- Use the scipy.odr package to fit a straight line to the data.
- Pass both $M_{\mathrm{bh}}$ and $\sigma$ uncertainties into the fit so that the regression accounts for errors in both variables.
- Compare the ODR fit visually with your earlier least-squares fits.
Hint

Hint: Starting Values ODR requires an initial guess for the slope and intercept. Use the coefficients from your earlier np.polyfit result as a good starting point.
Bootstrap the ODR Fit:
- Perform bootstrap resampling (e.g., 500 iterations) as in Exercise 2, but now apply ODR to each resampled dataset.
- Collect the slope and intercept values from each bootstrap iteration.
Visualize and Interpret:
- Overplot the 500 ODR bootstrap fits on your M–σ scatter plot (use a low alpha again).
- Plot histograms of the slope and intercept distributions.
- Compare the spread of these distributions to the spread you found in Exercise 2 (least-squares bootstraps). Do the uncertainties look more realistic when both x and y errors are included?

By the end of this exercise, you will have a more robust estimate of the M–σ relation that incorporates both measurement errors and sampling variance, providing a more complete picture of the uncertainties in this fundamental relation.