Revealing the Galaxy-Halo Connection Through Machine Learning


1University of California, Santa Cruz

2University of Chicago

3University of Pittsburgh

Abstract

Understanding the connections between galaxy stellar mass, star formation rate, and dark matter halo mass represents a key goal of the theory of galaxy formation. Cosmological simulations that include hydrodynamics, physical treatments of star formation, feedback from supernovae, and the radiative transfer of ionizing photons can capture the processes relevant for establishing these connections. The complexity of these physics can prove difficult to disentangle and obfuscate how mass-dependent trends in the galaxy population originate. Here, we train a machine learning method called Explainable Boosting Machines (EBMs) to infer how the stellar mass and star formation rate of nearly 6 million galaxies simulated by the Cosmic Reionization on Computers (CROC) project depend on the physical properties of halo mass, the peak circular velocity of the galaxy during its formation history vpeak, cosmic environment, and redshift. The resulting EBM models reveal the relative importance of these properties in setting galaxy stellar mass and star formation rate, with vpeak providing the most dominant contribution. Environmental properties provide substantial improvements for modeling the stellar mass and star formation rate in only ≲10% of the simulated galaxies. We also provide alternative formulations of EBM models that enable low-resolution simulations, which cannot track the interior structure of dark matter halos, to predict the stellar mass and star formation rate of galaxies computed by high-resolution simulations with detailed baryonic physics.

Highlights

  1. \(S\!F\!R\) and \(M_\star\) primarily depend on \(M_\textrm{vir}\) and \(v_\textrm{peak}\), followed by redshift, environmental density, and environmental gas temperature.

  2. When including \(M_\textrm{vir}\) and \(v_\textrm{peak}\) in the parameter set used to train the EBM, the model recovers better than 97% of the distribution of \(M_\star\) or \(S\!F\!R\) with virial mass \(M_\textrm{vir}\) in the CROC simulations.

  3. If the model fit excludes \(v_\textrm{peak}\), the fraction of outliers in the CROC data relative to the predicted model distribution increases to 7.6% for \(S\!F\!R\) and 2.8% for \(M_\star\).

  4. To ameliorate the degradation of the model performance when excluding vpeak, we define a composite EBM model comprised of a weighted sum of the base EBM model fit to main trend of \(S\!F\!R\) and \(M_\star\) with the halo properties and a second EBM model to fit the outliers not represented in the base EBM. The weighting coefficients are themselves determined by an EBM model fit.

  5. The Composite EBM model improves the performance to \(\approx\) 95 - 98% accuracy in the distribution of \(S\!F\!R\) or \(M_\star\) with virial mass, even when excluding vpeak measurements from the training dataset.

Demonstration EBM Predicting Star Formation Rate

In this work, we released four models: two predicting the star formation rate and stellar mass of a galaxy as a function of \( M_{\text{vir}} \), \( z \), \( v_{\text{peak}} \), \( \rho_1 \), \( T_1 \), and \( \Upsilon_{0.1} \), and two to predict star formation rate and stellar mass as a function of \( M_{\text{vir}} \), \( z \),q \( \rho_1 \), \( T_1 \), and \(\Upsilon_{0.1} \) (excluding \( v_{\text{peak}} \)). Where the input pararmeters indicate the following:

    Intrinsic Properties of the Galaxy
  • \( \text{log}_{10}M_{\text{vir}}[M_{\odot}] \): Galaxy virial mass
  • \( z \): Redshift
  • \( \text{log}_{10}v_{\text{peak}}[\text{kms}^{-1}] \): Maximum peak circular velocity
  • Extrinsic Properties of the Galaxy
  • \( \text{log}_{10}\rho_{1} \): Environmental density, \(\rho_1 \equiv 1 + \Delta_1\), where \( \Delta_1 \) is the dimensionless matter overdensity measured within 1 Mpc scales
  • \( \text{log}_{10}T_{1}[K] \): Environmental gas temperature averaged on 1 Mpc scales
  • \( \text{log}_{10}\Upsilon_{0.1} \): The ratio of the virial mass of the most massive neighbor within 100 kpc (\( M_{\text{max},0.1} \)) to the virial mass of the galaxy. Specifically, we define the ratio as the following: \( \Upsilon_{0.1} \equiv 1 + M_{\text{max},0.1}/M_{\text{vir}} \)

The generic EBM is formulated as follows:

\( E[y|\mathbf{x}] = \beta + \sum_{i=0}^{n} f_i(\mathbf{x}_i) + \sum_{i=0,i \ne j}^{n} \sum_{j=0}^{n} f_{ij}(\mathbf{x}_i, \mathbf{x}_j) \)

In this case, the EBM models the expected star formation rate given the galaxy features using a sum of a baseline \( \beta \), univariate functions \( f_i \), and bivariate \( f_{ij} \) functions. The univariate functions model the impact of each feature on the expected star formation rate, and the bivariate functions model how two features together impact the star formation rate.

Below, we provide a demonstration of the EBM trained to predict star formation rate as a function of \( M_{\text{vir}} \), \( z \), \( v_{\text{peak}} \), \( \rho_1 \), \( T_1 \), and \( \Upsilon_{0.1} \). Insert values into the input boxes below and click calculate to see what the EBM predicts for the star formation rate.

\( \text{log}_{10}M_{\text{vir}}[M_{\odot}] \)

\( z \)

\( \text{log}_{10}v_{\text{peak}}[\text{kms}^{-1}] \)

\( \text{log}_{10}\rho_{1} \)

\( \text{log}_{10}T_{1}[K] \)

\( \text{log}_{10}\Upsilon_{0.1} \)

Univariate Functions \( f_i \)

Bivariate Functions \( f_{ij} \)

Acknowledgements

This work was supported by the NASA Theoretical and Computational Astrophysics Network (TCAN) grant 80NSSC21K0271. The authors acknowledge use of the lux supercomputer at UC Santa Cruz, funded by NSF MRI grant AST 1828315. This manuscript has been co-authored by Fermi Research Alliance, LLC under Contract No. DEAC02-07CH11359 with the U.S. Department of Energy, Office of Science, Office of High Energy Physics. CROC project relied on resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. An award of computer time was provided by the Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program. CROC project is also part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (awards OCI0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05- 00OR22725. We have used resources from DOE INCITE award AST 175.