Astrophysicists Show How to ‘Weigh’ Galaxy Clusters with Artificial Intelligence

Galaxy clusters are the most massive objects in the universe, and understanding them is crucial to pinning down the origin and evolution of our universe

Illustration of bright circular galaxy with dust lanes, nebulae and myriads of stars.

Machine learning can help us better understand galaxy clusters, the largest things in the universe, according to new research (Adobe Stock).

Scholars from the Institute for Advanced Study and UConn’s Cosmology and Astrophysics with MachinE Learning Simulations (CAMELS) project have used a machine learning algorithm known as “symbolic regression” to generate new equations that help solve a fundamental problem in astrophysics: inferring the mass of galaxy clusters. Their work was recently published in Proceedings of the National Academy of Sciences.

Galaxy clusters are the most massive objects in the Universe: a single cluster contains anything from a hundred to many thousands of galaxies, alongside collections of plasma, hot X-ray emitting gas, and dark matter. These components are held together by the cluster’s own gravity. Understanding such galaxy clusters is crucial to pinning down the origin and continuing evolution of our universe.

“Measuring how many clusters exist, and then what their masses are, can help us understand fundamental properties like the total matter density in the universe, the nature of dark energy, and other fundamental questions,” says co-author and UConn Professor of Physics Daniel Anglés-Alcázar.

Perhaps the most crucial quantity determining the properties of a galaxy cluster is its total mass. But measuring this quantity is difficult: galaxies cannot be “weighed” by placing them on a scale. The problem is further complicated by the fact that the dark matter making up much of a cluster’s mass is invisible. Instead, scientists infer the mass of a cluster from other observable quantities.

Anglés-Alcázar notes another limitation: there are many ideas for how galaxies form and evolve and give rise to galaxy clusters, but there are still uncertainties on some of these processes.

Previously, scholars considered a cluster’s mass to be roughly proportional to another, more easily measurable quantity called the “integrated electron pressure” (or the Sunyaev-Zel’dovich flux, often abbreviated to YSZ). The theoretical foundations of the Sunyaev-Zel’dovich flux were laid in the early 1970s by Rashid Sunyaev, current Distinguished Visiting Professor in the Institute for Advanced Study’s School of Natural Sciences, and his collaborator Yakov B. Zel’dovich.

However, the integrated electron pressure is not a reliable proxy for mass because it can behave inconsistently across different galaxy clusters. The outskirts of clusters tend to exhibit very similar YSZ, but their cores are much more variable. The YSZ/mass equivalence was problematic in that it gave equal weight to all parts of the cluster. As a result, a lot of “scatter” was observed, meaning that the error bars on the mass inferences were large.

The trade-offs between different machine learning techniques. Symbolic regression is much less powerful than deep neural networks on high-dimensional datasets, but it is much more interpretable as it provides mathematical equations as output.
The trade-offs between different machine learning techniques. Symbolic regression is much less powerful than deep neural networks on high-dimensional datasets, but it is much more interpretable as it provides mathematical equations as output (courtesy of Digvijay Wadekar).

Digvijay Wadekar, current Member of the Institute for Advanced Study’s School of Natural Sciences, has worked with collaborators across ten different institutions to develop an AI program to improve the understanding of the relationship between the mass and the YSZ.

Wadekar and his collaborators “fed” their AI program with state-of-the-art cosmological simulations developed by groups at Harvard and at the Center for Computational Astrophysics in New York. Their program searched for and identified additional variables that might make inferring the mass from the YSZ more accurate.

Anglés-Alcázar explains the CAMELS collaboration provided a large suite of simulations where the researchers could measure the properties of galaxies and clusters, and how they depend on the assumptions of the underlying galaxy formation physics.

“Because we have thousands of parallel universes simulated under different assumptions, when we train a machine learning algorithm on large amounts of simulated data, we can test whether the predictions are robust relative to those variations or not.”

AI is useful for identifying new parameter combinations that could be overlooked by human analysts. While it is easy for human analysts to identify two significant parameters in a data set, AI is better able to parse through high data volumes often revealing unexpected influencing factors.

“Machine learning can be fantastic for making predictions,” says Anglés-Alcázar, “but it’s only as good as the data you use to train the machine learning model. For this type of application, it’s important to have simulations that can represent the real universe accurately enough, and understand the range of uncertainties, so that when you train the machine learning model, hopefully, you can apply that to real galaxy clusters to improve the actual measurements in the real universe.”

More specifically, the AI method the researchers employed is known as symbolic regression. “Right now, a lot of the machine learning community focuses on deep neural networks,” Wadekar says. “These are very powerful but the drawback is that they are almost like a black box. We cannot understand what goes on in them. In physics, if something is giving good results, we want to know why it is doing so. Symbolic regression is beneficial because it searches a given dataset and generates simple, mathematical expressions in the form of simple equations that you can understand. It provides an easily interpretable model.”

Their symbolic regression program handed them a new equation, which was able to better predict the mass of the galaxy cluster by augmenting YSZ with information about the cluster’s gas concentration. Wadekar and his collaborators then worked backward from this AI-generated equation and tried to find a physical explanation for it. They realized that gas concentration is in fact correlated with the noisy areas of clusters where mass inferences are less reliable. Their new equation, therefore, improved mass inferences by providing a way for these noisy areas of the cluster to be “down-weighted”. In a sense, the galaxy cluster can be compared to a spherical doughnut. The new equation extracts the jelly at the center of the doughnut (that introduces larger errors), and concentrates on the doughy outskirts for more reliable mass inferences.

The new equations can provide observational astronomers engaged in upcoming galaxy cluster surveys with better insights into the mass of the objects that they observe. “There are quite a few surveys targeting galaxy clusters which are planned in the near future,” Wadekar says. “Examples include the Simons Observatory (SO), the Stage 4 CMB experiment (CMB-S4), and an X-ray survey called eROSITA. The new equations can help us in maximizing the scientific return from these surveys.”

He also hopes that this publication will be just the tip of the iceberg when it comes to using symbolic regression in astrophysics. “We think that symbolic regression is highly applicable to answering many astrophysical questions,” Wadekar says. “In a lot of cases in astronomy, people make a linear fit between two parameters and ignore everything else. But nowadays, with these tools, you can go further. Symbolic regression and other artificial intelligence tools can help us go beyond existing two-parameter power laws in a variety of different ways, ranging from investigating small astrophysical systems like exoplanets to galaxy clusters, the biggest things in the universe.”