Symbolic regression in the physical sciences

28 - 29 April 2025 09:00 - 17:00 The Royal Society Free Watch online
Register now
Beach scene with equations

Discussion meeting organised by Dr Deaglan Bartlett, Dr Harry Desmond, Professor Pedro G Ferreira and Professor Gabriel Kronberger.

Symbolic Regression is a branch of Machine Learning that attempts to find interpretable mathematical expressions which can accurate approximate a data set. This meeting will bring together practitioners of Symbolic Regression with physicists who are tackling problems which are particularly amenable to their analysis.

Attending the meeting

This event is intended for researchers in relevant fields.

  • Free to attend
  • Both virtual and in-person attendance is available. Advance registration is essential. Please follow the link to register
  • Lunch is available on both days of the meeting for an optional £25 per day. There are plenty of places to eat nearby if you would prefer to purchase food offsite
  • Participants are welcome to bring their own lunch to the meeting

Enquiries: Scientific Programmes team.

Organisers

  • Dr Deaglan John Bartlett, Institut d’Astrophysique de Paris, France

    Dr Deaglan John Bartlett, Institut d'Astrophysique de Paris, France

    Deaglan is a Postdoctoral Fellow at the Institut d'Astrophysique de Paris, where he works on projects at the border of artificial intelligence and Bayesian inference problems, and their applications to cosmology. He was awarded his MA and MSci in Natural Sciences at the University of Cambridge (Trinity College), and obtained his DPhil in Astrophysics at the University of Oxford, where he was the Graduate Teaching and Research Scholar in Physics at Oriel College.

    He is interested in statistical and machine learning methodology in astrophysics and cosmology, Bayesian large-scale structure inference, field-level inference, and probing fundamental physics with astronomical surveys. His work centres on the reconstruction of the initial conditions of the Universe and constraining cosmological parameters through Bayesian forward modelling of large scale structure, the acceleration of simulations and inference methods through machine learning and the development of new methods and applications of interpretable machine learning with symbolic regression.

  • Dr Harry Desmond, University of Portsmouth, UK

    Dr Harry Desmond, University of Portsmouth, UK

    Harry Desmond is a University Research Fellow at the Institute of Cosmology and Gravitation, University of Portsmouth. His interests in astrophysics include galaxy dynamics & phenomenology, the galaxy-halo connection, constrained cosmological simulations, stellar structure, modified gravity and dark matter. He is also interested in astrostatistics, particularly Bayesian methods and field-level inference. In symbolic regression he was a developer of the Exhaustive Symbolic Regression algorithm, which also introduced description length as a joint measure of functions’ accuracy and complexity. He is keen both on methodological developments in symbolic regression – for example developing hybrids between the exhaustive and stochastic (eg genetic programming) approaches – and on applications in astrophysics and cosmology. This includes creating more principled fitting functions, or even learning new physics.

  • Pedro Ferreira

    Professor Pedro Ferreira, University of Oxford, UK

    Pedro G Ferreira is a Professor of Astrophysics in the Physics Department of the University of Oxford and a Director of the Beecroft Institute of Particle Astrophysics and Cosmology.

    His main field of expertise is cosmology, in particular the early universe and the large scale structure of the universe with a focus on trying to extract information about fundamental physics from these data sets. This has led him to explore how one might test and constrain General Relativity on large scales. Recently he has become interested in black holes, gravitational waves and, inevitably, he has also developed an interest in machine learning with a particular focus on methods for inferring physical laws or equations using symbolic regression.

  • Dr Gabriel Kronberger, University of Applied Sciences Upper Austria, Austria

    Dr Gabriel Kronberger, University of Applied Sciences Upper Austria, Austria

    Gabriel Kronberger is Professor at the University of Applied Sciences Upper Austria and has been working on algorithms for symbolic regression since more than 15 years. From 2018 until 2022 he led the Josef Ressel Center for Symbolic Regression. In 2024, he published a book for "Symbolic Regression" together with Burlacu, Kommenda, Winkler, and Affenzeller. His current research interests are symbolic regression for physics-based machine learning and applications in science and engineering.

Schedule

Chair

Pedro Ferreira

Professor Pedro Ferreira, University of Oxford, UK

09:00-09:05 Welcome by the Royal Society and lead organiser
Dr Deaglan John Bartlett, Institut d'Astrophysique de Paris, France

Dr Deaglan John Bartlett, Institut d'Astrophysique de Paris, France

09:05-09:30 (Exhaustive) Symbolic Regression and model selection by minimum description length

After a broad overview of the symbolic regression landscape, I will present the ambitious programme of searching function space exhaustively with a single, objective metric that combines accuracy and simplicity. At low complexity, an exhaustive search is afforded by our algorithm Exhaustive Symbolic Regression (ESR), which generates all possible simple functions produced from a user-defined basis set of operators. Single-objective optimisation is possible by means of the Minimum Description Length (MDL) principle, which puts function accuracy (log-likelihood of the data) and complexity (both structural and parametric) on the same scale (number of nats) to allow them to be directly traded off. I will showcase the ESR+MDL combination on several hot topics in astrophysics, and discuss potential future developments.

Dr Harry Desmond, University of Portsmouth, UK

Dr Harry Desmond, University of Portsmouth, UK

09:30-09:45 Discussion
09:45-10:15 Symbolic regression in beyond Standard Model physics

In this talk I will discuss symbolic regression as a tool for studying new ideas in high energy physics. Many proposals for new physics that goes beyond the Standard Model yield a high dimensional parameter space that is difficult to examine by conventional means. Traditional exhaustive search methods are often time consuming and are possible for only the simplest of theories. This situation holds across surprisingly many activities, from confronting theories with experimental data, to model building in string theory. In this talk I will describe ongoing work that brings symbolic regression to bear on such problems. The first example I will discuss is one in which a phenomenological proposal for new physics confronts experimental data. The model I consider is the so-called Constrained Minimal Supersymmetric Standard Model, which has a four-dimensional parameter space. We provide a set of analytical expressions that reproduce low-energy observables of interest in terms of the fundamental parameters of the theory. These observables are traditionally derived through a long chain of difficult computation, but we will see that this process can be significantly short-cut using symbolic regression, which provides a set of powerful formulae that can be used in analyses instead. The second problem I will discuss occurs in string theory, where one would like to determine the Kähler metric which is crucial for determining the effective 4D physics. Typically it is a very difficult task to find analytic formula for the Kahler metric, although approximations can often be found using for example the Donaldson algorithm. I will discuss how symbolic regression could play a crucial role in providing the desired analytic expressions. The talk will aim to provide pedagogical descriptions of the physical problems.  

Professor Steven Abel, Durham University, UK

Professor Steven Abel, Durham University, UK

10:15-10:30 Discussion
10:30-11:00 Break
11:00-11:30 Constitutive modelling using symbolic regression

Process-structure-property relationships are fundamental in material science and engineering and key to developing new and better materials. Symbolic regression is a powerful tool for discovering mathematical models that describe these relationships. It can automatically generate equations that predict material behaviour under given manufacturing conditions and optimise its performance, e.g. strength and elasticity. Unlike the commonly used “black-box” machine learning models, the results are interpretable and provide insights into the studied system, e.g. the influence of individual alloying elements or processing parameters like temperature. This method thus facilitates the design of new materials with desired properties, the optimization of manufacturing processes, and the acceleration of material design without relying on predefined models or assumptions. The forthcoming lecture will illustrate how symbolic regression can derive the constitutive laws that describe the behaviour of various metallic alloys during plastic deformation. Constitutive modelling is a vital mathematical framework for understanding the relationship between stress and strain in materials under different loading conditions. It clarifies how materials respond to external forces by employing established material laws or equations. Such models are indispensable for predicting material behaviour in engineering applications, including structural analysis, design, and simulations.

Dr Evgeniya Kabliman, Technical University of Munich, Germany

Dr Evgeniya Kabliman, Technical University of Munich, Germany

11:30-11:45 Discussion
11:45-12:15 Speaker to be confirmed
12:15-12:30 Discussion

Chair

Dr Deaglan John Bartlett, Institut d’Astrophysique de Paris, France

Dr Deaglan John Bartlett, Institut d'Astrophysique de Paris, France

13:30-14:00 Brush: incorporating split-wise functions and multi-armed bandits into symbolic regression

Symbolic Regression, a form of supervised learning, attempts to solve the NP-hard problem of searching both function form and its free parameters. It is useful when the objective is a concise and accurate mathematical description of the data.

An often considered alternative, decision trees can be easy to interpret but grow exponentially, and mathematical functions can represent many non-linearities otherwise challenging to interpret.

We propose Brush, a multi-objective symbolic regression framework that combines the decision tree’s split-wise structures with flexible conditionals to build mathematical expressions.

The inclusion of split-wise operations enables the combination of the interpretability of decision trees with the flexibility of mathematical expressions.

To complement the evolutionary loop, we test static and dynamic, contextual and context-free, multi-armed bandits that adapt the sampling process for symbols during search.

We benchmark Brush against several state-of-the-art algorithms over 250 datasets, reporting R2, training time, recovery rate of ground-truth equations, and final expressions.

Brush showed Pareto-optimal performance in R2 and model size ranks, and multi-armed bandits shifted the trade-off between accuracy and expression size along the optimal curve.

Notably, training time was shorter than that of zero-shot, transformer-based approaches on a single core.

Our experiments highlight how Brush effectively navigates a much larger search space by incorporating split-wise equations.

This tool shows promise as both a first-principles modelling technique and for learning interpretable prediction models, enhancing the practical applications of symbolic regression in various fields.

Dr William G La Cava, Boston Children's Hospital, USA

Dr William G La Cava, Boston Children's Hospital, USA

14:00-14:15 Discussion
14:15-14:45 A Galois theorem for machine learning: functions on symmetric matrices and point clouds via lightweight invariant features

We present a mathematical formulation for machine learning of (1) functions on symmetric matrices that are invariant with respect to the action of permutations by conjugation, and (2) functions on point clouds that are invariant with respect to rotations, reflections, and permutations of the points. To achieve this, we provide a general construction of generically separating invariant features using ideas inspired by Galois theory. We construct O(n^2) invariant features derived from generators for the field of rational functions on n × n symmetric matrices that are invariant under joint permutations of rows and columns. We show that these invariant features can separate all distinct orbits of symmetric matrices except for a measure zero set; such features can be used to universally approximate invariant functions on almost all weighted graphs. For point clouds in a fixed dimension, we prove that the number of invariant features can be reduced, generically without losing expressivity, to O(n), where n is the number of points. We combine these invariant features with DeepSets to learn functions on symmetric matrices and point clouds with varying sizes. We empirically demonstrate the feasibility of our approach on molecule property regression and point cloud distance prediction. 

Assistant Professor Soledad Villar, Johns Hopkins University, USA

Assistant Professor Soledad Villar, Johns Hopkins University, USA

14:45-15:00 Discussion
15:00-15:30 Break
15:30-16:00 Deliverable scientific discovery

Scientists aim to create mathematical models that accurately describe observed phenomena. In the past, models were manually created from domain knowledge and subsequently validated using data. More recently, models are automatically extracted from large datasets using machine learning algorithms. However, finding meaningful models from data remains an ongoing challenge. In this talk I will explore two novel approaches for deriving scientific laws that exploit both experimental data and axiomatic knowledge, enabling the creation of interpretable models with minimal data requirements: (1) AI-Descartes utilizes symbolic regression to extract models from data and then applies logical reasoning to select those that best align with established axioms; (2) AI-Hilbert, combines polynomial optimization and logical constraints ensuring both theoretically consistency and empirical validation at the same time. Together, these methods offer a new perspective on how to integrate data and logic, establishing the new paradigm of derivable scientific discovery. More information can be found at ai-descartes.github.io and ai-hilbert.github.io.

Dr Cristina Cornelio, Samsung AI, UK

Dr Cristina Cornelio, Samsung AI, UK

16:00-16:15 Discussion
16:15-16:45 Sparse regression for symbolic representations in latent space dynamics

Sensing is a universal task in science and engineering. Downstream tasks from sensing include learning dynamical models, inferring full state estimates of a system (system identification), control decisions, and forecasting. These tasks are exceptionally challenging to achieve with limited sensors, noisy measurements, and corrupt or missing data. Existing techniques typically use current (static) sensor measurements to perform such tasks and require principled sensor placement or an abundance of randomly placed sensors. In contrast, we propose a SHallow REcurrent Decoder (SHRED) neural network structure which incorporates (i) a recurrent neural network (LSTM) to learn a latent representation of the temporal dynamics of the sensors, and (ii) a shallow decoder that learns a mapping between this latent representation and the high-dimensional state space.  The latent space is critical for understanding the symbolic dynamics, which provides an on-the-fly compression for modelling physical and engineering systems. Forecasting in the latent space symbolic representation is also achieved from the sensor time-series data alone, producing an efficient paradigm for predicting temporal evolution with an exceptionally limited number of sensors. In the example cases explored, including turbulent flows, complex spatio-temporal dynamics can be characterized with exceedingly limited sensors that can be randomly placed with minimal loss of performance.

Professor Jose Nathan Kutz, University of Washington, USA

Professor Jose Nathan Kutz, University of Washington, USA

16:45-17:00 Discussion

Chair

Dr Harry Desmond, University of Portsmouth, UK

Dr Harry Desmond, University of Portsmouth, UK

09:00-09:30 Accelerating cosmological modelling with symbolic regression

Cosmology seeks to understand the structure and evolution of the universe by comparing theoretical models to observational data. Achieving this requires exploring a vast parameter space, varying quantities such as matter density, dark energy properties, neutrino mass, and the initial conditions of cosmic structure. Running large-volume, high-resolution simulations across this space is computationally expensive, and a direct comparison to data would demand an infeasible number of simulations. To address this, emulation techniques have been developed to approximate simulation outputs efficiently. Recently, symbolic regression has emerged as a promising approach for emulating cosmological processes, offering interpretable and computationally efficient models. This talk will review the role of symbolic regression in cosmology, its application to simulation-based inference, and its potential for accelerating scientific discovery.

Dr Deaglan John Bartlett, Institut d'Astrophysique de Paris, France

Dr Deaglan John Bartlett, Institut d'Astrophysique de Paris, France

09:30-09:45 Discussion
09:45-10:15 Concept evolution and SymbolicRegression.jl as a modular research platform

I will introduce SymbolicRegression.jl (and its Python frontend, PySR), and discuss its use as a modular platform for creating symbolic regression algorithms, including how its generic interface enables other researchers to make use of efficient distributed computation, fast expression evaluation, higher-order symbolic/reverse-mode/forward-mode differentiation, as well as dimensional analysis. A key example of this extensibility is our recent method LaSR, or "Library Augmented Symbolic Regression", which augments symbolic regression with "concept evolution"—leveraging large language models to evolve and crossover abstract ideas that guide search. By building on SymbolicRegression.jl, LaSR overrides core components while maintaining the efficiency of the underlying engine. This talk will explore concepts on both the engineering side of SymbolicRegression.jl, as well as the methodological side of LaSR.

Dr Miles Cranmer, University of Cambridge, UK

Dr Miles Cranmer, University of Cambridge, UK

10:15-10:30 Discussion
10:30-11:00 Break
11:00-11:30 Multi-view Symbolic Regression: from independent experiments to general laws

Symbolic regression (SR) searches for analytical expressions representing the relationship between a set of explanatory and response variables. Current SR methods assume a single dataset extracted from a single experiment. Nevertheless, frequently, the researcher is confronted with multiple sets of results, obtained from the same experiment, but conducted with different set-ups. In this context, traditional SR methods may only describe the experiments independently, and will fail to capture the common phenomenon that generated each particular dataset. Multi-view Symbolic Regression (MvSR) overcomes this limitation by taking into account multiple datasets simultaneously, effectively mimicking experimental environments. Such approach fits the evaluated expression to each independent dataset and returns a parametric family of functions f(x; θ) simultaneously capable of accurately fitting all datasets. Therefore, they are particularly adapted to many practical scientific use cases, where the aim is often to discover a general law to model a phenomenon. Recently, multiple approaches similar to MvSR have emerged. In this paper, we evaluate them regarding different aspects, including computation speed, quality of the fits and simplicity of the solutions. We place a special emphasis on applying the methods to real-world data from a wide range of scientific fields, which allows for a better understanding of their direct applicability to scientific communities. We demonstrate their good performances and propose insights on how to further improve in the future.

Dr Etienne Russeil, Stockholm University, Sweden

Dr Etienne Russeil, Stockholm University, Sweden

11:30-11:45 Discussion
11:45-12:15 Symbolic regression via posterior sampling

Symbolic regression is a machine learning approach for producing interpretable equations from data but can suffer from overfitting and bloat, especially when data is noisy or sparse. Recent works have partially addressed these issues by recasting the selection of models in a genetic programming-based symbolic regression (GPSR) framework as a Bayesian model selection task. While this enhancement, referred to as Bayesian GPSR, was shown to improve bloat and overfitting, remaining challenges included a potential lack of diversity in the final population and difficulty interpreting the relative fitness of resulting equations. This work departs from genetic programming and instead introduces an approach for sampling directly from the posterior distribution of models. As such, each equation in the final sample is associated with a computed likelihood of producing the observed data, enabling more principled post-processing and model discovery in the presence of uncertainty. The method employs a Laplace approximation of the marginal likelihood for computational efficiency and uses sequential Monte Carlo (SMC) to generate the posterior samples. A key benefit of SMC is the use of likelihood annealing, which gradually introduces the effect of the data during model evolution. It was hypothesized that such a feature would improve equation diversity by avoiding oversampling from local minima. The algorithm is compared to GPSR and Bayesian GPSR on a suite of benchmark problems, comparing trade-offs in computational cost, equation diversity, predictive error, and the ability to recover the true data-generating function from noisy, sparse data.

Dr Geoffrey Bomarito, National Aeronautics and Space Administration (NASA), USA

Dr Geoffrey Bomarito, National Aeronautics and Space Administration (NASA), USA

12:15-12:30 Discussion

Chair

Dr Gabriel Kronberger, University of Applied Sciences Upper Austria, Austria

Dr Gabriel Kronberger, University of Applied Sciences Upper Austria, Austria

13:30-14:00 Speaker to be confirmed
14:00-14:15 Discussion
14:15-14:45 Zobrist hash-based duplicate detection in symbolic regression

The evolutionary search in genetic programming may visit the same points in solution space multiple times, which affects efficiency and overall odds of convergence. This work introduces a method for avoiding duplicate exploration through an efficient detection mechanism for already seen solutions, based on the concept of Zobrist hashing. This type of hashing, frequently used in abstract board games, allows the efficient construction and subsequent update of transposition tables (essentially, a cache of previously seen solutions) during the run, thus providing the ability to detect duplicates during the recombination phase of the algorithm.

We prototype this idea using the open-source symbolic regression library Operon and perform empirical testing on a number of synthetic and real-world datasets. Furthermore, we investigate the ability of the Zobrist-based approach to cover an artificial set of solution points generated exhaustively by an deterministic approach. The results show that our approach provides a reliable way of avoiding the evaluation of duplicate solutions during the search, with a moderate amount of overhead.

Professor Bogdan Burlacu, University of Applied Sciences Upper Austria, Austria

Professor Bogdan Burlacu, University of Applied Sciences Upper Austria, Austria

14:45-15:00 Discussion
15:00-15:30 Break
15:30-16:00 Equality graph assisted symbolic regression

Automatic equation discovery searches for a mathematical function that is capable of describing a set of experimental observations. The generated function can help to further the study of a certain phenomena by understanding how the different measured variables interact, the meaning of the constant values, and the effects of the different adjustable parameters. This technique is named symbolic regression and is often implemented using a populational meta-heuristic called genetic programming (GP). Recent studies showed that GP search can be inefficient with the exploration of many redundant solution candidates. Another issue of modern GP is the excessive number of tuneable hyperparameters that either requires pre-optimization or determined by the own experience of the user, both creating an unnecessary burden to the experimentation process.

The equality graph and equality saturation can compactly store expressions together with their equivalent forms in such a way that it is capable of verifying if a given expression or any variation of it were already visited by the search.

In this paper we propose a new search algorithm for equation recovery from data called SymRegg that revolves around the e-graph structure following simple steps: sample one of the top-N expressions from the e-graph, perturb the selected solution, insert it with the equivalent forms if it was never visited during the search.

We show that this approach is capable of improving the efficiency of the search, maintaining consistently accurate results across different datasets while requiring a choice of a minimalistic set of hyperparameters.

Dr Fabricio Olivetti de França, Universidade Federal do ABC, Brazil

Dr Fabricio Olivetti de França, Universidade Federal do ABC, Brazil

16:00-16:15 Discussion
16:15-17:00 Panel discussion