I am hoping this also opens up more opportunity to leverage Lisp's symbolic powers. I had great fun with Structure and Interpretation of Classical Mechanics (SICM), and recently a paper on analyzing music using symbolic regression [0], and another on symbolic regression with Lisp [1]. Julia and CL seem perfect for this with Mathematica as another option for quickly playing with ideas.
Yeah, Julia does great in this domain. DataDrivenDiffEq.jl (https://datadriven.sciml.ai/) is a comprehensive symbolic regression package out there, which pulls together all of what's mentioned in the article (SINDy and variants, EQSearch, SymbolicRegression.jl (which is the core of PySR), OccamNet, etc.) all into a single API to allow comparisons between each of the methods.
The Julia automated discovery of physical equations work showcases these symbolic regression techniques quite heavily, for example in the State of SciML talk (https://www.youtube.com/watch?v=eSeY4K4bITI) we discuss how this has been used now across hundreds of different scientific use cases. The universal differential equations paper which describes the Julia SciML organization (https://arxiv.org/abs/2001.04385) demonstrates these cases where neural networks in differential equations are mixed with symbolic regression to allow for mixing prior physical knowledge with data for discovering just the unknown higher order physical equations. And there's a lot more directions we're going next.
Chris, the work of you and others on Julia is nothing short of amazing. Thanks! I tried Julia when it first came out, and then picked it up again to replace my MATLAB habit, and I have not looked back. I can't wait until the sims.jl package and others can be used like Simulink/Modelica.
The thing to watch in the space of Simulink/Modelica is https://github.com/SciML/ModelingToolkit.jl . It's an acausal modeling system similar to Modelica (though extended to things like SDEs, PDEs, and nonlinear optimization), and has a standard library (https://github.com/SciML/ModelingToolkitStandardLibrary.jl) similar to the MSL. There's still a lot to do, but it's pretty functional at this point. The two other projects to watch are FunctionalModels.jl (https://github.com/tshort/FunctionalModels.jl, which is the renamed Sims.jl), which is built using ModelingToolkit.jl and puts a more functional interface on it. Then there's Modia.jl (https://github.com/ModiaSim/Modia.jl) which had a complete rewrite not too long ago, and in its new form it's fairly similar to ModelingToolkit.jl and the differences are more in the details. For causal modeling similar to Simulink, there's Causal.jl (https://github.com/zekeriyasari/Causal.jl) which is fairly feature-complete, though I think a lot of people these days are going towards acausal modeling instead so flipping Simulink -> acausal, and in that transition picking up Julia, is what I think is the most likely direction (and given MTK has gotten 40,000 downloads in the last year, I think there's good data backing it up).
And quick mention to bring it back to the main thread here, the DataDrivenDiffEq symbolic regression API gives back Symbolics.jl/ModelingToolkit.jl objects, meaning that the learned equations can be put directly into the simulation tools or composed with other physical models. We're really trying to marry this process modeling and engineering world with these "newer" AI tools.
It's all coming together. I am definitely back on board, and I will try and apply your package suggestions to a real world project as soon as I can to test it out. Thanks again!
I'm really curious to see if efforts like this, and from computational augmentation and automation more generally, will yield progress on symbolic modeling of large natural systems.
There is a segment in Adam Curtis's "All Watched Over By Machines of Loving Grace" which describes scientists trying to model the ecology of a prairie. As described by Curtis, the more data they collected and more complex their model, the worse the predictions of the model became. It feels like the lessons of the past couple generations (many decades) have been that symbolic models don't compose together easily; symbolic analysis only works in simple systems and has hit diminishing returns; numeric and ML methods work well enough; etc. I'm curious if better tooling (augmentation and/or automation) can push through some of these challenges and yield real understanding. Or if there are some fundamental scaling problems that can not be overcome.
I found it curious that one of the implementations of symbolic regression (the "machine scientist" referenced in the article) is a Python wrapper on Julia: https://github.com/MilesCranmer/PySR
I don't think I've seen a Python wrapper on Julia code before.
I worked on genetic programming for my MSc in econometrics, used it to build human-readable risk models for time series data. It worked really well. The article seems to call these methods genetic algorithms, though Koza himself (the "godfather" of genetic algorithms & genetic programming) was pretty keen on distinguishing the two. The main difference is that genetic algorithms represent solutions as linear strings / vectors whereas genetic programming works on tree structures. Maybe just splitting hairs here though.
It should also be emphasized that genetic programming is just one approach to program synthesis, i.e. automatically deriving computer programs from data.
You don't have to use genetic/evolutionary algorithms to search the space of programs, it's just the most popular method.
You can even try pure random search if you're feeling particularly lucky:
"The main difference is that genetic algorithms represent solutions as linear strings / vectors whereas genetic programming works on tree structures."
Yes, that and that the tree data structure used by genetic programming tends to have functions in every non-leaf node, and static values or variables in the leaf nodes.
Basically approximating a solution, and then finding an equation from some data. Nothing wrong with writing down an approximation to some problem. An approximation can be very useful. As long as you know it’s limitations it’s fine to answer questions where it applies. But this can not replace deeper understanding. It’s all good to write down approximate solutions to some hard n body problem or non linear pde, just don’t look too far ahead in time, or update your model with fresh data if you need to.
> It’s all good to write down approximate solutions to some hard n body problem or non linear pde, just don’t look too far ahead in time, or update your model with fresh data if you need to.
It sounds like this symbolic regression approach can potentially find correct PDE solutions directly from data:
"In February, they fed their system 30 years’ worth of real positions of the solar system’s planets and moons in the sky. The algorithm skipped Kepler’s laws altogether, directly inferring Newton’s law of gravitation and the masses of the planets and moons to boot."
Note that Newton’s law of gravitation can also be derived from a differential equation. Also note that simply knowing the exact solution to the PDE does not mean that the PDE model itself is an exact representation of reality. For example, we know from Einstein's laws that Newton's laws are not correct at astronomical scales. In the symbolic regression approach, the solution would be limited by the number of different observable variables available to train the model on.
Many PDEs in real world problems also do not have any known exact symbolic solution, so approximate symbolic formulas are used routinely anyway but are just painstakingly discovered and derived by humans with "deeper understanding" (actually, just hoards of PhD students and their advisors throwing everything they can think of at the problem until one of them lands on a unusable formula).
Well, with a human coming up with a PDE model (i.e. a system of equations) based on their intuition about what the causes of a given phenomenon are, they would be limited by their intuition/imagination about which variables are significant for the model. For all practical purposes, the variables that a modeler imagines to be significant are a subset of all the possible observable variables. A machine learning approach would learn the significant variable directly from the data, and is therefore not limited to imagination. There's an example of this mentioned in the article, in the context of ocean models:
“The algorithm picked up on additional terms,” Zanna said, producing a “beautiful” equation that “really represents some of the key properties of ocean currents, which are stretching, shearing and [rotating].”
Yes I would say most of physics we currently can make predictions with are just effective models that are only applicable at some certain scale. Certainly there may be some underlying ideas that get shared around but doing calculations and making predictions is effective.
Yes, they are in absolute terms. To prove that they are exact we would need to have a complete understanding of the universe, which we do not have.
In practical terms, some of them are in the sense that it is impossible to measure their inaccuracy with common equipment and our primitive human senses. In the end, their purpose is to make predictions and help us make sense of what’s around us, they do not need to be exact.
The ones you learn in high school are, but general and special relativity are accurate to trillionths of a percent (their approximations at low speed and without time dilation are the high school equations).
We don’t know if they break down, because we have no means of measuring them more accurately on earth with all the statistical fluctuations happening.
I think the point is that with conventional equations we can make increasingly better approximations by just using pen and paper, while with these ML methods, you need to train your system ab initio if you want to improve your approximations.
This article seems to be about ML but good old fashioned brute force searches are also possible.
It's trivial to find formulas that fit data to within the variable's uncertainties. You can easily drown in such output. There are two main challenges with this approach: 1. make sure you have the right "ingredients" (factors) for the search, and 2. filter the output so that human reviewers don't waste time on obvious nonsense equations.
Filtering output is not terribly difficult. Some assessment of complexity is required and depending on the formula a symmetry score might be useful. Some actual physics equations are quite complex so you have to be expecting a simple equation or filtering output will be less effective.
Making sure you have all of the right potential factors available is very hard. Expanding the input factor set extends the search space and time.
Brute force costs grow exponentially (well, combinatorically) with the number of terms. It's viable for the small cases, but not for the larger cases. There's a nice paper from 2016 which demonstrates this very directly (http://www.cogsys.org/proceedings/2016/paper-2016-2.pdf), though I'm sure there's some stuff from the 70's on it as well.
Aw, no mention of the two Robot Scientists, Adam and Eve, who don't just invent theories but also design and run their own experiments to verify their theories?
“ These algorithms resemble supercharged versions of Excel’s curve-fitting function, except they look not just for lines or parabolas to fit a set of data points, but billions of formulas of all sorts. In this way, the machine scientist could give the humans insight into why cells divide, whereas a neural network could only predict when they do.”
It’s very cool witnessing the consequences of Moore’s law; as the cost of billions of calculations goes to pennies, this becomes easier and easier.
Since the complexity of nature is unbounded, I wonder if the slow down of Moore’s law will represent a plateau of what can be symbolically discovered.
For example, will the next set of scientistic discoveries require quintillions of combinatorial checks that can’t be accomplished in a human lifetime?
> A constant implied that it had identified two proportional quantities — in this case, period squared and distance cubed. In other words, it stopped when it found an equation.
Very interesting article. Thanks for posting.
I've been thinking about the physics equation and its relation to proportionality. Not all physics equations are proportionalities because in physics the equality sign is loaded.
To me, proportionality is fundamental not the physics equation. So I would have written the last sentence of the quote as "...it stopped when it found [a proportionality].
"They started by downloading all the equations in Wikipedia. They then statistically analyzed those equations to see what types are most common. This allowed them to ensure that the algorithm’s initial guesses would be straightforward — making it more likely to try out a plus sign than a hyperbolic cosine, for instance. The algorithm then generated variations of the equations using a random sampling method that is mathematically proven to explore every nook and cranny in the mathematical landscape."
Wikipedia contains equations from many different branches of math.
Aren't mathematical equations written in all sorts of notations (really obscure notations sometimes, for obscure branches of math), depending on the branch of math they're used in?
How would this program even be able to use equations written in a notation it doesn't understand? Even if it somehow understood the notation, that doesn't mean the algorithm understood the branch of math the equation was for.
This all sounds unworkable to me, unless they're somehow limiting the equations they use to some branch of math they already understand.
Could you elaborate on this? What assumptions do you object to?
Are you saying that different branches of math don't have their own special notations?
Sure, they have notation in common, but they have their own notation too, to express objects, relations, or operations that are of special relevance to them.
That's not to mention that any given symbol could mean different things depending on the branch of math it's used in.
This is also how a few hedge funds ( WorldQuant and spinoffs ) get alpha signals/factors for their long/short portfolios. They get 95% of their aggregate alpha/returns from Genetic Programming and not human quant researchers.
Wouldn't this be a "what we know about physics" distillation? I am basing this assumption on the "Raw data" as being that of already performed experimentation etc.
"One might define simplicity as the length of the equation, say, and accuracy as how close the curve gets to each point in the data set, but those are just two definitions from a smorgasbord of options."
"..the algorithm evaluated candidate equations in terms of how well they could compress a data set. A random smattering of points, for example, can’t be compressed at all; you need to know the position of every dot. But if 1,000 dots fall along a straight line, they can be compressed into just two numbers (the line’s slope and height). The degree of compression, the couple found, gave a unique and unassailable way to compare candidate equations."
Yes. The article states something about statistics and Bayesian and other terms I only barely understand. It uses "+" over "hyperbolic cosine" for example (to quote the article).
[0] https://www.researchgate.net/publication/286905402_Symbolic_...
[1] https://towardsdatascience.com/symbolic-regression-the-forgo...