Data-free Data Science
This is Part 3 on my series of posts on physics-informed machine learning; for backstory, see Parts 1 and 2.
In my previous post, I had discussed physics-informed neural networks, or PINNs. These are networks that are trained in order to approximate the solution to a (possibly nonlinear) PDE: \[ \mathcal{N}(u) = f \] The idea is to parameterize the solution as \(u = u_\Theta\) (where \(\Theta\) represents the weights of a neural network) and optimize the residual loss: \[ L(\Theta) = E_\Omega (\mathcal{N}(u_\Theta) - f)^2 . \] Here the expectation is taken over points in the domain and/or the boundary. In practice, this can be replaced by a finite sum by sampling points either uniformly or at random. Once this is done, optimization can be performed by standard back-propagation over the network with assorted tips and tricks.
One point that struck me was that the above approach requires no training data. (There are data-dependent variations of PINNs, but that is perhaps a topic for a later post.) From a traditional ML modeling perspective, my first instinct would have been to generate a training dataset of input-output pairs (in this case, a bunch of \((f,u)\) tuples) and try to learn a neural map from forcing function space to solution space. But getting such a dataset would require considerable work up front.
In machine learning, we tend to take for granted the availability of ginormous training datasets. But this is not the case in most scientific or engineering applications — for proof, spend time talking with any physicist or material scientist. Data generation takes time, money, and manpower. But PINNs neatly sidestep this issue; the only information needed is
- the structure of the PDE (encoded in the loss function) and
- a list of collocation/boundary points (also encoded in the loss function).
However, reading about PINNs also brought to mind another recent success story in deep learning that also works without any training data: the Deep Image Prior (DIP).
A quick historical digression. Back when we were kids, there was a lot of fuss about modeling data via priors. Priors represent your belief that the solution to a particular problem obeys a certain structure. That structure is encoded by either a probability distribution or a deterministic low-dimensional manifold, depending on what strikes your fancy. Priors are hand-picked: at the onset, one declares what prior they want (“We will assume a stick-breaking Griffiths-Engel-McClosky distribution…”), and then proceeds to solve the problem of interest.
The pendulum has now shifted the other way. Picking priors by hand has been fully surpassed by using learned priors in all sorts of applications. The idea is to parameterize the prior distribution/manifold of the data somehow, and figure out what parameters make sense for your problem. Yann LeCun makes the fascinating point that assuming a priori structure about your problem is a “necessary evil”: your priors can be (and often are) wrong, and even if they are not, they might become obsolete when a new prior comes along.
What kind of parametric form should the priors take? A decade ago, dictionary learning would have been the answer. Now, one uses deep neural networks, typically with some form of convolutional structure. The results are amazing; look no further than the (synthetic) images sampled from the GANGogh prior.
But there is a subtle point to be made here. The belief encoded in neural networks consists of two parts:
- the data-free part, which is encapsulated via the architecture of the neural network (whether convolutional, recurrent, or whatever).
- the data-dependent part, which is encapsulated via training over a given dataset.
Both play major (if individually unacknowledged) roles in the predictive power of neural networks. I wish there was a clean mathematical way to separate the contributions of these two components. Far too often, a paper that constructs a very rich and complicated prior using massive amounts of training data is superseded a few years later by a paper that replicates similar behavior while trained over a single training example. The repeated emergence of such instances tells me that there are new, more sensible, more sample-efficient neural network training approaches that are waiting to be discovered.
The DIP is one example of a learned prior modeled via a (deep) neural network. The key is that this learning is done without any training data. Imagine, for example, a deblurring application where the input is a noisy, blurry image \(x\), and the goal is to sharpen it up. Call the clean image \(u\). The classical way to estimate \(u\) would be to solve the linear inverse problem: \[ x = \mathcal{A}(u) \] where \(\mathcal{A}\) models the blurring operator. Since \(\mathcal{A}\) is most likely rank-deficient, one would have to use ridge regression, LASSO, or some other regularization scheme to make the problem well-posed.
The modern approach is to use deep learning. But let us pose the problem in the language of priors. If we assume that \(u = u_\Theta\), i.e., the solution is parameterized via a neural prior parameterized by \(\Theta\), then the weights of that network can be learned by optimizing, over \(\Theta\), the loss function: \[ L(\Theta) = E_\Omega (\mathcal{A}(u_\Theta) - x)^2 \] where the expectation is interpreted as the average over all the pixel intensities. (The latent code that is the input to this network is assumed to be fixed and held constant throughout). In essence, we are adjusting the weights of the network to fit the blurry measurements of the particular image that we are interested in.
Pause, for a moment, to observe a few points:
- that this learning problem is completely data-free: there is no auxiliary training dataset needed whatsoever.
- that there is a risk of overfitting if the network is too large (the original DIP paper took care of this via early-stopping, but later works, such as the Deep Decoder have dealt with this by designing smarter architectures).
- that each new denoising problem requires us to retrain a whole new network, possibly from scratch (although perhaps some degree of transfer learning is possible).
- the DIP learning problem is strikingly similar to the PINN formulation above.
There are several unanswered questions here (adding to the already long list of questions from last time). There is hope that some of these can be answered, at least in the context of DIP.
More pertinently to this series of posts: Can the lessons learned by DIP and its descendants be applied to learning physics-informed models? What does this mean for solving PDEs? And is there any hope for theoretical analysis at all here? I will try to explain why I think these have affirmative answers in a future post.