It has become clear over the last decade that progress in practical applications of deep learning has considerably outpaced our understanding of its foundations. Many fundamental questions remain unanswered. Why are we able to train neural networks so efficiently? Why do they perform so well on unseen data? Is there any benefit of one network architecture over another?

These lecture notes are an attempt to sample a growing body of work in the mathematics of deep learning that address some of these questions. They supplement my graduate level course on this topic taught at NYU Tandon in the Spring of 2022.

All pages on that site are under construction. Corrections, pointers to omitted results, and other feedback are welcome: just email me, or open a Github pull request at this repository.

]]>In my previous post, I had discussed physics-informed neural networks, or PINNs. These are networks that are trained in order to approximate the solution to a (possibly nonlinear) PDE: \[ \mathcal{N}(u) = f \] The idea is to parameterize the solution as \(u = u_\Theta\) (where \(\Theta\) represents the weights of a neural network) and optimize the residual loss: \[ L(\Theta) = E_\Omega (\mathcal{N}(u_\Theta) - f)^2 . \] Here the expectation is taken over points in the domain and/or the boundary. In practice, this can be replaced by a finite sum by sampling points either uniformly or at random. Once this is done, optimization can be performed by standard back-propagation over the network with assorted tips and tricks.

One point that struck me was that the above approach *requires no training data*. (There are data-dependent variations of PINNs, but that is perhaps a topic for a later post.) From a traditional ML modeling perspective, my first instinct would have been to generate a training dataset of input-output *pairs* (in this case, a bunch of \((f,u)\) tuples) and try to learn a neural map from forcing function space to solution space. But getting such a dataset would require considerable work up front.

In machine learning, we tend to take for granted the availability of ginormous training datasets. But this is not the case in most scientific or engineering applications — for proof, spend time talking with any physicist or material scientist. Data generation takes time, money, and manpower. But PINNs neatly sidestep this issue; the only information needed is

- the structure of the PDE (encoded in the loss function) and
- a list of collocation/boundary points (also encoded in the loss function).

However, reading about PINNs also brought to mind another recent success story in deep learning that also works without any training data: the Deep Image Prior (DIP).

A quick historical digression. Back when we were kids, there was a lot of fuss about modeling data via *priors*. Priors represent your belief that the solution to a particular problem obeys a certain structure. That structure is encoded by either a probability distribution or a deterministic low-dimensional manifold, depending on what strikes your fancy. Priors are hand-picked: at the onset, one declares what prior they want (“We will assume a stick-breaking Griffiths-Engel-McClosky distribution…”), and then proceeds to solve the problem of interest.

The pendulum has now shifted the other way. Picking priors by hand has been fully surpassed by using *learned* priors in all sorts of applications. The idea is to parameterize the prior distribution/manifold of the data somehow, and figure out what parameters make sense for your problem. Yann LeCun makes the fascinating point that assuming *a priori* structure about your problem is a “necessary evil”: your priors can be (and often are) wrong, and even if they are not, they might become obsolete when a new prior comes along.

What kind of parametric form should the priors take? A decade ago, dictionary learning would have been the answer. Now, one uses deep neural networks, typically with some form of convolutional structure. The results are amazing; look no further than the (synthetic) images sampled from the GANGogh prior.

But there is a subtle point to be made here. The belief encoded in neural networks consists of two parts:

- the
*data-free*part, which is encapsulated via the architecture of the neural network (whether convolutional, recurrent, or whatever). - the
*data-dependent*part, which is encapsulated via training over a given dataset.

Both play major (if individually unacknowledged) roles in the predictive power of neural networks. I wish there was a clean mathematical way to separate the contributions of these two components. Far too often, a paper that constructs a very rich and complicated prior using massive amounts of training data is superseded a few years later by a paper that replicates similar behavior while trained over a single training example. The repeated emergence of such instances tells me that there are new, more sensible, more sample-efficient neural network training approaches that are waiting to be discovered.

The DIP is one example of a learned prior modeled via a (deep) neural network. The key is that this learning is done *without any training data*. Imagine, for example, a deblurring application where the input is a noisy, blurry image \(x\), and the goal is to sharpen it up. Call the clean image \(u\). The classical way to estimate \(u\) would be to solve the linear inverse problem:
\[
x = \mathcal{A}(u)
\]
where \(\mathcal{A}\) models the blurring operator. Since \(\mathcal{A}\) is most likely rank-deficient, one would have to use ridge regression, LASSO, or some other regularization scheme to make the problem well-posed.

The modern approach is to use deep learning. But let us pose the problem in the language of priors. If we assume that \(u = u_\Theta\), i.e., the *solution* is parameterized via a *neural prior* parameterized by \(\Theta\), then the weights of that network can be learned by optimizing, over \(\Theta\), the loss function:
\[
L(\Theta) = E_\Omega (\mathcal{A}(u_\Theta) - x)^2
\]
where the expectation is interpreted as the average over all the pixel intensities. (The latent code that is the input to this network is assumed to be fixed and held constant throughout). In essence, we are adjusting the weights of the network to fit the blurry measurements *of the particular image that we are interested in*.

Pause, for a moment, to observe a few points:

- that this learning problem is completely
*data-free*: there is no auxiliary training dataset needed whatsoever. - that there is a risk of overfitting if the network is too large (the original DIP paper took care of this via early-stopping, but later works, such as the Deep Decoder have dealt with this by designing smarter architectures).
- that
*each*new denoising problem requires us to retrain a whole new network, possibly from scratch (although perhaps some degree of transfer learning is possible). - the DIP learning problem is strikingly similar to the PINN formulation above.

There are several unanswered questions here (adding to the already long list of questions from last time). There is hope that some of these can be answered, at least in the context of DIP.

More pertinently to this series of posts: Can the lessons learned by DIP and its descendants be applied to learning physics-informed models? What does this mean for solving PDEs? And is there any hope for theoretical analysis at all here? I will try to explain why I think these have affirmative answers in a future post.

]]>In Book 1 of the Principia Mathematica, Newton puts forth his celebrated Laws of Motion. He uses them to provide qualitative explanations to a staggering number of measured phenomena (including the inverse square behavior of gravity, Kepler’s laws of planetary motion, the basis of tides, the precession of equinoxes, and the orbits of comets, among others). In particular, the Second Law of Motion assumes the form of an *ordinary differential equation* (ODE):

\[ \frac{d^2 x(t)}{dt^2} = \frac{1}{m} F(t) \]

where \(x(t)\) is the instantaneous position of a body of mass \(m\) and \(F(t)\) is the force acting on it.

Despite the tag “ordinary”, ODEs can be simple to write down but become quickly complex. For the Second Law, analytical solutions for \(x(t)\) are available only if the force function is well-behaved. If not, one has to resort to numerical methods that involve appropriate discretization of the differential operators involved in the ODE. Natural questions start to emerge here: how does one do the discretization? How fine should we discretize? Does the approximation error induced by the discretization converge to zero, and if so, at what rate?

All these matters are doubly/triply exacerbated when we start talking about *partial* differential equations (PDEs). Here, the variable is an unknown multivariate function \(u\). Let us be concrete and limit ourselves to the variables being space and time, so that \(u = u(x,t)\). The equation to be solved now involves an arbitrary operator with partial derivatives:

\[ \mathcal{N}(u) = f \]

where \(f = f(x,t)\) is called the *forcing* function. Let us assume this is deterministically fixed. If not, then the above equation is called a *stochastic* PDE.

Unlike ODEs, there is no general understanding of when (or whether) a generic PDE even admits a solution. The celebrated Navier-Stokes equation is an example of a system of PDEs whose theoretical understanding is incomplete:

\[ \frac{\partial u}{\partial t} + (u \cdot \Delta - \nu) u + \frac{1}{\rho} \nabla p = f . \]

where \( \Delta, \nabla \) represent the Laplacian and the gradient respectively. Things have already become hairy, since the above PDE is *nonlinear* in its unknown variables. So even a heuristic application of numerical methods may not always work well.

Let us now briefly set aside the 300+ year history of solving ODEs/PDEs, and instead imagine a completely different approach. Suppose we *parameterize* the solution using a *deep feedforward neural network.* In concrete terms, if \(\Theta\) represents the weight and biases of the network, we write down:

\[ u = u_\Theta(x,t) \]

and formulate the *physics-informed* loss function:

\[ L(\Theta) = \sum_{(x_i,t_i) \in S_{\text{int}}}^{} (\mathcal{N} (u_\Theta(x_i, t_i)) - f(x_i, t_i) )^2 + \lambda \sum_{(x_j,t_j) \in S_{\text{bdry}}}^{} (u_\Theta(x_i, t_i) - u_0(x_i, t_i))^2 \]

where \(S_{\text{int}}\) denotes a set of collocation points in the interior and \(S_{\text{bdry}}\) denotes a set of boundary points. The weights can now be learned using standard neural training paraphernalia (autodiff, ADAM, batch normalization, etc). Once the weights are learned, the solution can be reconstructed by evaluating \(u_\Theta(x,t)\) over the entire domain.

Variants of this idea seem to have been floating around since (at least) the 1990s. As with most ideas based on neural networks, they didn’t gain steam until much later, starting with an inspirational series of papers from George Karniadakis and co-authors starting from 2017. They call this *PINN*, short for *Physics-Informed Neural Networks*.

The simplicity of the above formulation lends itself to a number of extensions. Among many others:

- Additional training data (in the form of, say, values of the solution at a pre-identified set of collocation points) can be incorporated by throwing in new loss terms.
- The PDE operator \(\mathcal{N}\) itself can involve unknown parameters (say, \(\lambda\)) in which case both \(\Theta, \lambda\) are jointly estimated.
- Extension to stochastic PDEs can be achieved by taking the expectation of the physics-informed loss function over appropriately defined probability measures.

In this manner, the considerable advances in neural network learning over the last five years (and the democratization of software tools for learning neural nets, including powerful packages like TensorFlow and PyTorch) can now directly be ported to the field of numerical PDE analysis. The results are very impressive.

It leads me to wonder more broadly: what other fields in science are waiting for such a clean connection to be made to neural nets?

However, despite all these exciting advances, there are several open questions here.

First, why should the above neural approach to solving PDEs be any better than a standard numerical method? The issues of discretization, solution uniqueness, and convergence continue to persist (note that the standard PINN formulation does *not* explicitly discretize the domain, but there is an implicit level of discretization achieved by how the collocation points in \(S_{\text{int}}\) are distributed).

Second, how do we know that we have obtained the right solution? One answer is to do a post hoc check: if we see low/zero training loss, then we are good. But *a priori* there do not seem to be any guarantees on how to achieve low loss, and I find this a bit unsatisfying.

Third, does the solution given by the PINN generalize to *all* points in the domain? In other words, how can we control the generalization error? See here and here for interesting generalization upper bounds. But I am not entirely sure how powerful these are. In any case getting non-vacuous bounds on neural net generalization is a challenging problem in itself.

Fourth, somewhat unfortunately we have to learn a *different* network from scratch for each new set of boundary conditions and/or PDE system.

In a later post, I will describe potential avenues towards addressing some of these questions.

]]>We are now nearing the end of Month 4 of the You-Know-What, and in the absence of commuting I suddenly found time to catch up on the (numerous) unread books sitting on my Kindle. One of them was *Quicksilver* by fellow ex-Ames resident Neal Stephenson. The book is (typically) dense and packed with information, but there is an interesting connection to the current capital-P Pandemic: the first part is set during the Plague of 1666 which forced Newton to spend the year in the countryside watching apples fall out of trees.

In the preface of the Principia Mathematica, Newton defines the primary guiding principle of the scientific method as “[observing] the phenomena of motions to investigate the forces of nature, and then from these forces to demonstrate the other phenomena”.

IANA scientific historian, but it is interesting to interpret this in the light of today’s AI revolution. His words resonate with the principle of *generalization* in machine learning, where an abstract model is constructed (using numerical and/or other tools) from a given set of observations, and then deployed in new, unseen contexts. This forms the bedrock of a lot of ML (both theoretical and applied) research. Unfortunately, it has become clear that the community faces more questions than answers.

Suppose we take for granted the (plausible) ability of modern AI to deduce patterns from observations, and to apply these deductions to new unseen contexts. Could a properly constructed AI system be used to discover new scientific laws? Or design new engineering systems? Or even just accelerate the current cycle of scientific progress, much of which relies on expensive and time-consuming trial-and-error?

I spent a large portion of the last eighteen months leading a DARPA AIRA project on the interplay between AI and scientific discovery. The goal of AIRA was to explore the possibility of AI being a co-equal partner to human ingenuity in the loop of scientific discovery. Kudos to the DARPA team for shepherding an inspirational program with an intellectually diverse collection of performers.

In the coming weeks, I will describe some key recent advances in this area. I will focus on some of the work developed during the AIRA program, although it should be clear that this is only the tip of a very large iceberg. Somewhere in this list is also our recent work on GAN-like models for accelerating PDE solvers and materials characterization.

]]>Test Latex:

\[ y = \Phi x \]

Check out the Wordpress site for more info.

]]>