Foundations of Deep Learning
NYU Tandon, Spring 2022

Chapter 8 - PAC learning primer and error bounds

OK, so implicit regularization (via the choice of training algorithm) may be part of the reason for generalization, but almost surely not the entire picture.

It may be useful to revisit classical ML here. Very broadly, the conventional thinking has been:

Parsimony enables generalization.

The challenge is to precisely define what “parsimony” means here, since there are a myriad different ways of doing this. The bulk of work in generalization theory explores more and more refined complexity measures of ML models, but as we will see below, most existing classical approaches lead to vacuous bounds for deep networks. Getting non-vacuous generalization guarantees is a major challenge.

Warmup: Finite hypothesis classes

Below, $R(f)$ is the population risk, $\hat{R}(f)$ is the empirical risk, and we would like $R(f) - \hat{R}(f)$ to be (a) small, and (b) decreasing as a function of sample size. Complete.

Theorem Consider the setting of binary classification, and let $\f$ be a hypothesis class with finitely many elements. If $n \geq O\left( \frac{1}{\varepsilon^2} \log \frac{|\f|}{\delta}\right)$ then with probability at least $1-\delta$, we have that for every $f \in \f$, $|R(f) - \hat{R}(f)| \leq \varepsilon$.

Proof Hoeffding + union bound.

Remark Observe that this holds for all $f \in \f$, not just the optimal predictor. Such a bound is called a uniform convergence result.

Remark This works irrespective of the choice of $\f$ or the data distribution, which could be a weakness of the technique. Even in incompatible cases (for example, the data is highly highly nonlinear, but we only use linear classifiers), one can use the above result that the generalization error is small. The lesson is that we need to control both train and generalization error.

Remark Rearranging terms, we get that the generalization error $\lesssim \frac{1}{\sqrt{n}}$, which is very typical. Classical theory tells us that this is a natural stopping point for optimization – this is the “noise floor” so we don’t have to optimize the train error below this level. (Note: unfortunately, deep learning doesn’t obey this, and optimizing to zero is indeed beneficial; this is the “double descent” phenomenon.)

Remark The scaling of the sample complexity looks like $n \approx \log |\f|$, which basically is the number of “bits” needed to encode any particular hypothesis.

Complexity measures

Long list of ways to extend the above reasoning. The major drawback is that the above bound is meaningful only for hypothesis classes $\f$ with finite cardinality. Alternatives:

  • Covering number
  • VC-dimension
  • Pseudo-dimension

Agnostic (PAC) learning

Data-dependent bounds

Definition of Rademacher complexity, upper bounds for NN.

PAC-Bayes bounds

Possibly first approach to produce “non-vacuous” bounds, at least for small networks. Key results: basic approach1, application to DL2.

Error bounds for (deep) neural networks

Key results: here3, here4, here5.

Possible roadblocks?

All of the above bounds lead to generalization bounds of the form:

\[\text{Test error} \leq \text{train error}~+~O \left(\frac{\text{complexity measure}}{\sqrt{n}}\right),\]

and progress has been focused on defining better and better complexity measures. However, two issues with this:

  • Usually, the complexity measure in the numerator is far too large anyway, leading to “vacuous” bounds. For example, in 5 it reduces to $\text{depth} \times \text{width}$, which is too large in the overparameterized setting.

  • This also (seemingly) means that error bounds should decrease with dataset size, for a fixed model class. Not the case :( .

A recent result provides evidence6 to show that any result that uses uniform convergence may suffer from this kind of looseness. We likely need alternate techniques, which we will do next.