# Chapter 8 - PAC learning primer and error bounds

OK, so implicit regularization (via the choice of training algorithm) may be part of the reason for generalization, but almost surely not the entire picture.

It may be useful to revisit classical ML here. Very broadly, the conventional thinking has been:

Parsimony enables generalization.


The challenge is to precisely define what “parsimony” means here, since there are a myriad different ways of doing this. The bulk of work in generalization theory explores more and more refined complexity measures of ML models, but as we will see below, most existing classical approaches lead to vacuous bounds for deep networks. Getting non-vacuous generalization guarantees is a major challenge.

## Warmup: Finite hypothesis classes

Below, $R(f)$ is the population risk, $\hat{R}(f)$ is the empirical risk, and we would like $R(f) - \hat{R}(f)$ to be (a) small, and (b) decreasing as a function of sample size. Complete.

Theorem Consider the setting of binary classification, and let $\f$ be a hypothesis class with finitely many elements. If $n \geq O\left( \frac{1}{\varepsilon^2} \log \frac{|\f|}{\delta}\right)$ then with probability at least $1-\delta$, we have that for every $f \in \f$, $|R(f) - \hat{R}(f)| \leq \varepsilon$.

Proof Hoeffding + union bound.

Remark Observe that this holds for all $f \in \f$, not just the optimal predictor. Such a bound is called a uniform convergence result.

Remark This works irrespective of the choice of $\f$ or the data distribution, which could be a weakness of the technique. Even in incompatible cases (for example, the data is highly highly nonlinear, but we only use linear classifiers), one can use the above result that the generalization error is small. The lesson is that we need to control both train and generalization error.

Remark Rearranging terms, we get that the generalization error $\lesssim \frac{1}{\sqrt{n}}$, which is very typical. Classical theory tells us that this is a natural stopping point for optimization – this is the “noise floor” so we don’t have to optimize the train error below this level. (Note: unfortunately, deep learning doesn’t obey this, and optimizing to zero is indeed beneficial; this is the “double descent” phenomenon.)

Remark The scaling of the sample complexity looks like $n \approx \log |\f|$, which basically is the number of “bits” needed to encode any particular hypothesis.

## Complexity measures

Long list of ways to extend the above reasoning. The major drawback is that the above bound is meaningful only for hypothesis classes $\f$ with finite cardinality. Alternatives:

• Covering number
• VC-dimension
• Pseudo-dimension

### Data-dependent bounds

Definition of Rademacher complexity, upper bounds for NN.

### PAC-Bayes bounds

Possibly first approach to produce “non-vacuous” bounds, at least for small networks. Key results: basic approach1, application to DL2.

## Error bounds for (deep) neural networks

Key results: here3, here4, here5.

$\text{Test error} \leq \text{train error}~+~O \left(\frac{\text{complexity measure}}{\sqrt{n}}\right),$
• Usually, the complexity measure in the numerator is far too large anyway, leading to “vacuous” bounds. For example, in 5 it reduces to $\text{depth} \times \text{width}$, which is too large in the overparameterized setting.