Chapter 9  Generalization bounds via stability
In the last two lectures, we have investigated generalization through the lens of
 marginbased regularization
 model parsimony
Each had their shortfalls, especially in the context of deep learning:
 Regularization very much depends on model architecture and/or training procedure. Implicit bias of GD shows up mysteriously in several places.
 Model parsimony is hard to measure; interacts with the data distribution in hardtoquantify ways; and anyway in the last chapter we saw that uniform convergence seems to be an insufficient tool for getting nontrivial bounds, especially in the overparameterized setting.
Let us now briefly a third lens to understand generalization:
 Algorithmic stability.
Our central hypothesis in this chapter is:
Stable learning implies generalization.
The high level idea is that if the model produced by a training algorithm is not sensitive to any single training data point, then the model generalizes. This idea has been around for some time (and in fact, methods such as bagging^{1} explicitly were developed in order to make classifiers stable to individual points.)
(Notice that the definition of stability involves specifying the training procedure; once again, the role of the training algorithm is more clearly illuminated here.)
The plus side is that we won’t need to appeal to uniform convergence, so overparameterization is not a concern. The minus side is that convexity of the loss seems to be the key tool here, which makes application to deep learning difficult.
On the other hand, our tour of optimization (Chapter 4) has given us tools to deal with (some) nonconvex losses. All these will be prove to be fruitful.
Setup
Stability as a tool for generalization dates back to an influential paper by Bousquet and Elisseef^{2}, who introduced the notion of uniform stability.
Further refined in the context of empirical risk minimization by ShalevSchwartz et al.^{3}.
Algorithmic stability of (S)GD for empirical risk minimization
Beyond convexity
Stability under smoothness + PLcondition. Key paper: here^{4}.
Connections to differential privacy
Fascinating parallels between
 generalization via uniform stability
 differential privacy (DP)
which has a different origin story, dating back to several papers by Dwork and coauthors ^{5} ^{6} ^{7}.
In DP the goal is to protect “data leakage” – nothing about the user’s identity (as far as possible) should be revealed from the model parameters. But the method to achieve is the same as the ones we discussed above: make sure the model does not depend on any one data point too much.
Upshot: DP implies uniform stability (and DPstyle algorithm design gives a way to control generalization.)
Complete.

L. Breiman, Bagging Predictors, 1994. ↩︎

O. Bousquet and A. Elisseef, Stability and Generalization, 2002. ↩︎

S. ShalevSchwartz, O. Shamir, K. Sridharan, N. Srebro, Learnability, Stability and Uniform Convergence, 2010. ↩︎

Z. Charles and D. Papailiopoulos, Stability and Generalization of Learning Algorithms that Converge to Global Optima, 2018. ↩︎

C. Dwork, Differential Privacy, 2006. ↩︎

C. Dwork, F. McSherry, K. Nissim, A. Smith, Calibrating Noise to Sensitivity in Private Data Analysis, 2006. ↩︎

C. Dwork and A. Roth, The Algorithmic Foundations of Differential Privacy, 2014. ↩︎