Please report any queries concerning the funding data grouped in the sections named "Externally Awarded" or "Internally Disbursed" (shown on the profile page) to
your Research Finance Administrator. Your can find your Research Finance Administrator at https://www.ucl.ac.uk/finance/research/rs-contacts.php by entering your department
Please report any queries concerning the student data shown on the profile page to:
Email: portico-services@ucl.ac.uk
Help Desk: http://www.ucl.ac.uk/ras/portico/helpdesk
Email: portico-services@ucl.ac.uk
Help Desk: http://www.ucl.ac.uk/ras/portico/helpdesk
Publication Detail
On the information bottleneck theory of deep learning
-
Publication Type:Journal article
-
Publication Sub Type:article
-
Authors:Saxe AM, Bansal Y, Dapello J, Advani M, Kolchinsky A, Tracey BD, Cox DD
-
Publication date:12/2019
-
Pagination:124020
-
Journal:Journal of Statistical Mechanics: Theory and Experiment
-
Article number:12
-
Print ISSN:1742-5468
-
Publisher URL:
-
Notes:Number: 12 Publisher: IOP Publishing keywords: publications
Abstract
The practical successes of deep neural networks have not been matched by theoret- ical progress that satisfyingly explains their behavior. In this work, we study the information bottleneck (IB) theory of deep learning, which makes three specific claims: first, that deep networks undergo two distinct phases consisting of an initial fitting phase and a subsequent compression phase; second, that the compression phase is causally related to the excellent generalization performance of deep networks; and third, that the compression phase occurs due to the diffusion-like behavior of stochastic gradient descent. Here we show that none of these claims hold true in the general case. Through a combination of analytical results and simulation, we demonstrate that the information plane trajectory is predominantly a function of the neural nonlinearity employed: double-sided saturating nonlinearities like tanh yield a compression phase as neural activations enter the saturation regime, but linear activation functions and single-sided saturating nonlinearities like the widely used ReLU in fact do not. Moreover, we find that there is no evident causal connection between compression and generalization: networks that do not compress are still capable of generalization, and vice versa. Next, we show that the compression phase, when it exists, does not arise from stochasticity in training by demonstrating that we can replicate the IB findings using full batch gradient descent rather than stochastic gradient descent. Finally, we show that when an input domain consists of a subset of task-relevant and task-irrelevant information, hidden representations do compress the task-irrelevant information, although the overall information about the input may monotonically increase with training time, and that this compression happens concurrently with the fitting process rather than during a subsequent compression period.
› More search options
UCL Researchers