skip to main content

Analytical variation in the generalization of deep feed-forward neural networks

Neves, Carlos Guatimosim

Biblioteca Digital de Teses e Dissertações da USP; Universidade de São Paulo; Instituto de Matemática e Estatística 2021-01-26

Acesso online. A biblioteca também possui exemplares impressos.

  • Título:
    Analytical variation in the generalization of deep feed-forward neural networks
  • Autor: Neves, Carlos Guatimosim
  • Orientador: Vicente, Renato
  • Assuntos: Variação De Hardy-Krause; Redes Neurais; Regularização; Erro De Generalização; Teoria Da Generalização; Aprendizado Profundo; Aprendizado Estatístico; Aprendizado Analítico; Inteligência Artificial; Statistical Learning; Regularization; Neural Networks; Machine Learning; Hardy-Krause Variation; Generalization Theory; Generalization Gap; Generalization Error; Deep Learning; Artificial Intelligence; Analytical Learning
  • Notas: Dissertação (Mestrado)
  • Descrição: The essence of Machine Learning modelling may be summarized as: to unravel the implicit pattern in the data by having access to only a finite number of samples. The theory which studies this process is rich, and two quantities are of particular importance: the performance errors inside and outside the sample. The error inside the sample is called the training error, and is calculated over the set used to optimize the models parameters. The outside error is the average error amongst all samples, and may be understood as the true error. Although the final objective is to construct a model with low true error, we only have access to the training error, which is an empirical estimation. Thus, in order to deduce the general pattern in the data, is necessary that both are similar.The distance between those errors is called the generalization gap, and much of the theory is dedicated to study its properties and upper bounds, so that we may understand under what circumstances it is controlled. The gap is a measure of the models ability to properly induce the global pattern, and is a major topic in all applied instances of Machine Learning.The classical view of statistics correlates the generalization property with the model capacity to fit patterns in the data. The reasoning behind this is that, being capable of fitting many configurations, the model is prone to read noise in the sample, and thus perform poorly in general. However, the definition of a models complexity is loose, and while it usually translates to the number of parameters, there is one hypothesis space which seemingly escapes this intuition. That is the case of Neural Networks. Using Neural Networks with many layers (Deep learning) is proving to be the best modelling paradigm for many benchmark problems, and many of the advances in the industry are due to their success. However, this seemingly goes against what classical Statistical Learning theory states about generalization and complexity, since Deep networks are capable offitting many patterns. Indeed, there has been experiments showing that networks may fit even random labels.This apparent paradox is an open question in the field and the main topic of this work.After a introduction and overview of the classical understanding of generalization, we introduce the work of [20], which is the central to our contributions. In it a new approach named Analytical Learning is proposed, aiming to complement the classical one, hopefully bringingsome insights about the apparent contradiction emerged from Deep Learning. Instead of analyzing probabilistic bounds, in this paper the generalization gap is studied in a context where the predictor and the dataset are fixed. By doing so, we prevent the pessimistic cases, and a tighter bound is hopefully achieved. Additionally, it provides a more real scenario, since in practice usually the data is given. The main result of [20] bounds the gap involving a term related to the data and another related to the loss function\'s Hardy-Kruase Variation. Our main contribution revolves around tracing similarities between this variation term with the stability concept studied in the classical approach of generalization, making parallels with what may be understood in the Analytical case as information.The main idea is that the loss function variation decreases if the partial derivatives of the predictor, according to the instance space, are close to the oracles. The derivatives in this sense may be understood as how much information the function is reading, since it measures the impact of a certain dimension in a local prediction. Thus, if the predictor reads information similarly to the oracle, then we guarantee a low gap. With this, we argue that the partial derivatives of the predictor are the main measure of regularization in the analytical sense. One of the advantages of this is simplicity: rewriting the SGD (Stochastic Gradient Descent) optimization step in the function space, we have an easy way to investigate the evolution of the models complexity during training.Furthermore, we use this interpretation to develop on relative recent papers trying to tackle the generalization paradox in deep learning, [28] and [37]. In the former we make an extensive analysis. while in the latter we make a more brief qualitative approach, showing how our interpretation relates with their result. In [28], the complexity of networks is studied through the lens of Fourier Theory. There is shown that the space of ReLU (Rectifier Linear Unit) networks has a high spectral decay:during optimization, the increments in the k-th harmonic caused by the weights updates decreases with at least k^2. This means that high frequencies magnitudes in this space are naturally damped during training, suggesting an inherent regularization property. However,at no moment in [28] the generalization gap is mentioned, and so it is not clear if the spectral decay is enough to guarantee a good estimation of the true error. Motivated by this, we show a bound using the Hardy-Krause Variation on splines which decreases with the degree, justifying the special properties of ReLU activation functions.In [37] the main theorem shows that if the architecture of the network follows a funnel pattern (when the number of neurons in the network decreases as we go deeper), then increasing the number of layers actually reduces the generalization gap, thus supporting the deep learning approach. This happens because the funnel like architecture forces a nontrivial kernel in the linear transformations, which translates into a loss of information. This implies that as the number of layers increases, the information shared between the final layerand the dataset decreases, making the prediction less data dependant and thus regularized.This result relates closely to our interpretation of information in the analytical sense.Having a non trivial kernel means that in some cases the prediction is constant with respect to disruptions in certain dimensions. This means that the overall variation (in the sense ofderivatives) will be smaller, which according to Analytical Learning translates to a smaller generalization gap.
  • DOI: 10.11606/D.45.2021.tde-19042021-202404
  • Editor: Biblioteca Digital de Teses e Dissertações da USP; Universidade de São Paulo; Instituto de Matemática e Estatística
  • Data de criação/publicação: 2021-01-26
  • Formato: Adobe PDF
  • Idioma: Inglês

Buscando em bases de dados remotas. Favor aguardar.