skip to main content

Error estimation based on variance analysis of k-fold cross-validation

Jiang, Gaoxia ; Wang, Wenjian

Pattern recognition, 2017-09, Vol.69, p.94-106 [Periódico revisado por pares]

Elsevier Ltd

Texto completo disponível

Citações Citado por
  • Título:
    Error estimation based on variance analysis of k-fold cross-validation
  • Autor: Jiang, Gaoxia ; Wang, Wenjian
  • Assuntos: Error estimation ; k-fold cross-validation ; Model selection ; Variance analysis
  • É parte de: Pattern recognition, 2017-09, Vol.69, p.94-106
  • Descrição: •When the numbers of samples and folds are both large enough, we proved that CV variance and its accuracy have the quadratic relationship.•The relationships between CV variance and its factors have been derived, allowing to predict which variance is less before applying k-fold CV.•Theoretical explanations have been given for some empirical evidences of Rodriguez and Kohavi from the respect of variance analysis.•The proposed normalized variance has significant correlation with the error and is unrelated to k so that it can serve as a stable error measurement. Cross-validation (CV) is often used to estimate the generalization capability of a learning model. The variance of CV error has a considerable impact on the accuracy of CV estimator and the adequacy of the learning model, so it is very important to analyze CV variance. The aim of this paper is to investigate how to improve the accuracy of the error estimation based on variance analysis. We first describe the quantitative relationship between CV variance and its accuracy, which can provide guidance for improving the accuracy by reducing the variance. We then study the relationships between variance and relevant variables including the sample size, the number of folds, and the number of repetitions. These form the basis of theoretical strategies of regulating CV variance. Our classification results can theoretically explain the empirical results of Rodríguez and Kohavi. Finally, we propose a uniform normalized variance which not only measures model accuracy but also is irrelative to fold number. Therefore, it simplifies the selection of fold number in k-fold CV and normalized variance can serve as a stable error measurement for model comparison and selection. We report the results of experiments using 5 supervised learning models and 20 datasets. The results indicate that it is reliable to determine which variance is less before k-fold CV by the proposed theorems, and thus the accuracy of error estimation can be promoted by reducing variance. In so doing, we are more likely to select the best parameter or model.
  • Editor: Elsevier Ltd
  • Idioma: Inglês

Buscando em bases de dados remotas. Favor aguardar.