skip to main content
Tipo de recurso Mostra resultados com: Mostra resultados com: Índice

Employing nonlinear time series analysis tools with stable clustering algorithms for detecting concept drift on data streams

Costa, Fausto Guzzo Da

Biblioteca Digital de Teses e Dissertações da USP; Universidade de São Paulo; Instituto de Ciências Matemáticas e de Computação 2017-08-17

Acesso online. A biblioteca também possui exemplares impressos.

  • Título:
    Employing nonlinear time series analysis tools with stable clustering algorithms for detecting concept drift on data streams
  • Autor: Costa, Fausto Guzzo Da
  • Orientador: Mello, Rodrigo Fernandes de
  • Assuntos: Agrupamento; Mudanças De Conceito; Fluxos De Dados; Aprendizado De Máquina; Séries Temporais Não Lineares; Concept Drift; Data Streams; Clustering; Machine Learning; Nonlinear Time Series
  • Notas: Tese (Doutorado)
  • Descrição: Several industrial, scientific and commercial processes produce open-ended sequences of observations which are referred to as data streams. We can understand the phenomena responsible for such streams by analyzing data in terms of their inherent recurrences and behavior changes. Recurrences support the inference of more stable models, which are deprecated by behavior changes though. External influences are regarded as the main agent actuacting on the underlying phenomena to produce such modifications along time, such as new investments and market polices impacting on stocks, the human intervention on climate, etc. In the context of Machine Learning, there is a vast research branch interested in investigating the detection of such behavior changes which are also referred to as concept drifts. By detecting drifts, one can indicate the best moments to update modeling, therefore improving prediction results, the understanding and eventually the controlling of other influences governing the data stream. There are two main concept drift detection paradigms: the first based on supervised, and the second on unsupervised learning algorithms. The former faces great issues due to the labeling infeasibility when streams are produced at high frequencies and large volumes. The latter lacks in terms of theoretical foundations to provide detection guarantees. In addition, both paradigms do not adequately represent temporal dependencies among data observations. In this context, we introduce a novel approach to detect concept drifts by tackling two deficiencies of both paradigms: i) the instability involved in data modeling, and ii) the lack of time dependency representation. Our unsupervised approach is motivated by Carlsson and Memolis theoretical framework which ensures a stability property for hierarchical clustering algorithms regarding to data permutation. To take full advantage of such framework, we employed Takens embedding theorem to make data statistically independent after being mapped to phase spaces. Independent data were then grouped using the Permutation-Invariant Single-Linkage Clustering Algorithm (PISL), an adapted version of the agglomerative algorithm Single-Linkage, respecting the stability property proposed by Carlsson and Memoli. Our algorithm outputs dendrograms (seen as data models), which are proven to be equivalent to ultrametric spaces, therefore the detection of concept drifts is possible by comparing consecutive ultrametric spaces using the Gromov-Hausdorff (GH) distance. As result, model divergences are indeed associated to data changes. We performed two main experiments to compare our approach to others from the literature, one considering abrupt and another with gradual changes. Results confirm our approach is capable of detecting concept drifts, both abrupt and gradual ones, however it is more adequate to operate on complicated scenarios. The main contributions of this thesis are: i) the usage of Takens embedding theorem as tool to provide statistical independence to data streams; ii) the implementation of PISL in conjunction with GH (called PISLGH); iii) a comparison of detection algorithms in different scenarios; and, finally, iv) an R package (called streamChaos) that provides tools for processing nonlinear data streams as well as other algorithms to detect concept drifts.
  • DOI: 10.11606/T.55.2017.tde-13112017-105506
  • Editor: Biblioteca Digital de Teses e Dissertações da USP; Universidade de São Paulo; Instituto de Ciências Matemáticas e de Computação
  • Data de criação/publicação: 2017-08-17
  • Formato: Adobe PDF
  • Idioma: Inglês

Buscando em bases de dados remotas. Favor aguardar.