Idioma:

Employing nonlinear time series analysis tools with stable clustering algorithms for detecting concept drift on data streams

Costa, Fausto Guzzo Da

Biblioteca Digital de Teses e Dissertações da USP; Universidade de São Paulo; Instituto de Ciências Matemáticas e de Computação 2017-08-17

Acesso online. A biblioteca também possui exemplares impressos.

Enviar para

Título:
Employing nonlinear time series analysis tools with stable clustering algorithms for detecting concept drift on data streams
Autor: Costa, Fausto Guzzo Da
Orientador: Mello, Rodrigo Fernandes de
Assuntos: Agrupamento; Mudanças De Conceito; Fluxos De Dados; Aprendizado De Máquina; Séries Temporais Não Lineares; Concept Drift; Data Streams; Clustering; Machine Learning; Nonlinear Time Series
Notas: Tese (Doutorado)
Descrição: Several industrial, scientific and commercial processes produce open-ended sequences of observations which are referred to as data streams. We can understand the phenomena responsible for such streams by analyzing data in terms of their inherent recurrences and behavior changes. Recurrences support the inference of more stable models, which are deprecated by behavior changes though. External influences are regarded as the main agent actuacting on the underlying phenomena to produce such modifications along time, such as new investments and market polices impacting on stocks, the human intervention on climate, etc. In the context of Machine Learning, there is a vast research branch interested in investigating the detection of such behavior changes which are also referred to as concept drifts. By detecting drifts, one can indicate the best moments to update modeling, therefore improving prediction results, the understanding and eventually the controlling of other influences governing the data stream. There are two main concept drift detection paradigms: the first based on supervised, and the second on unsupervised learning algorithms. The former faces great issues due to the labeling infeasibility when streams are produced at high frequencies and large volumes. The latter lacks in terms of theoretical foundations to provide detection guarantees. In addition, both paradigms do not adequately represent temporal dependencies among data observations. In this context, we introduce a novel approach to detect concept drifts by tackling two deficiencies of both paradigms: i) the instability involved in data modeling, and ii) the lack of time dependency representation. Our unsupervised approach is motivated by Carlsson and Memolis theoretical framework which ensures a stability property for hierarchical clustering algorithms regarding to data permutation. To take full advantage of such framework, we employed Takens embedding theorem to make data statistically independent after being mapped to phase spaces. Independent data were then grouped using the Permutation-Invariant Single-Linkage Clustering Algorithm (PISL), an adapted version of the agglomerative algorithm Single-Linkage, respecting the stability property proposed by Carlsson and Memoli. Our algorithm outputs dendrograms (seen as data models), which are proven to be equivalent to ultrametric spaces, therefore the detection of concept drifts is possible by comparing consecutive ultrametric spaces using the Gromov-Hausdorff (GH) distance. As result, model divergences are indeed associated to data changes. We performed two main experiments to compare our approach to others from the literature, one considering abrupt and another with gradual changes. Results confirm our approach is capable of detecting concept drifts, both abrupt and gradual ones, however it is more adequate to operate on complicated scenarios. The main contributions of this thesis are: i) the usage of Takens embedding theorem as tool to provide statistical independence to data streams; ii) the implementation of PISL in conjunction with GH (called PISLGH); iii) a comparison of detection algorithms in different scenarios; and, finally, iv) an R package (called streamChaos) that provides tools for processing nonlinear data streams as well as other algorithms to detect concept drifts.
DOI: 10.11606/T.55.2017.tde-13112017-105506
Editor: Biblioteca Digital de Teses e Dissertações da USP; Universidade de São Paulo; Instituto de Ciências Matemáticas e de Computação
Data de criação/publicação: 2017-08-17
Formato: Adobe PDF
Idioma: Inglês

Links

Voltar para lista de resultados

Anterior Resultado 3 Avançar Ir para próxima página

Realização: Logos de Redes Sociais:

Employing nonlinear time series analysis tools with stable clustering algorithms for detecting concept drift on data streams

Costa, Fausto Guzzo Da

Biblioteca Digital de Teses e Dissertações da USP; Universidade de São Paulo; Instituto de Ciências Matemáticas e de Computação 2017-08-17

Buscando em bases de dados remotas. Favor aguardar.