skip to main content
Primo Search
Search in: Busca Geral

Techniques for indexing large and complex datasets with missing attribute values.

Brinis, Safia

Biblioteca Digital de Teses e Dissertações da USP; Universidade de São Paulo; Instituto de Ciências Matemáticas e de Computação 2016-07-18

Acesso online. A biblioteca também possui exemplares impressos.

  • Título:
    Techniques for indexing large and complex datasets with missing attribute values.
  • Autor: Brinis, Safia
  • Orientador: Traina Junior, Caetano; Traina, Agma Juci Machado
  • Assuntos: Busca Por Similaridade; Dimensão Fractal; Métodos De Acesso Métricos; Valores De Atributos Faltantes; Fractal Dimension; Metric Access Methods; Missing Attribute Values; Similarity Search
  • Notas: Tese (Doutorado)
  • Descrição: Due to the increasing amount and complexity of data processed in real world applications, similarity search became a vital task to store and retrieve such data. However, missing attribute values are very frequent and metric access methods (MAMs), designed to support similarity search, do not operate on datasets when attribute values are missing. Currently, the approach to use the existing indexing techniques on datasets with missing attribute values just use an indicator to identify the missing values and employ a traditional indexing technique. Although, this approach can be applied over multidimensional indexing techniques, it is impractical for metric access methods. This dissertation presents the results of a research conducted to identify and deal with the issues related to indexing and querying datasets with missing values in metric spaces. An empirical analysis of the metric access methods when applied on incomplete datasets leads us to identify two main issues: distortion of the internal structure of the index when data are missing at random and skew of the index structure when data are not missing at random. Based on those findings, a new variant of the Slim-tree access method, called Hollow-tree, is presented. It employs new techniques that are capable to handle missing data issues when missingness is ignorable. The first technique includes a set of indexing policies that allow to index objects with missing attribute values and prevent distortions to occur in the internal structure of the indexes. The second technique targets the similarity queries to improve the query performance over incomplete datasets. This technique employs the fractal dimension of the dataset and the local density around the query object to estimate an ideal radius able to achieve an accurate query answer, considering data with missing values as a potential response. Results from experiments with a variety of real and synthetic datasets show that Hollow-tree achieves nearly 100% of precision and recall for Range queries and more than 90% for k Nearest Neighbor queries, while Slim-tree access method deteriorates with the increasing amount of missing values. The results confirm that the indexing technique helps to establish consistency in the index structure and the searching technique achieves a remarkable performance. When combined, the new techniques allow to explore properly all the available data even with high amounts of missing attribute values. As they are independent of the underlying access method, they can be adopted by a broad range of metric access methods, allowing to extend the class of MAMs.
  • DOI: 10.11606/T.55.2016.tde-01122016-150947
  • Editor: Biblioteca Digital de Teses e Dissertações da USP; Universidade de São Paulo; Instituto de Ciências Matemáticas e de Computação
  • Data de criação/publicação: 2016-07-18
  • Formato: Adobe PDF
  • Idioma: Inglês

Buscando em bases de dados remotas. Favor aguardar.