|dc.description.abstract||Computer Science applications are often concerned with efficient storage and
retrieval of data. Well defined structure of traditional databases help to access
required query objects effectively using the Relational Database paradigm. However,
in recent times, we are faced with the challenges of dealing with unstructured
and complex data such as images, video, sound clips and text documents.
Multimedia Information Retrieval, Data Mining, Pattern Recognition, Machine
Learning, Computer Vision and Biomedical Databases are examples of the fields
that require efficient management of complex data. Complex, unstructured type
of data often cannot be broken down into well-defined components, and exact
matching cannot be applied for defining queries. Instead, the notion of similarity
search is used where a query or prototype object is provided by the user and the
database retrieves the objects that are similar.
One popular approach for similarity searching is to approximate the relationship
between database objects by mapping them into a vector space. There are
well-known indexing methods in literature that support similarity queries in vector
spaces, however, it has been shown that these methods are ineffective for high
dimensional data. Another approach is to use Metric Spaces model for indexing.
Metric spaces are defined by a distance function that has the triangular inequality
property. Since there are no assumptions about the structure of the data itself,
they constitute a higher level abstraction and thus have more applicability. They
have also been shown to perform better in higher dimensions.
A lot of the previous work in metric spaces have concentrated on static methods
that do not allow new insertions once the index structure has been initialized.
M-Tree, Slim-Tree, DF-Tree, Omni are some of the popular dynamic structures.
These methods can grow incrementally by splitting overflowed nodes and adding
new levels to the tree very much like the B-tree variants. Unfortunately, they have
been shown to perform very poorly compared to flat structures such as AESA,
LAESA, Spaghettis and Kvp that use a fixed set of global pivots. The distances
between the query object and the pivots are computed to eliminate some portion
of the database from consideration. The number of pivots can be easily increased
to provide more selectivity, thus better query performance. However, there is an
optimum number of pivots for a given query radius, and using too many pivots
increases the costs of queries and the initialization of the index. Recently, Sparse
Spatial Selection(SSS) was introduced as a LAESA variant that allows insertions
of new database objects and dynamically promotes some of the new objects as
In this thesis, we argue that SSS has fundamental problems that results in
poor query performance for clustered or otherwise skewed distributions. Real
datasets have often been observed to show such characteristics. We show that
SSS has been optimized to work for a symmetrical, balanced distribution and
for a specific radius value. Our first main contribution is offering a new pivot
promotion scheme that can perform robustly for clustered or skewed distributions.
Our second contribution is proposing new methods that solve the problem of
determining the right number of pivots for different query radius values. We
show that our new indexing scheme performs significantly better than tree-based
dynamic structures while having lower insertion costs. We also show that our
structure adapts to changes in the database population in a superior way.||en_US