Browsing by Subject "Indexing."

Now showing 1 - 3 of 3

Open Access
Characteristics of Web-based textual communications
(2012) Küçükyılmaz, Tayfun
In this thesis, we analyze different aspects of Web-based textual communications and argue that all such communications share some common properties. In order to provide practical evidence for the validity of this argument, we focus on two common properties by examining these properties on various types of Web-based textual communications data. These properties are: All Web-based communications contain features attributable to their author and reciever; and all Web-based communications exhibit similar heavy tailed distributional properties. In order to provide practical proof for the validity of our claims, we provide three practical, real life research problems and exploit the proposed common properties of Web-based textual communications to find practical solutions to these problems. In this work, we first provide a feature-based result caching framework for real life search engines. To this end, we mined attributes from user queries in order to classify queries and estimate a quality metric for giving admission and eviction decisions for the query result cache. Second, we analyzed messages of an online chat server in order to predict user and mesage attributes. Our results show that several user- and message-based attributes can be predicted with significant occuracy using both chat message- and writing-style based features of the chat users. Third, we provide a parallel framework for in-memory construction of term partitioned inverted indexes. In this work, in order to minimize the total communication time between processors, we provide a bucketing scheme that is based on term-based distributional properties of Web page contents.
Open Access
A new dynamic and adaptive scheme for indexing in metric spaces
(2007) Tosun, Umut
Computer Science applications are often concerned with efficient storage and retrieval of data. Well defined structure of traditional databases help to access required query objects effectively using the Relational Database paradigm. However, in recent times, we are faced with the challenges of dealing with unstructured and complex data such as images, video, sound clips and text documents. Multimedia Information Retrieval, Data Mining, Pattern Recognition, Machine Learning, Computer Vision and Biomedical Databases are examples of the fields that require efficient management of complex data. Complex, unstructured type of data often cannot be broken down into well-defined components, and exact matching cannot be applied for defining queries. Instead, the notion of similarity search is used where a query or prototype object is provided by the user and the database retrieves the objects that are similar. One popular approach for similarity searching is to approximate the relationship between database objects by mapping them into a vector space. There are well-known indexing methods in literature that support similarity queries in vector spaces, however, it has been shown that these methods are ineffective for high dimensional data. Another approach is to use Metric Spaces model for indexing. Metric spaces are defined by a distance function that has the triangular inequality property. Since there are no assumptions about the structure of the data itself, they constitute a higher level abstraction and thus have more applicability. They have also been shown to perform better in higher dimensions. A lot of the previous work in metric spaces have concentrated on static methods that do not allow new insertions once the index structure has been initialized. M-Tree, Slim-Tree, DF-Tree, Omni are some of the popular dynamic structures. These methods can grow incrementally by splitting overflowed nodes and adding new levels to the tree very much like the B-tree variants. Unfortunately, they have been shown to perform very poorly compared to flat structures such as AESA, LAESA, Spaghettis and Kvp that use a fixed set of global pivots. The distances between the query object and the pivots are computed to eliminate some portion of the database from consideration. The number of pivots can be easily increased to provide more selectivity, thus better query performance. However, there is an optimum number of pivots for a given query radius, and using too many pivots increases the costs of queries and the initialization of the index. Recently, Sparse Spatial Selection(SSS) was introduced as a LAESA variant that allows insertions of new database objects and dynamically promotes some of the new objects as pivots. In this thesis, we argue that SSS has fundamental problems that results in poor query performance for clustered or otherwise skewed distributions. Real datasets have often been observed to show such characteristics. We show that SSS has been optimized to work for a symmetrical, balanced distribution and for a specific radius value. Our first main contribution is offering a new pivot promotion scheme that can perform robustly for clustered or skewed distributions. Our second contribution is proposing new methods that solve the problem of determining the right number of pivots for different query radius values. We show that our new indexing scheme performs significantly better than tree-based dynamic structures while having lower insertion costs. We also show that our structure adapts to changes in the database population in a superior way.
Open Access
Storage management and indexing in object-oriented database management systems
(1990) Al-Hajj, Reda
Storage management and indexing methods used in existing conventional database management systems are not appropriate for the object-oriented database management systems due to the distinctive features of the later systems. A model for storage management suitable for object- oriented database management systems is proposed in this thesis. It supports object identity, multiple inheritance, composite objects, a fine degree of granularity and schema evolution. An index provides fast access to data stored in files at the price of using additional storage space and an overhead in update operations. Work has been carried out on indexing and an indexing method for the object-oriented database systems is proposed. Identity and equality indexes are treated. Object identity and information hiding are provided. Schema changes are handled without affecting existing indexes. It is general enough to be applicable to most existing object-oriented database systems. The mapping of the proposed storage and indexing approaches into a relational database scheme is also presented.