Bilkent Repository :: Browsing by Subject "Query processing"

Browsing by Subject "Query processing"

Now showing 1 - 20 of 33

Open Access
Detection of compound structures using multiple hierarchical segmentations
(IEEE, 2014) Akçay, Hüseyin Gökhan; Aksoy, Selim
In this paper, we present a method for automatic compound structure detection in high-resolution images. Given a query compound structure, our aim is to detect coherent regions with similar spatial arrangement and characteristics in multiple hierarchical segmentations. A Markov random field is constructed by representing query regions as variables and connecting the vertices that are spatially close by edges. Then, a maximum entropy distribution is assumed over the query region process and selection of similar region processes among a set of region hierarchies is achieved by maximizing the query model. Experiments using WorldView-2 images show the efficiency of probabilistic modeling of compound structures. © 2014 IEEE.
Open Access
Distributed block formation and layout for disk-based management of large-scale graphs
(Springer, 2017) Yaşar, A.; Gedik, B.; Ferhatosmanoğlu, H.
We are witnessing an enormous growth in social networks as well as in the volume of data generated by them. An important portion of this data is in the form of graphs. In recent years, several graph processing and management systems emerged to handle large-scale graphs. The primary goal of these systems is to run graph algorithms and queries in an efficient and scalable manner. Unlike relational data, graphs are semi-structured in nature. Thus, storing and accessing graph data using secondary storage requires new solutions that can provide locality of access for graph processing workloads. In this work, we propose a scalable block formation and layout technique for graphs, which aims at reducing the I/O cost of disk-based graph processing algorithms. To achieve this, we designed a scalable MapReduce-style method called ICBL, which can divide the graph into a series of disk blocks that contain sub-graphs with high locality. Furthermore, ICBL can order the resulting blocks on disk to further reduce non-local accesses. We experimentally evaluated ICBL to showcase its scalability, layout quality, as well as the effectiveness of automatic parameter tuning for ICBL. We deployed the graph layouts generated by ICBL on the Neo4j open source graph database, http://www.neo4j.org/ (2015) graph database management system. Our results show that the layout generated by ICBL reduces the query running times over Neo4j more than 2 × compared to the default layout. © 2017, Springer Science+Business Media New York.
Open Access
Document replication strategies for geographically distributed web search engines
(Elsevier Ltd., 2013) Kayaaslan, E.; Cambazoglu, B. B.; Aykanat, Cevdet
Large-scale web search engines are composed of multiple data centers that are geographically distant to each other. Typically, a user query is processed in a data center that is geographically close to the origin of the query, over a replica of the entire web index. Compared to a centralized, single-center search engine, this architecture offers lower query response times as the network latencies between the users and data centers are reduced. However, it does not scale well with increasing index sizes and query traffic volumes because queries are evaluated on the entire web index, which has to be replicated and maintained in all data centers. As a remedy to this scalability problem, we propose a document replication framework in which documents are selectively replicated on data centers based on regional user interests. Within this framework, we propose three different document replication strategies, each optimizing a different objective: reducing the potential search quality loss, the average query response time, or the total query workload of the search system. For all three strategies, we consider two alternative types of capacity constraints on index sizes of data centers. Moreover, we investigate the performance impact of query forwarding and result caching. We evaluate our strategies via detailed simulations, using a large query log and a document collection obtained from the Yahoo! web search engine. (C) 2012 Elsevier Ltd. All rights reserved.
Open Access
Effective use of space for pivot-based metric indexing structures
(IEEE, 2008-04) Çelik, Cengiz
Among the metric space indexing methods, AESA is known to produce the lowest query costs in terms of the number of distance computations. However, its quadratic construction cost and space consumption makes it infeasiblefor large dataseis. There have been some work on reducing the space requirements of AESA. Instead of keeping all the distances between objects, LAESA appoints a subset of the database as pivots, keeping only the distances between objects and pivots. Kvp uses the idea of prioritizing the pivots based on their distances to objects, only keeping pivot distances that it evaluates as promising. FQA discretizes the distances using a fixed amount of bits per distance instead of using system's floating point types. Varying the number of bits to produce a performance-space trade-off was also studied in Kvp. Recently, BAESA has been proposed based on the same idea, but using different distance ranges for each pivot. The t-spanner based indexing structure compacts the distance matrix by introducing an approximation factor that makes the pivots less effective. In this work, we show that the Kvp prioritization is oriented toward symmetric distance distributions. We offer a new method that evaluates the effectiveness of pivots in a better fashion by making use of the overall distance distribution. We also simulate the performance of our method combined with distance discretization. Our results show that our approach is able to offer very good space-performance trade-offs compared to AESA and tree-based methods. © 2008 IEEE.
Open Access
Efficiency and effectiveness of query processing in cluster-based retrieval
(Elsevier, 2004) Can, F.; Altingövde I.S.; Demir, E.
Our research shows that for large databases, without considerable additional storage overhead, cluster-based retrieval (CBR) can compete with the time efficiency and effectiveness of the inverted index-based full search (FS). The proposed CBR method employs a storage structure that blends the cluster membership information into the inverted file posting lists. This approach significantly reduces the cost of similarity calculations for document ranking during query processing and improves efficiency. For example, in terms of in-memory computations, our new approach can reduce query processing time to 39% of FS. The experiments confirm that the approach is scalable and system performance improves with increasing database size. In the experiments, we use the cover coefficient-based clustering methodology (C3M), and the Financial Times database of TREC containing 210158 documents of size 564 MB defined by 229748 terms with total of 29545234 inverted index elements. This study provides CBR efficiency and effectiveness experiments using the largest corpus in an environment that employs no user interaction or user behavior assumption for clustering. © 2003 Elsevier Ltd. All rights reserved.
Open Access
Efficient processing of category-restricted queries for web directories
(Springer, 2008-03-04) Altıngövde, İsmail Şengör; Can, Fazlı; Ulusoy, Özgür
We show that a cluster-skipping inverted index (CS-IIS) is a practical and efficient file structure to support category-restricted queries for searching Web directories. The query processing strategy with CS-IIS improves CPU time efficiency without imposing any limitations on the directory size. © 2008 Springer-Verlag Berlin Heidelberg.
Open Access
Energy-price-driven query processing in multi-center web search engines
(IEEE, 2011-07) Kayaaslan, Enver; Cambazoglu, B. B.; Blanco, R.; Junqueira, F. P.; Aykanat, Cevdet
Concurrently processing thousands of web queries, each with a response time under a fraction of a second, necessitates maintaining and operating massive data centers. For large-scale web search engines, this translates into high energy consumption and a huge electric bill. This work takes the challenge to reduce the electric bill of commercial web search engines operating on data centers that are geographically far apart. Based on the observation that energy prices and query workloads show high spatio-temporal variation, we propose a technique that dynamically shifts the query workload of a search engine between its data centers to reduce the electric bill. Experiments on real-life query workloads obtained from a commercial search engine show that significant financial savings can be achieved by this technique.
Open Access
First large-scale information retrieval experiments on Turkish texts
(ACM, 2006-08) Can, Fazlı; Koçberber, Seyit; Balcık, Erman; Kaynak, Cihan; Öcalan, H. Çağdaş; Vursavaş, Onur M.
We present the results of the first large-scale Turkish information retrieval experiments performed on a TREC-like test collection. The test bed, which has been created for this study, contains 95.5 million words, 408,305 documents, 72 ad hoc queries and has a size of about 800MB. All documents come from the Turkish newspaper Milliyet. We implement and apply simple to sophisticated stemmers and various query-document matching fonctions and show that truncating words at a prefix length of 5 creates an effective retrieval environment in Turkish. However, a lemmatizer-based stemmer provides significantly better effectiveness over a variety of matching functions.
Open Access
A five-level static cache architecture for web search engines
(Elsevier Ltd, 2012) Ozcan, R.; Altingovde, I. S.; Cambazoglu, B. B.; Junqueira, F. P.; Ulusoy, Özgür
Caching is a crucial performance component of large-scale web search engines, as it greatly helps reducing average query response times and query processing workloads on backend search clusters. In this paper, we describe a multi-level static cache architecture that stores five different item types: query results, precomputed scores, posting lists, precomputed intersections of posting lists, and documents. Moreover, we propose a greedy heuristic to prioritize items for caching, based on gains computed by using items' past access frequencies, estimated computational costs, and storage overheads. This heuristic takes into account the inter-dependency between individual items when making its caching decisions, i.e.; after a particular item is cached, gains of all items that are affected by this decision are updated. Our simulations under realistic assumptions reveal that the proposed heuristic performs better than dividing the entire cache space among particular item types at fixed proportions. © 2010 Elsevier Ltd. All rights reserved.
Open Access
HandVR: a hand-gesture-based interface to a video retrieval system
(Springer U K, 2015) Genç, S.; Baştan M.; Güdükbay, Uğur; Atalay, V.; Ulusoy, Özgür
Using one’s hands in human–computer interaction increases both the effectiveness of computer usage and the speed of interaction. One way of accomplishing this goal is to utilize computer vision techniques to develop hand-gesture-based interfaces. A video database system is one application where a hand-gesture-based interface is useful, because it provides a way to specify certain queries more easily. We present a hand-gesture-based interface for a video database system to specify motion and spatiotemporal object queries. We use a regular, low-cost camera to monitor the movements and configurations of the user’s hands and translate them to video queries. We conducted a user study to compare our gesture-based interface with a mouse-based interface on various types of video queries. The users evaluated the two interfaces in terms of different usability parameters, including the ease of learning, ease of use, ease of remembering (memory), naturalness, comfortable use, satisfaction, and enjoyment. The user study showed that querying video databases is a promising application area for hand-gesture-based interfaces, especially for queries involving motion and spatiotemporal relations.
Open Access
Implications of non-volatile memory as primary storage for database management systems
(IEEE, 2017) Mustafa, Naveed Ul; Armejach, A.; Öztürk, Özcan; Cristal, A.; Unsal, O. S.
Traditional Database Management System (DBMS) software relies on hard disks for storing relational data. Hard disks are cheap, persistent, and offer huge storage capacities. However, data retrieval latency for hard disks is extremely high. To hide this latency, DRAM is used as an intermediate storage. DRAM is significantly faster than disk, but deployed in smaller capacities due to cost and power constraints, and without the necessary persistency feature that disks have. Non-Volatile Memory (NVM) is an emerging storage class technology which promises the best of both worlds. It can offer large storage capacities, due to better scaling and cost metrics than DRAM, and is non-volatile (persistent) like hard disks. At the same time, its data retrieval time is much lower than that of hard disks and it is also byte-addressable like DRAM. In this paper, we explore the implications of employing NVM as primary storage for DBMS. In other words, we investigate the modifications necessary to be applied on a traditional relational DBMS to take advantage of NVM features. As a case study, we have modified the storage engine (SE) of PostgreSQL enabling efficient use of NVM hardware. We detail the necessary changes and challenges such modifications entail and evaluate them using a comprehensive emulation platform. Results indicate that our modified SE reduces query execution time by up to 40% and 14.4% when compared to disk and NVM storage, with average reductions of 20.5% and 4.5%, respectively. © 2016 IEEE.
Open Access
Incremental cluster-based retrieval using compressed cluster-skipping inverted files
(Association for Computing Machinery, 2008-06) Altingovde, I. S.; Demir, E.; Can, F.; Ulusoy, Özgür
We propose a unique cluster-based retrieval (CBR) strategy using a new cluster-skipping inverted file for improving query processing efficiency. The new inverted file incorporates cluster membership and centroid information along with the usual document information into a single structure. In our incremental-CBR strategy, during query evaluation, both best(-matching) clusters and the best(-matching) documents of such clusters are computed together with a single posting-list access per query term. As we switch from term to term, the best clusters are recomputed and can dynamically change. During query-document matching, only relevant portions of the posting lists corresponding to the best clusters are considered and the rest are skipped. The proposed approach is essentially tailored for environments where inverted files are compressed, and provides substantial efficiency improvement while yielding comparable, or sometimes better, effectiveness figures. Our experiments with various collections show that the incremental-CBR strategy using a compressed cluster-skipping inverted file significantly improves CPU time efficiency, regardless of query length. The new compressed inverted file imposes an acceptable storage overhead in comparison to a typical inverted file. We also show that our approach scales well with the collection size. © 2008 ACM.
Open Access
Information retrieval on Turkish texts
(John Wiley & Sons, Inc., 2008-02) Can, F.; Kocberber, S.; Balcik, E.; Kaynak, C.; Ocalan, H. C.; Vursavas, O. M.
In this study, we investigate information retrieval (IR) on Turkish texts using a large-scale test collection that contains 408,305 documents and 72 ad hoc queries. We examine the effects of several stemming options and query-document matching functions on retrieval performance. We show that a simple word truncation approach, a word truncation approach that uses language-dependent corpus statistics, and an elaborate lemmatizer-based stemmer provide similar retrieval effectiveness in Turkish IR. We investigate the effects of a range of search conditions on the retrieval performance; these include scalability issues, query and document length effects, and the use of stop-word list in indexing. © 2007 Wiley Periodicals, Inc.
Open Access
Lexical cohesion and term proximity in document ranking
(Elsevier Ltd, 2008-07) Vechtomova, O.; Karamuftuoglu, M.
We demonstrate effective new methods of document ranking based on lexical cohesive relationships between query terms. The proposed methods rely solely on the lexical relationships between original query terms, and do not involve query expansion or relevance feedback. Two types of lexical cohesive relationship information between query terms are used in document ranking: short-distance collocation relationship between query terms, and long-distance relationship, determined by the collocation of query terms with other words. The methods are evaluated on TREC corpora, and show improvements over baseline systems. © 2008 Elsevier Ltd. All rights reserved.
Open Access
Mobile image search using multi-query images
(IEEE, 2015) Çalışır, Fatih; Bastan, M.; Güdükbay, Uğur; Ulusoy, Özgür
Recent advances in mobile device technology have turned the mobile phones into powerfull devices with high resolution cameras and fast processing capabilities. Having more user interaction potential compared to regular PCs, mobile devices with cameras can enable richer content-based object image queries: the user can capture multiple images of the query object from different viewing angles and at different scales, thereby providing much more information about the object to improve the retrieval accuracy. The goal of this paper is to improve the mobile image retrieval performance using multiple query images. To this end, we use the well-known bag-of-visual-words approach to represent the images, and employ early and late fusion strategies to utilize the information in multiple query images. With extensive experiments on an object image dataset with a single object per image, we show that multi-image queries result in higher average precision performance than single image queries. © 2015 IEEE.
Open Access
Models and algorithms for parallel text retrieval
(Bilkent University, 2006) Cambazoğlu, Berkant Barla
In the last decade, search engines became an integral part of our lives. The current state-of-the-art in search engine technology relies on parallel text retrieval. Basically, a parallel text retrieval system is composed of three components: a crawler, an indexer, and a query processor. The crawler component aims to locate, fetch, and store the Web pages in a local document repository. The indexer component converts the stored, unstructured text into a queryable form, most often an inverted index. Finally, the query processing component performs the search over the indexed content. In this thesis, we present models and algorithms for efficient Web crawling and query processing. First, for parallel Web crawling, we propose a hybrid model that aims to minimize the communication overhead among the processors while balancing the number of page download requests and storage loads of processors. Second, we propose models for documentand term-based inverted index partitioning. In the document-based partitioning model, the number of disk accesses incurred during query processing is minimized while the posting storage is balanced. In the term-based partitioning model, the total amount of communication is minimized while, again, the posting storage is balanced. Finally, we develop and evaluate a large number of algorithms for query processing in ranking-based text retrieval systems. We test the proposed algorithms over our experimental parallel text retrieval system, Skynet, currently running on a 48-node PC cluster. In the thesis, we also discuss the design and implementation details of another, somewhat untraditional, grid-enabled search engine, SE4SEE. Among our practical work, we present the Harbinger text classification system, used in SE4SEE for Web page classification, and the K-PaToH hypergraph partitioning toolkit, to be used in the proposed models.
Open Access
An MPEG-7 compatible video retrieval system with integrated support for complex multimodal queries
(IEEE Computer Society, 2019) Baştan, Muhammet; Çam, Hayati; Güdükbay, Uğur; Ulusoy, Özgür
We present BilVideo-7, an MPEG-7 compatible, video indexing and retrieval system that supports complex multimodal queries in a unified framework. An MPEG-7 profile is developed to represent the videos by decomposing them into Shots, Keyframes, Still Regions and Moving Regions. The MPEG-7 compatible XML representations of videos according to this profile are obtained by the MPEG-7 compatible video feature extraction and annotation tool of BilVideo-7, and stored in a native XML database. Users can formulate text-based semantic, color, texture, shape, location, motion and spatio-temporal queries on an intuitive, easy-to-use Visual Query Interface, whose Composite Query Interface can be used to specify very complex queries containing any type and number of video segments with their descriptors. The multi-threaded Query Processing Server parses incoming queries into subqueries and executes each subquery in a separate thread. Then, it fuses subquery results in a bottom-up manner to obtain the final query result. The whole system is unique in that it provides very powerful querying capabilities with a wide range of descriptors and multimodal query processing in an MPEG-7 compatible interoperable environment. We present sample queries to demonstrate the capabilities of the system.
Open Access
Natural language querying for video databases
(Elsevier Inc., 2008-06-15) Erozel, G.; Cicekli, N. K.; Cicekli, I.
The video databases have become popular in various areas due to the recent advances in technology. Video archive systems need user-friendly interfaces to retrieve video frames. In this paper, a user interface based on natural language processing (NLP) to a video database system is described. The video database is based on a content-based spatio-temporal video data model. The data model is focused on the semantic content which includes objects, activities, and spatial properties of objects. Spatio-temporal relationships between video objects and also trajectories of moving objects can be queried with this data model. In this video database system, a natural language interface enables flexible querying. The queries, which are given as English sentences, are parsed using link parser. The semantic representations of the queries are extracted from their syntactic structures using information extraction techniques. The extracted semantic representations are used to call the related parts of the underlying video database system to return the results of the queries. Not only exact matches but similar objects and activities are also returned from the database with the help of the conceptual ontology module. This module is implemented using a distance-based method of semantic similarity search on the semantic domain-independent ontology, WordNet. © 2008 Elsevier Inc. All rights reserved.
Open Access
New formulations for the hop-constrained minimum spanning tree problem via Sherali and Driscoll's tightened Miller-Tucker-Zemlin constraints
(Elsevier, 2010) Akgün, İbrahim
Given an undirected network with positive edge costs and a natural number p, the hop-constrained minimum spanning tree problem (HMST) is the problem of finding a spanning tree with minimum total cost such that each path starting from a specified root node has no more than p hops (edges). In this paper, the new models based on the Miller-Tucker-Zemlin (MTZ) subtour elimination constraints are developed and computational results together with comparisons against MTZ-based, flow-based, and hop-indexed formulations are reported. The first model is obtained by adapting the MTZ-based Asymmetric Traveling Salesman Problem formulation of Sherali and Driscoll [18] and the other two models are obtained by combining topology-enforcing and MTZ-related constraints offered by Akgün and Tansel (submitted for publication) [20] for HMST with the first model appropriately. Computational studies show that the best LP bounds of the MTZ-based models in the literature are improved by the proposed models. The best solution times of the MTZ-based models are not improved for optimally solved instances. However, the results for the harder, large-size instances imply that the proposed models are likely to produce better solution times. The proposed models do not dominate the flow-based and hop-indexed formulations with respect to LP bounds. However, good feasible solutions can be obtained in a reasonable amount of time for problems for which even the LP relaxations of the flow-based and hop-indexed formulations can be solved in about 2 days. © 2010 Elsevier Ltd. All rights reserved.
Open Access
Ottoman archives explorer: a retrieval system for digital Ottoman archives
(Association for Computing Machinery, 2009-12) Yalniz, I. Z.; Altingovde, I. S.; Güdükbay, Uğur; Ulusoy, Özgür
This article presents Ottoman Archives Explorer, a Content-Based Retrieval (CBR) system based on character recognition for printed and handwritten historical documents. Several methods for character segmentation and recognition stages are investigated. In particular, sliding-window and histogram segmentation methods are coupled with recognition approaches using spatial features, neural networks, and a graph-based model. The prototype system provides CBR of document images using both example-based queries and a virtual keyboard to construct query words. © 2009 ACM.