Efficiency and effectiveness of XML keyword search using a full element index

buir.advisorUlusoy, Özgür
dc.contributor.authorAtılgan, Duygu
dc.date.accessioned2016-01-08T18:18:08Z
dc.date.available2016-01-08T18:18:08Z
dc.date.issued2010
dc.descriptionAnkara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2010.en_US
dc.descriptionThesis (Master's) -- Bilkent University, 2010.en_US
dc.descriptionIncludes bibliographical references leaves 63-67.en_US
dc.description.abstractIn the last decade, both the academia and industry proposed several techniques to allow keyword search on XML databases and document collections. A common data structure employed in most of these approaches is an inverted index, which is the state-of-the-art for conducting keyword search over large volumes of textual data, such as world wide web. In particular, a full element-index considers (and indexes) each XML element as a separate document, which is formed of the text directly contained in it and the textual content of all of its descendants. A major criticism for a full element-index is the high degree of redundancy in the index (due to the nested structure of XML documents), which diminishes its usage for large-scale XML retrieval scenarios. As the rst contribution of this thesis, we investigate the e ciency and e ectiveness of using a full element-index for XML keyword search. First, we suggest that lossless index compression methods can signi cantly reduce the size of a full element-index so that query processing strategies, such as those employed in a typical search engine, can e ciently operate on it. We show that once the most essential problem of a full element-index, i.e., its size, is remedied, using such an index can improve both the result quality (e ectiveness) and query execution performance (e ciency) in comparison to other recently proposed techniques in the literature. Moreover, using a full element-index also allows generating query results in di erent forms, such as a ranked list of documents (as expected by a search engine user) or a complete list of elements that include all of the query terms (as expected by a DBMS user), in a uni ed framework. As a second contribution of this thesis, we propose to use a lossy approach, static index pruning, to further reduce the size of a full element-index. In this way, we aim to eliminate the repetition of an element's terms at upper levels in an adaptive manner considering the element's textual content and search system's ranking function. That is, we attempt to remove the repetitions in the index only when we expect that removal of them would not reduce the result quality. We conduct a well-crafted set of experiments and show that pruned index les are comparable or even superior to the full element-index up to very high pruning levels for various ad hoc tasks in terms of retrieval e ectiveness. As a nal contribution of this thesis, we propose to apply index pruning strategies to reduce the size of the document vectors in an XML collection to improve the clustering performance of the collection. Our experiments show that for certain cases, it is possible to prune up to 70% of the collection (or, more speci cally, underlying document vectors) and still generate a clustering structure that yields the same quality with that of the original collection, in terms of a set of evaluation metrics.en_US
dc.description.provenanceMade available in DSpace on 2016-01-08T18:18:08Z (GMT). No. of bitstreams: 1 0006151.pdf: 4875012 bytes, checksum: 6ef73044aab41e78cd8015dadfa910c6 (MD5)en
dc.description.statementofresponsibilityAtılgan, Duyguen_US
dc.format.extentxv, 67 leavesen_US
dc.identifier.itemidB122787
dc.identifier.urihttp://hdl.handle.net/11693/15411
dc.language.isoEnglishen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectInformation Retrievalen_US
dc.subjectXML Keyword Searchen_US
dc.subjectFull Element-Indexen_US
dc.subjectSLCAen_US
dc.subjectStatic Pruning, Clusteringen_US
dc.subject.lccZA3075 .A85 2010en_US
dc.subject.lcshInformation retrieval.en_US
dc.subject.lcshInformation storage and retrieval systems.en_US
dc.subject.lcshXML (Document markup language).en_US
dc.subject.lcshDatabase searching.en_US
dc.titleEfficiency and effectiveness of XML keyword search using a full element indexen_US
dc.typeThesisen_US
thesis.degree.disciplineComputer Engineering
thesis.degree.grantorBilkent University
thesis.degree.levelMaster's
thesis.degree.nameMS (Master of Science)

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
0006151.pdf
Size:
4.65 MB
Format:
Adobe Portable Document Format