dc.contributor.advisor | Aykanat, Cevdet | |
dc.contributor.author | Varol, Emre | |
dc.date.accessioned | 2016-01-08T18:19:49Z | |
dc.date.available | 2016-01-08T18:19:49Z | |
dc.date.issued | 2012 | |
dc.identifier.uri | http://hdl.handle.net/11693/15526 | |
dc.description | Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2012. | en_US |
dc.description | Thesis (Master's) -- Bilkent University, 2012. | en_US |
dc.description | Includes bibliographical references leaves 37-39. | en_US |
dc.description.abstract | In this thesis, we investigate containment detection, which is a generalized version
of the well known near-duplicate detection problem concerning whether a
document is a subset of another document. In text-based applications, there are
three way of observing document containment: exact-duplicates, near-duplicates,
or containments, where first two are the special cases of containment. To detect
containments, we introduce CoDet, which is a novel algorithm that focuses
particularly on containment problem. We also construct a test collection using a
novel pooling technique, which enables us to make reliable judgments for the relative
effectiveness of algorithms using limited human assessments. We compare its
performance with four well-known near duplicate detection methods (DSC, full
fingerprinting, I-Match, and SimHash) that are adapted to containment detection.
Our algorithm is especially suitable for streaming news. It is also expandable to
different domains. Experimental results show that CoDet mostly outperforms the
other algorithms and produces remarkable results in detection of containments in
text corpora. | en_US |
dc.description.statementofresponsibility | Varol, Emre | en_US |
dc.format.extent | xii, 40 leaves, graphics | en_US |
dc.language.iso | English | en_US |
dc.rights | info:eu-repo/semantics/openAccess | en_US |
dc.subject | Corpus Tree | en_US |
dc.subject | Near-Duplicate Detection | en_US |
dc.subject | Similarity | en_US |
dc.subject | Test Collection | en_US |
dc.subject | Document Containment | en_US |
dc.subject.lcc | Z699 .V37 2012 | en_US |
dc.subject.lcsh | Information storage and retrieval systems. | en_US |
dc.subject.lcsh | Information retrieval. | en_US |
dc.subject.lcsh | Electronic data processing--Distributed processing. | en_US |
dc.subject.lcsh | Similarity. | en_US |
dc.title | CoDet : a new algorithm for containment and near duplicate detection in text corpora | en_US |
dc.type | Thesis | en_US |
dc.department | Department of Computer Engineering | en_US |
dc.publisher | Bilkent University | en_US |
dc.description.degree | M.S. | en_US |
dc.identifier.itemid | B131817 | |