CoDet : a new algorithm for containment and near duplicate detection in text corpora

Date
2012
Advisor
Aykanat, Cevdet
Supervisor
Co-Advisor
Co-Supervisor
Instructor
Source Title
Print ISSN
Electronic ISSN
Publisher
Bilkent University
Volume
Issue
Pages
Language
English
Type
Thesis
Journal Title
Journal ISSN
Volume Title
Series
Abstract

In this thesis, we investigate containment detection, which is a generalized version of the well known near-duplicate detection problem concerning whether a document is a subset of another document. In text-based applications, there are three way of observing document containment: exact-duplicates, near-duplicates, or containments, where first two are the special cases of containment. To detect containments, we introduce CoDet, which is a novel algorithm that focuses particularly on containment problem. We also construct a test collection using a novel pooling technique, which enables us to make reliable judgments for the relative effectiveness of algorithms using limited human assessments. We compare its performance with four well-known near duplicate detection methods (DSC, full fingerprinting, I-Match, and SimHash) that are adapted to containment detection. Our algorithm is especially suitable for streaming news. It is also expandable to different domains. Experimental results show that CoDet mostly outperforms the other algorithms and produces remarkable results in detection of containments in text corpora.

Course
Other identifiers
Book Title
Keywords
Corpus Tree, Near-Duplicate Detection, Similarity, Test Collection, Document Containment
Citation
Published Version (Please cite this version)