CoDet : a new algorithm for containment and near duplicate detection in text corpora

buir.advisorAykanat, Cevdet
dc.contributor.authorVarol, Emre
dc.date.accessioned2016-01-08T18:19:49Z
dc.date.available2016-01-08T18:19:49Z
dc.date.issued2012
dc.descriptionAnkara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2012.en_US
dc.descriptionThesis (Master's) -- Bilkent University, 2012.en_US
dc.descriptionIncludes bibliographical references leaves 37-39.en_US
dc.description.abstractIn this thesis, we investigate containment detection, which is a generalized version of the well known near-duplicate detection problem concerning whether a document is a subset of another document. In text-based applications, there are three way of observing document containment: exact-duplicates, near-duplicates, or containments, where first two are the special cases of containment. To detect containments, we introduce CoDet, which is a novel algorithm that focuses particularly on containment problem. We also construct a test collection using a novel pooling technique, which enables us to make reliable judgments for the relative effectiveness of algorithms using limited human assessments. We compare its performance with four well-known near duplicate detection methods (DSC, full fingerprinting, I-Match, and SimHash) that are adapted to containment detection. Our algorithm is especially suitable for streaming news. It is also expandable to different domains. Experimental results show that CoDet mostly outperforms the other algorithms and produces remarkable results in detection of containments in text corpora.en_US
dc.description.provenanceMade available in DSpace on 2016-01-08T18:19:49Z (GMT). No. of bitstreams: 1 0006258.pdf: 1134148 bytes, checksum: 5f06c5a8e52ad06c2ef859f5cceac8e4 (MD5)en
dc.description.statementofresponsibilityVarol, Emreen_US
dc.format.extentxii, 40 leaves, graphicsen_US
dc.identifier.itemidB131817
dc.identifier.urihttp://hdl.handle.net/11693/15526
dc.language.isoEnglishen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectCorpus Treeen_US
dc.subjectNear-Duplicate Detectionen_US
dc.subjectSimilarityen_US
dc.subjectTest Collectionen_US
dc.subjectDocument Containmenten_US
dc.subject.lccZ699 .V37 2012en_US
dc.subject.lcshInformation storage and retrieval systems.en_US
dc.subject.lcshInformation retrieval.en_US
dc.subject.lcshElectronic data processing--Distributed processing.en_US
dc.subject.lcshSimilarity.en_US
dc.titleCoDet : a new algorithm for containment and near duplicate detection in text corporaen_US
dc.typeThesisen_US
thesis.degree.disciplineComputer Engineering
thesis.degree.grantorBilkent University
thesis.degree.levelMaster's
thesis.degree.nameMS (Master of Science)

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
0006258.pdf
Size:
1.08 MB
Format:
Adobe Portable Document Format