Near-duplicate news detection using named entities

buir.advisorCan, Fazlı
dc.contributor.authorUyar, Erkan
dc.date.accessioned2016-01-08T18:18:28Z
dc.date.available2016-01-08T18:18:28Z
dc.date.issued2009
dc.descriptionAnkara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2009.en_US
dc.descriptionThesis (Master's) -- Bilkent University, 2009.en_US
dc.descriptionIncludes bibliographical references leaves 60-65.en_US
dc.description.abstractThe number of web documents has been increasing in an exponential manner for more than a decade. In a similar way, partially or completely duplicate documents appear frequently on the Web. Advances in the Internet technologies have increased the number of news agencies. People tend to read news from news portals that aggregate documents from different sources. The existence of duplicate or near-duplicate news in these portals is a common problem. Duplicate documents create redundancy and only a few users may want to read news containing identical information. Duplicate documents decrease the efficiency and effectiveness of search engines. In this thesis, we propose and evaluate a new near-duplicate news detection algorithm: Tweezer. In this algorithm, named entities and the words that appear before and after them are used to create document signatures. Documents sharing the same signatures are considered as a nearduplicate. For named entity detection, we introduce a method called Turkish Named Entity Recognizer, TuNER. For the evaluation of Tweezer, a document collection is created using news articles obtained from Bilkent News Portal. In the experiments, Tweezer is compared with I-Match, which is a state-of-the-art near-duplicate detection algorithm that creates document signatures using Inverse Document Frequency, IDF, values of terms. It is experimentally shown that the effectiveness of Tweezer is statistically significantly better than that of I-Match by using a cost function that combines false alarm and miss rate probabilities, and the F-measure that combines precision and recall. Furthermore, Tweezer is at least 7% faster than I-Match.en_US
dc.description.provenanceMade available in DSpace on 2016-01-08T18:18:28Z (GMT). No. of bitstreams: 1 0006175.pdf: 1217520 bytes, checksum: 7bf2af2ace3024c650cbce05a3acd030 (MD5)en
dc.description.statementofresponsibilityUyar, Erkanen_US
dc.format.extentxiii, 74 leaves, graphicsen_US
dc.identifier.itemidBILKUTUPB116265
dc.identifier.urihttp://hdl.handle.net/11693/15436
dc.language.isoEnglishen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectBilkent News Portalen_US
dc.subjectI-Matchen_US
dc.subjectInverse document frequency (IDF)en_US
dc.subjectNamed entity recognition (NER)en_US
dc.subjectNear-duplicate detectionen_US
dc.subjectT-testen_US
dc.subjectTurkish Named Entity Recognizer (TuNER)en_US
dc.subjectTweezeren_US
dc.subject.lccZ699 .U93 2009en_US
dc.subject.lcshInformation storage and retrieval systems.en_US
dc.subject.lcshInformation retrieval.en_US
dc.subject.lcshElectronic data processing--Distributed processing.en_US
dc.subject.lcshSimilarity.en_US
dc.titleNear-duplicate news detection using named entitiesen_US
dc.typeThesisen_US
thesis.degree.disciplineComputer Engineering
thesis.degree.grantorBilkent University
thesis.degree.levelMaster's
thesis.degree.nameMS (Master of Science)

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
0006175.pdf
Size:
1.16 MB
Format:
Adobe Portable Document Format