Near-duplicate news detection using named entities

Uyar, Erkan

Near-duplicate news detection using named entities

Files

0006175.pdf (1.16 MB)

Date

2009

Authors

Uyar, Erkan

Advisor

Can, Fazlı

BUIR Usage Stats

2
views

34
downloads

Abstract

The number of web documents has been increasing in an exponential manner for more than a decade. In a similar way, partially or completely duplicate documents appear frequently on the Web. Advances in the Internet technologies have increased the number of news agencies. People tend to read news from news portals that aggregate documents from different sources. The existence of duplicate or near-duplicate news in these portals is a common problem. Duplicate documents create redundancy and only a few users may want to read news containing identical information. Duplicate documents decrease the efficiency and effectiveness of search engines. In this thesis, we propose and evaluate a new near-duplicate news detection algorithm: Tweezer. In this algorithm, named entities and the words that appear before and after them are used to create document signatures. Documents sharing the same signatures are considered as a nearduplicate. For named entity detection, we introduce a method called Turkish Named Entity Recognizer, TuNER. For the evaluation of Tweezer, a document collection is created using news articles obtained from Bilkent News Portal. In the experiments, Tweezer is compared with I-Match, which is a state-of-the-art near-duplicate detection algorithm that creates document signatures using Inverse Document Frequency, IDF, values of terms. It is experimentally shown that the effectiveness of Tweezer is statistically significantly better than that of I-Match by using a cost function that combines false alarm and miss rate probabilities, and the F-measure that combines precision and recall. Furthermore, Tweezer is at least 7% faster than I-Match.

Keywords

Bilkent News Portal, I-Match, Inverse document frequency (IDF), Named entity recognition (NER), Near-duplicate detection, T-test, Turkish Named Entity Recognizer (TuNER), Tweezer

Degree Discipline

Computer Engineering

Degree Level

Master's

Degree Name

MS (Master of Science)

Permalink

http://hdl.handle.net/11693/15436

Collections

Graduate School of Engineering and Science

Language

English

Type

Thesis

Full item page

Near-duplicate news detection using named entities

Files

Date

Authors

Editor(s)

Advisor

Supervisor

Co-Advisor

Co-Supervisor

Instructor

BUIR Usage Stats

Series

Abstract

Source Title

Publisher

Course

Other identifiers

Book Title

Keywords

Degree Discipline

Degree Level

Degree Name

Citation

Permalink

Published Version (Please cite this version)

Collections

Language

Type

Near-duplicate news detection using named entities

Files

Date

Authors

Editor(s)

Advisor

Supervisor

Co-Advisor

Co-Supervisor

Instructor

BUIR Usage Stats

Share

Series

Abstract

Source Title

Publisher

Course

Other identifiers

Book Title

Keywords

Degree Discipline

Degree Level

Degree Name

Citation

Permalink

Published Version (Please cite this version)

Collections

Language

Type