Near-duplicate news detection using named entities
Author(s)
Advisor
Can, FazlıDate
2009Publisher
Bilkent University
Language
English
Type
ThesisItem Usage Stats
193
views
views
64
downloads
downloads
Abstract
The number of web documents has been increasing in an exponential manner for more
than a decade. In a similar way, partially or completely duplicate documents appear
frequently on the Web. Advances in the Internet technologies have increased the number
of news agencies. People tend to read news from news portals that aggregate documents
from different sources. The existence of duplicate or near-duplicate news in these portals
is a common problem. Duplicate documents create redundancy and only a few users
may want to read news containing identical information. Duplicate documents decrease
the efficiency and effectiveness of search engines. In this thesis, we propose and
evaluate a new near-duplicate news detection algorithm: Tweezer. In this algorithm,
named entities and the words that appear before and after them are used to create
document signatures. Documents sharing the same signatures are considered as a nearduplicate.
For named entity detection, we introduce a method called Turkish Named
Entity Recognizer, TuNER. For the evaluation of Tweezer, a document collection is
created using news articles obtained from Bilkent News Portal. In the experiments,
Tweezer is compared with I-Match, which is a state-of-the-art near-duplicate detection
algorithm that creates document signatures using Inverse Document Frequency, IDF,
values of terms. It is experimentally shown that the effectiveness of Tweezer is
statistically significantly better than that of I-Match by using a cost function that combines false alarm and miss rate probabilities, and the F-measure that combines
precision and recall. Furthermore, Tweezer is at least 7% faster than I-Match.
Keywords
Bilkent News PortalI-Match
Inverse document frequency (IDF)
Named entity recognition (NER)
Near-duplicate detection
T-test
Turkish Named Entity Recognizer (TuNER)
Tweezer