Near-duplicate news detection using named entities
Item Usage Stats
The number of web documents has been increasing in an exponential manner for more than a decade. In a similar way, partially or completely duplicate documents appear frequently on the Web. Advances in the Internet technologies have increased the number of news agencies. People tend to read news from news portals that aggregate documents from different sources. The existence of duplicate or near-duplicate news in these portals is a common problem. Duplicate documents create redundancy and only a few users may want to read news containing identical information. Duplicate documents decrease the efficiency and effectiveness of search engines. In this thesis, we propose and evaluate a new near-duplicate news detection algorithm: Tweezer. In this algorithm, named entities and the words that appear before and after them are used to create document signatures. Documents sharing the same signatures are considered as a nearduplicate. For named entity detection, we introduce a method called Turkish Named Entity Recognizer, TuNER. For the evaluation of Tweezer, a document collection is created using news articles obtained from Bilkent News Portal. In the experiments, Tweezer is compared with I-Match, which is a state-of-the-art near-duplicate detection algorithm that creates document signatures using Inverse Document Frequency, IDF, values of terms. It is experimentally shown that the effectiveness of Tweezer is statistically significantly better than that of I-Match by using a cost function that combines false alarm and miss rate probabilities, and the F-measure that combines precision and recall. Furthermore, Tweezer is at least 7% faster than I-Match.
KeywordsBilkent News Portal
Inverse document frequency (IDF)
Named entity recognition (NER)
Turkish Named Entity Recognizer (TuNER)