First large-scale information retrieval experiments on Turkish texts

Date

2006-08

Editor(s)

Advisor

Supervisor

Co-Advisor

Co-Supervisor

Instructor

Source Title

Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

Print ISSN

Electronic ISSN

Publisher

ACM

Volume

Issue

Pages

627 - 628

Language

English

Journal Title

Journal ISSN

Volume Title

Series

Abstract

We present the results of the first large-scale Turkish information retrieval experiments performed on a TREC-like test collection. The test bed, which has been created for this study, contains 95.5 million words, 408,305 documents, 72 ad hoc queries and has a size of about 800MB. All documents come from the Turkish newspaper Milliyet. We implement and apply simple to sophisticated stemmers and various query-document matching fonctions and show that truncating words at a prefix length of 5 creates an effective retrieval environment in Turkish. However, a lemmatizer-based stemmer provides significantly better effectiveness over a variety of matching functions.

Course

Other identifiers

Book Title

Citation