First large-scale information retrieval experiments on Turkish texts
Date
2006-08
Advisor
Instructor
Source Title
Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Print ISSN
Electronic ISSN
Publisher
ACM
Volume
Issue
Pages
627 - 628
Language
English
Type
Conference Paper
Journal Title
Journal ISSN
Volume Title
Abstract
We present the results of the first large-scale Turkish information retrieval experiments performed on a TREC-like test collection. The test bed, which has been created for this study, contains 95.5 million words, 408,305 documents, 72 ad hoc queries and has a size of about 800MB. All documents come from the Turkish newspaper Milliyet. We implement and apply simple to sophisticated stemmers and various query-document matching fonctions and show that truncating words at a prefix length of 5 creates an effective retrieval environment in Turkish. However, a lemmatizer-based stemmer provides significantly better effectiveness over a variety of matching functions.
Course
Other identifiers
Book Title
Keywords
IR test collection creation, Lemmatizer, Stemming, Turkish, Data acquisition, Data mining, Information technology, Query languages, Ad hoc networks, Query processing, Text processing, IR test collection creation, Large-scale information retrieval, Matching functions, Information retrieval, Information retrieval systems