First large-scale information retrieval experiments on Turkish texts
Date
2006-08
Editor(s)
Advisor
Supervisor
Co-Advisor
Co-Supervisor
Instructor
Source Title
Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Print ISSN
Electronic ISSN
Publisher
ACM
Volume
Issue
Pages
627 - 628
Language
English
Type
Journal Title
Journal ISSN
Volume Title
Series
Abstract
We present the results of the first large-scale Turkish information retrieval experiments performed on a TREC-like test collection. The test bed, which has been created for this study, contains 95.5 million words, 408,305 documents, 72 ad hoc queries and has a size of about 800MB. All documents come from the Turkish newspaper Milliyet. We implement and apply simple to sophisticated stemmers and various query-document matching fonctions and show that truncating words at a prefix length of 5 creates an effective retrieval environment in Turkish. However, a lemmatizer-based stemmer provides significantly better effectiveness over a variety of matching functions.
Course
Other identifiers
Book Title
Keywords
IR test collection creation , Lemmatizer , Stemming , Turkish , Data acquisition , Data mining , Information technology , Query languages , Ad hoc networks , Query processing , Text processing , IR test collection creation , Large-scale information retrieval , Matching functions , Information retrieval , Information retrieval systems