Browsing by Subject "Word segmentation"

Now showing 1 - 3 of 3

Open Access
Atatürk'ün el yazmalarının işlenmesi
(IEEE, 2010-04) Soysal, Talha; Adıgüzel Hande; Öktem, Alp; Haman, Alican; Can, Ethem Fatih; Duygulu, Pınar; Kalpaklı, Mehmet
Bu çalımada Atatürk'ün el yazmalarının etkin ve kolay eriimini salayabilecek kelime tabanlı bir arama sisteminin ilk aaması olarak sayısallatırılmı belgelerin ön ilemesi ve satır ve kelimelere bölütlenmesi konusunda çalımalar yapılmıtır. Tarihi el yazması belgeler çeitli zorluklar getirmekte, basılı belgelerde kullanılan yöntemlerin uygulanması baarılı sonuçlar üretememektedir. Bu nedenle daha gelimi çözümler üzerine younlaarak satır bölütlemede Hough dönüümü [1] tabanlı bir yöntem uyarlanmı, kelime bölütlemede ise yazıların eiklii göz önüne alınmıtır. Afet nan tarafından salanan belgelerin [4] 30 sayfası üzerinde yapılan çalımalarda elde edilen sonuçlar gelecek çalımalar açısından umut vericidir. In this paper, as a first step to an easy and convenient way to access the manuscripts of Atatürk with a word based search engine, the preprocessing of digitalized documents and their line and word segmentation is studied. The techniques that are applied on printed documents may not yield satisfactory results. Due to this fact, more developed techniques are decided to be applied consisting of a technique based on Hough transform [1] for line segmentation and a technique that is based on dealing with skewness of lines for word segmentation. The results, which are acquired through studies that are conducted on the documents provided by Afet İnan and consisting of 30 pages [2], prove to be highly accurate and promising for future researches. ©2010 IEEE.
Open Access
OTAP Ottoman archives internet interface
(IEEE, 2012) Şahin, Emre; Adıgüzel, Hande; Duygulu, Pınar; Kalpaklı, Mehmet
Within Ottoman Text Archive Project a web interface to aid in uploading, binarization, line and word segmentation, labeling, recognition and testing of the Ottoman Turkish texts has been developed. It became possible to retrieve expert knowledge of scholars working with Ottoman archives through this interface, and apply this knowledge in developing further technologies in transliteration of historical manuscripts. © 2012 IEEE.
Open Access
A statistical information extraction system for Turkish
(2000) Tür, Gökhan
This thesis presents the results of a study on information extraction from unrestricted Turkish text using statistical language processing methods. We have successfully applied statistical methods using both the lexical and morphological information to the following tasks: -The Turkish Text Deasciifier task aims to convert the ASCII characters in a Turkish text, into the corresponding non-ASCII Turkish characters (i.e.,"ü", "ö", "ç", "ş", "ğ", "ı", and their upper cases). -The Word Segmentation task aims to detect word boundaries, given we have a sequence of characters without space or punctuation.-The Vowel Restoration task aims to restore the vowels of an input stream, whose vowels are deleted.-The Sentence Segmentation task aims to divide a stream of text or speech into grammatical sentences. Given a sequence of (written or spoken) words, the aim of sentence segmentation is to find the boundaries of the sentences.-The Topic Segmentation task aims to divide a stream of text or speech into topically homogeneous blocks. Given a sequence of (written or spoken) words, the aim of topic segmentation is to find the boundaries where topics change.-The Name Tagging task aims to mark the games (persons, locations, and organizations) in a text. For relatively simpler tasks, such as Turkish Text Deasciifier, Word Segmentation, and Vowel Restoration, only lexical information is enough, but in order to obtain better performance in more complex tasks, such as Sentence Segmentation, Topic Segmentation, and Name Tagging, we not only use lexical information, but also exploit morphological and contextual information. For sentence segmentation, we have modeled the final inflectional groups of the words and combined them with the lexical model, and decreased the error rate to 4.34%. For name tagging, in addition to the lexical and morphological models, we have also employed contextual and tag models, and reached an F-measure of 91.56%. For topic segmentation, stems of the words (nouns) have been found to be more effective than using the surface forms of the words and we have achieved 10.90% segmentation error rate on our test set.