Browsing by Subject "Computational linguistics."

Now showing 1 - 15 of 15

Open Access
An ATN grammar for Turkish
(1993) Demir, Coşkun
Syntactic parsing i.s an iinporta.nt step in a.ny natural language processing system. Augmented Transition Networks (A'l'Ns) are procedural mechanisms which have been one of the earliest and most common paradigms for parsing natural language. ATNs have the generative power of a Turing machine and were first popularized by Woods in 1970. This thesis presents our efforts in developing an ATN grammar for a subset of Turkish including simple and complex sentences. There are five networks in our grammar: the sentence (S) network, which includes the sentence structures that falls in our scope, the noun phrase (NP) network, the adverbial phrase (ADVP) network and finally the clause (CLAUSE) and gerund (GERUND) networks for handling complex sentences. We present results from parsing a large number of Turkish sentences.
Open Access
Automating information extraction task for Turkish texts
(2011) Tatar, Serhan
Throughout history, mankind has often suffered from a lack of necessary resources. In today’s information world, the challenge can sometimes be a wealth of resources. That is to say, an excessive amount of information implies the need to find and extract necessary information. Information extraction can be defined as the identification of selected types of entities, relations, facts or events in a set of unstructured text documents in a natural language. The goal of our research is to build a system that automatically locates and extracts information from Turkish unstructured texts. Our study focuses on two basic Information Extraction (IE) tasks: Named Entity Recognition and Entity Relation Detection. Named Entity Recognition, finding named entities (persons, locations, organizations, etc.) located in unstructured texts, is one of the most fundamental IE tasks. Entity Relation Detection task tries to identify relationships between entities mentioned in text documents. Using supervised learning strategy, the developed systems start with a set of examples collected from a training dataset and generate the extraction rules from the given examples by using a carefully designed coverage algorithm. Moreover, several rule filtering and rule refinement techniques are utilized to maximize generalization and accuracy at the same time. In order to obtain accurate generalization, we use several syntactic and semantic features of the text, including: orthographical, contextual, lexical and morphological features. In particular, morphological features of the text are effectively used in this study to increase the extraction performance for Turkish, an agglutinative language. Since the system does not rely on handcrafted rules/patterns, it does not heavily suffer from domain adaptability problem. The results of the conducted experiments show that (1) the developed systems are successfully applicable to the Named Entity Recognition and Entity Relation Detection tasks, and (2) exploiting morphological features can significantly improve the performance of information extraction from Turkish, an agglutinative language.
Open Access
Computational situation theory with baby-sit
(1995) Tin, Erkan
Language is an integral part of our everyday experience and encompasses situated activities such as talking, listening, reading, and writing. These activities are situated because they occur in situations and they are about situations. Their primary function, on the other hand, is to convey information. With this vision, situation theory has been developed over the last decade or so and various versions of the theory have been applied to a number of linguistic issues. However, not much work has been done in regard to its computational aspects. Existing approaches towards 'computational situation theory' incorporate only some of the original features of situation theory and hence show conceptual and philosophical divergence from its ontology. This thesis presents a computational account of situation theory that embodies the essentials of the theory and adopts its ontological features. A medium (called BABY-SIT) which is based on the proposed computational foundation is described and its constructs are formally defined. The features of BABY-SIT are compared to those of the existing approaches. In order to demonstrate the appropriateness of BABY-SIT, some examples from the domain of artifical intelligence are given. Resolution of pronominal anaphora in Turkish , which has been chosen as a linguistic test-bed for BABY-SIT, is also demonstrated.
Open Access
Computer-aided analysis of English punctuation on a parsed corpus: the special case of comma
(1996) Bayraktar, Murat
Punctuation, an orthographical component of language, has usually been ignored by most research in computational linguistics over the years. One reason for this is the overall difficulty of the subject, and another is the absence of a good theory. On the other hand, both ‘conventional’ and computational linguistics have increased their attention to punctuation in recent years because it has been realized that true understanding and processing of written language will be almost impossible if punctuation marks are not taken into account. Except the lists of rules given in style manuals or usage books, we know little about punctuation. These books give us information about how we should punctuate, but they are generally silent about the actual punctuation practice. This thesis contains the details of a computer-aided experiment to investigate English punctuation practice, for the special case of comma (the most significant punctuation mark) in a parsed corpus. The experiment attempts to classify the various uses of comma according to the syntax-patterns in which comma occurs. The corpus (Penn Treebank) consists of syntactically annotated sentences with no part-of-speech tag information about individual words, and this ideally seems to be enough to classify ‘structural’ punctuation marks.
Open Access
Design and implementation of a tactical generator for Turkish, a free constituent order language
(1996) Hakkani, Dilek Zeynep
This thesis describes a tactical generator for Turkish, a free constituent order language, in which the order of the constituents may change according to the information structure of the sentences to be generated. In the absence of any information regarding the information structure of a sentence (i.e., topic, focus, background, etc.), the constituents of the sentence obey a default order, but the order is almost freely changeable, depending on the constraints of the text flow or discourse. We have used a recursively structured finite state machine for handling the changes in constituent order, implemented as a right-linear grammar backbone. Our implementation environment is the GenKit system, developed at Carnegie Mellon University-Center for Machine Translation. Morphological realization has been implemented using an external morphological analysis/generation component which performs concrete morpheme selection and handles morphographemic processes.
Open Access
Design and implementation of a verb lexicon and verb sense disambiguator for Turkish
(1994) Yılmaz, Okan
The lexicon has a crucial role in all natural language processing systems and has special importance in machine translation systems. This thesis presents the design and implementation of a verb lexicon and a verb sense disambigua- tor for Turkish. The lexicon contains only verbs because verbs encode events in sentences and play the most important role in natural language processing systems, especially in parsing (syntactic analyzing) and machine translation. The verb sense disambiguator uses the information stored in the verb lexicon that we developed. The main purpose of this tool is to disambiguate senses of verbs having several meanings, some of which are idiomatic. We also present a tool implemented in Lucid Common Lisp under X-Windows for adding, accessing, modifying, and removing entries of the lexicon, and a semantic concept ontology containing semantic features of commonly used Turkish nouns.
Open Access
Joint source channel coding using sequential decoding
(1997) Doğrusöz, Bekir Ahmet
In systems using conventional source encoding, source sequence is changed into a series of approximately independent equally likely binary digits. Performance of a code is bounded with the rate distortion function and improves as the redundancy of the encoder output is decreased. However decreasing the redundancy implies increasing the block length and hence the complexity. For the systems requiring low complexity at transmitter, joint source channel (JSC) coding can be successfully used for direct encoding of source into the channel for lossless recovery. In such a system, without any distortion, compression depends on the redundancy of the source, and is bounded by the Renyi entropy of the source. In this thesis we analyze transmission of English text with a JSC coding system. Written English is a good example for sources with natural redundancy. Since we are unable to calculate the Renyi entropy of written English, we obtain estimates and compare with the experimental results. We also work on an alternative source encoding method for accuracycompression trade-off in joint source channel coding systems. The proposed stochastic distortion encoder (SDE) is capable of achieving accuracycompression trade-off at any average distortion constraint with very low block lengths, and hence performs better than or as good as an equivalent rate distortion encoder. As block length approaches infinity the performance of stochastic distortion encoder approaches rate distortion function. Formulations for optimal SDE design and results for block lengths 1,2 and 3 are also given.
Open Access
A lexical-functional grammar for Turkish
(1993) Güngördü, Zelal
Natural language processing is a research area which is becoming increasingly popular each day for both academic and commercial reasons. Syntactic parsing underlies most of the applications in natural language processing. Although there have been comprehensive studies of Turkish syntax from a linguistic perspective, this is one of the first attempts for investigating it extensively from a computational point of view. In this thesis, a lexical-functional grammar for Turkish syntax is presented. Our current work deals with regular Turkish sentences that are structurally simple or complex.
Open Access
Noun phrase chunker for Turkish using dependency parser
(2010) Kutlu, Mücahid
Noun phrase chunking is a sub-category of shallow parsing that can be used for many natural language processing tasks. In this thesis, we propose a noun phrase chunker system for Turkish texts. We use a weighted constraint dependency parser to represent the relationship between sentence components and to determine noun phrases. The dependency parser uses a set of hand-crafted rules which can combine morphological and semantic information for constraints. The rules are suitable for handling complex noun phrase structures because of their flexibility. The developed dependency parser can be easily used for shallow parsing of all phrase types by changing the employed rule set. The lack of reliable human tagged datasets is a significant problem for natural language studies about Turkish. Therefore, we constructed the first noun phrase dataset for Turkish. According to our evaluation results, our noun phrase chunker gives promising results on this dataset. The correct morphological disambiguation of words is required for the correctness of the dependency parser. Therefore, in this thesis, we propose a hybrid morphological disambiguation technique which combines statistical information, hand-crafted grammar rules, and transformation based learning rules. We have also constructed a dataset for testing the performance of our disambiguation system. According to tests, the disambiguation system is highly effective.
Open Access
A performatory analysis of the overt use of the predicate "true"
(2013) Şenol, Mahmut Burak
The de ationary theory has been one of the most in uential theories of truth in contemporary philosophy. This theory holds that there is no property of truth at all, and that overt uses of the predicate \true" in our sentences are redundant, having absolutely no e ect on what we express. However, all hypothetical examples used by de ationary theorists in exemplifying the theory, in papers, books, have been taken out of context. Thus, there is no way to examine and analyze what the predicate adds to the sentence within context. We oppose this theory not on philosophical grounds, but on empirical grounds, with an \ordinary language philosophy" approach. We computationally collect 7610 occurrences of overt uses of the predicate \true" in the form \it is true that", from 10 in uential periodicals (newspapers and a magazine) published in the United States. We classify and annotate these examples with respect to coordinating and subordinating conjunctions' positions they contain. We investigate contextual relations of the proposition following the phrase \it is true that" with its surrounding propositions. We encounter 34 di erent syntactical patterns. We propose that in some occurrences of overt uses of the predicate \true", existence of the predicate makes an emphasis, performs an action in the same manner as a performatory verb does. We provide ordinary language appearances of overt uses of the predicate \true", which have been used in linguistically reliable media and constitute pragmatic `counter-examples' to the de ationary theory of truth.
Open Access
Segmentation based Ottoman text and matching based Kufic image analysis
(2013) Adıgüzel, Hande
Large archives of historical documents attract many researchers from all around the world. The increasing demand to access those archives makes automatic retrieval and recognition of historical documents crucial. Ottoman archives are one of the largest collections of historical documents. Although Ottoman is not a currently spoken language, many researchers from all around the world are interested in accessing the archived material. This thesis proposes two Ottoman document analysis studies; first one is a crucial pre-processing task for retrieval and recognition which is segmentation of documents. Second one is a more specific retrieval and recognition problem which aims matching Islamic patterns is Kufic images. For the first segmentation task, layout, line and word segmentation is studied. Layout segmentation is obtained via Log-Gabor filtering. Four different algorithms are proposed for line segmentation and finally a simple morphological method is preferred for word segmentation. Datasets are constructed with documents from both Ottoman and other languages (English, Greek and Bangla) to test the script-independency of the methods. Experiments show that our segmentation steps give satisfactory results. The second task aims to detect Islamic patterns in Kufic images. The sub-patterns are considered as basic units and matching is used for the analysis. Graphs are preferred to represent subpatterns where graph and sub-graph isomorphism are used for matching them. Kufic images are analyzed in three different ways. Given a query pattern, all the instances of the query can be found through retrieval. Going further, through known patterns images can be automatically labeled in the entire dataset. Finally, patterns that repeat inside an image can be automatically discovered. As there is no existing Kufic dataset, a new one is constructed by collecting images from the Internet and promising results are obtained on this dataset.
Open Access
Tagging and morphological disambiguation of Turkish text
(1994) Kuruöz, İlker
A part-of-speech (POS) tagger is a system that uses various sources of information to assign possibly unique POS to words. Automatic text tagging is an important component in higher level analysis of text corpora. Its output can also be used in many natural language processing applications. In languages like Turkish or Finnish, with agglutinative morphology, morphological disambiguation is a very crucial process in tagging as the structures of many lexical forms are morphologically ambiguous. This thesis present a POS tagger for Turkish text based on a full-scale two-level specification of Turkish morphology. The tagger is augmented with a multi-word and idiomatic construct recognizer, and most importantly morphological disambiguator based on local lexical neighborhood constraints, heuristics and limited amount of statistical information. The tagger also has additional functionality for statistics compilation and fine tuning of the morphological analyzer, such as logging erroneous morphological parses, commonly used roots, etc. Test results indicate that the tagger can tag about 97/% to 99% of the texts accurately with very minimal user intervention. Furthermore for sentences morphologically disambiguated with the tagger, an LFG parser developed for Turkish, on the average, generates 50% less ambiguous parses almost 2.5 times faster.
Open Access
A template-independent content extraction approach for new web pages
(2012) Yeniçağ, Ahmet
News web pages contain additional elements such as advertisements, hyperlinks, and reader comments. These elements make the extraction of news contents a challenging task. Current news content extraction (NCE) methods are usually template-dependent. They require regular maintenance, since news providers frequently change their web page templates. Therefore, there is a need for NCE methods that extract news contents accurately without depending on web page templates. In this thesis, a template-independent News content EXTraction approach, called N-EXT, is introduced. It rst parses a web page into its blocks according to the HTML tags. Then, it examines all blocks to detect the one that contains the major part of the news content. For this purpose, it assigns weights to the blocks by considering both their textual sizes and similarities to the news title. For quantifying the importance of these two weight components, we use the k-fold cross validation approach; and for assessing the impact of di erent possible similarity measures, we use a one-way Analysis of Variance (ANOVA) with a Sche e comparison. The block with the highest weight is considered as the news block. Our approach eliminates the sentences in the news block that are not related to the news content by considering similarities of sentences to the news block. Finally, it also examines other blocks to detect the rest of the news content. The experimental results show the accuracy and robustness of our method by using two test collections whose web pages are obtained from several di erent news websites.
Open Access
Turkish text generation with systemic-functional grammar
(1996) Korkmaz, Turgay
Natural Language Generation (NLG) is roughly decomposed into two stages: text planning, and text generation. In the text planning stage, the semantic description of the text is produced from the conceptual inputs. Then, the text generation system transforms this semantic description into an actual text. This thesis focuses on the design and implementation of a Turkish text generation system rather than text planning. To develop a text generator, we need a linguistic theory that describes the resources of the desired natural language, and also a software tool that represents and performs these linguistic resources in a computational environment. In this thesis, in order to carry out the mentioned requirements, we have used a functional linguistic theory called Systemic-Functional Grammar (SFG), and the FUF text generation system as a software tool. The ultimate text generation system takes the semantic description of the text sentence by sentence, and then produces a morphological description for each lexical constituent of the sentence. The morphological descriptions are worded by a Turkish morphological generator. Because of our concentration on the text generation, we have not considered the details of the text planning. Hence, we assume that the semantic description of the text is produced and lexicalized by an application (currently given by hand).
Open Access
Using multiple sources of information for constraint-based morphological disambiguation
(1996) Tür, Gökhan
This thesis presents a constraint-based morphological disambiguation approach that is applicable to languages with complex morphology-specifically agglutiriiitive languages with productive inflectional and derivational morphological phenomena. For morphologicciJly comiDlex languages like Turkish, automatic morphological disarnbigucition involves selecting for each token rnorphologiccil parse(s), with the right set of inflectional and derivational markers. Our system combines corpus independent hand-crafted constraint rules, constraint rules that are lecirned via unsupervised learning from a training corpus, and additioiml stcitistiCcil information obtcvined from the corpus to be morphologically disarnbigucited. The hcind-crafted rules are linguistically motivated and tuned to improve precision without sacrificing recall. In certain respects, our ai^proach has been motivated by Brill’s recent work [6], but with the observation that his transformational approach is not directly applicable to languages like Turkish. Our approach also uses a novel approach to unknown word processing by employing a secondary morphological processor which recovers any relevant inflectional and derivational information from a lexical item whose root is unknown. With this approach, well below 1% of the tokens remains as unknown in the texts we have experimented with. Our results indicate that by combining these hand-crafted, statistical and learned information sources, we can attain a reccill of 96 to 97% with a corresponding precision of 93 to 94%, and ambiguity of 1.02 to 1.03 parses per token.