Browsing by Subject "Topic segmentation"
Now showing 1 - 2 of 2
- Results Per Page
- Sort Options
Item Open Access Local context based linear text segmentation(2014-02) Erdem, HayrettinUnderstanding the topical structure of text documents is important for effective retrieval and browsing, automatic summarization, and tasks related to identifying, clustering and tracking documents about their topics. Despite documents often display structural organization and contain explicit section markers, some lack of such properties thereby revealing the need for topical text segmentation systems. Examples of such documents are speech transcripts and inherently un-structured texts like newspaper columns and blog entries discussing several sub-jects in a discourse. A novel local-context based approach depending on lexical cohesion is presented for linear text segmentation, which is the task of dividing text into a linear sequence of coherent segments. As the lexical cohesion indicator, the proposed technique exploits relationships among terms induced from semantic space called HAL (Hyperspace Analogue to Language), which is built upon by examining co-occurrence of terms through passing a fixed-sized window over text. The proposed algorithm (BTS) iteratively discovers topical shifts by examining the most relevant sentence pairs in a block of sentences considered at each iteration. The technique is evaluated on both error-free speech transcripts of news broadcasts and documents formed by concatenating different topical regions of text. A new corpus for Turkish is automatically built where each document is formed by concatenating different news articles. For performance comparison, two state-of-the-art methods, TextTiling and C99, are leveraged, and the results show that the proposed approach has comparable performance with these two techniques. The results are also statistically validated by applying the ANOVA and Tukey post–hoc test.Item Open Access A statistical information extraction system for Turkish(2000) Tür, GökhanThis thesis presents the results of a study on information extraction from unrestricted Turkish text using statistical language processing methods. We have successfully applied statistical methods using both the lexical and morphological information to the following tasks: -The Turkish Text Deasciifier task aims to convert the ASCII characters in a Turkish text, into the corresponding non-ASCII Turkish characters (i.e.,"ü", "ö", "ç", "ş", "ğ", "ı", and their upper cases). -The Word Segmentation task aims to detect word boundaries, given we have a sequence of characters without space or punctuation.-The Vowel Restoration task aims to restore the vowels of an input stream, whose vowels are deleted.-The Sentence Segmentation task aims to divide a stream of text or speech into grammatical sentences. Given a sequence of (written or spoken) words, the aim of sentence segmentation is to find the boundaries of the sentences.-The Topic Segmentation task aims to divide a stream of text or speech into topically homogeneous blocks. Given a sequence of (written or spoken) words, the aim of topic segmentation is to find the boundaries where topics change.-The Name Tagging task aims to mark the games (persons, locations, and organizations) in a text. For relatively simpler tasks, such as Turkish Text Deasciifier, Word Segmentation, and Vowel Restoration, only lexical information is enough, but in order to obtain better performance in more complex tasks, such as Sentence Segmentation, Topic Segmentation, and Name Tagging, we not only use lexical information, but also exploit morphological and contextual information. For sentence segmentation, we have modeled the final inflectional groups of the words and combined them with the lexical model, and decreased the error rate to 4.34%. For name tagging, in addition to the lexical and morphological models, we have also employed contextual and tag models, and reached an F-measure of 91.56%. For topic segmentation, stems of the words (nouns) have been found to be more effective than using the surface forms of the words and we have achieved 10.90% segmentation error rate on our test set.