Browsing by Subject "Summarization"

Now showing 1 - 3 of 3

Open Access
Aspect based opinion mining on Turkish tweets
(2012) Akbaş, Esra
Understanding opinions about entities or brands is instrumental in reputation management and decision making. With the advent of social media, more people are willing to publicly share their recommendations and opinions. As the type and amount of such venues increase, automated analysis of sentiment on textual resources has become an essential data mining task. Sentiment classification aims to identify the polarity of sentiment in text. The polarity is predicted on either a binary (positive, negative) or a multi-variant scale as the strength of sentiment expressed. Text often contains a mix of positive and negative sentiments, hence it is often necessary to detect both simultaneously. While classifying text based on sentiment polarity is a major task, analyzing sentiments separately for each aspect can be more useful in many applications. In this thesis, we investigate the problem of mining opinions by extracting aspects of entities/topics on collection of short texts. We focus on Turkish tweets that contain informal short messages. Most of the available resources such as lexicons and labeled corpus in the literature of opinion mining are for the English language. Our approach would help enhance the sentiment analyses to other languages where such rich sources do not exist. After a set of preprocessing steps, we extract the aspects of the product(s) from the data and group the tweets based on the extracted aspects. In addition to our manually constructed Turkish opinion word list, an automated generation of the words with their sentiment strengths is proposed using a word selection algorithm. Then, we propose a new representation of the text according to sentiment strength of the words, which we refer to as sentiment based text representation. The feature vectors of the text are constructed according to this new representation. We adapt machine learning methods to generate classifiers based on the multi-variant scale feature vectors to detect mixture of positive and negative sentiments and to test their performance on Turkish tweets.
Open Access
Lexical cohesion analysis for topic segmentation, summarization and keyphrase extraction
(2012) Ercan, Gönenç
When we express some idea or story, it is inevitable to use words that are semantically related to each other. When this phenomena is exploited from the aspect of words in the language, it is possible to infer the level of semantic relationship between words by observing their distribution and use in discourse. From the aspect of discourse it is possible to model the structure of the document by observing the changes in the lexical cohesion in order to attack high level natural language processing tasks. In this research lexical cohesion is investigated from both of these aspects by first building methods for measuring semantic relatedness of word pairs and then using these methods in the tasks of topic segmentation, summarization and keyphrase extraction. Measuring semantic relatedness of words requires prior knowledge about the words. Two different knowledge-bases are investigated in this research. The first knowledge base is a manually built network of semantic relationships, while the second relies on the distributional patterns in raw text corpora. In order to discover which method is effective in lexical cohesion analysis, a comprehensive comparison of state-of-the art methods in semantic relatedness is made. For topic segmentation different methods using some form of lexical cohesion are present in the literature. While some of these confine the relationships only to word repetition or strong semantic relationships like synonymy, no other work uses the semantic relatedness measures that can be calculated for any two word pairs in the vocabulary. Our experiments suggest that topic segmentation performance improves methods using both classical relationships and word repetition. Furthermore, the experiments compare the performance of different semantic relatedness methods in a high level task. The detected topic segments are used in summarization, and achieves better results compared to a lexical chains based method that uses WordNet. Finally, the use of lexical cohesion analysis in keyphrase extraction is investigated. Previous research shows that keyphrases are useful tools in document retrieval and navigation. While these point to a relation between keyphrases and document retrieval performance, no other work uses this relationship to identify keyphrases of a given document. We aim to establish a link between the problems of query performance prediction (QPP) and keyphrase extraction. To this end, features used in QPP are evaluated in keyphrase extraction using a Naive Bayes classifier. Our experiments indicate that these features improve the effectiveness of keyphrase extraction in documents of different length. More importantly, commonly used features of frequency and first position in text perform poorly on shorter documents, whereas QPP features are more robust and achieve better results.
Open Access
Semantic argument frequency-based multi-document summarization
(IEEE, 2009-09) Aksoy, Cem; Buğdaycı, Ahmet; Gür, Tunay; Uysal, İbrahim; Can, Fazlı
Semantic Role Labeling (SRL) aims to identify the constituents of a sentence, together with their roles with respect to the sentence predicates. In this paper, we introduce and assess the idea of using SRL on generic Multi-Document Summarization (MDS). We score sentences according to their inclusion of frequent semantic phrases and form the summary using the top-scored sentences. We compare this method with a term-based sentence scoring approach to investigate the effects of using semantic units instead of single words for sentence scoring. We also integrate our scoring metric as an auxiliary feature to a cutting edge summarizer with the intention of examining its effects on the performance. The experiments using datasets from the Document Understanding Conference (DUC) 2004 show that the SRL-based summarization outperforms the term-based approach as well as most of the DUC participants. © 2009 IEEE.