Browsing by Subject "Cluster Labeling"
Now showing 1 - 2 of 2
- Results Per Page
- Sort Options
Item Open Access Cluster labeling improvement by utilizing data fusion and Wikipedia(2017-07) Ayduğan, GökçeA cluster is a set of related documents. Cluster labeling is the process of assigning descriptive labels to clusters. This study investigates several cluster labeling approaches and presents novel methods. The rst uses clusters themselves and extracts important terms, which distinguish clusters from each other, with different statistical feature selection methods. Then it applies di erent data fusion methods for combining their outcomes. Our results show that although it provides statistically signi cantly better results for some cases, it is not a stable and reliable labeling method. This can be explained by the fact that a good label may not occur in the cluster at all. The second exploits Wikipedia as an external resource and uses its anchor texts and categories to enrich the label pool. Labeling with Wikipedia anchor text fails because the suggested labels tend to focus on minor topics. Although the minor topics are related to the main topic, they do not exactly describe it. After this observation, we use categories of Wikipedia pages to improve our label pool in two ways. The rst fuses important terms and Wikipedia categories with rank based fusion methods. The second looks relatedness of Wikipedia pages to the clusters and use only categories of related pages. The experimental results show that both methods provide statistically signi - cantly better results than the other cluster labeling approaches that we examine in this study.Item Open Access Enhancing feature selection with contextual relatedness filtering using Wikipedia(2017-08) Baydar, MelihFeature selection is an important component of information retrieval and natural language processing applications. It is used to extract distinguishing terms for a group of documents; such terms, for example, can be used for clustering, multi-document summarization and classi cation. The selected features are not always the best representatives of the documents due to some noisy terms. Addressing this issue, our contribution is twofold. First, we present a novel approach of ltering out the noisy, unrelated terms from the feature lists with the usage of contextual relatedness information of terms to their topics in order to enhance the feature set quality. Second, we propose a new method to assess the contextual relatedness of terms to the topic of their documents. Our approach automatically decides the contextual relatedness of a term to the topic of a set of documents using co-occurrences with the distinguishing terms of the document set inside an external knowledge source, Wikipedia for our work. Deletion of unrelated terms from the feature lists gives a better, more related set of features. We evaluate our approach for cluster labeling problem where feature sets for clusters can be used as label candidates. We work on commonly used 20NG and ODP datasets for the cluster labeling problem, nding that it successfully detects relevancy information of terms to topics, and ltering out irrelevant label candidates results in signi cantly improved cluster labeling quality.