• About
  • Policies
  • What is open access
  • Library
  • Contact
Advanced search
      View Item 
      •   BUIR Home
      • University Library
      • Bilkent Theses
      • Theses - Department of Computer Engineering
      • Dept. of Computer Engineering - Master's degree
      • View Item
      •   BUIR Home
      • University Library
      • Bilkent Theses
      • Theses - Department of Computer Engineering
      • Dept. of Computer Engineering - Master's degree
      • View Item
      JavaScript is disabled for your browser. Some features of this site may not work without it.

      Cluster labeling improvement by utilizing data fusion and Wikipedia

      Thumbnail
      View / Download
      2.9 Mb
      Author(s)
      Ayduğan, Gökçe
      Advisor
      Can, Fazlı
      Date
      2017-07
      Publisher
      Bilkent University
      Language
      English
      Type
      Thesis
      Item Usage Stats
      179
      views
      147
      downloads
      Abstract
      A cluster is a set of related documents. Cluster labeling is the process of assigning descriptive labels to clusters. This study investigates several cluster labeling approaches and presents novel methods. The rst uses clusters themselves and extracts important terms, which distinguish clusters from each other, with different statistical feature selection methods. Then it applies di erent data fusion methods for combining their outcomes. Our results show that although it provides statistically signi cantly better results for some cases, it is not a stable and reliable labeling method. This can be explained by the fact that a good label may not occur in the cluster at all. The second exploits Wikipedia as an external resource and uses its anchor texts and categories to enrich the label pool. Labeling with Wikipedia anchor text fails because the suggested labels tend to focus on minor topics. Although the minor topics are related to the main topic, they do not exactly describe it. After this observation, we use categories of Wikipedia pages to improve our label pool in two ways. The rst fuses important terms and Wikipedia categories with rank based fusion methods. The second looks relatedness of Wikipedia pages to the clusters and use only categories of related pages. The experimental results show that both methods provide statistically signi - cantly better results than the other cluster labeling approaches that we examine in this study.
      Keywords
      Cluster Labeling
      Data Fusion
      Wikipedia
      Permalink
      http://hdl.handle.net/11693/33553
      Collections
      • Dept. of Computer Engineering - Master's degree 566
      Show full item record

      Browse

      All of BUIRCommunities & CollectionsTitlesAuthorsAdvisorsBy Issue DateKeywordsTypeDepartmentsCoursesThis CollectionTitlesAuthorsAdvisorsBy Issue DateKeywordsTypeDepartmentsCourses

      My Account

      Login

      Statistics

      View Usage StatisticsView Google Analytics Statistics

      Bilkent University

      If you have trouble accessing this page and need to request an alternate format, contact the site administrator. Phone: (312) 290 2976
      © Bilkent University - Library IT

      Contact Us | Send Feedback | Off-Campus Access | Admin | Privacy