Cluster labeling improvement by utilizing data fusion and Wikipedia

Ayduğan, Gökçe

Cluster labeling improvement by utilizing data fusion and Wikipedia

Files

gokce_tez.pdf (2.95 MB)

Date

2017-07

Authors

Ayduğan, Gökçe

Advisor

Can, Fazlı

BUIR Usage Stats

1
views

27
downloads

Abstract

A cluster is a set of related documents. Cluster labeling is the process of assigning descriptive labels to clusters. This study investigates several cluster labeling approaches and presents novel methods. The rst uses clusters themselves and extracts important terms, which distinguish clusters from each other, with different statistical feature selection methods. Then it applies di erent data fusion methods for combining their outcomes. Our results show that although it provides statistically signi cantly better results for some cases, it is not a stable and reliable labeling method. This can be explained by the fact that a good label may not occur in the cluster at all. The second exploits Wikipedia as an external resource and uses its anchor texts and categories to enrich the label pool. Labeling with Wikipedia anchor text fails because the suggested labels tend to focus on minor topics. Although the minor topics are related to the main topic, they do not exactly describe it. After this observation, we use categories of Wikipedia pages to improve our label pool in two ways. The rst fuses important terms and Wikipedia categories with rank based fusion methods. The second looks relatedness of Wikipedia pages to the clusters and use only categories of related pages. The experimental results show that both methods provide statistically signi - cantly better results than the other cluster labeling approaches that we examine in this study.

Keywords

Cluster Labeling, Data Fusion, Wikipedia

Degree Discipline

Computer Engineering

Degree Level

Master's

Degree Name

MS (Master of Science)

Permalink

http://hdl.handle.net/11693/33553

Collections

Graduate School of Engineering and Science

Language

English

Type

Thesis

Full item page

Cluster labeling improvement by utilizing data fusion and Wikipedia

Files

Date

Authors

Editor(s)

Advisor

Supervisor

Co-Advisor

Co-Supervisor

Instructor

BUIR Usage Stats

Series

Abstract

Source Title

Publisher

Course

Other identifiers

Book Title

Keywords

Degree Discipline

Degree Level

Degree Name

Citation

Permalink

Published Version (Please cite this version)

Collections

Language

Type

Cluster labeling improvement by utilizing data fusion and Wikipedia

Files

Date

Authors

Editor(s)

Advisor

Supervisor

Co-Advisor

Co-Supervisor

Instructor

BUIR Usage Stats

Share

Series

Abstract

Source Title

Publisher

Course

Other identifiers

Book Title

Keywords

Degree Discipline

Degree Level

Degree Name

Citation

Permalink

Published Version (Please cite this version)

Collections

Language

Type