Enhancing feature selection with contextual relatedness filtering using Wikipedia
Author(s)
Advisor
Can, FazlıDate
2017-08Publisher
Bilkent University
Language
English
Type
ThesisItem Usage Stats
155
views
views
122
downloads
downloads
Abstract
Feature selection is an important component of information retrieval and natural
language processing applications. It is used to extract distinguishing terms
for a group of documents; such terms, for example, can be used for clustering,
multi-document summarization and classi cation. The selected features are not
always the best representatives of the documents due to some noisy terms. Addressing
this issue, our contribution is twofold. First, we present a novel approach
of ltering out the noisy, unrelated terms from the feature lists with the usage
of contextual relatedness information of terms to their topics in order to enhance
the feature set quality. Second, we propose a new method to assess the contextual
relatedness of terms to the topic of their documents. Our approach automatically
decides the contextual relatedness of a term to the topic of a set of documents
using co-occurrences with the distinguishing terms of the document set inside an
external knowledge source, Wikipedia for our work. Deletion of unrelated terms
from the feature lists gives a better, more related set of features. We evaluate
our approach for cluster labeling problem where feature sets for clusters can be
used as label candidates. We work on commonly used 20NG and ODP datasets
for the cluster labeling problem, nding that it successfully detects relevancy information of terms to topics, and ltering out irrelevant label candidates results
in signi cantly improved cluster labeling quality.