A template-independent content extraction approach for new web pages
Author
Yeniçağ, Ahmet
Advisor
Can, Fazlı
Date
2012Publisher
Bilkent University
Language
English
Type
ThesisItem Usage Stats
80
views
views
143
downloads
downloads
Abstract
News web pages contain additional elements such as advertisements, hyperlinks,
and reader comments. These elements make the extraction of news contents a
challenging task. Current news content extraction (NCE) methods are usually
template-dependent. They require regular maintenance, since news providers
frequently change their web page templates. Therefore, there is a need for NCE
methods that extract news contents accurately without depending on web page
templates. In this thesis, a template-independent News content EXTraction approach,
called N-EXT, is introduced. It rst parses a web page into its blocks
according to the HTML tags. Then, it examines all blocks to detect the one that
contains the major part of the news content. For this purpose, it assigns weights
to the blocks by considering both their textual sizes and similarities to the news
title. For quantifying the importance of these two weight components, we use
the k-fold cross validation approach; and for assessing the impact of di erent
possible similarity measures, we use a one-way Analysis of Variance (ANOVA)
with a Sche e comparison. The block with the highest weight is considered as
the news block. Our approach eliminates the sentences in the news block that
are not related to the news content by considering similarities of sentences to
the news block. Finally, it also examines other blocks to detect the rest of the
news content. The experimental results show the accuracy and robustness of our
method by using two test collections whose web pages are obtained from several
di erent news websites.
Keywords
Information extractionNews block detection (NBD)
News content extraction (NCE)
News portal
Web information aggregators