A template-independent content extraction approach for new web pages

Date

2012

Editor(s)

Advisor

Can, Fazlı

Supervisor

Co-Advisor

Co-Supervisor

Instructor

Source Title

Print ISSN

Electronic ISSN

Publisher

Volume

Issue

Pages

Language

English

Journal Title

Journal ISSN

Volume Title

Series

Abstract

News web pages contain additional elements such as advertisements, hyperlinks, and reader comments. These elements make the extraction of news contents a challenging task. Current news content extraction (NCE) methods are usually template-dependent. They require regular maintenance, since news providers frequently change their web page templates. Therefore, there is a need for NCE methods that extract news contents accurately without depending on web page templates. In this thesis, a template-independent News content EXTraction approach, called N-EXT, is introduced. It rst parses a web page into its blocks according to the HTML tags. Then, it examines all blocks to detect the one that contains the major part of the news content. For this purpose, it assigns weights to the blocks by considering both their textual sizes and similarities to the news title. For quantifying the importance of these two weight components, we use the k-fold cross validation approach; and for assessing the impact of di erent possible similarity measures, we use a one-way Analysis of Variance (ANOVA) with a Sche e comparison. The block with the highest weight is considered as the news block. Our approach eliminates the sentences in the news block that are not related to the news content by considering similarities of sentences to the news block. Finally, it also examines other blocks to detect the rest of the news content. The experimental results show the accuracy and robustness of our method by using two test collections whose web pages are obtained from several di erent news websites.

Course

Other identifiers

Book Title

Degree Discipline

Computer Engineering

Degree Level

Master's

Degree Name

MS (Master of Science)

Citation

Published Version (Please cite this version)