A template-independent content extraction approach for new web pages

Yeniçağ, Ahmet

A template-independent content extraction approach for new web pages

buir.advisor	Can, Fazlı
dc.contributor.author	Yeniçağ, Ahmet
dc.date.accessioned	2016-01-08T18:24:49Z
dc.date.available	2016-01-08T18:24:49Z
dc.date.issued	2012
dc.description	Cataloged from PDF version of article.	en_US
dc.description	Includes bibliographical references leaves 56-63.	en_US
dc.description.abstract	News web pages contain additional elements such as advertisements, hyperlinks, and reader comments. These elements make the extraction of news contents a challenging task. Current news content extraction (NCE) methods are usually template-dependent. They require regular maintenance, since news providers frequently change their web page templates. Therefore, there is a need for NCE methods that extract news contents accurately without depending on web page templates. In this thesis, a template-independent News content EXTraction approach, called N-EXT, is introduced. It rst parses a web page into its blocks according to the HTML tags. Then, it examines all blocks to detect the one that contains the major part of the news content. For this purpose, it assigns weights to the blocks by considering both their textual sizes and similarities to the news title. For quantifying the importance of these two weight components, we use the k-fold cross validation approach; and for assessing the impact of di erent possible similarity measures, we use a one-way Analysis of Variance (ANOVA) with a Sche e comparison. The block with the highest weight is considered as the news block. Our approach eliminates the sentences in the news block that are not related to the news content by considering similarities of sentences to the news block. Finally, it also examines other blocks to detect the rest of the news content. The experimental results show the accuracy and robustness of our method by using two test collections whose web pages are obtained from several di erent news websites.	en_US
dc.description.statementofresponsibility	Yeniçağ, Ahmet	en_US
dc.format.extent	xiii, 82 leaves	en_US
dc.identifier.itemid	B133866
dc.identifier.uri	http://hdl.handle.net/11693/15800
dc.language.iso	English	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Information extraction	en_US
dc.subject	News block detection (NBD)	en_US
dc.subject	News content extraction (NCE)	en_US
dc.subject	News portal	en_US
dc.subject	Web information aggregators	en_US
dc.subject.lcc	Z699 .Y46 2012	en_US
dc.subject.lcsh	Information storage and retrieval systems.	en_US
dc.subject.lcsh	Computational linguistics.	en_US
dc.subject.lcsh	Information retrieval.	en_US
dc.subject.lcsh	Electronic data processing--Distributed processing.	en_US
dc.subject.lcsh	Web publishing.	en_US
dc.subject.lcsh	Online journalism.	en_US
dc.subject.lcsh	World Wide Web.	en_US
dc.title	A template-independent content extraction approach for new web pages	en_US
dc.type	Thesis	en_US
thesis.degree.discipline	Computer Engineering
thesis.degree.grantor	Bilkent University
thesis.degree.level	Master's
thesis.degree.name	MS (Master of Science)

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 0006505.pdf
Size:: 1.88 MB
Format:: Adobe Portable Document Format

Download

Collections

Graduate School of Engineering and Science