A template-independent content extraction approach for new web pages

buir.advisorCan, Fazlı
dc.contributor.authorYeniçağ, Ahmet
dc.date.accessioned2016-01-08T18:24:49Z
dc.date.available2016-01-08T18:24:49Z
dc.date.issued2012
dc.departmentDepartment of Computer Engineeringen_US
dc.descriptionAnkara : The Department of Computer Engineering and the Graduate School of Engineering and Science of Bilkent University, 2012.en_US
dc.descriptionThesis (Master's) -- Bilkent University, 2012.en_US
dc.descriptionIncludes bibliographical references leaves 56-63.en_US
dc.description.abstractNews web pages contain additional elements such as advertisements, hyperlinks, and reader comments. These elements make the extraction of news contents a challenging task. Current news content extraction (NCE) methods are usually template-dependent. They require regular maintenance, since news providers frequently change their web page templates. Therefore, there is a need for NCE methods that extract news contents accurately without depending on web page templates. In this thesis, a template-independent News content EXTraction approach, called N-EXT, is introduced. It rst parses a web page into its blocks according to the HTML tags. Then, it examines all blocks to detect the one that contains the major part of the news content. For this purpose, it assigns weights to the blocks by considering both their textual sizes and similarities to the news title. For quantifying the importance of these two weight components, we use the k-fold cross validation approach; and for assessing the impact of di erent possible similarity measures, we use a one-way Analysis of Variance (ANOVA) with a Sche e comparison. The block with the highest weight is considered as the news block. Our approach eliminates the sentences in the news block that are not related to the news content by considering similarities of sentences to the news block. Finally, it also examines other blocks to detect the rest of the news content. The experimental results show the accuracy and robustness of our method by using two test collections whose web pages are obtained from several di erent news websites.en_US
dc.description.degreeM.S.en_US
dc.description.statementofresponsibilityYeniçağ, Ahmeten_US
dc.format.extentxiii, 82 leavesen_US
dc.identifier.itemidB133866
dc.identifier.urihttp://hdl.handle.net/11693/15800
dc.language.isoEnglishen_US
dc.publisherBilkent Universityen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectInformation extractionen_US
dc.subjectNews block detection (NBD)en_US
dc.subjectNews content extraction (NCE)en_US
dc.subjectNews portalen_US
dc.subjectWeb information aggregatorsen_US
dc.subject.lccZ699 .Y46 2012en_US
dc.subject.lcshInformation storage and retrieval systems.en_US
dc.subject.lcshComputational linguistics.en_US
dc.subject.lcshInformation retrieval.en_US
dc.subject.lcshElectronic data processing--Distributed processing.en_US
dc.subject.lcshWeb publishing.en_US
dc.subject.lcshOnline journalism.en_US
dc.subject.lcshWorld Wide Web.en_US
dc.titleA template-independent content extraction approach for new web pagesen_US
dc.typeThesisen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
0006505.pdf
Size:
1.88 MB
Format:
Adobe Portable Document Format