Categorization in a hierarchically structured text database

buir.supervisorGüvenir, H. Altay
dc.contributor.authorKutlu, Ferhat
dc.date.accessioned2016-01-08T18:06:48Z
dc.date.available2016-01-08T18:06:48Z
dc.date.issued2001
dc.descriptionCataloged from PDF version of article.en_US
dc.descriptionIncludes bibliographical refences (leaves 63-66).en_US
dc.description.abstractOver the past two decades there has been a huge increase in the amount of data being stored in databases and the on-line flow of data by the effects of improvements in Internet. This huge increase brought out the needs for intelligent tools to manage that size of data and its flow. Hierarchical approach is the best way to satisfy these needs and it is so widespread among people dealing with databases and Internet. Usenet newsgroups system is one of the on-line databases that have built-in hierarchical structures. Our point of departure is this hierarchical structure which makes categorization tasks easier and faster. In fact most of the search engines in Internet also exploit inherent hierarchy of Internet. Growing size of data makes most of the traditional categorization algorithms obsolete. Thus we developed a brand-new categorization learning algorithm which constructs an index tree out of Usenet news database and then decides the related newsgroups of a new news by categorizing it over the index tree. In learning phase it has an agglomerative and bottom-up hierarchical approach. In categorization phase it does an overlapping and supervised categorization. k Nearest Neighbor categorization algorithm is used to compare the complexity measure and accuracy of our algorithm. This comparison does not only mean comparing two different algorithms but also means comparing hierarchical approach vs. flat approach, similarity measure vs. distance measure and importance of accuracy vs. importance of speed. Our algorithm prefers hierarchical approach and similarity measure, and greatly outperforms k Nearest Neighbor categorization algorithm in speed with minimal loss of accuracy.
dc.description.provenanceMade available in DSpace on 2016-01-08T18:06:48Z (GMT). No. of bitstreams: 1 0001610.pdf: 864875 bytes, checksum: 970ddc4c6a175ddd021558aea4e31031 (MD5)en
dc.description.statementofresponsibilityby Ferhat Kutluen_US
dc.format.extent66 leaves ; 30 cm.en_US
dc.identifier.itemidB056068
dc.identifier.urihttp://hdl.handle.net/11693/14741
dc.language.isoEnglishen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectLearning
dc.subjectCategorization
dc.subjectClustering
dc.subjectUsenet
dc.subjectNewsgroup
dc.subjectTop-level
dc.subjectHeader-line
dc.subjectPosting
dc.subjectFrequency
dc.subjectNorm-scaling
dc.subjectSimilarity measure
dc.subjectDistance measure
dc.subjectAgglomerative
dc.subjectBottom-up
dc.subjectStemming
dc.subjectStopword
dc.subjectIndex
dc.titleCategorization in a hierarchically structured text databaseen_US
dc.title.alternativeHiyerarşik yapıda olan bir veritabanının kategorizasyonu
dc.typeThesisen_US
thesis.degree.disciplineComputer Engineering
thesis.degree.grantorBilkent University
thesis.degree.levelMaster's
thesis.degree.nameMS (Master of Science)

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
0001610.pdf
Size:
844.6 KB
Format:
Adobe Portable Document Format