A new representation for matching words

Ataer, Esra

A new representation for matching words

buir.advisor	Şahin, Pınar Duygulu
dc.contributor.author	Ataer, Esra
dc.date.accessioned	2016-01-08T18:02:51Z
dc.date.available	2016-01-08T18:02:51Z
dc.date.issued	2007
dc.description	Cataloged from PDF version of article.	en_US
dc.description	Includes bibliographical references leaves 77-82.	en_US
dc.description.abstract	Large archives of historical documents are challenging to many researchers all over the world. However, these archives remain inaccessible since manual indexing and transcription of such a huge volume is difficult. In addition, electronic imaging tools and image processing techniques gain importance with the rapid increase in digitalization of materials in libraries and archives. In this thesis, a language independent method is proposed for representation of word images, which leads to retrieval and indexing of documents. While character recognition methods suffer from preprocessing and overtraining, we make use of another method, which is based on extracting words from documents and representing each word image with the features of invariant regions. The bag-of-words approach, which is shown to be successful to classify objects and scenes, is adapted for matching words. Since the curvature or connection points, or the dots are important visual features to distinct two words from each other, we make use of the salient points which are shown to be successful in representing such distinctive areas and heavily used for matching. Difference of Gaussian (DoG) detector, which is able to find scale invariant regions, and Harris Affine detector, which detects affine invariant regions, are used for detection of such areas and detected keypoints are described with Scale Invariant Feature Transform (SIFT) features. Then, each word image is represented by a set of visual terms which are obtained by vector quantization of SIFT descriptors and similar words are matched based on the similarity of these representations by using different distance measures. These representations are used both for document retrieval and word spotting. The experiments are carried out on Arabic, Latin and Ottoman datasets, which included different writing styles and different writers. The results show that the proposed method is successful on retrieval and indexing of documents even if with different scripts and different writers and since it is language independent, it can be easily adapted to other languages as well. Retrieval performance of the system is comparable to the state of the art methods in this field. In addition, the system is succesfull on capturing semantic similarities, which is useful for indexing, and it does not include any supervising step.	en_US
dc.description.statementofresponsibility	Ataer, Esra	en_US
dc.format.extent	xiv, 82 leaves	en_US
dc.identifier.itemid	BILKUTUPB103929
dc.identifier.uri	http://hdl.handle.net/11693/14602
dc.language.iso	English	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Word matching	en_US
dc.subject	Document retrieval	en_US
dc.subject	Bag-of-features	en_US
dc.subject.lcc	CD974.4 .A83 2007	en_US
dc.subject.lcsh	Electronic records.	en_US
dc.subject.lcsh	Archives--Data processing.	en_US
dc.subject.lcsh	Information retrieval.	en_US
dc.title	A new representation for matching words	en_US
dc.type	Thesis	en_US
thesis.degree.discipline	Computer Engineering
thesis.degree.grantor	Bilkent University
thesis.degree.level	Master's
thesis.degree.name	MS (Master of Science)

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 0003435.pdf
Size:: 3.18 MB
Format:: Adobe Portable Document Format

Download

Collections

Graduate School of Engineering and Science