Measuring and improving interpretability of word embeddings using lexical resources

buir.advisorÇukur, Tolga
dc.contributor.authorŞenel, Lütfi Kerem
dc.date.accessioned2019-08-29T05:52:24Z
dc.date.available2019-08-29T05:52:24Z
dc.date.copyright2019-08
dc.date.issued2019-08
dc.date.submitted2019-08-27
dc.departmentDepartment of Electrical and Electronics Engineeringen_US
dc.descriptionCataloged from PDF version of article.en_US
dc.descriptionThesis (M.S.): Bilkent University, Department of Electrical and Electronics Engineering, İhsan Doğramacı Bilkent University, 2019.en_US
dc.descriptionIncludes bibliographical references (leaves 76-82).en_US
dc.description.abstractAs an ubiquitous method in natural language processing, word embeddings are extensively employed to map semantic properties of words into a dense vector representations. They have become increasingly popular due to their state-of-the-art performances in many natural language processing (NLP) tasks. Word embeddings are substantially successful in capturing semantic relations among words, so a meaningful semantic structure must be present in the respective vector spaces. However, in many cases, this semantic structure is broadly and heterogeneously distributed across the embedding dimensions. In other words, vectors corresponding to the words are only meaningful relative to each other. Neither the vector nor its dimensions have any absolute meaning, making interpretation of dimensions a big challenge. We propose a statistical method to uncover the underlying latent semantic structure in the dense word embeddings. To perform our analysis, we introduce a new dataset (SEMCAT) that contains more than 6,500 words semantically grouped under 110 categories. We further propose a method to quantify the interpretability of the word embeddings that is a practical alternative to the classical word intrusion test that requires human intervention. Moreover, in order to improve the interpretability of word embeddings while leaving the original semantic learning mechanism mostly una ected, we introduce an additive modifi- cation to the objective function of the embedding learning algorithm, GloVe, that promotes the vectors of words that are semantically related to a predefined concept to take larger values along a specified dimension. We use Roget's Thesaurus to extract concept groups and align the words in these groups with embedding dimensions using modified objective function. By performing detailed evaluations, we show that proposed method improves interpretability drastically while preserving the semantic structure. We also demonstrate that imparting method with suitable concept groups can be used to significantly improve performance on benchmark tests and to measure and reduce gender bias present in the word embeddings.en_US
dc.description.degreeM.S.en_US
dc.description.provenanceSubmitted by Betül Özen (ozen@bilkent.edu.tr) on 2019-08-29T05:52:24Z No. of bitstreams: 1 MS_Thesis_Lutfi_Kerem_Senel.pdf: 2198279 bytes, checksum: 8e9d21783b572b020fca04f2ce8d32c5 (MD5)en
dc.description.provenanceMade available in DSpace on 2019-08-29T05:52:24Z (GMT). No. of bitstreams: 1 MS_Thesis_Lutfi_Kerem_Senel.pdf: 2198279 bytes, checksum: 8e9d21783b572b020fca04f2ce8d32c5 (MD5) Previous issue date: 2019-08en
dc.description.statementofresponsibilityby Lütfi Kerem Şenelen_US
dc.embargo.release2020-02-19
dc.format.extentxv, 82 leaves : charts (some colour) ; 30 cm.en_US
dc.identifier.itemidB109837
dc.identifier.urihttp://hdl.handle.net/11693/52377
dc.language.isoEnglishen_US
dc.publisherBilkent Universityen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectWord embeddingsen_US
dc.subjectInterpretabilityen_US
dc.subjectSemanticsen_US
dc.titleMeasuring and improving interpretability of word embeddings using lexical resourcesen_US
dc.title.alternativeSözcüksel kaynaklar kullanarak kelime temsillerinin yorumlanabilirliklerinin ölçülmesi ve iyileştirilmesien_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
MS_Thesis_Lutfi_Kerem_Senel.pdf
Size:
2.1 MB
Format:
Adobe Portable Document Format
Description:
Full printable version
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: