Browsing by Subject "Word embeddings"

Now showing 1 - 8 of 8

Open Access
Analysis of gender bias in legal texts using natural language processing methods
(2023-07) Sevim, Nurullah
Word embeddings have become important building blocks that are used profoundly in natural language processing (NLP). Despite their several advantages, word embed-dings can unintentionally accommodate some gender- and ethnicity-based biases that are present within the corpora they are trained on. Therefore, ethical concerns have been raised since word embeddings are extensively used in several high level algorithms. Furthermore, transformer-based contextualized language models constitute the state-of-the-art in several natural language processing (NLP) tasks and applications. Despite their utility, contextualized models can contain human-like social biases as their training corpora generally consist of human-generated text. Evaluating and re-moving social biases in NLP models have been an ongoing and prominent research endeavor. In parallel, the NLP approaches in the legal area, namely legal NLP or computational law, have also been increasing recently. Eliminating unwanted bias in the legal domain is doubly crucial since the law has the utmost importance and effect on people. We approach the gender bias problem from the scope of legal text processing domain. In the first stage of our study, we focus on the gender bias in traditional word embeddings, like Word2Vec and GloVe. Word embedding models which are trained on corpora composed by legal documents and legislation from different countries have been utilized to measure and eliminate gender bias in legal documents. Several methods have been employed to reveal the degree of gender bias and observe its variations over countries. Moreover, a debiasing method has been used to neutralize unwanted bias. The preservation of semantic coherence of the debiased vector space has also been demonstrated by using high level tasks. In the second stage, we study the gender bias encoded in BERT-based models. We propose a new template-based bias measurement method with a bias evaluation corpus using crime words from the FBI database. This method quantifies the gender bias present in BERT-based models for legal applications. Furthermore, we propose a fine-tuning-based debiasing method using the European Court of Human Rights (ECtHR) corpus to debias legal pre-trained models. We test the debiased models on the LexGLUE benchmark to confirm that the under-lying semantic vector space is not perturbed during the debiasing process. Finally, overall results and their implications have been discussed in the scope of NLP in legal domain.
Open Access
Imparting interpretability to word embeddings while preserving semantic structure
(Cambridge University Press, 2020) Şenel, L. K.; Utlu, İhsan; Şahinuç, Furkan; Özaktaş, Haldun M.; Koç, Aykut
As a ubiquitous method in natural language processing, word embeddings are extensively employed to map semantic properties of words into a dense vector representation. They capture semantic and syntactic relations among words, but the vectors corresponding to the words are only meaningful relative to each other. Neither the vector nor its dimensions have any absolute, interpretable meaning. We introduce an additive modification to the objective function of the embedding learning algorithm that encourages the embedding vectors of words that are semantically related to a predefined concept to take larger values along a specified dimension, while leaving the original semantic learning mechanism mostly unaffected. In other words, we align words that are already determined to be related, along predefined concepts. Therefore, we impart interpretability to the word embedding by assigning meaning to its vector dimensions. The predefined concepts are derived from an external lexical resource, which in this paper is chosen as Roget’s Thesaurus. We observe that alignment along the chosen concepts is not limited to words in the thesaurus and extends to other related words as well. We quantify the extent of interpretability and assignment of meaning from our experimental results. Manual human evaluation results have also been presented to further verify that the proposed method increases interpretability. We also demonstrate the preservation of semantic coherence of the resulting vector space using word-analogy/word-similarity tests and a downstream task. These tests show that the interpretability-imparted word embeddings that are obtained by the proposed framework do not sacrifice performances in common benchmark tests.
Open Access
Learning interpretable word embeddings via bidirectional alignment of dimensions with semantic concepts
(Elsevier Ltd, 2022-03-22) Şenel, L. K.; Şahinuç, Furkan; Yücesoy, V.; Schütze, H.; Çukur, Tolga; Koç, Aykut
We propose bidirectional imparting or BiImp, a generalized method for aligning embedding dimensions with concepts during the embedding learning phase. While preserving the semantic structure of the embedding space, BiImp makes dimensions interpretable, which has a critical role in deciphering the black-box behavior of word embeddings. BiImp separately utilizes both directions of a vector space dimension: each direction can be assigned to a different concept. This increases the number of concepts that can be represented in the embedding space. Our experimental results demonstrate the interpretability of BiImp embeddings without making compromises on the semantic task performance. We also use BiImp to reduce gender bias in word embeddings by encoding gender-opposite concepts (e.g., male–female) in a single embedding dimension. These results highlight the potential of BiImp in reducing biases and stereotypes present in word embeddings. Furthermore, task or domain-specific interpretable word embeddings can be obtained by adjusting the corresponding word groups in embedding dimensions according to task or domain. As a result, BiImp offers wide liberty in studying word embeddings without any further effort.
Open Access
Measuring and improving interpretability of word embeddings using lexical resources
(2019-08) Şenel, Lütfi Kerem
As an ubiquitous method in natural language processing, word embeddings are extensively employed to map semantic properties of words into a dense vector representations. They have become increasingly popular due to their state-of-the-art performances in many natural language processing (NLP) tasks. Word embeddings are substantially successful in capturing semantic relations among words, so a meaningful semantic structure must be present in the respective vector spaces. However, in many cases, this semantic structure is broadly and heterogeneously distributed across the embedding dimensions. In other words, vectors corresponding to the words are only meaningful relative to each other. Neither the vector nor its dimensions have any absolute meaning, making interpretation of dimensions a big challenge. We propose a statistical method to uncover the underlying latent semantic structure in the dense word embeddings. To perform our analysis, we introduce a new dataset (SEMCAT) that contains more than 6,500 words semantically grouped under 110 categories. We further propose a method to quantify the interpretability of the word embeddings that is a practical alternative to the classical word intrusion test that requires human intervention. Moreover, in order to improve the interpretability of word embeddings while leaving the original semantic learning mechanism mostly una ected, we introduce an additive modifi- cation to the objective function of the embedding learning algorithm, GloVe, that promotes the vectors of words that are semantically related to a predefined concept to take larger values along a specified dimension. We use Roget's Thesaurus to extract concept groups and align the words in these groups with embedding dimensions using modified objective function. By performing detailed evaluations, we show that proposed method improves interpretability drastically while preserving the semantic structure. We also demonstrate that imparting method with suitable concept groups can be used to significantly improve performance on benchmark tests and to measure and reduce gender bias present in the word embeddings.
Open Access
Semantic change detection with gaussian word embeddings
(IEEE, 2021-10-20) Yüksel, Arda; Uğurlu, Berke; Koç, Aykut
Diachronic study of the evolution of languages is of importance in natural language processing (NLP). Recent years have witnessed a surge of computational approaches for the detection and characterization of lexical semantic change (LSC) due to the availability of diachronic corpora and advancing word representation techniques. We propose a Gaussian word embedding (w2g)-based method and present a comprehensive study for the LSC detection. W2g is a probabilistic distribution-based word embedding model and represents words as Gaussian mixture models using covariance information along with the existing mean (word vector). We also extensively study several aspects of w2g-based LSC detection under the SemEval-2020 Task 1 evaluation framework as well as using Google N-gram corpus. In the Sub-task 1 (LSC binary classification) of the SemEval-2020 Task 1, we report the highest overall ranking as well as the highest ranks for the two (German and Swedish) of the four languages (English, Swedish, German and Latin). We also report the highest Spearman correlation in the Sub-task 2 (LSC ranking) for Swedish. Our overall rankings in the LSC classification and ranking sub-tasks are 1st and 7th , respectively. Qualitative analysis has also been presented.
Open Access
Semantic similarity between Turkish and European languages using word embeddings
(IEEE, 2017) Şenel, Lütfü Kerem; Yücesoy, V.; Koç, A.; Çukur, Tolga
Representation of words coming from vocabulary of a language as real vectors in a high dimensional space is called as word embeddings. Word embeddings are proven to be successful in modelling semantic relations between words and numerous natural language processing applications. Although developed mainly for English, word embeddings perform well for many other languages. In this study, semantic similarity between Turkish (two different corpora) and five basic European languages (English, German, French, Spanish, Italian) is calculated using word embeddings over a fixed vocabulary, obtained results are verified using statistical testing. Also, the effect of using different corpora, and additional preprocess steps on the performance of word embeddings on similarity and analogy test sets prepared for Turkish is studied.
Open Access
Semantic structure and interpretability of word embeddings
(Institute of Electrical and Electronics Engineers, 2018) Şenel, Lütfi Kerem; Utlu, İhsan; Yücesoy, Veysel; Koç, Aykut; Çukur, Tolga
Dense word embeddings, which encode meanings of words to low-dimensional vector spaces, have become very popular in natural language processing (NLP) research due to their state-of-the-art performances in many NLP tasks. Word embeddings are substantially successful in capturing semantic relations among words, so a meaningful semantic structure must be present in the respective vector spaces. However, in many cases, this semantic structure is broadly and heterogeneously distributed across the embedding dimensions making interpretation of dimensions a big challenge. In this study, we propose a statistical method to uncover the underlying latent semantic structure in the dense word embeddings. To perform our analysis, we introduce a new dataset (SEMCAT) that contains more than 6500 words semantically grouped under 110 categories. We further propose a method to quantify the interpretability of the word embeddings. The proposed method is a practical alternative to the classical word intrusion test that requires human intervention.
Open Access
Türkçe kelime temsillerinde cinsiyetçi ön yargının incelenmesi
(IEEE, 2021-07-19) Sevim, Nurullah; Koç, Aykut
Doğal Dil İşleme uygulamalarında cinsiyetçi ön yargının incelenmesi, olası bir cinsiyetçi yaklaşımın olumsuz sonuçlarından dolayı son zamanlarda önem kazanmıştır. Özellikle İngilizce kelime temsillerinde bu tür ön yargılar çeşitli bağlamlarda incelenerek birçok araştırma yapılmıştır. Bu çalışmada Türkçe kelime temsillerinin cinsiyetçi ön yargılar açısından durumu incelenmiştir ve Türkçe dil yapısı İngilizce dil yapısı ile cinsiyetçi ön yargılar kapsamında karşılaştırılmıştır. Kelime temsillerinde yapılan cinsiyetçi ön yargıların ölçümü sonucunda Türkçe’nin İngilizce’ye kıyasla dil yapısında cinsiyetçi ön yargıyı daha az barındırdığı sonucuna varılmıştır.