Browsing by Subject "Natural language processing"

Now showing 1 - 20 of 32

Open Access
A transformer-based prior legal case retrieval method
(IEEE - Institute of Electrical and Electronics Engineers, 2023-08-28) Öztürk, Ceyhun Emre; Özçelik, Şemsi Barış; Koç, Aykut
In this work, BERTurk-Legal, a transformer-based language model, is introduced to retrieve prior legal cases. BERTurk-Legal is pre-trained on a dataset from the Turkish legal domain. This dataset does not contain any labels related to the prior court case retrieval task. Masked language modeling is used to train BERTurk-Legal in a self-supervised manner. With zero-shot classification, BERTurk-Legal provides state-of-the-art results on the dataset consisting of legal cases of the Court of Cassation of Turkey. The results of the experiments show the necessity of developing language models specific to the Turkish law domain.
Open Access
Analysis of speech content and voice for deceit detection
(2024-09) Eskin, Maria Raluca
Deceptive behavior is part of daily life, often without being recognized, leading to severe repercussions. With the recent improvements in machine learning, more reliable detection of deceit appears to be possible. Although current visual and multimodal models can identify deception with adequate precision, the individual use of speech content or voice still performs poorly. Therefore, we systematically analyze such essential communication forms focusing on feature extraction and optimization for deceit detection. To this end, we assess the reliability of employing transformers, spatial and temporal architectures, state-of-the-art pre-trained models, and handcrafted representations to detect deceit patterns. Furthermore, we conduct a thorough analysis to comprehend the distinct properties and discriminative power of the evaluated methods. The results demonstrate that speech content (transcribed text) provides more information than vocal characteristics. In addition, transformer architectures are found to be effective in representation learning and modeling, providing insights into optimal model configurations for deceit detection.
Open Access
Automatic categorization and summarization of documentaries
(Sage Publications Ltd., 2010) Demirtas, K.; Cicekli, N. K.; Cicekli, I.
In this paper, we propose automatic categorization and summarization of documentaries using subtitles of videos. We propose two methods for video categorization. The first makes unsupervised categorization by applying natural language processing techniques on video subtitles and uses the WordNet lexical database and WordNet domains. The second has the same extraction steps but uses a learning module to categorize. Experiments with documentary videos give promising results in discovering the correct categories of videos. We also propose a video summarization method using the subtitles of videos and text summarization techniques. Significant sentences in the subtitles of a video are identified using these techniques and a video summary is then composed by finding the video parts corresponding to these summary sentences. © 2010 The Author(s).
Open Access
Design and evaluation of an ontology based information extraction system for radiological reports
(Pergamon Press, 2010) Soysal, E.; Cicekli, I.; Baykal, N.
This paper describes an information extraction system that extracts and converts the available information in free text Turkish radiology reports into a structured information model using manually created extraction rules and domain ontology. The ontology provides flexibility in the design of extraction rules, and determines the information model for the extracted semantic information. Although our information extraction system mainly concentrates on abdominal radiology reports, the system can be used in another field of medicine by adapting its ontology and extraction rule set. We achieved very high precision and recall results during the evaluation of the developed system with unseen radiology reports. © 2010 Elsevier Ltd.
Open Access
Design and implementation of a computational lexicon for Turkish
(1997) Yorulmaz, Abdullah Kurtuluş
All natural hinguage processing systems (such as parsers, generators, taggers) need to have access to a lexicon about the words in the language. This thesis presents a lexicon architecture for natural language processing in Turkish. Given a query form consisting of a surface form and other features acting as restrictions, the lexicon produces feature structures containing morphosyntactic, syntactic, and semantic information for all possible interpretations of the surface form Scitisfying those restrictions. The lexicon is based on contemporary cipproaches like feature-based representation, inheritance, and unification. It makes use of two information sources: a morphological processor and a lexical database contciining all the open and closed-class words of Turkish. The system has been implemented in SICStus Prolog as a standalone module for use in natural language processing applications.
Restricted
Dil yetmiyor
(1981) Eyüboğlu, İsmet Zeki
Restricted
Dil yetmiyor
(1985) Eyüboğlu, İsmet Zeki
Open Access
Diyalog tabanlı metinlerde konu değişimi tespiti
(IEEE, 2019-04) Şenel, Lütfi Kerem; Yücesoy, Veysel; Koҫ, A.; Çukur, Tolga
Son dönemde katlanarak gelişen haberleşme yöntemleri (internet, sosyal medya, akıllı telefon, vb.) sayesinde veriye ulaşmak ve paylaşmak kolaylaşmıştır. Özellikle son yıllarda sözlü ve yazılı paylaşım mecraları hızlı gelişim göstermiştir. Yazılı paylaşımın en hızlı yaşandıgı alanlar arasında sosyal medya siteleri ve forumlar öne çıkmaktadır. Forumlarda sosyal medyadan farklı olarak, her başlık altında sadece o başlık ile ilgili konuşmalar yapılması beklenmektedir. Konu kısıtlılıgı olan ve sözlü iletişimin son yıllarda en hızlı geli¸stigi alanlardan biri de çagrı merkezleridir. Belirli konuların dışına çıkılması ya da ana konunun değiştirilmesinin otomatik tespiti özellikle çağrı merkezleri ve teknik forumlar gibi mecraların iletişim performansının değerlendirilmesi ve otomatik olarak yönetilebilmesi açısından önemlidir. Bu çalışma ile diyalog tabanlı Türkçe metinler içerisinde konu değişimini otomatik olarak algılayabilen sınıflandırıcılar geliştirilmiştir. Bu sınıflandırıcıların geliştirilebilmesi için öncelikle Türkçe forumlardan konu tabanlı karşılıklı konuşma verileri tasnif edilerek ham bir veri kümesi elde edilmiştir. Oluşturulan veri kümesi üzerinde klasik bir yöntem (TF-IDF) ile bir derin öğrenme modeli (LSTM) otomatik konu değişimi tespiti problemi için karşılaştırılmıştır. Klasik yöntem ile test kümesinde %80’lere varan başarı elde edilirken, derin öğrenme yönteminin performansının %76 seviyesinde kaldığı gözlenmiştir.
Open Access
Elicitation and use of relevance feedback information
(Elsevier Ltd, 2006-01) Vechtomova, O.; Karamuftuoglu, M.
The paper presents two approaches to interactively refining user search formulations and their evaluation in the new High Accuracy Retrieval from Documents (HARD) track of TREC-12. The first method consists of asking the user to select a number of sentences that represent documents. The second method consists of showing to the user a list of noun phrases extracted from the initial document set. Both methods then expand the query based on the user feedback. The TREC results show that one of the methods is an effective means of interactive query expansion and yields significant performance improvements. The paper presents a comparison of the methods and detailed analysis of the evaluation results. © 2004 Elsevier Ltd. All rights reserved.
Open Access
Feedforward neural network based case prediction in Turkish higher courts
(2022-08-29) Aras, Arda C.; Öztürk, Ceyhun E.; Koç, Aykut
Thanks to natural language processing (NLP) methods, legal texts can be processed by computers and decision prediction applications can be developed in the legal tech field. Increase in the available data sources in the Turkish legal system provides an opportunity to develop NLP applications as well. In order to develop these applications, the necessary corpora and datasets should be created. In this work, legal case texts from the Turkish Higher Courts that are open to public access and free from personal data are used to develop decision prediction methods. Feedforward neural networks (FFNN) are deployed using word embeddings and the features extracted from texts via the Principal Component Analysis (PCA) algorithm. %85.4 Macro F1 score level is achieved.
Open Access
Generating semantic similarity atlas for natural languages
(IEEE, 2018-12) Şenel, Lütfi Kerem; Utlu, İhsan; Yücesoy, V.; Koç, A.; Çukur, Tolga
Cross-lingual studies attract a growing interest in natural language processing (NLP) research, and several studies showed that similar languages are more advantageous to work with than fundamentally different languages in transferring knowledge. Different similarity measures for the languages are proposed by researchers from different domains. However, a similarity measure focusing on semantic structures of languages can be useful for selecting pairs or groups of languages to work with, especially for the tasks requiring semantic knowledge such as sentiment analysis or word sense disambiguation. For this purpose, in this work, we leverage a recently proposed word embedding based method to generate a language similarity atlas for 76 different languages around the world. This atlas can help researchers select similar language pairs or groups in cross-lingual applications. Our findings suggest that semantic similarity between two languages is strongly correlated with the geographic proximity of the countries in which they are used.
Open Access
Generic text summarization for Turkish
(Oxford University Press, 2010) Kutlu, M.; Cığır, C.; Cicekli, I.
In this paper, we propose a generic text summarization method that generates summaries of Turkish texts by ranking sentences according to their scores. Sentence scores are calculated using their surface-level features, and summaries are created by extracting the highest ranked sentences from the original documents. To extract sentences which form a summary with an extensive coverage of the main content of the text and less redundancy, we use features such as term frequency, key phrase (KP), centrality, title similarity and sentence position. The sentence rank is computed using a score function that uses its feature values and the weights of the features. The best feature weights are learned using machine-learning techniques with the help of human-constructed summaries. Performance evaluation is conducted by comparing summarization outputs with manual summaries of two newly created Turkish data sets. This paper presents one of the first Turkish summarization systems, and its results are promising. We introduce the usage of KP as a surface-level feature in text summarization, and we show the effectiveness of the centrality feature in text summarization. The effectiveness of the features in Turkish text summarization is also analyzed in detail. © The Author 2008. Published by Oxford University Press on behalf of The British Computer Society. All rights reserved.
Open Access
Language ability in schizophrenia patients and genetic high-risk individuals: neuropsychological and computational investigation
(2024-05) Çabuk, Tuğçe
The study aimed to define language-related phenotypes in schizophrenia and analyze language in schizophrenia patients (SZ) and their unaffected siblings (SIB) as a possible endophenotype. For experiment 1, language was evaluated with the Thought and Language Disorder Scale (TALD), Thought and Language Index (TLI), phonemic and semantic verbal fluency, Boston Naming Test, and Scale for Scoring the Inclusion and the Quality of the Parts of the Story. Language skills of SIB were higher than those of SZ, but lower than those of healthy controls (HC). The best predictor of SZ and SIB was TLI score in the main regression model compared to HC. For experiment 2, I utilized Natural Language Processing (NLP) to explore whether there are altered linguistic features in Turkish-speaking SZ and whether these possible features as phenotypes are language-dependent or -independent. Analyses was conducted in two parts. Firstly, mean sentence length (MSL), total completed words (TCW), moving average type-token ratio (MATTR), and first- person singular pronoun usage (FPSP) were calculated. Secondly, I used parts-of- speech tagging (POS) and Word2Vec. I found that SZ had lower MSL and MATTR but higher use of FPSP. Results were correlated with the TALD. POS demonstrated that SZ used fewer coordinating conjunctions. Word2Vec detected that SZ had higher semantic similarity than HC and K-Means could differentiate between SZ and HC into two distinct groups with high accuracy, 86.84%. My findings suggest that semantics as subparts of language could be a possible endophenotype in schizophrenia. Thus, their assessment may improve the early diagnosis of the illness. Also, it showed that altered linguistic features in SZ are mostly language- independent.
Open Access
A lexical-functional grammar for Turkish
(1993) Güngördü, Zelal
Natural language processing is a research area which is becoming increasingly popular each day for both academic and commercial reasons. Syntactic parsing underlies most of the applications in natural language processing. Although there have been comprehensive studies of Turkish syntax from a linguistic perspective, this is one of the first attempts for investigating it extensively from a computational point of view. In this thesis, a lexical-functional grammar for Turkish syntax is presented. Our current work deals with regular Turkish sentences that are structurally simple or complex.
Open Access
Local context based linear text segmentation
(2014-02) Erdem, Hayrettin
Understanding the topical structure of text documents is important for eﬀective retrieval and browsing, automatic summarization, and tasks related to identifying, clustering and tracking documents about their topics. Despite documents often display structural organization and contain explicit section markers, some lack of such properties thereby revealing the need for topical text segmentation systems. Examples of such documents are speech transcripts and inherently un-structured texts like newspaper columns and blog entries discussing several sub-jects in a discourse. A novel local-context based approach depending on lexical cohesion is presented for linear text segmentation, which is the task of dividing text into a linear sequence of coherent segments. As the lexical cohesion indicator, the proposed technique exploits relationships among terms induced from semantic space called HAL (Hyperspace Analogue to Language), which is built upon by examining co-occurrence of terms through passing a ﬁxed-sized window over text. The proposed algorithm (BTS) iteratively discovers topical shifts by examining the most relevant sentence pairs in a block of sentences considered at each iteration. The technique is evaluated on both error-free speech transcripts of news broadcasts and documents formed by concatenating diﬀerent topical regions of text. A new corpus for Turkish is automatically built where each document is formed by concatenating diﬀerent news articles. For performance comparison, two state-of-the-art methods, TextTiling and C99, are leveraged, and the results show that the proposed approach has comparable performance with these two techniques. The results are also statistically validated by applying the ANOVA and Tukey post–hoc test.
Open Access
Measuring cross-lingual semantic similarity across European languages
(IEEE, 2017) Şenel, Lütfü Kerem; Yücesoy, V.; Koç, A.; Çukur, Tolga
This paper studies cross-lingual semantic similarity (CLSS) between five European languages (i.e. English, French, German, Spanish and Italian) via unsupervised word embeddings from a cross-lingual lexicon. The vocabulary in each language is projected onto a separate high-dimensional vector space, and these vector spaces are then compared using several different distance measures (i.e., correlation, cosine etc.) to measure their pairwise semantic similarities between these languages. A substantial degree of similarity is observed between the vector spaces learned from corpora of the European languages. Null hypothesis testing and bootstrap methods (by resampling without replacement) are utilized to verify the results.
Open Access
Natural language processing in law: Prediction of outcomes in the higher courts of Turkey
(Elsevier Ltd, 2021-09) Mumcuoğlu, Emre; Öztürk, Ceyhun E.; Özaktaş, Haldun Memduh; Koç, Aykut
Natural language processing (NLP) based approaches have recently received attention for legal systems of several countries. It is of interest to study the wide variety of legal systems that have so far not received any attention. In particular, for the legal system of the Republic of Turkey, codified in Turkish, no works have been published. We first review the state-of-the-art of NLP in law, and then study the problem of predicting verdicts for several different courts, using several different algorithms. This study is much broader than earlier studies in the number of different courts and the variety of algorithms it includes. Therefore it provides a reference point and baseline for further studies in this area. We further hope the scope and systematic nature of this study can set a framework that can be applied to the study of other legal systems. We present novel results on predicting the rulings of the Turkish Constitutional Court and Courts of Appeal, using only fact descriptions, and without seeing the actual rulings. The methods that are utilized are based on Decision Trees (DTs), Random Forests (RFs), Support Vector Machines (SVMs) and state-of-the-art deep learning (DL) methods; specifically Gated Recurrent Units (GRUs), Long Short-Term Memory networks (LSTMs) and bidirectional LSTMs (BiLSTMs), with the integration of an attention mechanism for each model. The prediction results for all algorithms are given in a comparative and detailed manner. We demonstrate that outcomes of the courts of Turkish legal system can be predicted with high accuracy, especially with deep learning based methods. The presented results exhibit similar performance to earlier work in the literature for other languages and legal systems.
Open Access
A natural language-based interface for querying a video database
(Institute of Electrical and Electronics Engineers, 2007-01) Küçüktunç, O.; Güdükbay, U.; Ulusoy, Özgür
The authors developed a video database system called BilVideo that provides integrated support for spatiotemporal, semantic, and low-level feature queries. As a further development for this system, the authors present a natural language processing-based interface that lets users formulate queries in English and discuss the advantage of using such an interface. © 2007 IEEE.
Open Access
Online text classification for real life tweet analysis
(IEEE, 2016) Yar, Ersin; Delibalta, İ.; Baruh, L.; Kozat, Süleyman Serdar
In this paper, we study multi-class classification of tweets, where we introduce highly efficient dimensionality reduction techniques suitable for online processing of high dimensional feature vectors generated from freely-worded text. As for the real life case study, we work on tweets in the Turkish language, however, our methods are generic and can be used for other languages as clearly explained in the paper. Since we work on a real life application and the tweets are freely worded, we introduce text correction, normalization and root finding algorithms. Although text processing and classification are highly important due to many applications such as emotion recognition, advertisement selection, etc., online classification and regression algorithms over text are limited due to need for high dimensional vectors to represent natural text inputs. We overcome such limitations by showing that randomized projections and piecewise linear models can be efficiently leveraged to significantly reduce the computational cost for feature vector extraction from the tweets. Hence, we can perform multi-class tweet classification and regression in real time. We demonstrate our results over tweets collected from a real life case study where the tweets are freely-worded, e.g., with emoticons, shortened words, special characters, etc., and are unstructured. We implement several well-known machine learning algorithms as well as novel regression methods and demonstrate that we can significantly reduce the computational complexity with insignificant change in the classification and regression performance.
Open Access
Predicting outcomes of the court of cassation of Turkey with recurrent neural networks
(IEEE, 2022-08-29) Öztürk, Ceyhun E.; Özçelik, Ş. Barış; Koç, Aykut
Natural Language Processing (NLP) based approaches have recently become very popular for studies in legal domain. In this work, the outcomes of the cases of the Court of Cassation of Turkey were predicted with the use of Deep Learning models. These models are GRU, LSTM and BiLSTM which are variants of the recurrent neural network. Models saw only fact description parts of the case decision texts during training. Firstly, the models were trained with the word embeddings that were created from the texts from daily language. Then, the models were trained with the word embeddings that were created from downloaded legal cases from Turkish courts. The results of the experiments on the models are given in a comparative and detailed manner. It is observed based on this study and the past studies that the outcomes of the Court of Cassation can be predicted with higher accuracy than most of the courts that were investigated in previous legal NLP studies. The model which is best at predicting decisions is GRU. The GRU model achieves 96.8% accuracy in the decision prediction task.