Browsing by Subject "Text classification"

Now showing 1 - 9 of 9

Open Access
Addressing encoder-only transformer limitations with graph neural networks for text classification
(2025-01) Aras, Arda Can
Recent advancements in NLP have been primarily driven by transformer-based models, which capture contextual information within sequences, revolutionizing tasks such as text classification and natural language understanding. In parallel, GNNs have emerged as powerful tools for modeling structured data, leveraging graph representations to capture global relationships across entities. However, significant challenges persist at the intersection of these fields, limiting the efficacy and scalability of existing models. These challenges include the inability to seamlessly integrate contextual and structural information, computational inefficiencies associated with static graph construction and transductive learning, and the underperformance of models in low-labeled data scenarios. This thesis explores and addresses these challenges by developing novel methodologies that unify transformers and GNNs, leveraging their complementary strengths. The first contribution, GRTE, introduces an architecture that combines pre-trained transformer models with heterogeneous and homogeneous graph representations to enhance text classification in both inductive and transductive settings. Compared to state-of-the-art models, GRTE achieves significant computational efficiency, reducing training overhead by up to 100 times. The second contribution, Text-RGNN, proposes a relational modeling framework for heterogeneous text graphs, enabling the nuanced representation of diverse interactions between nodes and demonstrating substantial accuracy improvements of up to 10.61% over existing models, particularly in low-labeled data settings. Finally, the third contribution, VISPool, introduces a scalable architecture that dynamically constructs vector visibility graphs from transformer outputs, enabling seamless integration of graph-based reasoning into transformer pipelines while improving performance on NLP benchmarks such as GLUE, with performance improvements of up to 13% in specific tasks. Through comprehensive experimentation and benchmarking against state-of-the-art models, this thesis establishes the efficacy of these proposed methodologies. The results demonstrate the potential for improved performance, scalability, and the ability to address long-standing challenges in NLP and GNN integration. These contributions lay a robust foundation for future research and applications at the intersection of graph-based and transformer-based approaches, advancing the state of the art in text representation and classification.
Open Access
Ai-assisted text composition for automated content authoring using transformer-based language models
(IEEE, 2024-06-23) Alpdemir, Yusuf; Alpdemir, Mahmut Nedim
In this paper, we introduce a hybrid method that combines the use of Controllable Text Generation (CTG) approach via Large Language Models (LLMs), fine-tuned language models and sentence transformers in a single framework to generate real-author styled articles in Turkish language. As such, we seek to exemplify hybrid solutions that produce real-human styled high-quality contents, given limited resources and relatively short text prompts as inputs. To achieve this, we introduce a novel method to assemble an author-specific article in different coherence and fluency levels, based on phrasal control of the CTG process. Control phrases are automatically assembled based on a semantic correlation measure calculated using sentence embed dings corresponding to author articles, that are obtained from pre-trained sentence transformers.
Open Access
Architecture of a grid-enabled Web search engine
(Elsevier Ltd, 2007) Cambazoglu, B. B.; Karaca, E.; Kucukyilmaz T.; Turk, A.; Aykanat, Cevdet
Search Engine for South-East Europe (SE4SEE) is a socio-cultural search engine running on the grid infrastructure. It offers a personalized, on-demand, country-specific, category-based Web search facility. The main goal of SE4SEE is to attack the page freshness problem by performing the search on the original pages residing on the Web, rather than on the previously fetched copies as done in the traditional search engines. SE4SEE also aims to obtain high download rates in Web crawling by making use of the geographically distributed nature of the grid. In this work, we present the architectural design issues and implementation details of this search engine. We conduct various experiments to illustrate performance results obtained on a grid infrastructure and justify the use of the search strategy employed in SE4SEE. © 2006 Elsevier Ltd. All rights reserved.
Open Access
ARG: A tool for automatic report generation
(Aves Yayıncılık, 2004) Karakaya, K. Murat; Güvenir, H. Altay
The expansion of on-line text with the rapid growth of the Internet imposes utilizing Data Mining techniques to reveal the information embedded in these documents. Therefore text classification and text summarization are two of the most important application areas. In this work, we attempt to integrate these two techniques to help the user to compile and extract the information that is needed. Basically, we propose a two-phase algorithm in which the paragraphs in the documents are first classified according to given topics and then each topic is summarized to constitute the automatically generated report.
Open Access
Chat mining: predicting user and message attributes in computer-mediated communication
(Elsevier Ltd, 2008-07) Kucukyilmaz T.; Cambazoglu, B. B.; Aykanat, Cevdet; Can, F.
The focus of this paper is to investigate the possibility of predicting several user and message attributes in text-based, real-time, online messaging services. For this purpose, a large collection of chat messages is examined. The applicability of various supervised classification techniques for extracting information from the chat messages is evaluated. Two competing models are used for defining the chat mining problem. A term-based approach is used to investigate the user and message attributes in the context of vocabulary use while a style-based approach is used to examine the chat messages according to the variations in the authors' writing styles. Among 100 authors, the identity of an author is correctly predicted with 99.7% accuracy. Moreover, the reverse problem is exploited, and the effect of author attributes on computer-mediated communications is discussed. © 2008 Elsevier Ltd. All rights reserved.
Open Access
Graph receptive transformer encoder for text classification
(IEEE, 2024) Aras, Arda Can; Alikaşifoğlu, Tuna; Koç, Aykut
By employing attention mechanisms, transformers have made great improvements in nearly all NLP tasks, including text classification. However, the context of the transformer's attention mechanism is limited to single sequences, and their fine-tuning stage can utilize only inductive learning. Focusing on broader contexts by representing texts as graphs, previous works have generalized transformer models to graph domains to employ attention mechanisms beyond single sequences. However, these approaches either require exhaustive pre-training stages, learn only transductively, or can learn inductively without utilizing pre-trained models. To address these problems simultaneously, we propose the Graph Receptive Transformer Encoder (GRTE), which combines graph neural networks (GNNs) with large-scale pre-trained models for text classification in both inductive and transductive fashions. By constructing heterogeneous and homogeneous graphs over given corpora and not requiring a pre-training stage, GRTE can utilize information from both large-scale pre-trained models and graph-structured relations. Our proposed method retrieves global and contextual information in documents and generates word embeddings as a by-product of inductive inference. We compared the proposed GRTE with a wide range of baseline models through comprehensive experiments. Compared to the state-of-the-art, we demonstrated that GRTE improves model performances and offers computational savings up to ˜100×.
Open Access
Models and algorithms for parallel text retrieval
(2006) Cambazoğlu, Berkant Barla
In the last decade, search engines became an integral part of our lives. The current state-of-the-art in search engine technology relies on parallel text retrieval. Basically, a parallel text retrieval system is composed of three components: a crawler, an indexer, and a query processor. The crawler component aims to locate, fetch, and store the Web pages in a local document repository. The indexer component converts the stored, unstructured text into a queryable form, most often an inverted index. Finally, the query processing component performs the search over the indexed content. In this thesis, we present models and algorithms for efficient Web crawling and query processing. First, for parallel Web crawling, we propose a hybrid model that aims to minimize the communication overhead among the processors while balancing the number of page download requests and storage loads of processors. Second, we propose models for documentand term-based inverted index partitioning. In the document-based partitioning model, the number of disk accesses incurred during query processing is minimized while the posting storage is balanced. In the term-based partitioning model, the total amount of communication is minimized while, again, the posting storage is balanced. Finally, we develop and evaluate a large number of algorithms for query processing in ranking-based text retrieval systems. We test the proposed algorithms over our experimental parallel text retrieval system, Skynet, currently running on a 48-node PC cluster. In the thesis, we also discuss the design and implementation details of another, somewhat untraditional, grid-enabled search engine, SE4SEE. Among our practical work, we present the Harbinger text classification system, used in SE4SEE for Web page classification, and the K-PaToH hypergraph partitioning toolkit, to be used in the proposed models.
Open Access
Online text classification for real life tweet analysis
(IEEE, 2016) Yar, Ersin; Delibalta, İ.; Baruh, L.; Kozat, Süleyman Serdar
In this paper, we study multi-class classification of tweets, where we introduce highly efficient dimensionality reduction techniques suitable for online processing of high dimensional feature vectors generated from freely-worded text. As for the real life case study, we work on tweets in the Turkish language, however, our methods are generic and can be used for other languages as clearly explained in the paper. Since we work on a real life application and the tweets are freely worded, we introduce text correction, normalization and root finding algorithms. Although text processing and classification are highly important due to many applications such as emotion recognition, advertisement selection, etc., online classification and regression algorithms over text are limited due to need for high dimensional vectors to represent natural text inputs. We overcome such limitations by showing that randomized projections and piecewise linear models can be efficiently leveraged to significantly reduce the computational cost for feature vector extraction from the tweets. Hence, we can perform multi-class tweet classification and regression in real time. We demonstrate our results over tweets collected from a real life case study where the tweets are freely-worded, e.g., with emoticons, shortened words, special characters, etc., and are unstructured. We implement several well-known machine learning algorithms as well as novel regression methods and demonstrate that we can significantly reduce the computational complexity with insignificant change in the classification and regression performance.
Open Access
Text-RGNNs: relational modeling for heterogeneous text graphs
(IEEE, 2024) Aras, Arda Can; Alikaşifoğlu, Tuna; Koç, Aykut
Text-graph convolutional Network (TextGCN) is the fundamental work representing corpus with heterogeneous text graphs. Its innovative application of GCNs for text classification has garnered widespread recognition. However, GCNs are inherently designed to operate within homogeneous graphs, potentially limiting their performance. To address this limitation, we present Text Relational Graph Neural Networks (Text-RGNNs), which offer a novel methodology by assigning dedicated weight matrices to each relation within the graph by using heterogeneous GNNs. This approach leverages RGNNs, enabling nuanced and compelling modeling of relationships inherent in the heterogeneous text graphs, ultimately resulting in performance enhancements. We present a theoretical framework for the relational modeling of GNNs for text classification within the context of document classification and demonstrate its effectiveness through extensive experimentation on benchmark datasets. Conducted experiments reveal that Text-RGNNs outperform the existing state-of-the-art in scenarios with complete labeled nodes and minimal labeled training data proportions by incorporating relational modeling into heterogeneous text graphs. Text-RGNNs outperform the second-best models by up to 10.61% for the corresponding evaluation metric.