Browsing by Author "Dibeklioğlu, Hamdi"

Now showing 1 - 18 of 18

Open Access
Assessment of Parkinson's disease severity from videos using deep architecture
(IEEE, 2021-07-26) Yin, Z.; Geraedts, V. J.; Wang, Z.; Contarino, M. F.; Dibeklioğlu, Hamdi; Gemert, J. V.
Parkinson's disease (PD) diagnosis is based on clinical criteria, i.e., bradykinesia, rest tremor, rigidity, etc. Assessment of the severity of PD symptoms with clinical rating scales, however, is subject to inter-rater variability. In this paper, we propose a deep learning based automatic PD diagnosis method using videos to assist the diagnosis in clinical practices. We deploy a 3D Convolutional Neural Network (CNN) as the baseline approach for the PD severity classification and show the effectiveness. Due to the lack of data in clinical field, we explore the possibility of transfer learning from non-medical dataset and show that PD severity classification can benefit from it. To bridge the domain discrepancy between medical and non-medical datasets, we let the network focus more on the subtle temporal visual cues, i.e., the frequency of tremors, by designing a Temporal Self-Attention (TSA) mechanism. Seven tasks from the Movement Disorders Society - Unified PD rating scale (MDS-UPDRS) part III are investigated, which reveal the symptoms of bradykinesia and postural tremors. Furthermore, we propose a multi-domain learning method to predict the patient-level PD severity through task-assembling. We show the effectiveness of TSA and task-assembling method on our PD video dataset empirically. We achieve the best MCC of 0.55 on binary task-level and 0.39 on three-class patient-level classification.
Open Access
Attended end-to-end architecture for age estimation from facial expression videos
(IEEE, 2020) Pei, W.; Dibeklioğlu, Hamdi; Baltrušaitis, T.
The main challenges of age estimation from facial expression videos lie not only in the modeling of the static facial appearance, but also in the capturing of the temporal facial dynamics. Traditional techniques to this problem focus on constructing handcrafted features to explore the discriminative information contained in facial appearance and dynamics separately. This relies on sophisticated feature-refinement and framework-design. In this paper, we present an end-to-end architecture for age estimation, called Spatially-Indexed Attention Model (SIAM), which is able to simultaneously learn both the appearance and dynamics of age from raw videos of facial expressions. Specifically, we employ convolutional neural networks to extract effective latent appearance representations and feed them into recurrent networks to model the temporal dynamics. More importantly, we propose to leverage attention models for salience detection in both the spatial domain for each single image and the temporal domain for the whole video as well. We design a specific spatially-indexed attention mechanism among the convolutional layers to extract the salient facial regions in each individual image, and a temporal attention layer to assign attention weights to each frame. This two-pronged approach not only improves the performance by allowing the model to focus on informative frames and facial areas, but it also offers an interpretable correspondence between the spatial facial regions as well as temporal frames, and the task of age estimation. We demonstrate the strong performance of our model in experiments on a large, gender-balanced database with 400 subjects with ages spanning from 8 to 76 years. Experiments reveal that our model exhibits significant superiority over the state-of-the-art methods given sufficient training data.
Open Access
Augmentation of virtual agents in real crowd videos
(Springer, 2019) Doğan, Yalım; Demirci, Serkan; Güdükbay, Uğur; Dibeklioğlu, Hamdi
Augmenting virtual agents in real crowd videos is an important task for different applications from simulations of social environments to modeling abnormalities in crowd behavior. We propose a framework for this task, namely for augmenting virtual agents in real crowd videos. We utilize pedestrian detection and tracking algorithms to automatically locate the pedestrians in video frames and project them into our simulated environment, where the navigable area of the simulated environment is available as a navigation mesh. We represent the real pedestrians in the video as simple three-dimensional (3D) models in our simulation environment. 3D models representing real agents and the augmented virtual agents are simulated using local path planning coupled with a collision avoidance algorithm. The virtual agents augmented into the real video move plausibly without colliding with static and dynamic obstacles, including other virtual agents and the real pedestrians.
Open Access
Automatic deceit detection through multimodal analysis of high-stake court-trials
(Institute of Electrical and Electronics Engineers, 2023-10-05) Biçer, Berat; Dibeklioğlu, Hamdi
In this article we propose the use of convolutional self-attention for attention-based representation learning, while replacing traditional vectorization methods with a transformer as the backbone of our speech model for transfer learning within our automatic deceit detection framework. This design performs a multimodal data analysis and applies fusion to merge visual, vocal, and speech(textual) channels; reporting deceit predictions. Our experimental results show that the proposed architecture improves the state-of-the-art on the popular Real-Life Trial (RLT) dataset in terms of correct classification rate. To further assess the generalizability of our design, we experiment on the low-stakes Box of Lies (BoL) dataset and achieve state-of-the-art performance as well as providing cross-corpus comparisons. Following our analysis, we report that (1) convolutional self-attention learns meaningful representations while performing joint attention computation for deception, (2) apparent deceptive intent is a continuous function of time and subjects can display varying levels of apparent deceptive intent throughout recordings, and (3), in support of criminal psychology findings, studying abnormal behavior out of context can be an unreliable way to predict deceptive intent.
Open Access
Behavior and usability analysis for multimodal user interfaces
(Springer, 2021-12) Dibeklioğlu, Hamdi; Surer, Elif; Salah, Albert Ali; Dutoit, Thierry
Multimodal interfaces offer ever-changing tasks and challenges for designers to accommodate newer technologies, and as these technologies become more accessible, newer application scenarios emerge. Prototype development and user evaluation are important steps in the creation of solutions to these challenges. Furthermore, playful interactions and games are shown to be important settings to study social signals of interaction people. Research in multimodal analysis brings together people with diverse skills and specializations on the integration of tools in different modalities, to collect and annotate data, and to exchange ideas and skills, and this special issue is a reflection of that collective effort.
Open Access
Detection and elimination of systematic labeling bias in code reviewer recommendation systems
(Association for Computing Machinery, 2021-06-21) Tecimer, K. Ayberk; Tüzün, Eray; Dibeklioğlu, Hamdi; Erdoğmuş, Hakan
Reviewer selection in modern code review is crucial for effective code reviews. Several techniques exist for recommending reviewers appropriate for a given pull request (PR). Most code reviewer recommendation techniques in the literature build and evaluate their models based on datasets collected from real projects using open-source or industrial practices. The techniques invariably presume that these datasets reliably represent the “ground truth.” In the context of a classification problem, ground truth refers to the objectively correct labels of a class used to build models from a dataset or evaluate a model’s performance. In a project dataset used to build a code reviewer recommendation system, the recommended code reviewer picked for a PR is usually assumed to be the best code reviewer for that PR. However, in practice, the recommended code reviewer may not be the best possible code reviewer, or even a qualified one. Recent code reviewer recommendation studies suggest that the datasets used tend to suffer from systematic labeling bias, making the ground truth unreliable. Therefore, models and recommendation systems built on such datasets may perform poorly in real practice. In this study, we introduce a novel approach to automatically detect and eliminate systematic labeling bias in code reviewer recommendation systems. The bias that we remove results from selecting reviewers that do not ensure a permanently successful fix for a bug-related PR. To demonstrate the effectiveness of our approach, we evaluated it on two open-source project datasets —HIVE and QT Creator— and with five code reviewer recommendation techniques —Profile-Based, RSTrace, Naive Bayes, k-NN, and Decision Tree. Our debiasing approach appears promising since it improved the Mean Reciprocal Rank (MRR) of the evaluated techniques up to 26% in the datasets used.
Open Access
Do Alzheimer’s disease patients appear younger than their real age?
(S. Karger AG, 2020-10) Tüfekçioğlu, Z.; Bilgiç, B.; Zeylan, A. E.; Salah, A. A.; Dibeklioğlu, Hamdi; Emre, M.
Introduction: The most prominent risk factor of Alzheimer’s disease (AD) is aging. Aging also influences the physical appearance. Our clinical experience suggests that patients with AD may appear younger than their actual age. Based on this empirical observation, we set forth to test the hypothesis with human and computer-based estimation systems. Method: We compared 50 early-stage AD patients with 50 age and sex-matched controls. Facial images of all subjects were recorded using a video camera with high resolution, frontal view, and clear lighting. Subjects were recorded during natural conversations while performing Mini-Mental State Examination, including spontaneous smiles in addition to static images. The images were used for age estimation by 2 methods: (1) computer-based age estimation; (2) human-based age estimation. Computer-based system used a state-of-the-art deep convolutional neural network classifier to process the facial images contained in a single-video session and performed frame-based age estimation. Individuals who estimated the age by visual inspection of video sequences were chosen following a pilot selection phase. The mean error (ME) of estimations was the main end point of this study. Results: There was no statistically significant difference between the ME scores for AD patients and healthy controls (p = 0.33); however, the difference was in favor of younger estimation of the AD group. The average ME score for AD patients was lower than that for healthy controls in computer-based estimation system, indicating that AD patients were on average estimated to be younger than their actual age as compared to controls. This difference was statistically significant (p = 0.007). Conclusion: There was a tendency for humans to estimate AD patients younger, and computer-based estimations showed that AD patients were estimated to be younger than their real age as compared to controls. The underlying mechanisms for this observation are unclear.
Open Access
Enforcing multilabel consistency for automatic spatio-temporal assessment of shoulder pain intensity
(Association for Computing Machinery, 2020) Erekat, Diyala; Hammal, Z.; Siddiqui, M.; Dibeklioğlu, Hamdi
The standard clinical assessment of pain is limited primarily to self-reported pain or clinician impression. While the self-reported measurement of pain is useful, in some circumstances it cannot be obtained. Automatic facial expression analysis has emerged as a potential solution for an objective, reliable, and valid measurement of pain. In this study, we propose a video based approach for the automatic measurement of self-reported pain and the observer pain intensity, respectively. To this end, we explore the added value of three self-reported pain scales, i.e., the Visual Analog Scale (VAS), the Sensory Scale (SEN), and the Affective Motivational Scale (AFF), as well as the Observer Pain Intensity (OPI) rating for a reliable assessment of pain intensity from facial expression. Using a spatio-temporal Convolutional Neural Network - Recurrent Neural Network (CNN-RNN) architecture, we propose to jointly minimize the mean absolute error of pain scores estimation for each of these scales while maximizing the consistency between them. The reliability of the proposed method is evaluated on the benchmark database for pain measurement from videos, namely, the UNBC-McMaster Pain Archive. Our results show that enforcing the consistency between different self-reported pain intensity scores collected using different pain scales enhances the quality of predictions and improve the state of the art in automatic self-reported pain estimation. The obtained results suggest that automatic assessment of selfreported pain intensity from videos is feasible, and could be used as a complementary instrument to unburden caregivers, specially for vulnerable populations that need constant monitoring.
Open Access
Face inpainting with pre-trained image transformers
(IEEE, 2022-08-29) Gönç, Kaan; Sağlam, Baturay; Kozat, Süleyman S.; Dibeklioğlu, Hamdi
Image inpainting is an underdetermined inverse problem that allows various contents to fill in the missing or damaged regions realistically. Convolutional neural networks (CNNs) are commonly used to create aesthetically pleasing content, yet CNNs have restricted perception fields for collecting global characteristics. Transformers enable long-range relationships to be modeled and different content generated with autoregressive modeling of pixel-sequence distributions using image-level attention mechanism. However, the current approaches to inpainting with transformers are limited to task-specific datasets and require larger-scale data. We introduce an approach to image inpainting by leveraging pre-trained vision transformers to remedy this issue. Experiments show that our approach can outperform CNN-based approaches and have a remarkable performance closer to the task-specific transformer methods.
Open Access
Facial feedback for reinforcement learning: A case study and ofine analysis using the TAMER framework
(Springer, 2020-02) Li, G.; Dibeklioğlu, Hamdi; Whiteson, S.; Hung, H.
Interactive reinforcement learning provides a way for agents to learn to solve tasks from evaluative feedback provided by a human user. Previous research showed that humans give copious feedback early in training but very sparsely thereafter. In this article, we investigate the potential of agent learning from trainers’ facial expressions via interpreting them as evaluative feedback. To do so, we implemented TAMER which is a popular interactive reinforcement learning method in a reinforcement-learning benchmark problem—Infinite Mario, and conducted the first large-scale study of TAMER involving 561 participants. With designed CNN–RNN model, our analysis shows that telling trainers to use facial expressions and competition can improve the accuracies for estimating positive and negative feedback using facial expressions. In addition, our results with a simulation experiment show that learning solely from predicted feedback based on facial expressions is possible and using strong/effective prediction models or a regression method, facial responses would significantly improve the performance of agents. Furthermore, our experiment supports previous studies demonstrating the importance of bi-directional feedback and competitive elements in the training interface.
Open Access
Identity unbiased deception detection by 2D-to-3D face reconstruction
(IEEE, 2021-06-14) Ngô, Le Minh; Wang, Wei; Mandıra, Burak; Karaoğlu, Sezer; Bouma, Henri; Dibeklioğlu, Hamdi; Gevers, Theo
Deception is a common phenomenon in society, both in our private and professional lives. However, humans are notoriously bad at accurate deception detection. Based on the literature, human accuracy of distinguishing between lies and truthful statements is 54% on average, in other words, it is slightly better than a random guess. While people do not much care about this issue, in high-stakes situations such as interrogations for series crimes and for evaluating the testimonies in court cases, accurate deception detection methods are highly desirable. To achieve a reliable, covert, and non-invasive deception detection, we propose a novel method that disentangles facial expression and head pose related features using 2D-to-3D face reconstruction technique from a video sequence and uses them to learn characteristics of deceptive behavior. We evaluate the proposed method on the Real-Life Trial (RLT) dataset that contains high-stakes deceits recorded in courtrooms. Our results show that the proposed method (with an accuracy of 68%) improves the state of the art. Besides, a new dataset has been collected, for the first time, for low-stake deceit detection. In addition, we compare high-stake deceit detection methods on the newly collected low-stake deceits.
Open Access
Multi-label sentiment analysis on 100 languages with dynamic weighting for label imbalance
(Institute of Electrical and Electronics Engineers, 2021-07-19) Yılmaz, Selim Fırat; Kaynak, Ergün Batuhan; Koç, Aykut; Dibeklioğlu, Hamdi; Kozat, Süleyman Serdar
We investigate cross-lingual sentiment analysis, which has attracted significant attention due to its applications in various areas including market research, politics, and social sciences. In particular, we introduce a sentiment analysis framework in multi-label setting as it obeys Plutchik's wheel of emotions. We introduce a novel dynamic weighting method that balances the contribution from each class during training, unlike previous static weighting methods that assign non-changing weights based on their class frequency. Moreover, we adapt the focal loss that favors harder instances from single-label object recognition literature to our multi-label setting. Furthermore, we derive a method to choose optimal class-specific thresholds that maximize the macro-f1 score in linear time complexity. Through an extensive set of experiments, we show that our method obtains the state-of-the-art performance in seven of nine metrics in three different languages using a single model compared with the common baselines and the best performing methods in the SemEval competition. We publicly share our code for our model, which can perform sentiment analysis in 100 languages, to facilitate further research.
Open Access
Multi-label sentiment analysis on 100 languages with dynamic weighting for label imbalance
(Institute of Electrical and Electronics Engineers Inc., 2023-01-01) Yılmaz, Selim Fırat; Kaynak, Ergün Batuhan; Koç, Aykut; Dibeklioğlu, Hamdi; Kozat, Süleyman Serdar
We investigate cross-lingual sentiment analysis, which has attracted significant attention due to its applications in various areas including market research, politics, and social sciences. In particular, we introduce a sentiment analysis framework in multi-label setting as it obeys Plutchik’s wheel of emotions. We introduce a novel dynamic weighting method that balances the contribution from each class during training, unlike previous static weighting methods that assign non-changing weights based on their class frequency. Moreover, we adapt the focal loss that favors harder instances from single-label object recognition literature to our multi-label setting. Furthermore, we derive a method to choose optimal class-specific thresholds that maximize the macro-f1 score in linear time complexity. Through an extensive set of experiments, we show that our method obtains the state-of-the-art performance in seven of nine metrics in three different languages using a single model compared with the common baselines and the best performing methods in the SemEval competition. We publicly share our code for our model, which can perform sentiment analysis in 100 languages, to facilitate further research.
Unknown
Multimodal analysis of personality traits on videos of self-presentation and induced behavior
(Springer, 2020) Giritlioğlu, Dersu; Mandira, Burak; Yılmaz, Selim Fırat; Ertenli, C. U.; Akgür, Berhan Faruk; Kınıklıoğlu, Merve; Kurt, Aslı Gül; Mutlu, E.; Dibeklioğlu, Hamdi
Personality analysis is an important area of research in several fields, including psychology, psychiatry, and neuroscience. With the recent dramatic improvements in machine learning, it has also become a popular research area in computer science. While the current computational methods are able to interpret behavioral cues (e.g., facial expressions, gesture, and voice) to estimate the level of (apparent) personality traits, accessible assessment tools are still substandard for practical use, not to mention the need for fast and accurate methods for such analyses. In this study, we present multimodal deep architectures to estimate the Big Five personality traits from (temporal) audio-visual cues and transcribed speech. Furthermore, for a detailed analysis of personality traits, we have collected a new audio-visual dataset, namely: Self-presentation and Induced Behavior Archive for Personality Analysis (SIAP). In contrast to the available datasets, SIAP introduces recordings of induced behavior in addition to self-presentation (speech) videos. With thorough experiments on SIAP and ChaLearn LAP First Impressions datasets, we systematically assess the reliability of different behavioral modalities and their combined use. Furthermore, we investigate the characteristics and discriminative power of induced behavior for personality analysis, showing that the induced behavior indeed includes signs of personality traits.
Unknown
Multimodal assessment of apparent personality using feature attention and error consistency constraint
(Elsevier BV, 2021-06) Aslan, Süleyman; Güdükbay, Uğur; Dibeklioğlu, Hamdi
Personality computing and affective computing, where the recognition of personality traits is essential, have gained increasing interest and attention in many research areas recently. We propose a novel approach to recognize the Big Five personality traits of people from videos. To this end, we use four different modalities, namely, ambient appearance (scene), facial appearance, voice, and transcribed speech. Through a specialized subnetwork for each of these modalities, our model learns reliable modality-specific representations and fuse them using an attention mechanism that re-weights each dimension of these representations to obtain an optimal combination of multimodal information. A novel loss function is employed to enforce the proposed model to give an equivalent importance for each of the personality traits to be estimated through a consistency constraint that keeps the trait-specific errors as close as possible. To further enhance the reliability of our model, we employ (pre-trained) state-of-the-art architectures (i.e., ResNet, VGGish, ELMo) as the backbones of the modality-specific subnetworks, which are complemented by multilayered Long Short-Term Memory networks to capture temporal dynamics. To minimize the computational complexity of multimodal optimization, we use two-stage modeling, where the modality-specific subnetworks are first trained individually, and the whole network is then fine-tuned to jointly model multimodal data. On the large scale ChaLearn First Impressions V2 challenge dataset, we evaluate the reliability of our model as well as investigating the informativeness of the considered modalities. Experimental results show the effectiveness of the proposed attention mechanism and the error consistency constraint. While the best performance is obtained using facial information among individual modalities, with the use of all four modalities, our model achieves a mean accuracy of 91.8%, improving the state of the art in automatic personality analysis.
Unknown
Multimodal interaction in psychopathology
(Association for Computing Machinery, 2020) Onal-Ertugrul, I.; Cohn, J. F.; Dibeklioğlu, Hamdi
This paper presents an introduction to the Multimodal Interaction in Psychopathology workshop, which is held virtually in conjunction with the 22nd ACM International Conference on Multimodal Interaction on October 25th, 2020. This workshop has attracted submissions in the context of investigating multimodal interaction to reveal mechanisms and assess, monitor, and treat psychopathology. Keynote speakers from diverse disciplines present an overview of the field from different vantages and comment on future directions. Here we summarize the goals and the content of the workshop.
Unknown
User feedback-based online learning for intent classification
(Association for Computing Machinery, 2023-10-09) Gönç, Kaan; Sağlam, Baturay; Dalmaz, Onat; Çukur, Tolga; Kozat, Serdar; Dibeklioğlu, Hamdi
Intent classifcation is a key task in natural language processing (NLP) that aims to infer the goal or intention behind a user’s query. Most existing intent classifcation methods rely on supervised deep models trained on large annotated datasets of text-intent pairs. However, obtaining such datasets is often expensive and impractical in real-world settings. Furthermore, supervised models may overft or face distributional shifts when new intents, utterances, or data distributions emerge over time, requiring frequent retraining. Online learning methods based on user feedback can overcome this limitation, as they do not need access to intents while collecting data and adapting the model continuously. In this paper, we propose a novel multi-armed contextual bandit framework that leverages a text encoder based on a large language model (LLM) to extract the latent features of a given utterance and jointly learn multimodal representations of encoded text features and intents. Our framework consists of two stages: ofine pretraining and online fne-tuning. In the ofine stage, we train the policy on a small labeled dataset using a contextual bandit approach. In the online stage, we fne-tune the policy parameters using the REINFORCE algorithm with a user feedback-based objective, without relying on the true intents. We further introduce a sliding window strategy for simulating the retrieval of data samples during online training. This novel two-phase approach enables our method to efciently adapt to dynamic user preferences and data distributions with improved performance. An extensive set of empirical studies indicate that our method signifcantly outperforms policies that omit either offine pretraining or online fne-tuning, while achieving competitive performance to a supervised benchmark trained on an order of magnitude larger labeled dataset.
Unknown
Visual transformation aided contrastive learning for video-based kinship verification
(IEEE, 2017-10) Dibeklioğlu, Hamdi
Automatic kinship verification from facial information is a relatively new and open research problem in computer vision. This paper explores the possibility of learning an efficient facial representation for video-based kinship verification by exploiting the visual transformation between facial appearance of kin pairs. To this end, a Siamese-like coupled convolutional encoder-decoder network is proposed. To reveal resemblance patterns of kinship while discarding the similarity patterns that can also be observed between people who do not have a kin relationship, a novel contrastive loss function is defined in the visual appearance space. For further optimization, the learned representation is fine-tuned using a feature-based contrastive loss. An expression matching procedure is employed in the model to minimize the negative influence of expression differences between kin pairs. Each kin video is analyzed by a sliding temporal window to leverage short-term facial dynamics. The effectiveness of the proposed method is assessed on seven different kin relationships using smile videos of kin pairs. On the average, 93:65% verification accuracy is achieved, improving the state of the art. © 2017 IEEE.