Attended end-to-end architecture for age estimation from facial expression videos

Pei, W.; Dibeklioğlu, Hamdi; Baltrušaitis, T.

Attended end-to-end architecture for age estimation from facial expression videos

buir.contributor.author	Dibeklioğlu, Hamdi
dc.citation.epage	1984	en_US
dc.citation.spage	1972	en_US
dc.citation.volumeNumber	29	en_US
dc.contributor.author	Pei, W.	en_US
dc.contributor.author	Dibeklioğlu, Hamdi	en_US
dc.contributor.author	Baltrušaitis, T.	en_US
dc.date.accessioned	2021-02-18T10:21:54Z
dc.date.available	2021-02-18T10:21:54Z
dc.date.issued	2020
dc.department	Department of Computer Engineering	en_US
dc.description.abstract	The main challenges of age estimation from facial expression videos lie not only in the modeling of the static facial appearance, but also in the capturing of the temporal facial dynamics. Traditional techniques to this problem focus on constructing handcrafted features to explore the discriminative information contained in facial appearance and dynamics separately. This relies on sophisticated feature-refinement and framework-design. In this paper, we present an end-to-end architecture for age estimation, called Spatially-Indexed Attention Model (SIAM), which is able to simultaneously learn both the appearance and dynamics of age from raw videos of facial expressions. Specifically, we employ convolutional neural networks to extract effective latent appearance representations and feed them into recurrent networks to model the temporal dynamics. More importantly, we propose to leverage attention models for salience detection in both the spatial domain for each single image and the temporal domain for the whole video as well. We design a specific spatially-indexed attention mechanism among the convolutional layers to extract the salient facial regions in each individual image, and a temporal attention layer to assign attention weights to each frame. This two-pronged approach not only improves the performance by allowing the model to focus on informative frames and facial areas, but it also offers an interpretable correspondence between the spatial facial regions as well as temporal frames, and the task of age estimation. We demonstrate the strong performance of our model in experiments on a large, gender-balanced database with 400 subjects with ages spanning from 8 to 76 years. Experiments reveal that our model exhibits significant superiority over the state-of-the-art methods given sufficient training data.	en_US
dc.description.provenance	Submitted by Onur Emek (onur.emek@bilkent.edu.tr) on 2021-02-18T10:21:54Z No. of bitstreams: 1 Attended_End-to-End_Architecture_for_Age_Estimation_From_Facial_Expression_Videos.pdf: 2834408 bytes, checksum: c0cc363aa564421c5722cb90e7a9b0e1 (MD5)	en
dc.description.provenance	Made available in DSpace on 2021-02-18T10:21:54Z (GMT). No. of bitstreams: 1 Attended_End-to-End_Architecture_for_Age_Estimation_From_Facial_Expression_Videos.pdf: 2834408 bytes, checksum: c0cc363aa564421c5722cb90e7a9b0e1 (MD5) Previous issue date: 2020	en
dc.identifier.doi	10.1109/TIP.2019.2948288	en_US
dc.identifier.issn	1057-7149
dc.identifier.uri	http://hdl.handle.net/11693/75442
dc.language.iso	English	en_US
dc.publisher	IEEE	en_US
dc.relation.isversionof	https://dx.doi.org/10.1109/TIP.2019.2948288	en_US
dc.source.title	IEEE Transactions on Image Processing	en_US
dc.subject	Age estimation	en_US
dc.subject	End-to-end	en_US
dc.subject	Attention	en_US
dc.subject	Facial dynamics	en_US
dc.title	Attended end-to-end architecture for age estimation from facial expression videos	en_US
dc.type	Article	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Attended_End-to-End_Architecture_for_Age_Estimation_From_Facial_Expression_Videos.pdf
Size:: 2.7 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Scholarly Publications - Computer Engineering