Browsing by Subject "Speech recognition"

Now showing 1 - 19 of 19

Open Access
Capacity analysis of a PMR system with DAB downlink
(IEEE, 2003) Şengül, Ersin; Can, B.; Akar, Nail; İder, Yusuf Ziya; Köymen, Hayrettin
Several trunked private mobile radio (PMR) systems have been designed over the last decade, most of which have symmetric downlink and uplink channel capacities. These systems may not be spectrally efficient in case of group or broadcast-based voice and data calls, a common feature of PMR systems. We propose a new asymmetric PMR system comprising a wideband OFDM-based downlink and a narrowband uplink, which not only achieves a better spectral efficiency but also can support high bit rate multimedia applications. The system is shown to have high trunking efficiency since all users are assumed to use the pool of channels available in the wideband downlink. In this paper, we study the performance and capacity of a private mobile radio system using a digital audio broadcasting (DAB) downlink. In particular, we study the efficiency of such a system for voice calls using voice activity detection and statistical multiplexing. Moreover, we show that, the efficiency of the system can significantly increase, if the incoming calls, which can not find an available channel, are allowed to wait a certain amount of time before occupying a channel.
Open Access
Classification of closed-and open-shell pistachio nuts using voice-recognition technology
(American Society of Agricultural and Biological Engineers, 2004) Çetin, A. Enis; Pearson, T. C.; Tewfik, A. H.
An algorithm using speech recognition technology was developed to distinguish pistachio nuts with closed shells from those with open shells. It was observed that upon impact with a steel plate, nuts with closed shells emit different sounds than nuts with open shells. Features extracted from the sound signals consisted of mel-cepstrum coefficients and eigenvalues obtained from the principle component analysis (PCA) of the autocorrelation matrix of the sound signals. Classification of a sound signal was performed by linearly combining the mel-cepstrum and PCA feature vectors. An important property of the algorithm is that it is easily trainable, as are most speech-recognition algorithms. During the training phase, sounds of nuts with closed shells and with open shells were used to obtain a representative vector of each class. During the recognition phase, the feature vector from the sample under question was compared with representative vectors. The classification accuracy of closed-shell nuts was more than 99% on the validation set, which did not include the training set.
Open Access
Finding people frequently appearing in news
(Springer, 2006-07) Özkan, Derya; Duygulu, Pınar
We propose a graph based method to improve the performance of person queries in large news video collections. The method benefits from the multi-modal structure of videos and integrates text and face information. Using the idea that a person appears more frequently when his/her name is mentioned, we first use the speech transcript text to limit our search space for a query name. Then, we construct a similarity graph with nodes corresponding to all of the faces in the search space, and the edges corresponding to similarity of the faces. With the assumption that the images of the query name will be more similar to each other than to other images, the problem is then transformed into finding the densest component in the graph corresponding to the images of the query name. The same graph algorithm is applied for detecting and removing the faces of the anchorpeople in an unsupervised way. The experiments are conducted on 229 news videos provided by NIST for TRECVID 2004. The results show that proposed method outperforms the text only based methods and provides cues for recognition of faces on the large scale. © Springer-Verlag Berlin Heidelberg 2006.
Open Access
Interframe differential vector coding of line spectrum frequencies
(IEEE, 1993-04) Erzin, Engin; Çetin, A. Enis
Line Spectrum Frequencies (LSF's) uniquely represent the Linear Predictive Coding (LPC) filter of a speech frame. In many vocoders LSF's are used to encode the LPC parameters. In this paper, an interframe differential coding scheme is presented for the LSF's. The LSF's of the current speech frame are predicted by using both the LSF's of the previous frame and some of the LSF's of the current frame. Then, the difference vector resulting from prediction is vector quantized.
Open Access
Large vocabulary speech recognition in noisy environments
(1998) Jabloun, Firas
A ІКПѴ set of speech feature parameters based on multirate subband analysis and the Teager Energy Operator (TEO) is developed. The speech signal is first divided into nonuniform subbands in mel-scale using a multirate filter-bank, then the Teager energies of the subsignals are estimated. Finally, the feature vector is constructed by logcompression and inverse DOT computation. The new feature parameters (TEOCEP) have a robust speech recognition performance in car engine noise which has a low pass nature. In this thesis, we also present some solutions to the problem of large vocabulary speech recognition. Triphone-based Hidden Markov. Models (HMM) are used to model the vocabulary words. Although the straight forward parallel search strategy gives good recognition performance, the processing time required is found to be long and impractical. Therefore another search strategy with similar performance is described. Subvocabularies are developed during the training session to reduce the total number of words considered in the search process. The search is then performed in a tree structure by investigating one subvocabulary instead of all the words.
Open Access
A large vocabulary speech recognition system for Turkish
(1999) Yılmaz, Cemal
This thesis presents a large vocabulary isolated word speech recognition system for Turkish. The triphones modeled by three-state Hidden Markov Models (HMM) are used as the smallest unit for the recognition. The HMM model of a word is constructed by using the HMM models of the triphones which make up the word. In the training stage, the word model is trained as a whole and then each HMM model of the triphones is extracted from the word model and it is stored individually. In the recognition stage, HMM models of triphones are used to construct the HMM models of the words in the dictionary. In this way, the words that are not trained can be recognized in the recognition stage. A new dictionary model based on trie structure is introduced for Turkish with a new search strategy for a given word. This search strategy performs breadth-first traversal on the trie and uses the appropriate region of the speech signal at each level of the trie. Moreover, it is integrated with a pruning strategy to improve both the system response time and recognition rate.
Open Access
Line spectral frequency representation of subbands for speech recognition
(1995) Erzin, E.; Çetin, A.E.
In this paper, a new set of speech feature parameters is constructed from subband analysis based Line Spectral Frequencies (LSFs). The speech signal is divided into several subbands and the resulting subsignals are represented by LSFs. The performance of the new speech feature parameters, SUBLSFs, is compared with the widely used Mel Scale Cepstral Coefficients (MELCEPs). SUBLSFs are observed to be more robust than the MELCEPs in the presence of car noise. © 1995.
Open Access
Mel-cepstral methods for image feature extraction
(IEEE, 2010) Çakır, Serdar; Çetin, A. Enis
A feature extraction method based on two-dimensional (2D) mel-cepstrum is introduced. The concept of one-dimensional (1D) mel-cepstrum which is widely used in speech recognition is extended to 2D in this article. Feature matrices resulting from the 2D mel-cepstrum, Fourier LDA, 2D PCA and original image matrices are converted to feature vectors and individually applied to a Support Vector Machine (SVM) classification engine for comparison. The AR face database, ORL database, Yale database and FRGC version 2 database are used in experimental studies, which indicate that recognition rates obtained by the 2D mel-cepstrum method is superior to the recognition rates obtained using Fourier LDA, 2D PCA and ordinary image matrix based face recognition. This indicates that 2D mel-cepstral analysis can be used in image feature extraction problems. © 2010 IEEE.
Open Access
New methods for robust speech recognition
(1995) Erzin, Engin
New methods of feature extraction, end-point detection and speech enhcincement are developed for a robust speech recognition system. The methods of feature extraction and end-point detection are based on wavelet analysis or subband analysis of the speech signal. Two new sets of speech feature parameters, SUBLSF’s and SUBCEP’s, are introduced. Both parameter sets are based on subband analysis. The SUBLSF feature parameters are obtained via linear predictive analysis on subbands. These speech feature parameters can produce better results than the full-band parameters when the noise is colored. The SUBCEP parameters are based on wavelet analysis or equivalently the multirate subband analysis of the speech signal. The SUBCEP parameters also provide robust recognition performance by appropriately deemphasizing the frequency bands corrupted by noise. It is experimentally observed that the subband analysis based feature parameters are more robust than the commonly used full-band analysis based parameters in the presence of car noise. The a-stable random processes can be used to model the impulsive nature of the public network telecommunication noise. Adaptive filtering are developed for Q-stable random processes. Adaptive noise cancelation techniques are used to reduce the mismacth between training and testing conditions of the recognition system over telephone lines. Another important problem in isolated speech recognition is to determine the boundaries of the speech utterances or words. Precise boundary detection of utterances improves the performance of speech recognition systems. A new distance measure based on the subband energy levels is introduced for endpoint detection.
Open Access
Object tracking under illumination variations using 2D-cepstrum characteristics of the target
(IEEE, 2010) Cogun, Fuat; Çetin, A. Enis
Most video processing applications require object tracking as it is the base operation for real-time implementations such as surveillance, monitoring and video compression. Therefore, accurate tracking of an object under varying scene conditions is crucial for robustness. It is well known that illumination variations on the observed scene and target are an obstacle against robust object tracking causing the tracker lose the target. In this paper, a 2D-cepstrum based approach is proposed to overcome this problem. Cepstral domain features extracted from the target region are introduced into the covariance tracking algorithm and it is experimentally observed that 2D-cepstrum analysis of the target object provides robustness to varying illumination conditions. Another contribution of the paper is the development of the co-difference matrix based object tracking instead of the recently introduced covariance matrix based method. ©2010 IEEE.
Open Access
Prefix-suffix based statistical language models of Turkish
(2001-07) Topkara, Umut
Open Access
Recognition of vessel acoustic signatures using non-linear teager energy based features
(IEEE, 2016-10) Can, Gökmen; Akbaş, Cem Emre; Çetin, A. Enis
This paper proposes a vessel recognition and classification system based on vessel acoustic signatures. Teager Energy Operator (TEO) based Mel Frequency Cepstral Coefficients (MFCC) are used for the first time in Underwater Acoustic Signal Recognition (UASR) to identify platforms the acoustic noise they generate. TEO based MFCC (TEO-MFCC), being more robust in noisy conditions than conventional MFCC, provides a better estimation platform energy. Conventionally, acoustic noise is recognized by sonar oper-ators who listen to audio signals received by ship sonars. The aim of this work is to replace this conventional human-based recognition system with a TEO-MFCC features-based classification system. TEO is applied to short-time Fourier transform (STFT) of acoustic signal frames and Mel-scale filter bank is used to obtain Mel Teager-energy spectrum. The feature vector is constructed by discrete cosine transform (DCT) of logarithmic Mel Teager-energy spectrum. Obtained spectrum is transformed into cepstral coefficients that are labeled as TEO-MFCC. This analysis and implementation are carried out with datasets of 24 different noise recordings that belong to 10 separate classes of vessels. These datasets are partially provided by National Park Service (NPS). Artificial Neural Networks (ANN) are used as a classification method. Experimental results demonstrate that TEO-MFCC achieves 99.5% accuracy in classification of vessel noises. © 2016 IEEE.
Open Access
Recognizing objects and scenes in news videos
(Springer, 2006-07) Baştan, Muhammet; Duygulu, Pınar
We propose a new approach to recognize objects and scenes in news videos motivated by the availability of large video collections. This approach considers the recognition problem as the translation of visual elements to words. The correspondences between visual elements and words are learned using the methods adapted from statistical machine translation and used to predict words for particular image regions (region naming), for entire images (auto-annotation), or to associate the automatically generated speech transcript text with the correct video frames (video alignment). Experimental results are presented on TRECVID 2004 data set, which consists of about 150 hours of news videos associated with manual annotations and speech transcript text. The results show that the retrieval performance can be improved by associating visual and textual elements. Also, extensive analysis of features are provided and a method to combine features are proposed. © Springer-Verlag Berlin Heidelberg 2006.
Open Access
Statistical modeling of agglutinative languages
(2000) Hakkani-Tür, Dilek Z.
Recent advances in computer hardware and availability of very large corpora have made the application of statistical techniques to natural language proce.ssing a possible, and a very appealing research area. Alany good results h;.i,ve been obtained by applying these techniques to English (and similar languages) in parsing. word sense disambiguation, part-of-speech tagging, and speech recognition. However, languages like Turkish, which have a number of characteristics that differ from English have mainly been left unstudied. Turkish presents an interesting problem for statistical modeling. In contrast to languages like English, for which there is a very small number of possible word forms with a gi\’en root wc>rd. for languages like Turkish or Finnish with very productive agglutinative morphology, it is possible to produce thousands of forms for a given root word. This causes a serious data sparseness problem for language modeling. This Ph.D. thesis presents the results of research and development of statistical language modeling techniques for Turkish, and tests such techniques on basic applications of natural language and speech processing like morphological disambiguation, spelling correction, and ?r-best list rescoring for speech recognition. For all tasks, the use of units smaller than a word for language modeling were tested in order to reduce the impact of data sparsity problem. For morphological disambiguation, we examined n-gram language models and ma.ximum entropy models using inflectional groups as modeling units. Our results indicate that using smaller units is useful for modeling languages with complex morphology and n-gram language models perform better than maximum entropy models. For n-best list rescoring and spelling correction, the n-gram language models that were developed for morphological disambiguation, and their approximations, via prefix-suffix models were used. The prefix-suffix models performed very well for n-best list rescoring, but for spelling correction, they could not beat word-based models, in terms of accuracy.
Open Access
Subband analysis for robust speech recognition in the presence of car noise
(IEEE, 1995-05) Çetin, A. Enis; Yardımcı, Y.; Erzin, Engin
In this paper, a new set of speech feature representations for robust speech recognition in the presence of car noise are proposed. These parameters are based on subband analysis of the speech signal. Line Spectral Frequency (LSF) representation of the Linear Prediction (LP) analysis in subbands and cepstral coefficients derived from subband analysis (SUBCEP) are introduced, and the performances of the new feature representations are compared to mel scale cepstral coefficients (MELCEP) in the presence of car noise. Subband analysis based parameters are observed to be more robust than the commonly employed MELCEP representations.
Open Access
Teager energy based feature parameters for robust speech recognition in car noise
(IEEE, Piscataway, NJ, United States, 1999) Jabloun, F.; Çetin, A. Enis
In this paper, a new set of speech feature parameters based on multirate signal processing and the Teager Energy Operator is developed. The speech signal is first divided into nonuniform subbands in mel-scale using a multirate filter-bank, then the Teager energies of the subsignals are estimated. Finally, the feature vector is constructed by log-compression and inverse DCT computation. The new feature parameters have a robust speech recognition performance in car engine noise which is low pass in nature.
Open Access
Teager energy based feature parameters for speech recognition in car noise
(Institute of Electrical and Electronics Engineers, 1999-10) Jabloun, F.; Çetin, A. Enis; Erzin, E.
In this letter, a new set of speech feature parameters based on multirate signal processing and the Teager energy operator is introduced. The speech signal is first divided into nonuniform subbands in mel-scale using a multirate filterbank, then the Teager energies of the subsignals are estimated. Finally, the feature vector is constructed by log-compression and inverse discrete cosine transform (DCT) computation. The new feature parameters have robust speech recognition performance in the presence of car engine noise.
Open Access
Time-scale wavelet scattering using hyperbolic tangent function for vessel sound classification
(IEEE, 2017-08-09) Can, Gökmen; Akbaş, Cem Emre; Çetin, A. Enis
We introduce a time-frequency scattering method using hyperbolic tangent function for vessel sound classification. The sound data is wavelet transformed using a two channel filter-bank and filter-bank outputs are scattered using tanh function. A feature vector similar to mel-scale cepstrum is obtained after a wavelet packed transform-like structure approximating the mel-frequency scale. Feature vectors of vessel sounds are classified using a support vector machine (SVM). Experimental results are presented and the new feature extraction method produces better classification results than the ordinary Mel-Frequency Cepstral Coefficients (MFCC) vectors. © EURASIP 2017.
Open Access
Two-dimensional Mellin and mel-cepstrum for image feature extraction
(Springer, Dordrecht, 2010) Çakır, Serdar; Çetin, A. Enis
An image feature extraction method based on two-dimensional (2D)Mellin cepstrum is introduced. The concept of one-dimensional (1D) melcepstrum which is widely used in speech recognition is extended to two-dimensions both using the ordinary 2D Fourier Transform and the Mellin transform in this article. The resultant feature matrices are applied to two different classifiers (Common Matrix Approach and Support Vector Machine) to test the performance of the melcepstrum and Mellincepstrum based features. Experimental studies indicate that recognition rates obtained by the 2D melcepstrum based method are superior to the recognition rates obtained using 2D PCA and ordinary image matrix based face recognition in both classifiers. © 2011 Springer Science+Business Media B.V.