Improving speech recognition using bionic wavelet features

Vani H Y; Anusuya M A; Vani H Y; Anusuya M A

doi:10.3934/ElectrEng.2020.2.200

AIMS Electronics and Electrical Engineering

2020, Volume 4, Issue 2: 200-215. doi: 10.3934/ElectrEng.2020.2.200

Previous Article Next Article

Research article

Improving speech recognition using bionic wavelet features

Vani H Y ^{1
,
,},
Anusuya M A ²

1.
Department of Information Science & Engg., JSS Science & Technology University, Mysore, Karnataka, India
2.
Department of Computer Science & Engg., JSS Science & Technology University, Mysore, Karnataka, India

Received: 18 February 2020 Accepted: 14 April 2020 Published: 27 April 2020

Bionic wavelet transform is a continuous wavelet, based on adaptive time frequency technique. This paper presents a speech recognition system for recognizing isolated words by discretizing the continuous Bionic Wavelet (BW). Conversion from continuous to discrete is achieved by adopting central frequency and thresholding techniques. The BW features of noisy signal are processed through MFCC to obtain the optimal features of the speech signal. SVM, Artificial Neural Network (ANN) and LSTM techniques are used to improve the recognition rate by enhancing the speech signals. The experiments are conducted on FSDD and Kannada data set. The speech feature vector is calculated using the parameters extracted by Bionic wavelet with different central frequencies of Morlet, Daubechies and Bior3.5, coiflet5 mother wavelets. The obtained Bionic-MFCC optimal features are fed to SVM, ANN and LSTM models for the classification and recognition process. The performance of the models is tabulated for correct recognition that varies from 95% to 96% among these models. The models are tested for various SNRs noise levels like 5 dB, 10 dB, 15 dB and the recognition accuracies of these models are presented for convoluted noisy speech data.

Keywords:

Citation: Vani H Y, Anusuya M A. Improving speech recognition using bionic wavelet features[J]. AIMS Electronics and Electrical Engineering, 2020, 4(2): 200-215. doi: 10.3934/ElectrEng.2020.2.200

Related Papers:

[1]	Manisha Bangar, Prachi Chaudhary . A novel approach for the classification of diabetic maculopathy using discrete wavelet transforms and a support vector machine. AIMS Electronics and Electrical Engineering, 2023, 7(1): 1-13. doi: 10.3934/electreng.2023001
[2]	Mamun Mishra, Bibhuti Bhusan Pati . A hybrid IDM using wavelet transform for a synchronous generator-based RES with zero non-detection zone. AIMS Electronics and Electrical Engineering, 2024, 8(1): 146-164. doi: 10.3934/electreng.2024006
[3]	K.V. Dhana Lakshmi, P.K. Panigrahi, Ravi kumar Goli . Machine learning assessment of IoT managed microgrid protection in existence of SVC using wavelet methodology. AIMS Electronics and Electrical Engineering, 2022, 6(4): 370-384. doi: 10.3934/electreng.2022022
[4]	Vuong Quang Phuoc, Nguyen Van Dien, Ho Duc Tam Linh, Nguyen Van Tuan, Nguyen Van Hieu, Le Thai Son, Nguyen Tan Hung . An optimized LSTM-based equalizer for 100 Gigabit/s-class short-range fiber-optic communications. AIMS Electronics and Electrical Engineering, 2024, 8(4): 404-419. doi: 10.3934/electreng.2024019
[5]	Sebin J Olickal, Renu Jose . LSTM projected layer neural network-based signal estimation and channel state estimator for OFDM wireless communication systems. AIMS Electronics and Electrical Engineering, 2023, 7(2): 187-195. doi: 10.3934/electreng.2023011
[6]	Assila Yousuf, David Solomon George . A hybrid CNN-LSTM model with adaptive instance normalization for one shot singing voice conversion. AIMS Electronics and Electrical Engineering, 2024, 8(3): 292-310. doi: 10.3934/electreng.2024013
[7]	Zaineb M. Alhakeem, Heba Hakim, Ola A. Hasan, Asif Ali Laghari, Awais Khan Jumani, Mohammed Nabil Jasm . Prediction of diabetic patients in Iraq using binary dragonfly algorithm with long-short term memory neural network. AIMS Electronics and Electrical Engineering, 2023, 7(3): 217-230. doi: 10.3934/electreng.2023013
[8]	D Venkata Ratnam, K Nageswara Rao . Bi-LSTM based deep learning method for 5G signal detection and channel estimation. AIMS Electronics and Electrical Engineering, 2021, 5(4): 334-341. doi: 10.3934/electreng.2021017
[9]	Desh Deepak Sharma, Ramesh C Bansal . LSTM-SAC reinforcement learning based resilient energy trading for networked microgrid system. AIMS Electronics and Electrical Engineering, 2025, 9(2): 165-191. doi: 10.3934/electreng.2025009
[10]	Loris Nanni, Michelangelo Paci, Gianluca Maguolo, Stefano Ghidoni . Deep learning for actinic keratosis classification. AIMS Electronics and Electrical Engineering, 2020, 4(1): 47-56. doi: 10.3934/ElectrEng.2020.1.47

Abstract

1. Introduction

One of the most important branches of speech processing is enhancing the speech recognition for noisy signals i.e. speech enhancement, speech recognition etc.. Reducing noise from a speech signal is very complex process. The main objective of speech enhancement is to find the optimal estimates of speech features. To obtain efficient feature, wavelet transforms are most useful because it is one of the most prominent technique to analyze the non stationary speech signals in both time and frequency domains in a better way.

Using wavelets ^[1], the noise can be reduced by appropriately selecting the wavelet coefficient threshold. These threshold values are subtracted from the noisy wavelet coefficients to obtain a noise reduced signal. Since features are computed in scalograms the obtained features are more prominent than the features obtained from short term Fourier transform technique.

In wavelet transforms there are two types: Continuous and Discrete wavelet transforms.

Discrete wavelet transform decomposes the signal into approximation and detail components by shifting and scaling the copies of the basic wavelet to a required level. BWT is proposed and used in the present work because, it resembles the auditory model of human cochlea ^{[2,3,4,5,6,7]} and it can be easily correlated with the MFCC feature extraction process. This helps in extracting the prominent features of the noisy speech signal.

CWT is used to obtain simultaneous time frequency analysis. It is preferred because it is based on Auditory model of Human Cochlea ^{[2,3,4,5,6,7]}.

In this paper, we propose the optimal feature selection procedure using BWT and MFCC procedures for convoluted noisy speech data for recognizing words. To calculate the optimal features mother wavelet’s central frequencies of Morlet ^[7], Daubechies, Bior, Coiflet wavelets are adapted to BWT with thresholding and central frequency techniques.

Thresholding on BWT is calculated using the following selection methods. They are ^[8]: ⅰ) Stein’s unbiased estimate of the risk rule (SURE), ⅱ) heuristic threshold selection rule, ⅲ) fixed selection rule, ⅳ) minimax ⅴ) sqtwolog threshold. To handle noise in the signal SURE threshold selection procedure has been adopted to BWT to estimate the recognition accuracy.

The contents of the paper is organized as follows: Section 2 discusses about the works carried out in literature using bionic wavelets. Section 3 provides introduction to continuous bionic wavelet. Section 4 presents the procedure adopted for converting the continuous wavelet to discrete wavelet. Section 5 discusses about the data set used for the experimentation purpose. Proposed system model is discussed in section 6 with results. The performance analysis of different classifier is discussed in section 7. Section 8 presents observations done during the simulation process. Last section discusses about the conclusion and future enhancements.

2. Literature survey

Extracting optimal feature plays a major role in classification and or recognition. However, many studies shows that bionic with Morlet wavelets are used for de-noising the speech signal by enhancing the signal component. At present the features can be extracted at three methods 1) Features from Time Domain, 2) Frequency Domain Features, 3) Features from Raw wave file. MFCC is the most popular method in frequency domain and the last method is now gearing up in the machine learning models. MFCC is well suited for clean speech signal but making it more robust for noisy data is also presented in this paper. In this direction the bionic wavelets are used for de-noising and the MFCC is made robust towards handling convoluted noisy speech data.

Bionic wavelet is made adaptive by applying various methods viz, by changing the ‘K’ factor, using different hard/soft thresholding methods and applying various base/central frequencies. The following are some of the related work towards the application of bionic wavelets used for denoising the speech data. A. Garg & O. P. Sahu ^[9] proposed a method to discretize bionic wavelet using CWT and ICWT using Morlet as the mother wavelet.

Fie Chen ^[10] proposed adaptive DBWT by changing T-function of BWT and splitting the dyadic tiling map of DWT that uses quadrature-mirror filters, organized as DBWT tiling map for decomposition. M. Talbi ^[11] proposed entropy technique to BWT to identify the two sub bands having minimal entropy for each coefficient.

Cao Bin-Fang ^[12] proposed a bionic wavelet method of hierarchical threshold based on PSO. The noisy speech signal is decomposed using bionic wavelet transform. In this Particle Swarm Optimization is proposed for threshold optimization. The noise with high frequency is separated by bionic wavelet transform and this is fed as an input to an adaptive filter. From the experimental work the paper illustrates speech enhancement for various SNR conditions.

A detail analysis is made by Yang Xi, Liu et al. to understand the behavior of bionic wavelet with additive noise for various db’s. It clearly explains the usage of bionic with Morlet as a mother wavelet for removing various db level noises from a speech signal. Yao and Zhango proposed an adaptive bionic with a Morlet wavelet base frequency “ωo” of mother wavelet 15165.4 Hz that is suitable for human auditory system.

Mourad used ^[13] MSS-MAP for wavelet transform and used four different test such as SNR, segmental SNR, Itakura and perceptual evaluation for various types of noises and their levels. A new speech enhancement procedure is proposed by WU Li-ming ^[14] on improved correlation function processing for Bionic wavelet co-efficient.

Speech recognition for Arabic words is demonstrated in Ben-Nasr ^[15]. Feature extraction is done by using MFCC with bionic wavelet. To increase the recognition rate Delta-Delta coefficients are used and classification is done by using feedforward back propagation neural network. Zehtabian ^[16], proposed speech enhancement technique using BWT and singular value decomposition method. The paper illustrates SVD is better than BWT for higher SNR’s.

Liu Yan ^[17], proposed de-noising algorithm on sub band spectrum entropy with bionic wavelet transform. They showed that sub band spectrum is good in detecting the end point of the speech signal. Hence it is used to distinguish speech as well as noise. The experimental work demonstrate sub band entropy de-noising method is superior than Wiener filter algorithm. Pritamdas ^[18] focus on continuous wavelet transform and thresholding of coefficients for speech enhancements using thresholds and wavelet transform scales in adaptive manner.

From the literature survey, it is observed that a lot of work is reported on Bionic wavelet for speech enhancement with thresholding and rescaling procedures used for converting continuous to discrete wavelet co-efficient’s for additive noise only. In this paper, procedure to convert continuous to discrete wavelet based on the central frequency is proposed. New feature extraction technique and the procedure to reduce the noise of convoluted noise is presented.

To the best of our knowledge this work is unique in its own way for de-noising the convoluted noise at various levels. The next section describes the characteristics of Bionic wavelet.

3. Continuous bionic wavelet

Alternative to STFT, is the WT technique ^{[19,20,21,22]}. When these two are compared visually, The scalograms of WT are better in representing the formant frequencies and structural harmonics of speech. Hence WT technique is identified as one of the prominent method to handle non stationary signals. CWT is fixed with some base scale ^[23] that is 2^1/m where m is an integer greater than 1. Where ‘m’ is the number of “voices per octave”. Different scales are obtained by raising this base scale to positive integer numbers, for example 2^k/m where k = 1, 2, 3…. The translation parameter in the CWT is discretized to integer values, represented by l. The resulting discretized wavelets for the CWT is represented by Eq. 1

$\frac{1}{2^{\frac{k}{m}}} \Psi \frac{(n-l)}{2^{\frac{k}{m}}}$

(1)

3.1. Bionic wavelet

Bionic wavelet transform (BWT) is an adaptive wavelet transform based on a model of the active biological auditory system ^[24]. The decomposition of BWT ^[2] is perceptually scaled and adaptive. It has the following properties:

ⅰ) High sensitivity and selectivity

ⅱ) Signal with determined energy distribution

ⅲ) Can be reconstructed

The resolution of bionic wavelet transform can be achieved by adjusting signal frequency and the instantaneous amplitude with its first order differential values.

4. Realization of discrete bionic wavelet from continuous

This section discusses about the mechanism adopted to convert continuous wavelet to discrete wavelet. To convert any continuous to discrete wavelet the discrete thresholding and central or base frequencies of different mother wavelets are adopted.

4.1. Center frequency ^[25]

Db11, Coif 5 and Bi-ortho3.5 wavelets are considered with central frequencies – 0.67, 0.68, 1.04 Hz respectively. The center frequency is calculated using centfrq of Matlab.

$\omega_{\mathrm{m}} = \omega_{0} /(1.1623)^{\mathrm{m}}$ , m varies from 1 to 22 for Morlet. For other wavelets centfrq function of Matlab is used.

All the wavelets possess different characteristics, hence the following four wavelets are considered

Db11: asymmetric, orthogonal, bi-orthogonal.

Coif 5: symmetric, orthogonal, bi-orthogonal.

Bior3.5: symmetric, not-orthogonal, bi-orthogonal.

22 scales are considered for BW in spite of center frequency.These wavelets are preferred because they mimics the mel-scale mapping of the MFCC ^[26] procedure and also these are designed to match the basilar membrane spacing i.e. based on nonlinear perceptual model of the auditory system.

4.2. Thresholding

This parameter decides about the number of levels used to reduce the redundant information in the CWT towards the discretisation of the wavelet. The following thresholding mechanisms are considered with various levels by trial and error procedure as listed below in the Table 1. Levels are fixed based on the obtained thresholds of the signal. The various ways of calculating the thresholding is as discussed below:

Table 1. SNR for various thresholding levels.

Sl. No	Thresholding	No. of levels based on the Thresholding	SNR before	SNR after
1	SURE.	2	-12.52	-3.3
2	Heuristic variant	5	-12.52	-11.96
3	Sqtwolog	4	-12.62	-11.96
4	Minimaxi	4	-12.52	-10.24

| Show Table

DownLoad: CSV

Sqtwolog:

$t h r = \sigma_{k} \sqrt{2 * \log (p)}$

where σ is the mean absolute deviation (MAD) and p is the length of the noisy signal. MAD is expressed as

$\sigma_{k} = \frac{M A D_{k}}{0.6745} = \frac{m e d i a n|\omega|}{0.6745}$

ω wavelet coefficient and k-scale for wavelet co-efficient

Rigrsure

$\mathit{\boldsymbol{t}}{\mathit{\boldsymbol{h}}_\mathit{\boldsymbol{k}}} = {\mathit{\boldsymbol{\sigma }}_\mathit{\boldsymbol{k}}}\sqrt {{\omega _c}}$

${{\omega _c}}$ is the cth coefficient wavelet square (coefficient at minimal risk) chosen from the vector

ω = [ω1, ω2, .., ωc]

σ is the standard deviation of the noisy signal.

Heursure: Heursure threshold selection rule is a combination of Sqtwolog and Rigrsure methods.

Minimaxi:

$t{h_k} = \left\{ {\begin{array}{*{20}{c}} {\sigma \left( {0.3936 + 0.10829{{\log }_2}M, } \right.}&{M \gt 32}\\ {0, }&{M \lt 32} \end{array}} \right\}$

ω: a vector of wavelet coefficients in units scale and M: vector length of the signal.

Algorithm

Steps for Discretizing Bionic Wavelet.

Step 1: Read the speech signal

Step 2: Multiply each value by ‘K’ as shown in Eq. 2

$\operatorname{BWT}_{\mathrm{f}}\left((a, \tau) = \mathrm{K} * \mathrm{WT}_{\mathrm{f}}((a, \tau)\right.$

(2)

Step 3: Thresholding function is selected with high SNR using Matlab function thselect.

Step 4: Base/central frequencies of various mother wavelets is applied using centfrq (wname).

Step 5: The modified bionic wavelet coefficients are divided by the ‘K’ factor to get the coefficients and reconstruction is done by taking its the inverse continuous wavelet transform. Where the ‘approximation is done by K’-factor using Eq. 3

$\frac{1.7772 T_{0}}{\sqrt{T^{2}+1}}$

(3)

Step 6: Compute the inverse continuous transform

Step 7: Obtain the Mel frequency Cepstral co-efficient for the de-noised signal ^[26]

Step 8: The Bionic-MFCC features obtained are listed as shown below

Step 9: Classify the same features using SVM, ANN and LSTM classifiers

Following tables shows the sample features are better than bare MFCC features.

Noisy speech Signal: The following table presents only the MFCC features without wavelet for noisy signal Co-efficient from MFCC without wavelets.

Table 1. SNR for various thresholding levels.

Sl. No	Thresholding	No. of levels based on the Thresholding	SNR before	SNR after
1	SURE.	2	-12.52	-3.3
2	Heuristic variant	5	-12.52	-11.96
3	Sqtwolog	4	-12.62	-11.96
4	Minimaxi	4	-12.52	-10.24

| Show Table

DownLoad: CSV

Coefficients after applying Wavelets

Table 1. SNR for various thresholding levels.

Sl. No	Thresholding	No. of levels based on the Thresholding	SNR before	SNR after
1	SURE.	2	-12.52	-3.3
2	Heuristic variant	5	-12.52	-11.96
3	Sqtwolog	4	-12.62	-11.96
4	Minimaxi	4	-12.52	-10.24

| Show Table

DownLoad: CSV

Clean speech Signal

The following table presents only the MFCC features without wavelet for clean signal

Table 1. SNR for various thresholding levels.

Sl. No	Thresholding	No. of levels based on the Thresholding	SNR before	SNR after
1	SURE.	2	-12.52	-3.3
2	Heuristic variant	5	-12.52	-11.96
3	Sqtwolog	4	-12.62	-11.96
4	Minimaxi	4	-12.52	-10.24

| Show Table

DownLoad: CSV

Coefficients after applying Wavelets

Table 1. SNR for various thresholding levels.

Sl. No	Thresholding	No. of levels based on the Thresholding	SNR before	SNR after
1	SURE.	2	-12.52	-3.3
2	Heuristic variant	5	-12.52	-11.96
3	Sqtwolog	4	-12.62	-11.96
4	Minimaxi	4	-12.52	-10.24

| Show Table

DownLoad: CSV

The above presents the weighted features obtained from step 1 to step 8 of the algorithms. From this it is clear that wavelets weighted feature values are better for both for clean and noisy speech signal.

5. Data set

Two different datasets considered are free spoken digit dataset (FSDD) ^[27] and Kannada dataset (Table 2) with recordings of spoken digits and words sampled at 8 kHz and 16 kHz respectively. The recordings are trimmed, so that they have near minimal silence at the beginnings and ends. It consists of English pronunciation words of numbers from one to nine from four different speakers. Totally 900 signals with 100 signals of each digit is collected. The second data set is isolated words Kannada data set. The words considered are as shown in Table 3. These signals are sampled at 16 KHz frequency consisting of 30 speakers with 20 male and 10 female speakers. 1000 word samples are collected from both genders for Kannada data set. The signals are artificially convoluted with street noise ^[28] with the SNR of 5, 10 and 15db to create convoluted noisy speech signals.

Table 2. Kannada dataset.

| Show Table

DownLoad: CSV

Table 3. English dataset.

| Show Table

DownLoad: CSV

6. System model for the proposed approach

The obtained features are modeled for classification and recognition using machine learning models like SVM ^{[29,30,31,32]} ANN ^[15,33] and LSTM ^[34,35] in the proposed work. The overall data flow diagram of adopting all the models is as shown below Figure 1.

Figure 1. Flowchart of the proposed work.

DownLoad: Full-Size Img PowerPoint

General experimental setup:

The obtained features of all the signals are grouped into training and testing samples. These signals are convoluted with 5 db, 10 db, 15 db street noise ^[28]. The same data set is used by all the models for testing and training purpose to evaluate the recognition accuracies performance of all the models. The results are discussed at two levels namely, i) signal to noise ratio before and after the application of bionic wavelet ii) Recognition accuracies of the models compared with the existing models if any.

6.1. Signal to Noise Ratio (SNR) ^[36]

It is a best indicator for identifying noise interference in a given signal. SNR is computed using the following formulas.

$sn{r_{before}} = \frac{{{\rm{ }}\mathit{mean}{\rm{ (}}\mathit{orgsignal}{\rm{)}}{{\rm{ }}^2}}}{{{\rm{ }}\mathit{mean}{\rm{ (}}\mathit{transpose}{\rm{ (}}\mathit{noise}{\rm{)}}{{\rm{ }}^2}}}$

$sn{r_{\mathit{befor}{\mathit{e}_{db}}}} = 10*\mathit{log}10\left( {\mathit{sn}{\mathit{r}_{\mathit{before}}}} \right)\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\% \;{\rm{in}}\;{\rm{dB}}$

$\mathit{sn}{\mathit{r}_{\mathit{after}}} = \frac{{\mathit{mean}\left( {\mathit{enhancesp}{{\rm{.}}^2}} \right)}}{{\mathit{mean}\left( {\mathit{redidual\_oise}{.^{\rm{2}}}} \right)}}$

$sn{r_{afte{r_{db}}}} = 10*\mathit{log}10({\rm{ }}\mathit{sn}{\mathit{r}_{\mathit{after}}}\mathit{ })\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\% \;{\rm{i}}{\rm{n}}\;{\rm{dB}}$

The Table 4 presents the application of different central frequencies to bionic wavelet to reduce the noise levels. It is clear that an average of 2db of noise is reduced.

Table 4. SNR for various central frequency with their mother wavelets.

| Show Table

DownLoad: CSV

Table 5 depicts the application of bionic wavelets for convoluted noise considering the 22 scales as mentioned in the literature.

Table 5. SNR for convoluted noise using bionic 22 scales.

| Show Table

DownLoad: CSV

Comparing Table 4 and 5 SNR level is better for convoluted noise. Hence noise reduction is better in Table 4 than in Table 5.

7. Performance analysis of various classification methods

In our earlier works ^[36,37], the experiments were carried no clean and noisy speech data set with normal MFCC features. The current feature extraction procedure applies bionic wavelets for extracting better features for the dataset specified in section 3. Hence, In this paper the new Bionic-MFCC features are used for the recognition purpose by reducing the noise using discrete bionic wavelets. Experiments are performed on standard benchmark dataset (FSDD) and Kannada dataset. The various models and their parameters used are as follows:

7.1. Support vector machine

Since the speech features are non-linear in nature, the features need to be mapped to high dimensional space. The basic idea is that the input space need to be mapped into a high dimensional feature space by nonlinear transformation and the optimal hyper plane is found in the new space. The optimal hyper plane not only needs to ensure that different categories can be discriminated correctly, but also the maximum categorization interval between them should be promised. Thus, the generalization capability of the support vector machine is stronger. The target function corresponding to the nonlinear separable support vector machine is given by:

$\begin{array}{l} \min \left( {\frac{1}{2}{\mathit{\boldsymbol{\omega }}^T}\mathit{\boldsymbol{\omega }} + C\sum\limits_{i = 1}^N {{\xi _i}} } \right)\\ {\rm{s}}{\rm{.t}}{\rm{. }}\;\;\;{y_i}\left( {{\mathit{\boldsymbol{\omega }}^T}{\mathit{\boldsymbol{x}}_i} + b} \right) \ge 1 - {\xi _i}, {\xi _i} \ge 0, i = 1, 2, \ldots , N \end{array}$

where ω represents the weight coefficient vector, and b is a constant. C denotes the penalty coefficient to control the penalty degree for misclassified samples and balance the complexity of the model and loss error. ξi represents the relaxation factor to adjust the number of misclassified samples that allowed exit in the process of classification.

When SVM is used to solve the classification problems, two strategies can be adopted. One is ONE-TO-ALL, and ONE-TO-ONE. In this paper ONE-To-ALL method is applied for multi-classification. Kernel functions are also the key functions for SVM. Hence, polynomial and radial basis kernel functions are considered.

Table 6 and Figure 2 depict the recognition performance using SVM model. To implement SVM RBF(r) and polynomial kernel (p) functions are used. It is observed that Bionic-MFCC features, well classifies the noisy signal compared to clean speech proposed using Bionic-MFCC features ^[30]. SVM performs better with RBF kernel function for standard data set. Whereas, as it fails for Kannada data set. Polynomial function performs better for Kannada data set as shown in Figure 2. From this it identifies that the kernel performance depends on the data set.

Table 6. Classification accuracy of SVM.

| Show Table

DownLoad: CSV

Figure 2. Classification accuracy of SVM.

DownLoad: Full-Size Img PowerPoint

7.2. Neural network

In the literature bionic wavelets are applied with Morlet base frequency for additive noisy Arabic speech recognition system ^[15,34,38] using NN. Hence in this paper bionic wavelets are tried for convoluted noisy speech data to identify the level of noise reduction and feature weights for recognition accuracy. Standard dataset has good recognition rate compared to Kannada data set. Less performance is due to the variable word length and existence of ambiguity in the utterance of the speaker. The Table 7 and Figure 3 show the recognition accuracies obtained.

Table 7. Recognition accuracy of NN.

| Show Table

DownLoad: CSV

Figure 3. Recognition accuracy of NN.

DownLoad: Full-Size Img PowerPoint

NN Implemented Procedure:

Neural network model has 9 nodes, each with 12 bionic MFCC features at the input layer. Two hidden layers are considered with 9 nodes at the output layer representing each word. The output layer has 9 nodes with one node for each digit. Softmax activation function is applied on the top of the network to get output class label probabilities. The model is optimized by adam-delta optimizer that adapts learning rate by moving window.

Learning is continued and network is learnt for all updates. The model is constructed and categorical cross entropy is used for multi classification.

7.3. LSTM

Procedure: The MFCC features are fed to the input-layer to build basic LSTM Cell. Wrapping of each layer in a dropout layer is considered with 0.5 probability value, for learning in each iteration. A group of dropout wrapped LSTMs are fed to a MultiRnn cell to group the layer together.

The CTC model helps to learn for labeling a variable –length sequence when the input-output arrangement is not known. Consider the features m = (m₁, m₂, ….m_T) and the label n = (n_1, n₂, ….n_U). The CTC is trained based on maximum probability. The loss function of the CTC model is computed as

${l_{CTC}} = - \mathit{\boldsymbol{lnP}}(\mathit{\boldsymbol{n}}|\mathit{\boldsymbol{m}}) \approx = \mathit{\boldsymbol{ln}}\sum\limits_{\pi \in \phi } {\prod\limits_{t = 1}^T \mathit{\boldsymbol{P}} } \left( {\mathit{\boldsymbol{k}} = {\mathit{\boldsymbol{\pi }}_\mathit{\boldsymbol{t}}}|\mathit{\boldsymbol{m}}} \right)$

The label sequence π is all expanded possible CTC path alignments Φ having length T

P(k = πt|m) is a label distribution at time step t.

Finally, stacked LSTM layers are embedded. The CTC ^[39,40] loss function and Adam-delta optimizer functions are used to define the model to create a single fully connected layer with SoftMax activation function to get the labeled predictions. The activation function is as given below:

${\mathit{\boldsymbol{P}}_\mathit{\boldsymbol{t}}}(\mathit{\boldsymbol{k}}|\mathit{\boldsymbol{m}}) = \frac{{\mathit{\boldsymbol{exp}}\left( {\mathit{\boldsymbol{h}}_\mathit{\boldsymbol{t}}^\mathit{\boldsymbol{L}}(\mathit{\boldsymbol{k}})} \right)}}{{\sum_{\mathit{\boldsymbol{i}} = 1}^{\mathit{\boldsymbol{K}} + 1} {\mathit{\boldsymbol{exp}}} \left( {\mathit{\boldsymbol{h}}_\mathit{\boldsymbol{t}}^\mathit{\boldsymbol{L}}(\mathit{\boldsymbol{i}})} \right)}}$

The Ada-delta optimizer is considered to minimize the loss by feeding the predictions to mean squared error loss function. Accuracy metric is used for training and testing process. The predicted values minimized with errors using mean squared error and Adam-delta optimizer.Then at the end accuracy metric is used for training and testing.

In the literature, works are carried out using Bi-LSTM and LSTM model for speech classification ^[34] with 95% and 96.58% of accuracy for clean speech signal. T. Goehring, et al. ^[41] uses recurrent neural network model for feature extraction for Babble noise for 5 dB and 10 dB with a recognition accuracy of 78% and 82% as illustrated in Table 8.

Table 8. Classification accuracy of LSTM.

| Show Table

DownLoad: CSV

Whereas in the proposed work LSTM model is applied to convoluted noisy speech data and the performance of the model is shown in Table 8 and Figure 4, demonstrates better results than identified in the literature. Using Bionic-MFCC features recognition accuracy is improved by 1% compared to Bi-LSTM model for speech data. Among SVM ANN, and LSTM models, LSTM is better in modeling the convoluted speech data using db11 mother wavelet.

Figure 4. Classification accuracy of LSTM.

DownLoad: Full-Size Img PowerPoint

Performance measures:

Word classification error rate is computed by

$\begin{array}{l} {\rm{Classification \;Accuracy = }}\frac{{{\rm{No}}{\rm{.\; of\; correctly \;classified \;audio}}}}{{{\rm{Notal\; No}}{\rm{.\; of \;audio \;files \;files}}}}{\rm{ }}\\ {\rm{Classification \;Error \;rate = }}\frac{{{\rm{No}}{\rm{. \;of\; incorrectly\; classified \;audio}}}}{{{\rm{Total \;No}}{\rm{.\; of \;audio\; files \;files}}}} \end{array}$

8. Observations and discussions

This section discusses about the observations done on the models used for the classification and recognition purposes.

SVM:

● In SVM the classification rate can be improved by applying different normalization methods.

● SVM performance varies with the choice of kernel function

● Non-linear SVM kernels are well suited for classification of speech data

ANN:

● Recognition accuracy can be increased by using large data set and the selection of appropriate optimizer function

● Increasing the number of hidden nodes improves the learning phase

LSTM:

It works on par with ANN, except the proper choice of the CTC loss function. The suitable selection of cost function will also help us to yield the good recognition rate. LSTM requires less features than SVM and ANN to model the data.

In general, SVM and ANN equally perform well compared to other model but not as good as LSTM. This is due to the optimality of the features obtained by the weighted values from Bionic-MFCC features. The results of LSTM model on FSDD dataset is better with db11 compared to other models because of fine-tuned dataset of FSDD. The results for db11 wavelet for 15 db is better because of high signal to noise ratio of noisy data.

9. Conclusion and future enhancements

In this work the discretization procedure of continuous bionic wavelet has been proposed for convoluted noisy speech recognition. The obtained bionic wavelet features are used for reducing the noise level in the speech data. These features are also used in MFCC to obtain the Bionic-MFCC speech features. It also presents the improvement of MFCC features using continuous wavelets. From the obtained results of the models it is clear that LSTM with DB11 wavelet at 15dB SNR outperforms. It is also observed that the recognition accuracies depend on the nature of dataset also.

It is a unique work of applying continuous bionic wavelet for feature extraction using the central frequencies of DB11, coif5 and Bior-3.5 wavelets for convoluted noisy speech data. This work also demonstrates that, even basic mother wavelets features can also be adopted in converting the continuous to discrete wavelets. It is very tedious to handle convoluted noisy speech data because of overlapping and the identification of the frequency of noise with the original data (convolution of signal and noise). According to our study, the additive noise can be completely removed by using filters but not convoluted noise. Hence this approach is towards reducing the noise using continuous wavelet for the isolated word recognition. As per the study, LSTM model better classifies and improves the recognition accuracy up to 96% with 4% of word error rate than other models. Hence bionic wavelet well sustains and it can be made adaptive in nature, by applying various thresholding concept. From this study it is also observed that central frequency and the thresholding concept plays a major role in noise reduction as well as in the conversion of continuous to discrete wavelet. For Kannada dataset, word error rate is high because of variation in speaker’s pronunciations. Whereas FSDD has good recognition rate because of its fine-tuned dataset.

Future enhancements:

In spite of thresholding, genetic algorithms can be adopted for feature reduction. Other wavelets central frequencies can also be tried for discretization of the wavelets. The performance of the above models can be verified for different types of noises for various noise levels to identify SNR. The model performances can also extend to sentence level recognition. The DWT trees can also be used for speech enhancement by noise reduction.

Acknowledgment

We are thankful to all the persons who helped in formulating this paper. The authors remain grateful to Dr. S.K. Katti for all his support.

Conflict of interest

All authors declare no conflicts of interest in this paper.

References

[1]	Donoho DL (1995) De-noising by soft-thresholding. IEEE T Inform Theory 41: 613-627. doi: 10.1109/18.382009
[2]	Yao J, Zhang YT (2001) Bionic wavelet transform: A new time-frequency method based on an auditory model. IEEE T Biomed Eng 48: 856-863. doi: 10.1109/10.936362
[3]	Yao J, Zhang YT (2002) The application of bionic wavelet transforms to speech signal processing in cochlear implants using neural network simulations. IEEE T Biomed Eng 49: 1299-1309. doi: 10.1109/TBME.2002.804590
[4]	Yuan XL (2003) Auditory Model-based Bionic Wavelet Transform for Speech Enhancement. A thesis submitted to the graduate school in partial fulfillment.
[5]	Gold T (1948) Hearing II: The physical basis of the action of the cochlea. Proc Roy Soc B-Biol Sci 135: 492-498. doi: 10.1098/rspb.1948.0025
[6]	Xi Y, Bing-wu L, Fang Y (2010) Speech enhancement using bionic wavelet transform and adaptive threshold function. Second International Conference on Computational Intelligence and Natural Computing 1: 265-268.
[7]	Cohen MX (2019) A better way to define and describe Morlet wavelets for time-frequency analysis. Neuroimage 199: 81-86. doi: 10.1016/j.neuroimage.2019.05.048
[8]	Valencia D, Orejuela D, Salazar J, et al. (2016) Comparison analysis between rigrsure, sqtwolog, heursure and minimaxi techniques using hard and soft thresholding methods. XXI Symposium on Signal Processing, Images and Artificial Vision (STSIVA), 1-5.
[9]	Garg A, Sahu OP (2019) A hybrid approach for speech enhancement using Bionic wavelet transform and Butterworth filter. International Journal of Computers and Applications, 1-11.
[10]	Chen F, Zhang YT (2006) A new implementation of discrete bionic wavelet transform: Adaptive tiling. Digit Signal Process 16: 233-246. doi: 10.1016/j.dsp.2005.05.002
[11]	Mourad T, Lotfi S, Sabeur A, et al. (2009) Speech Enhancement with Bionic Wavelet Transform and Recurrent Neural Network. 5th International Conference: Sciences of Electronic, Technologies of Information and Telecommunications SETIT, 22-26.
[12]	Bin-Fang C, Jian-Qi L, Peixin Q, et al. (2014) An Optimization Adaptive BWT Speech Enhancement Method. Information Technology Journal 13: 1730-1736. doi: 10.3923/itj.2014.1730.1736
[13]	Mourad T (2016) Speech enhancement based on stationary bionic wavelet transform and maximum a posterior estimator of magnitude-squared spectrum. International Journal of Speech Technology 20: 75-88.
[14]	Wu LM, Li YF, Li FJ, et al. (2014) Speech Enhancement Based on Bionic Wavelet Transform and Correlation Denoising. Advanced Materials Research, 1386-1390.
[15]	Ben-Nasr M, Talbi M, Cherif A (2012) Arabic Speech Recognition by MFCC and Bionic Wavelet Transform using a Multi-Layer Perceptron for Voice Control. International Journal of Software Engineering and Technology.
[16]	Zehtabian A, Hassanpour H, Zehtabian S, et al. (2010) A novel speech enhancement approach based on Singular Value Decomposition and Genetic Algorithm. 2010 International Conference of Soft Computing and Pattern Recognition.
[17]	LIU Y, NI W (2015) Speech enhancement based on bionic wavelet transform of subband spectrum entropy. Journal of Computer Applications 3: 58.
[18]	Singh RA, Pritamdas K (2015) Enhancement of Speech Signal by Transform Method and Various threshold Techniques: A Literature Review. Advanced Research in Electrical and Electronic Engineering 2: 5-10.
[19]	Tan BT, Fu M, Spray A, et al. (1996) The use of wavelet Transforms in Phoneme Recognition. Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP'96 4: 2431-2434. doi: 10.1109/ICSLP.1996.607300
[20]	Rioul O, Vetterli M (1991) Wavelets and Signal Processing. IEEE SP Magazine.
[21]	Kobayashi M and Sakamoto M (1993) Wavelets analysis of acoustic signals. Japan SIAM Wavelet Seminars II.
[22]	Jones DL, Baraniuk RG (1991) Efficient approximation of continuous wavelet transform. Electron Lett 27: 748-750. doi: 10.1049/el:19910465
[23]	Swami PD, Sharma R, Jain A, et al. (2015) Speech enhancement by noise driven adaptation of perceptual scales and thresholds of continuous wavelet transform coefficients. Speech Communication 70: 1-12. doi: 10.1016/j.specom.2015.02.007
[24]	Johnson MT, Yuan X, Ren Y (2007) Speech signal enhancement through adaptive wavelet thresholding. Speech Communication 49: 123-133. doi: 10.1016/j.specom.2006.12.002
[25]	Wavelet center frequency, MATLAB centfrq, MathWorks. Available from: https://in.mathworks.com/help/wavelet/ref/centfrq.html.
[26]	Anusuya MA and Katti SK (2011) Front end Analysis of Speech signal processing: A Review. International Journal of speech technology, Springer.
[27]	GitHub. Available from: https://github.com/Jakobovski/decoupled-multimodal-learning.
[28]	Hu Y and Loizou P (2007) Subjective evaluation and comparison of speech enhancement algorithms. Speech Communication 49: 588-601. doi: 10.1016/j.specom.2006.12.006
[29]	Aida-zade K, Xocayev A, Rustamov S (2016) Speech recognition using Support Vector Machines. 2016 IEEE 10th International Conference on Application of Information and Communication Technologies (AICT), 1-4.
[30]	Mini PP, Thomas T, Gopikakumari R (2018) Feature Vector Selection of Fusion of MFCC and SMRT Coefficients for SVM Classifier Based Speech Recognition System. 2018 8th International Symposium on Embedded Computing and System Design, 153-157.
[31]	Thiruvengatanadhan R (2018) Speech Recognition using SVM. International Research Journal of Engineering and Technology (IRJET) 5: 918-921.
[32]	Zou YX, Zheng WQ, Shi W, et al. (2014) Improved Voice Activity Detection based on support vector machine with high separable speech feature vectors. 2014 19th International Conference on Digital Signal Processing.
[33]	Gupta A, Joshi A (2018) Speech Recognition using Artificial Neural Network. International Conference on Communication and Signal Processing, India.
[34]	Al-Rababah MAA, Al-Marghilani A, Hamarshi AA (2018) Automatic Detection Technique for Speech Recognition based on Neural Networks Inter-Disciplinary. (IJACSA) International Journal of Advanced Computer Science and Applications 9: 179-184.
[35]	Swedia ER, Mutiara AB, Subali M (2018 ) Deep Learning Long-Short Term Memory (LSTM) for Indonesian Speech Digit Recognition using LPC and MFCC Feature. Third International Conference on Informatics and Computing (ICIC).
[36]	Vani HY, Anusuya MA (2015) Isolated Speech recognition using K-means and FCM Technique. International Conference on Emerging Research in Electronics, Computer Science and Technology (ICERECT).
[37]	Vani HY, Anusuya MA (2017) Noisy speech recognition using KFCM. International Conference on Cognitive Computing Information Processing.
[38]	ICSI Speech FAQ. Available from: https://www1.icsi.berkeley.edu/Speech/faq/speechSNR.html.
[39]	Yangyang S, Mei-Yuh H, Lei X (2019) END-TO-END SPEECH RECOGNITION USING A HIGH RANK LSTM-CTC BASED MODEL. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[40]	Graves A, Jaitly N (2014) Towards End-to-End Speech Recognition with Recurrent Neural Networks. Proceedings of the 31 st International Conference on Machine Learning, Beijing, China.
[41]	Goehring T, Keshavarzi M, Carlyon RP, et al. (2019) Recurrent neural networks to improve the perception of speech in non-stationary noise by people with cochlear implants. J Acoust Soc Am 146: 705-718. doi: 10.1121/1.5119226

This article has been cited by:

1.	Ashraf A. Ahmad, Ameer Mohammed, Mohammed Ajiya, Zainab Yunusa, Habibu Rabiu, Estimation of time-parameters of Barker binary phase coded radar signal using instantaneous power based methods, 2020, 4, 2578-1588, 347, 10.3934/ElectrEng.2020.4.347
2.	Xiangbo Liu, Rongxi Cui, Dianai Hu, Sen Wang, Wei Liang, Wenxuan Zhang, 2022, Speech Recognition System Based on Duty Log of Distribution Network, 978-1-6654-7968-4, 1929, 10.1109/IMCEC55388.2022.10020045
3.	Mohit Kumar Yadav, Vikrant Bhateja, Monalisa Singh, 2021, Chapter 44, 978-981-16-0979-4, 475, 10.1007/978-981-16-0980-0_44
4.	Bachchu Paul, Santanu Phadikar, Somnath Bera, Tanushree Dey, Utpal Nandi, Isolated word recognition based on a hyper-tuned cross-validated CNN-BiLSTM from Mel Frequency Cepstral Coefficients, 2024, 1573-7721, 10.1007/s11042-024-19750-3
5.	Bachchu Paul, Sumita Guchhait, Sandipan Maity, Biswajit Laya, Anudyuti Ghorai, Anish Sarkar, Utpal Nandi, Spoken word recognition using a novel speech boundary segment of voiceless articulatory consonants, 2024, 16, 2511-2104, 2661, 10.1007/s41870-024-01776-3
6.	Pavuluri Jaswanth, Pavuluri Yaswanth chowdary, M.V.S. Ramprasad, Deep learning based intelligent system for robust face spoofing detection using texture feature measurement, 2023, 29, 26659174, 100868, 10.1016/j.measen.2023.100868

Reader Comments

Your name:*

Email:*
© 2020 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

AIMS Electronics and Electrical Engineering

2.4

Metrics

Article views(5137) PDF downloads(570) Cited by(6)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Figures and Tables

Figures(4) / Tables(8)

AIMS Electronics and Electrical Engineering

Improving speech recognition using bionic wavelet features

Related Papers:

Abstract

1. Introduction

2. Literature survey

3. Continuous bionic wavelet

3.1. Bionic wavelet

4. Realization of discrete bionic wavelet from continuous

4.1. Center frequency ^[25]

4.2. Thresholding

5. Data set

6. System model for the proposed approach

6.1. Signal to Noise Ratio (SNR) ^[36]

7. Performance analysis of various classification methods

7.1. Support vector machine

7.2. Neural network

7.3. LSTM

8. Observations and discussions

9. Conclusion and future enhancements

Acknowledgment

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Catalog

AIMS Electronics and Electrical Engineering

Improving speech recognition using bionic wavelet features

Related Papers:

Abstract

1. Introduction

2. Literature survey

3. Continuous bionic wavelet

3.1. Bionic wavelet

4. Realization of discrete bionic wavelet from continuous

4.1. Center frequency [25]

4.2. Thresholding

5. Data set

6. System model for the proposed approach

6.1. Signal to Noise Ratio (SNR) [36]

7. Performance analysis of various classification methods

7.1. Support vector machine

7.2. Neural network

7.3. LSTM

8. Observations and discussions

9. Conclusion and future enhancements

Acknowledgment

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog

4.1. Center frequency ^[25]

6.1. Signal to Noise Ratio (SNR) ^[36]