
The soft-max function, a well-known extension of the logistic function, has been extensively utilized in numerous stochastic classification methodologies, such as linear differential analysis, soft-max extrapolation, naive Bayes detectors, and neural networks. The focus of this study is the development of soft-max based fuzzy aggregation operators (AOs) for Pythagorean fuzzy sets (PyFS), capitalizing on the benefits provided by the soft-max function. In addition to introducing these novel AOs, we also present a comprehensive approach to multi-attribute decision-making (MADM) that employs the proposed operators. To demonstrate the efficacy and applicability of our MADM method, we applied it to a real-world problem involving Pythagorean fuzzy data. The analysis of supplier selection has been extensively examined in many academic works as a crucial component of supply chain management (SCM), recognised as a significant MADM challenge. The process of choosing healthcare suppliers is a pivotal element that has the potential to greatly influence the efficacy and calibre of healthcare provisions. In addition, we given a numerical example to rigorously evaluate the accuracy and dependability of the proposed procedures. This examination demonstrates the effectiveness and potential of our proposed soft-max based AOs and their applicability in Pythagorean fuzzy environments.
Citation: Sana Shahab, Mohd Anjum, Ashit Kumar Dutta, Shabir Ahmad. Gamified approach towards optimizing supplier selection through Pythagorean Fuzzy soft-max aggregation operators for healthcare applications[J]. AIMS Mathematics, 2024, 9(3): 6738-6771. doi: 10.3934/math.2024329
[1] | Lu Yuan, Yuming Ma, Yihui Liu . Protein secondary structure prediction based on Wasserstein generative adversarial networks and temporal convolutional networks with convolutional block attention modules. Mathematical Biosciences and Engineering, 2023, 20(2): 2203-2218. doi: 10.3934/mbe.2023102 |
[2] | Xiaowen Jia, Jingxia Chen, Kexin Liu, Qian Wang, Jialing He . Multimodal depression detection based on an attention graph convolution and transformer. Mathematical Biosciences and Engineering, 2025, 22(3): 652-676. doi: 10.3934/mbe.2025024 |
[3] | Bingyu Liu, Jiani Hu, Weihong Deng . Attention distraction with gradient sharpening for multi-task adversarial attack. Mathematical Biosciences and Engineering, 2023, 20(8): 13562-13580. doi: 10.3934/mbe.2023605 |
[4] | Xing Hu, Minghui Yao, Dawei Zhang . Road crack segmentation using an attention residual U-Net with generative adversarial learning. Mathematical Biosciences and Engineering, 2021, 18(6): 9669-9684. doi: 10.3934/mbe.2021473 |
[5] | Hui Yao, Yuhan Wu, Shuo Liu, Yanhao Liu, Hua Xie . A pavement crack synthesis method based on conditional generative adversarial networks. Mathematical Biosciences and Engineering, 2024, 21(1): 903-923. doi: 10.3934/mbe.2024038 |
[6] | Jia Yu, Huiling Peng, Guoqiang Wang, Nianfeng Shi . A topical VAEGAN-IHMM approach for automatic story segmentation. Mathematical Biosciences and Engineering, 2024, 21(7): 6608-6630. doi: 10.3934/mbe.2024289 |
[7] | Changwei Gong, Bing Xue, Changhong Jing, Chun-Hui He, Guo-Cheng Wu, Baiying Lei, Shuqiang Wang . Time-sequential graph adversarial learning for brain modularity community detection. Mathematical Biosciences and Engineering, 2022, 19(12): 13276-13293. doi: 10.3934/mbe.2022621 |
[8] | Hanyu Zhao, Chao Che, Bo Jin, Xiaopeng Wei . A viral protein identifying framework based on temporal convolutional network. Mathematical Biosciences and Engineering, 2019, 16(3): 1709-1717. doi: 10.3934/mbe.2019081 |
[9] | Sakorn Mekruksavanich, Anuchit Jitpattanakul . RNN-based deep learning for physical activity recognition using smartwatch sensors: A case study of simple and complex activity recognition. Mathematical Biosciences and Engineering, 2022, 19(6): 5671-5698. doi: 10.3934/mbe.2022265 |
[10] | Hao Wang, Guangmin Sun, Kun Zheng, Hui Li, Jie Liu, Yu Bai . Privacy protection generalization with adversarial fusion. Mathematical Biosciences and Engineering, 2022, 19(7): 7314-7336. doi: 10.3934/mbe.2022345 |
The soft-max function, a well-known extension of the logistic function, has been extensively utilized in numerous stochastic classification methodologies, such as linear differential analysis, soft-max extrapolation, naive Bayes detectors, and neural networks. The focus of this study is the development of soft-max based fuzzy aggregation operators (AOs) for Pythagorean fuzzy sets (PyFS), capitalizing on the benefits provided by the soft-max function. In addition to introducing these novel AOs, we also present a comprehensive approach to multi-attribute decision-making (MADM) that employs the proposed operators. To demonstrate the efficacy and applicability of our MADM method, we applied it to a real-world problem involving Pythagorean fuzzy data. The analysis of supplier selection has been extensively examined in many academic works as a crucial component of supply chain management (SCM), recognised as a significant MADM challenge. The process of choosing healthcare suppliers is a pivotal element that has the potential to greatly influence the efficacy and calibre of healthcare provisions. In addition, we given a numerical example to rigorously evaluate the accuracy and dependability of the proposed procedures. This examination demonstrates the effectiveness and potential of our proposed soft-max based AOs and their applicability in Pythagorean fuzzy environments.
As a new technology of the internet of things, speech recognition plays an important role in various electronic products such as smart homes and vehicle-mounted equipment. However, the interference of surrounding environmental noise can seriously affect the quality and intelligibility of the speech signal. In response to the above problems, speech enhancement technology aimed at improving the quality of the speech signal, reducing noise, and enhancing speech information has emerged [1,2].
In the last century, by reason of limited resources and immature advanced technologies, people were able to rely more on traditional methods and techniques. Boll et al. [3] tried to obtain clear speech noise by subtracting the noise part from the spectrum, but spectral subtraction does not work well for nonstationary noise. To address the aforementioned issues, Ephraim and et al. [4] reduced the impact of noise on the speech signal by calculating the average value of samples within the window, and the experimental results show that the quality and intelligibility of speech signals has been improved significantly compared to other models.With the aim of further deepening the effect of the model in the face of nonstationary noise, some researchers used the median value of the value in the window to replace the value of the sampling point, which further improved the model's denoising effect on nonstationary noise and sudden noise [5,6]. With a view to solve the limitations of the median filtering method, Widrowand et al. [7] used adaptive filtering, which can automatically adjust parameters according to the signal and noise, improve signal quality, effectively suppress various noises, and it is suitable for complex noise environments and real-time signal processing. Although traditional methods have made many achievements in the field of speech enhancement, their scope of use is still limited, such as the detailed parts of the speech signal and the use environment. However, deep learning methods are able to compensate for these deficiencies through data-driven feature learning, thereby achieving better noise suppression and speech enhancement [8,9].
Up to now, speech enhancement technology has completed the transformation from traditional signal processing methods to deep learning methods [10,11]. Among them, Grais et al. [12,13] used a deep neural network (DNN) to process speech signals, and it completes the modeling of the spectrum or time domain characteristics of the speech signal and finds out the nonlinearity between the speech signal and the noise. Subsequently, as the complexity of speech enhancement tasks became higher and higher, Strake et al. [14,15] introduced the convolutional neural network (CNN) into speech enhancement technology to solve complex speech enhancement problems. CNN is deeply loved by researchers due to its efficient feature extraction capability and small number of parameters. Nonetheless, CNN still cannot learn features directly from the original signal when processing speech signals, which means that it has limitations in modeling time series data [16]. To address the issues mentioned above, Choi et al. [17,18] began to introduce recurrent neural network (RNN) into the speech enhancement model to improve the modeling ability of speech signals and noise. At the same time, Hsieh et al. [19,20] combined CNN and RNN to not only improve the model's ability for time series data, but also speed up the model's training and prediction speed. In recent years, under the concept of data-driven models, autoencoders (AED) [21] and generative adversarial neural networks (GAN) [22] have begun to emerge, among which the AED model can realize unsupervised learning of low-dimensional representations of data and reduce the need for labels, making model training more flexible. The GAN model consisting of a generator and a discriminator is also an unsupervised learning method, which achieves data enhancement through adversarial training. Pascual et al. [23,24] demonstrated for the first time that its performance in the field of speech enhancement has significantly improved compared to other models. However, there are many problems with the GAN model in practical applications [25,26]. In order to further improve model performance, Hao et al. [27] began to introduce deep learning technologies such as attention mechanisms into the GAN model, and relative experimental results showed that the model can effectively capture local feature information and establish a long sequence dependency relationship with the data. With the aim of further enhancing the feature extraction and data generation enhancement capabilities of the model, Pandey et al. [28] combined the AED and GAN models to implement a more flexible enhancement strategy.
This type of model has good performance in processing speech signals, for example, the generator of GAN can generate synthetic samples similar to real speech, and improve the generation effect through adversarial training. Additionally, GAN is able to learn and process complex speech features, including speech speed, pitch, and noise, thereby making the model more able to approximate the performance of real speech. Moreover, GAN is an unsupervised learning method that does not require a large amount of clearly labeled speech data and can reduce the difficulty of data acquisition. Last but not least, the generator of GAN can simulate multiple types of noise and makes the model highly robust in different environments, thereby improving the effectiveness of speech enhancement. These features make GAN a powerful tool for processing speech enhancement tasks. Nevertheless, these models possess certain drawbacks, such as the absence of aggregated feature information. The specific reasons why the structure design of the network may lead to discrete and non-aggregated feature information, include mismatched hierarchical structures between encoders and decoders as well as a lack of effective information transmission mechanisms in hierarchical structure design. The overly simple design of the network structure is the main factor that cannot fully capture and transmit the correlation of complex data, which results in the loss of continuity and integrity of feature information during the transmission process. However, the above models still cannot obtain the best speech enhancement effect. Through investigation and research, this article found that the above models have ignored the impact of feature information aggregation between the encoder and decoder on the performance. Therefore, this article will focus on the problem of non-aggregation of generator feature information in the GAN network.
Considering the above factors, this paper fully exploits the network advantages of the temporal convolutional network (TCN) [29]. By introducing modules such as multilayer convolutional layers, dilate causal convolutions, and residual connections in the TCN network to aggregate and interact feature information effectively, the goal is to capture the feature information between the encoder and decoder to improve feature expression ability of the overall network. The main contributions of this article are summarized as follows:
● A novel speech enhancement model is proposed. We have made some extension work on the basis of the Self-Attention Generative Adversarial Network for Speech Enhancement (SASEGAN) model [30]. By integrating the TCN network with the generator, this model can capture the local feature information and long-distance feature information to solve the problem of non-aggregation of feature information. Moreover, our model obviously improves speech signal quality and intelligibility.
● This article uses Chinese and English datasets to conduct experimental verification analysis based on SEGAN and SASEGAN models, respectively. The experimental results perform well, which validates the effectiveness and generalization of the model. During the training phase, the model has a relatively smooth and stable loss curve, which verifies that the model is more stable and has a good fitting ability compared to other models.
The remainder of this paper is organized as follows. We introduce the two baseline models of SEGAN and SASEGAN in Section 2. In Section 3, the SASEGAN-TCN model is proposed. In Section 4, we introduce the relevant configuration of the experiment, and the results of multiple sets of experimental data are analyzed and discussed in depth.
Assume that the speech signal input to the GAN model is ˜X=X+N, where X and N represent the intermediate variables of input data, noise, respectively. As shown in Figure 1, the goal of speech enhancement is to recover a clean signal X from a noisy signal ˜X. The SEGAN method generates enhanced data ˆX=G(˜X,Z) by using a generator G, where Z represent the data of encoder input value decoder. The task of the discriminator D is to distinguish between the enhanced data and the real clean signal and learn to classify as true or false. At the same time, the generator G learns and generates an enhanced signal in order for the discriminator D to classify data as true. SEGAN is trained through this adversarial method and the least squares loss function. The least squares target loss function calculation formula of D and G can be expressed as:
minDLLS(D)=12EX,˜X∼pdata(X,˜X)(D(X,˜X)−1)2+12EZ∼pZ(Z),˜X∼pdata(˜X)D(G(Z,˜X),˜X)2, | (2.1) |
minGLLS(G)=12EZ∼pZ(Z),˜X∼p{data}(˜X)(D(G(Z,˜X),˜X)−1)2+λ||G(Z,˜X)−X||1, | (2.2) |
where pdata(X) and Z represent the distribution probability density function of real data and latent variables, respectively. X, N, and E represent the clean speech signal, additive background noise and the expected value with respect to the distribution specified in the subscript, respectively.
When traditional GANs perform speech enhancement, they often rely entirely on the convolution operations of each layer of the CNN in the model, which may blur the event correlation of the entire sequence and provide a way to capture the correlation between long-distance speech data. The SASEGAN model combines a self-attention layer that can adapt to nonlocal features and the convolutional layer in the SEGAN model, and the effect is significantly improved.
The structure diagram of the self-attention layer is shown in Figure 2. The conv and pooling in the figure represent the convolutional layer and the max pooling layer, respectively. Assume that the input speech feature data is F∈RL×C, and choose to use a one-dimensional convolution to calculate one dimensional feature data. Query vector (Q), key vector (K), and value vector (V) are derived as follows:
Q=FWQ,K=FWK,V=FWV, | (2.3) |
where L and C represent the time dimension and the number of channels, respectively. WQ∈RC×Ck, WK∈RC×Ck, and WV∈RC×Ck represent weight matrices. Their values are determined by the convolution layer with the number of channels as Ck and the convolution kernel size as (1×1), respectively. The optimization of the feature dimension is achieved by setting the variable k. At the same time, K and V of appropriate dimensions are selected by introducing the variable p, then the relative lower complexity O, A, and O are as follows:
A=softmax(Q¯KT),A∈RL×Lp, | (2.4) |
O=(AV)WO,WO∈RCk×C, | (2.5) |
where k = 2, p = 3, C = 4, and L = 6 by introducing the variable \(\beta\). The convolution and other nonlinear operations are used to obtain the output result Oout, which can be expressed as:
Oout=βO+F. | (2.6) |
In the generator, in an effort to enhance the feature representation ability between the encoder and the decoder, the existing technology often ignores the aggregation of feature information between the encoder and the decoder, and the model cannot obtain long-distance feature dependencies. To this end, this paper proposes a SASEGAN-TCN model, whose generator structure diagram is presented in Figure 3:
In Figure 3, the speech signal is first extracted into matrix data with a dimension of (8192×16) through feature extraction. Second, a downsampling operation is performed through a multilayer CNN to compress the feature information, then the self-attention layer is used to obtain the dependencies of long-distance feature information until the latent variable Z between the encoder and the decoder is extracted. Finally, the obtained feature information is aggregated again through the TCN network layer. By virtue of the hole causal convolution and sum in the TCN network, the residual connection module not only avoids problems such as gradient disappearance and long-term dependence in traditional CNNs, but also it achieves the effect of aggregating feature information between the encoder and the decoder.
Although the SASEGAN model generates some feature vectors at each time step in the encoder, these features can only describe the local information of the input sequence, and the output of each time step is only related to the previous input in the decoder. The above situation will lead to the problem of non-aggregation of feature data in variable Z. We will choose the SASEGAN model based on the self-attention mechanism at the 10th layer for research and analysis. When processing time series data, the traditional CNN has some limitations. For example, when using a convolution kernel with a fixed kernel size for operation, the receptive field of the model is limited, which makes it impossible for the model to capture time dependencies within a limited range. In consideration of the foregoing challenges, dilated causal convolution combines the characteristics of dilated convolution and causal convolution to achieve an increase in the receptive field and an improvement in parameter efficiency and parallel operation efficiency. It can well handle long-term trend and periodic pattern data, and achieve an effect of feature information aggregation. Its structure is shown as:
In Figure 4, assuming that the input time series data is z=[z[0],z[1],z[2],z[3],z[4],z[5],...z[i]], the calculation formula of the dilate causal convolution output result is shown as:
l[t]=∑cz[t−d⋅c]⋅w[c], | (3.1) |
where i, d, k, l[t], z[t−d⋅c], and w[c] represent the time step, dilatation rate, index of the convolution kernel, the output of the t time step, the input data at the time step, and the weight of the convolution kernel at the convolution kernel index c, respectively.
This paper takes into account the problems of gradient disappearance and gradient explosion when traditional recursive neural networks process time series data. Therefore, the TCN network uses residual connection to bypass the feature information of the convolution layer and directly transfer the original feature information to the output layer. To alleviate the gradient descent problem and improve the information transfer of the network, we assume that the input is x, and the output result after the Rectified Linear Unit (RELU) nonlinear operations is F, then the calculation formula of the final output result o of the residual network is shown as:
o=x+F(x,W), | (3.2) |
where F(x,W) and W represent the nonlinear operation and network weight of the residual part, respectively.
The residual connection module in the TCN network is shown in Figure 5. The TCN network can well aggregate feature information and realize the interaction of feature information through methods such as multilayer convolution layers, dilate causal convolutions, and residual connections to achieve the goal of improving the overall network performance and feature expression ability. Accordingly, we have effectively integrated the SASEGAN model and the TCN network, as well as processed the final output result (latent variable Z) of the encoder in the generator through the two-layer TCN network to achieve the aggregation of feature information and improve the speech enhancement effect.
This article uses the Valentini English dataset [31] and the THCHS30 Chinese dataset [32] with both audio sampling rates of 16 kHz. The Valentini dataset contains audio data from 30 pronunciation members in the Voice Bank corpus, and the training set was recorded by 28 pronunciation members. This pronunciation data was mixed with 10 different types of noise at signal-to-noise ratios of 15, 10, 5, and 0 db, respectively. The test set was recorded by 2 pronunciation members. After recording, it was mixed with 5 types of noise in the Demand audio library, with a signal-to-noise ratio of 17.5, 12.5, 7.5 and 2.5 db as the mixing conditions. First, we adjust the sampling rate of 15 audio signals in NoiseX-92 and concatenate them to form a long-term noisy audio data. Second, we traverse the training and testing sets in the THCHS30 dataset, then randomly select a long period of noisy audio data and mix it with mixing conditions of one of the four signal-to-noise ratios of 0, 5, 10 and 15 db. In this experiment, Table 1 shows the output data dimensions of each layer of the generator.
Layer | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
Encoder | (8192×16) | (4096×32) | (2048×32) | (1024×64) | (512×64) | (256×128) | (128×128) | (64×256) | (32×256) | (16×512) | (8×1024) |
Decoder | (16×512) | (32×256) | (64×256) | (128×128) | (256×128) | (512×64) | (1024×64) | (2048×32) | (4096×32) | (8192×16) | (16,384×1) |
These experiments are conducted on a 2060 graphics card with 6 GB memory and the Windows system, and the software is used in Python version 3.7 and TensorFlow version 1.13. At training time, the raw audio segments in the batch are sampled from the training data with 50% overlap, followed by a high-frequency pre-emphasis filter with a synergy efficiency of 0.95. Because the computer hardware configuration is limited, the TCN network used in this article has only two layers, and the number of channels is 32 and 16, respectively. The models are trained for 10 rounds with a batch size of 10, and the learning rates of the generator and discriminator models are both 0.0002.
To evaluate the effectiveness of the model experiments, this article will elaborate analysis based on various indicators. PESQ acts as an objective measure of speech quality, typically ranging from -0.5 to 4.5. A superior PESQ score indicates enhanced speech quality, and it's a pivotal metric for assessing the performance of speech encoding, decoding, and communication systems. As a comprehensive signal-to-noise ratio indicator, Channel State Information Gain (CSIG) evaluates the ratio of speech signals to noise, with a higher CSIG score reflecting an improved signal quality. Mean opinion score prediction of the intrusiveness of background noise (CBAK) serves as a comprehensive indicator for background noise suppression, and measures the extent of noise reduction in speech signals. A heightened CBAK score signifies more effective background noise suppression. Mean opinion score prediction of the overall effect (COVL) assesses the coverage of speech quality assessment algorithms across various quality levels and offers a more thorough evaluation of system performance. Lastly, Segmental Signal-to-Noise Ratio (SSNR), as a segmented signal-to-noise ratio indicator, is employed to assess the ratio between speech signals.
In order to verify the effectiveness of this method, this paper first conducts experiments on the Valentini dataset. It can be seen from Table 2 that SEGAN-TCN has improved in PESQ, STOI, SSNR, and other indicators compared with the SEGAN model. Specifically, PESQ, CBAK, COVL, and STOI reached 2.1476, 2.8472, 2.7079 and 92.61% and have been improved by 9.0, 16.7, 3.0 and 0.5% compared with noisy data, in addition, the SSNR increased by 5.3724 db. However, the CSIG has been slightly reduced due to improper selection of data processing methods and insufficient model training, which will be elaborated later.
PESQ | CSIG | CBAK | COVL | SSNR | STOI | |
NOISY | 1.97 | 3.35 | 2.44 | 2.63 | 1.68 | 92.11 |
SEGAN [23] | 1.8176 | 3.0043 | 2.4423 | 2.3691 | 3.4108 | 91.24 |
SEGAN-TCN | 2.1476 | 3.3388 | 2.8472 | 2.7079 | 7.0524 | 92.61 |
During the training process of the SEGAN model and the SEGAN-TCN model, the false sample loss value of the discriminator (d_fk_loss), the real sample loss value of the discriminator (d_rl_loss), the adversarial loss value of the generator (g_adv_loss), and the L1 loss value of the generator (g_l1_loss) curve chart are shown in Figure 6. This article records data every 100 steps and plots it. As can be seen from Figure 6, the SEGAN-TCN model loss value decline curves are smoother than the SEGAN model curves, and the training process is relatively stable. A decline in the d_fk_loss value denotes the discriminator's increased proficiency in distinguishing the generated samples as counterfeit, while a reduction the in the d_rl_loss value indicates the discriminator's heightened ability to accurately classify genuine samples as authentic. The diminishing g_adv_loss value suggests the generator's success in outsmarting the discriminator and creating realistic samples. Meanwhile, the decrease in the g_l1_loss value signifies the similarity, at the pixel level, between the generator-produced sample and the authentic sample.
In order to further verify the generalization and effectiveness of the network, we will continue to conduct experiments based on the SASEGAN model. It can be seen from Table 3 that SASEGAN-TCN achieves 2.1636, 3.4132, 2.8272, 2.7631 and 92.78% on PESQ, CSIG, CBAK, COVL, and STOI on the Valentini dataset, and compared with the noise data, it's improved by 9.83, 1.9, 15.9, 5.1 and 0.7% besides the SSNR, which is improved by 4.4907 db. Data analysis reveals that the SASEGAN-TCN model has good performance in CSIG indicators, but it will reduce the quality of the speech signal when processing speech signals and the introduction of external noise will lead to a slight reduction in PESQ, CBAK, SSNR and other indicators. To effectively confront and resolve these issues, we will continue to conduct experiments and research analysis.
PESQ | CSIG | CBAK | COVL | SSNR | STOI | |
NOISY | 1.97 | 3.35 | 2.44 | 2.63 | 1.68 | 92.11 |
SASEGAN [30] | 2.2027 | 3.3331 | 2.9883 | 2.7441 | 8.3832 | 92.56 |
SASEGAN-TCN | 2.1636 | 3.4132 | 2.8272 | 2.7631 | 6.1707 | 92.78 |
As can be seen from Figure 7, we can clearly see that during the training phase, the SASEGAN-TCN model not only successfully fits to the optimal state, but also exhibits more stable loss curves compared to the SASEGAN model. This strongly confirms the higher stability and easier convergence of SASEGAN-TCN during the training process. This result further emphasizes the superiority of the model in processing training data. The reduction in discriminator loss (d_fk_loss, d_rl_loss) indicates an improvement in the recognition of false and true samples. Lower g_adv_loss indicates successful generator deception, while lower g_l1_loss represents pixel level similarity between generated samples and real samples.
To tackle the issue that the SASEGAN model will reduce the quality of the speech signal and introduce external noise when processing Valentini data, this article will once again verify the effectiveness and applicability of the network on the THCHS30 Chinese dataset based on the SASEGAN model. The experimental results are shown in Table 4. PESQ, CSIG, CBAK, COVL, and STOI can reach 1.8077, 2.9350, 2.4360, 2.3009 and 83.54%, and the SSNR increases to 4.6332 db. After analyzing the experimental data, it can be seen that the SSNR in the SASEGAN model is higher, while the PESQ and STOI are lower, which proves that the SASEGAN model introduces additional noise during the training process and results in signal distortion. Nevertheless, the SASEGAN-TCN model proposed in this article not only ensures that SSNR does not attenuate too more, but also effectively improves PESQ and STOI levels.
PESQ | CSIG | CBAK | COVL | SSNR | STOI | |
NOISY | 1.3969 | 2.3402 | 1.9411 | 1.78 | 1.3101 | 80.33 |
SASEGAN [30] | 1.7212 | 2.8051 | 2.3813 | 2.1815 | 4.9159 | 83.07 |
SASEGAN-TCN | 1.8077 | 2.9350 | 2.4360 | 2.3009 | 4.6332 | 83.54 |
During the training phase, the training loss graphs of the SASEGAN and SASEGAN-TCN models on the THCHS30 dataset are shown in Figure 8. The SASEGAN-TCN model is still very stable and can achieve better fitting results than other models during the training process, which indicates that the model in this paper improved the discriminator's ability to distinguish between false and true samples, and also enhanced the generator's ability to generate false samples that are extremely similar to true samples. Through relevant experiments, it has been shown that there are also some problems that we should notice. Specifically, the integration of the TCN module increases the number of model parameters, which in turn requires higher experimental hardware costs. In addition, it has been experimentally proven that the model presented in this paper performs well in processing long speech data, while there may be poor performance in processing short speech data.
To sum up, this article verifies the recognition effect of enhanced audio data in the field of speech recognition technology. First, this article will use the last five saved model parameters during the SASEGAN-TCN model training process for testing and will obtain enhanced audio data corresponding to the five model parameters. Second, the test output speech data is used for a multi-core two dimensional causal convolution fusion network with attention mechanism for end-to-end speech recognition (ASKCC-DCNN-CTC) model [33] testing. The recognition results are shown in Table 5. The model proposed in this article obviously improves the quality and intelligibility of speech signals and significantly reduces the recognition error rate in speech processing technology.
Type | Test wer |
Noisy audio data | 60.8189 |
First | 50.9427 |
Second | 51.5100 |
Third | 51.3780 |
Fourth | 52.5014 |
Fifth | 50.2238 |
Average | 51.3112 |
To enhance the quality and intelligibility of speech signals effectively, this paper analyzed the characteristics of the TCN network and used modules such as multilayer convolution layers, dilated causal convolution, and residual connections in the TCN network to effectively avoid problems like gradient vanishing. Moreover, the feature information between the encoder and decoder is also aggregated, thereby improving the performance and feature expression ability of network speech enhancement. Experimental results show that the proposed model has very obvious improvement on the Valentini and THCHS30 datasets, and exhibits a certain stability during the training process. In addition, we used the enhanced speech data in speech recognition technology, and the word recognition error rate is reduced by 17.4% compared with the original noisy audio data. The above content indicates that the SASEGAN-TCN model used the characteristics of the TCN network to solve the problem of non-aggregation, improved the model's speech enhancement performance and feature expression capabilities, and effectively elevated the quality and intelligibility of noisy speech data. Additionally, the speech recognition scheme proposed in this article can still maintain high recognition accuracy in noisy environments.
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
This work was supported by the National Natural Science Foundation of China (NSFC, No. 61702320).
There are no conflicts of interest to report in this study.
[1] |
Ž. Stević, N. Mujaković, A. Goli, S. Moslem, Selection of Logistics Distribution Channels for Final Product Delivery: FUCOM-MARCOS Model, J. Intell. Manag. Decis., 2 (2023), 172–178. https://doi.org/10.56578/jimd020402 doi: 10.56578/jimd020402
![]() |
[2] |
K. Rahman, Application of Complex Polytopic Fuzzy Information Systems in Knowledge Engineering: Decision Support for COVID-19 Vaccine Selection, Int J. Knowl. Innov. Stud., 1 (2023), 60–72. https://doi.org/10.56578/ijkis010105 doi: 10.56578/ijkis010105
![]() |
[3] |
D. Tešić, D. Božanić, M. Radovanović, A. Petrovski, Optimising Assault Boat Selection for Military Operations: An Application of the DIBR II-BM-CoCoSo MCDM Model, J. Intell. Manag. Decis., 2 (2023), 160–171. https://doi.org/10.56578/jimd020401 doi: 10.56578/jimd020401
![]() |
[4] |
N. Hicham, H. Nassera, S. Karim, Strategic Framework for Leveraging Artificial Intelligence in Future Marketing Decision-Making, J. Intell. Manag. Decis., 2 (2023), 139–150. https://doi.org/10.56578/jimd020304 doi: 10.56578/jimd020304
![]() |
[5] |
I. Badi, Ž. Stević, M. B. Bouraima, Evaluating Free Zone Industrial Plant Proposals Using a Combined Full Consistency Method-Grey-CoCoSo Model, J. Ind Intell., 1 (2023), 101–109. https://doi.org/10.56578/jii010203 doi: 10.56578/jii010203
![]() |
[6] |
Y. J. Qiu, M. B. Bouraima, C. K. Kiptum, E. Ayyildiz, Ž. Stević, I. Badi, K. M. Ndiema, Strategies for Enhancing Industry 4.0 Adoption in East Africa: An Integrated Spherical Fuzzy SWARA-WASPAS Approach, J. Ind Intell., 1 (2023), 87–100. https://doi.org/10.56578/jii010202 doi: 10.56578/jii010202
![]() |
[7] |
L. Chen, S. Su, Optimization of the Trust Propagation on Supply Chain Network Based on Blockchain Plus, J. Intell. Manag. Decis., 1 (2022), 17–27. https://doi.org/10.56578/jimd010103 doi: 10.56578/jimd010103
![]() |
[8] |
Z. Y. Zhao, Q. L. Yuan, Integrated Multi-objective Optimization of Predictive Maintenance and Production Scheduling: Perspective from Lead Time Constraints, J. Intell. Manag. Decis., 1 (2022), 67–77. https://doi.org/10.56578/jimd010108 doi: 10.56578/jimd010108
![]() |
[9] |
V. Selicati, N. Cardinale, Sustainability Assessment Techniques and Potential Sustainability Accreditation Tools for Energy-Product Systems Modelling, J. Sustain. Energy, 2 (2023), 1–18. https://doi.org/10.56578/jse020101 doi: 10.56578/jse020101
![]() |
[10] | L. A. Zadeh, Fuzzy sets, Inf. Control, 8 (1965), 338–353. |
[11] | D. Molodtsov, Soft set theory-first results, Comput. Math. Appl., 37 (1999), 19–31. |
[12] | Z. Pawlak, Rough sets, Int. J. Inf. Comput. Sci., 11 (1982), 341–356. |
[13] | K. T. Atanassov, Intuitionistic fuzzy sets, Fuzzy Sets Syst., 20 (1986), 87–96. |
[14] | R. R. Yager, A. M. Abbasov, Pythagorean membership grades, complex numbers, and decision making, Int. J. Intell. Syst., 28 (2013), 436–452. |
[15] | R. R. Yager, Pythagorean fuzzy subsets, 2013 Joint IFSA World Congress and NAFIPS Annual Meeting (IFSA/NAFIPS), 2013, 57–61. |
[16] | R. R. Yager, Pythagorean membership grades in multi criteria decision-making, IEEE Trans. Fuzzy Syst., 22 (2014), 958–965. |
[17] | S. Moslem, A Novel Parsimonious Spherical Fuzzy Analytic Hierarchy Process for Sustainable Urban Transport Solutions, Eng. Appl. Artif. Intell, 128 (2024), 107447. |
[18] | X. D. Peng, H. Yuan, Fundamental properties of Pythagorean fuzzy aggregation operators, Fund. Inform., 147 (2016), 415–446. |
[19] | K. Rahman, S. Abdullah, F. Husain, M. S. A. Khan, Approaches to Pythagorean fuzzy geometric aggregation operators, Int. J. Comput. Sci. Inf. Secur., 14 (2016), 174–200. |
[20] | L. Wang, H. Garg, Algorithm for Multiple Attribute Decision-Making with Interactive Archimedean Norm Operations Under Pythagorean Fuzzy Uncertainty, Int. J. Comput. Intell. Syst., 14 (2021), 503–527. |
[21] | S. Moslem, Ž. Stević, I. Tanackov, F. Pilla, Sustainable development solutions of public transportation: An integrated IMF SWARA and Fuzzy Bonferroni operator, Sustain. Cities Soc., 93 (2023), 104530. |
[22] | S. Gayen, A. Biswas, A. Sarkar, T. Senapati, S. Moslem, A novel Aczel-Alsina triangular norm-based group decision-making approach under dual hesitant q-rung orthopair fuzzy context for parcel lockers' location selection, Eng. Appl. Artif. Intell., 126 (2023), 106846. |
[23] |
S. Moslem, A Novel Parsimonious Best Worst Method for Evaluating Travel Mode Choice, IEEE Access, 11 (2023), 16768–16773. https://doi.org/10.1109/ACCESS.2023.3242120 doi: 10.1109/ACCESS.2023.3242120
![]() |
[24] | G. Demir, P. Chatterjee, D. Pamucar, Sensitivity analysis in multi-criteria decision making: A state-of-the-art research perspective using bibliometric analysis, Expert Syst. Appl., 237 (2024), 121660. |
[25] |
J. Ali, M. Naeem, A. N. Al-kenani, Complex T-spherical Fuzzy Frank Aggregation Operators and their Application to Decision making, IEEE Access, 11 (2023), 88971–89023. https://doi.org/10.1109/ACCESS.2023.3298845 doi: 10.1109/ACCESS.2023.3298845
![]() |
[26] | J. Ali, Probabilistic hesitant bipolar fuzzy Hamacher prioritized aggregation operators and their application in multi-criteria group decision-making, Comput. Appl. Math., 42 (2023), 260. |
[27] | M. Riaz, H. M. A. Farid, Enhancing green supply chain efficiency through linear Diophantine fuzzy soft-max aggregation operators, J. Ind. Intell., 1 (2023), 8–29. |
[28] | R. Kausar, H. M. A. Farid, M. Riaz, A numerically validated approach to modeling water hammer phenomena using partial differential equations and switched differential-algebraic equations, J. Ind. Intell., 1 (2023), 75–86. |
[29] |
T. Mahmood, U. U. Rehman, S. Shahab, Z. Ali, M. Anjum, Decision-Making by Using TOPSIS Techniques in the Framework of Bipolar Complex Intuitionistic Fuzzy N-Soft Sets, IEEE Access, 11 (2023), 105677–105697. https://doi.org/10.1109/ACCESS.2023.3316879 doi: 10.1109/ACCESS.2023.3316879
![]() |
[30] |
V. Pajić, M. Andrejić, M. Sternad, FMEA-QFD Approach for Effective Risk Assessment in Distribution Processes, J. Intell. Manag. Decis., 2 (2023), 46–56. https://doi.org/10.56578/jimd020201 doi: 10.56578/jimd020201
![]() |
[31] |
M. Saqlain, Sustainable Hydrogen Production: A Decision-Making Approach Using VIKOR and Intuitionistic Hypersoft Sets, J. Intell. Manag. Decis., 2 (2023), 130–138. https://doi.org/10.56578/jimd020303 doi: 10.56578/jimd020303
![]() |
[32] |
J. Chakraborty, S. Mukherjee, L. Sahoo, Intuitionistic Fuzzy Multi-Index Multi-Criteria Decision-Making for Smart Phone Selection Using Similarity Measures in a Fuzzy Environment, J. Ind Intell., 1 (2023), 1–7. https://doi.org/10.56578/jii010101 doi: 10.56578/jii010101
![]() |
[33] |
T. K. Paul, C. Jana, M. Pal, Enhancing Multi-Attribute Decision Making with Pythagorean Fuzzy Hamacher Aggregation Operators, J. Ind Intell., 1 (2023), 30–54. https://doi.org/10.56578/jii010103 doi: 10.56578/jii010103
![]() |
[34] |
A. A. Khan, L. Wang, Generalized and Group-Generalized Parameter Based Fermatean Fuzzy Aggregation Operators with Application to Decision-Making, Int J. Knowl. Innov. Stud., 1 (2023), 10–29. https://doi.org/10.56578/ijkis010102 doi: 10.56578/ijkis010102
![]() |
[35] | J. Ali, Z. Bashir, T. Rashid, A cubic q-rung orthopair fuzzy TODIM method based on Minkowski-type distance measures and entropy weight, Soft Comput., 27 (2023), 15199–15223. |
[36] | J. Ali, Norm-based distance measure of q-rung orthopair fuzzy sets and its application in decision-making, Comput. Appl. Math., 42 (2023), 184. |
[37] |
J. Ali, M. Naeem, r, s, t-spherical fuzzy VIKOR method and its application in multiple criteria group decision making, IEEE Access, 11 (2023), 46454–46475. https://doi.org/10.1109/ACCESS.2023.3271141 doi: 10.1109/ACCESS.2023.3271141
![]() |
[38] |
A. Puška, I. Stojanović, Fuzzy Multi-Criteria Analyses on Green Supplier Selection in an Agri-Food Company, J. Intell. Manag. Decis., 1 (2022), 2–16. https://doi.org/10.56578/jimd010102 doi: 10.56578/jimd010102
![]() |
[39] |
Ž. Stević, M. Subotić, E. Softić, B. Božić, Multi-Criteria Decision-Making Model for Evaluating Safety of Road Sections, J. Intell. Manag. Decis., 1 (2022), 78–87. https://doi.org/10.56578/jimd010201 doi: 10.56578/jimd010201
![]() |
[40] |
D. Tešić, D. Božanić, M. Radovanović, A. Petrovski, Optimising assault boat selection for military operations: An application of the DIBR II-BM-CoCoSo MCDM model, J. Intell Manag. Decis., 2 (2023), 160–171. https://doi.org/10.56578/jimd020401 doi: 10.56578/jimd020401
![]() |
[41] |
M. Abid, M. Saqlain, Utilizing Edge Cloud Computing and Deep Learning for Enhanced Risk Assessment in China's International Trade and Investment, Int J. Knowl. Innov. Stud., 1 (2023), 1–9. https://doi.org/10.56578/ijkis010101 doi: 10.56578/ijkis010101
![]() |
[42] |
C. Jana, M. Pal, Interval-Valued Picture Fuzzy Uncertain Linguistic Dombi Operators and Their Application in Industrial Fund Selection, J. Ind Intell., 1 (2023), 110–124. https://doi.org/10.56578/jii010204 doi: 10.56578/jii010204
![]() |
[43] |
Y. Li, Y. H. Sun, Q. Yang, Z. Y. Sun, C. Z. Wang, Z. Y. Liu, Method of Reaching Consensus on Probability of Food Safety Based on the Integration of Finite Credible Data on Block Chain, IEEE Access, 9 (2021), 123764–123776. https://doi.org/10.1109/ACCESS.2021.3108178 doi: 10.1109/ACCESS.2021.3108178
![]() |
[44] |
S. Li, Z. Liu, Scheduling uniform machines with restricted assignment, Math. Biosci. Eng., 19 (2022), 9697–9708. https://doi.org/10.3934/mbe.2022450 doi: 10.3934/mbe.2022450
![]() |
[45] |
X. Zhang, W. Pan, R. Scattolini, S. Yu, X. Xu, Robust tube-based model predictive control with Koopman operators, Automatica, 137 (2022), 110114. https://doi.org/10.1016/j.automatica.2021.110114 doi: 10.1016/j.automatica.2021.110114
![]() |
[46] |
H. Y. Jin, Z. Wang, Asymptotic dynamics of the one-dimensional attraction-repulsion Keller-Segel model, Math. Methods Appl. Sci., 38 (2015), 444–457. https://doi.org/10.1002/mma.3080 doi: 10.1002/mma.3080
![]() |
[47] |
Q. Li, H. Lin, X. Tan, S. Du, H∞ Consensus for Multiagent-Based Supply Chain Systems Under Switching Topology and Uncertain Demands, IEEE Trans. Syst., Man, Cybern.: Syst., 50 (2020), 4905–4918. https://doi.org/10.1109/TSMC.2018.2884510 doi: 10.1109/TSMC.2018.2884510
![]() |
[48] |
Y. Peng, Y. Zhao, J. Hu, On The Role of Community Structure in Evolution of Opinion Formation: A New Bounded Confidence Opinion Dynamics, Inform. Sci., 621 (2023), 672–690. https://doi.org/10.1016/j.ins.2022.11.101 doi: 10.1016/j.ins.2022.11.101
![]() |
[49] |
J. Dong, J. Hu, Y. Zhao, Y. Peng, Opinion formation analysis for Expressed and Private Opinions (EPOs) models: Reasoning private opinions from behaviors in group decision-making systems, Expert Syst. Appl., 236 (2023), 121292. https://doi.org/10.1016/j.eswa.2023.121292 doi: 10.1016/j.eswa.2023.121292
![]() |
[50] |
Q. Gu, S. Li, Z. Liao, Solving nonlinear equation systems based on evolutionary multitasking with neighborhood-based speciation differential evolution, Expert Syst. Appl., 238 (2024), 122025. https://doi.org/10.1016/j.eswa.2023.122025 doi: 10.1016/j.eswa.2023.122025
![]() |
[51] |
B. Cao, W. Dong, Z. Lv, Y. Gu, S. Singh, P. Kumar, Hybrid Microgrid Many-Objective Sizing Optimization With Fuzzy Decision, IEEE Trans. Fuzzy Syst., 28 (2020), 2702–2710. https://doi.org/10.1109/TFUZZ.2020.3026140 doi: 10.1109/TFUZZ.2020.3026140
![]() |
[52] |
B. Cao, J. Zhao, Z. Lv, Y. Gu, P. Yang, S. K. Halgamuge, Multiobjective Evolution of Fuzzy Rough Neural Network via Distributed Parallelism for Stock Prediction, IEEE Trans. Fuzzy Syst., 28 (2020), 939–952. https://doi.org/10.1109/TFUZZ.2020.2972207 doi: 10.1109/TFUZZ.2020.2972207
![]() |
[53] | G. Wei, M. Lu, Pythagorean fuzzy power aggregation operators in multiple attribute decision making, Int. J. Intell. Syst., 33 (2018), 169–186. |
[54] | S. J. Wu, G. W. Wei, Pythagorean fuzzy Hamacher aggregation operators and their application to multiple attribute decision making, Int. J. Knowl.-Based Intell. Eng. Syst., 21 (2017), 189–201. |
[55] | H. Garg, Confidence levels based Pythagorean fuzzy aggregation operators and its application to decision-making process, Comput. Math. Organ. Theory, 23 (2017), 546–571. |
[56] |
N. Komazec, K. Jankovic, A Systemic Approach to Risk Management: Utilizing Decision Support Software Solutions for Enhanced Decision-Making, Acadlore Trans. Appl. Math. Stat., 1 (2023), 66–76. https://doi.org/10.56578/atams010202 doi: 10.56578/atams010202
![]() |
[57] |
M. Krstić, S. Tadić, Hybrid Multi-Criteria Decision-Making Model for Optimal Selection of Cold Chain Logistics Service Providers, J. Organ. Technol. Entrep., 1 (2023), 77–87. https://doi.org/10.56578/jote010201 doi: 10.56578/jote010201
![]() |
[58] |
A. Puška, A. Beganović, I. Stojanović, Optimizing Logistics Center Location in Brčko District: A Fuzzy Approach Analysis, J. Urban Dev. Manag., 2 (2023), 160–171. https://doi.org/10.56578/judm020305 doi: 10.56578/judm020305
![]() |
[59] |
M. S. Chohan, S. Ashraf, K. Dong, Enhanced Forecasting of Alzheimer's Disease Progression Using Higher-Order Circular Pythagorean Fuzzy Time Series, Healthcraft Front., 1 (2023), 44–57. https://doi.org/10.56578/hf010104 doi: 10.56578/hf010104
![]() |
1. | Darshana Subhash, Jyothish Lal G., Premjith B., Vinayakumar Ravi, A robust accent classification system based on variational mode decomposition, 2025, 139, 09521976, 109512, 10.1016/j.engappai.2024.109512 | |
2. | Moran Chen, Mingjiang Wang, 2024, An Investigation Of Rotary Position Embedding For Speech Enhancement, 9798400710636, 44, 10.1145/3712464.3712472 |
Layer | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
Encoder | (8192×16) | (4096×32) | (2048×32) | (1024×64) | (512×64) | (256×128) | (128×128) | (64×256) | (32×256) | (16×512) | (8×1024) |
Decoder | (16×512) | (32×256) | (64×256) | (128×128) | (256×128) | (512×64) | (1024×64) | (2048×32) | (4096×32) | (8192×16) | (16,384×1) |
PESQ | CSIG | CBAK | COVL | SSNR | STOI | |
NOISY | 1.97 | 3.35 | 2.44 | 2.63 | 1.68 | 92.11 |
SEGAN [23] | 1.8176 | 3.0043 | 2.4423 | 2.3691 | 3.4108 | 91.24 |
SEGAN-TCN | 2.1476 | 3.3388 | 2.8472 | 2.7079 | 7.0524 | 92.61 |
PESQ | CSIG | CBAK | COVL | SSNR | STOI | |
NOISY | 1.97 | 3.35 | 2.44 | 2.63 | 1.68 | 92.11 |
SASEGAN [30] | 2.2027 | 3.3331 | 2.9883 | 2.7441 | 8.3832 | 92.56 |
SASEGAN-TCN | 2.1636 | 3.4132 | 2.8272 | 2.7631 | 6.1707 | 92.78 |
PESQ | CSIG | CBAK | COVL | SSNR | STOI | |
NOISY | 1.3969 | 2.3402 | 1.9411 | 1.78 | 1.3101 | 80.33 |
SASEGAN [30] | 1.7212 | 2.8051 | 2.3813 | 2.1815 | 4.9159 | 83.07 |
SASEGAN-TCN | 1.8077 | 2.9350 | 2.4360 | 2.3009 | 4.6332 | 83.54 |
Type | Test wer |
Noisy audio data | 60.8189 |
First | 50.9427 |
Second | 51.5100 |
Third | 51.3780 |
Fourth | 52.5014 |
Fifth | 50.2238 |
Average | 51.3112 |
Layer | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
Encoder | (8192×16) | (4096×32) | (2048×32) | (1024×64) | (512×64) | (256×128) | (128×128) | (64×256) | (32×256) | (16×512) | (8×1024) |
Decoder | (16×512) | (32×256) | (64×256) | (128×128) | (256×128) | (512×64) | (1024×64) | (2048×32) | (4096×32) | (8192×16) | (16,384×1) |
PESQ | CSIG | CBAK | COVL | SSNR | STOI | |
NOISY | 1.97 | 3.35 | 2.44 | 2.63 | 1.68 | 92.11 |
SEGAN [23] | 1.8176 | 3.0043 | 2.4423 | 2.3691 | 3.4108 | 91.24 |
SEGAN-TCN | 2.1476 | 3.3388 | 2.8472 | 2.7079 | 7.0524 | 92.61 |
PESQ | CSIG | CBAK | COVL | SSNR | STOI | |
NOISY | 1.97 | 3.35 | 2.44 | 2.63 | 1.68 | 92.11 |
SASEGAN [30] | 2.2027 | 3.3331 | 2.9883 | 2.7441 | 8.3832 | 92.56 |
SASEGAN-TCN | 2.1636 | 3.4132 | 2.8272 | 2.7631 | 6.1707 | 92.78 |
PESQ | CSIG | CBAK | COVL | SSNR | STOI | |
NOISY | 1.3969 | 2.3402 | 1.9411 | 1.78 | 1.3101 | 80.33 |
SASEGAN [30] | 1.7212 | 2.8051 | 2.3813 | 2.1815 | 4.9159 | 83.07 |
SASEGAN-TCN | 1.8077 | 2.9350 | 2.4360 | 2.3009 | 4.6332 | 83.54 |
Type | Test wer |
Noisy audio data | 60.8189 |
First | 50.9427 |
Second | 51.5100 |
Third | 51.3780 |
Fourth | 52.5014 |
Fifth | 50.2238 |
Average | 51.3112 |