
Skin cancer is a pandemic disease now worldwide, and it is responsible for numerous deaths. Early phase detection is pre-eminent for controlling the spread of tumours throughout the body. However, existing algorithms for skin cancer severity detections still have some drawbacks, such as the analysis of skin lesions is not insignificant, slightly worse than that of dermatologists, and costly and time-consuming. Various machine learning algorithms have been used to detect the severity of the disease diagnosis. But it is more complex when detecting the disease. To overcome these issues, a modified Probabilistic Neural Network (MPNN) classifier has been proposed to determine the severity of skin cancer. The proposed method contains two phases such as training and testing the data. The collected features from the data of infected people are used as input to the modified PNN classifier in the current model. The neural network is also trained using Spider Monkey Optimization (SMO) approach. For analyzing the severity level, the classifier predicts four classes. The degree of skin cancer is determined depending on classifications. According to findings, the system achieved a 0.10% False Positive Rate (FPR), 0.03% error and 0.98% accuracy, while previous methods like KNN, NB, RF and SVM have accuracies of 0.90%, 0.70%, 0.803% and 0.86% correspondingly, which is lesser than the proposed approach.
Citation: J. Rajeshwari, M. Sughasiny. Modified PNN classifier for diagnosing skin cancer severity condition using SMO optimization technique[J]. AIMS Electronics and Electrical Engineering, 2023, 7(1): 75-99. doi: 10.3934/electreng.2023005
[1] | Vani H Y, Anusuya M A . Improving speech recognition using bionic wavelet features. AIMS Electronics and Electrical Engineering, 2020, 4(2): 200-215. doi: 10.3934/ElectrEng.2020.2.200 |
[2] | Abdul Yussif Seidu, Elvis Twumasi, Emmanuel Assuming Frimpong . Hybrid optimized artificial neural network using Latin hypercube sampling and Bayesian optimization for detection, classification and location of faults in transmission lines. AIMS Electronics and Electrical Engineering, 2024, 8(4): 508-541. doi: 10.3934/electreng.2024024 |
[3] | Sebin J Olickal, Renu Jose . LSTM projected layer neural network-based signal estimation and channel state estimator for OFDM wireless communication systems. AIMS Electronics and Electrical Engineering, 2023, 7(2): 187-195. doi: 10.3934/electreng.2023011 |
[4] | Tamizhelakkiya K, Sabitha Gauni, Prabhu Chandhar . Modulation classification analysis of CNN model for wireless communication systems. AIMS Electronics and Electrical Engineering, 2023, 7(4): 337-353. doi: 10.3934/electreng.2023018 |
[5] | Desh Deepak Sharma, Ramesh C Bansal . LSTM-SAC reinforcement learning based resilient energy trading for networked microgrid system. AIMS Electronics and Electrical Engineering, 2025, 9(2): 165-191. doi: 10.3934/electreng.2025009 |
[6] | Imad Ali, Faisal Ghaffar . Robust CNN for facial emotion recognition and real-time GUI. AIMS Electronics and Electrical Engineering, 2024, 8(2): 227-246. doi: 10.3934/electreng.2024010 |
[7] | D Venkata Ratnam, K Nageswara Rao . Bi-LSTM based deep learning method for 5G signal detection and channel estimation. AIMS Electronics and Electrical Engineering, 2021, 5(4): 334-341. doi: 10.3934/electreng.2021017 |
[8] | Vuong Quang Phuoc, Nguyen Van Dien, Ho Duc Tam Linh, Nguyen Van Tuan, Nguyen Van Hieu, Le Thai Son, Nguyen Tan Hung . An optimized LSTM-based equalizer for 100 Gigabit/s-class short-range fiber-optic communications. AIMS Electronics and Electrical Engineering, 2024, 8(4): 404-419. doi: 10.3934/electreng.2024019 |
[9] | Loris Nanni, Michelangelo Paci, Gianluca Maguolo, Stefano Ghidoni . Deep learning for actinic keratosis classification. AIMS Electronics and Electrical Engineering, 2020, 4(1): 47-56. doi: 10.3934/ElectrEng.2020.1.47 |
[10] | Tadele A. Abose, Thomas O. Olwal, Muna M. Mohammed, Murad R. Hassen . Performance analysis of insertion loss incorporated hybrid precoding for massive MIMO. AIMS Electronics and Electrical Engineering, 2024, 8(2): 187-210. doi: 10.3934/electreng.2024008 |
Skin cancer is a pandemic disease now worldwide, and it is responsible for numerous deaths. Early phase detection is pre-eminent for controlling the spread of tumours throughout the body. However, existing algorithms for skin cancer severity detections still have some drawbacks, such as the analysis of skin lesions is not insignificant, slightly worse than that of dermatologists, and costly and time-consuming. Various machine learning algorithms have been used to detect the severity of the disease diagnosis. But it is more complex when detecting the disease. To overcome these issues, a modified Probabilistic Neural Network (MPNN) classifier has been proposed to determine the severity of skin cancer. The proposed method contains two phases such as training and testing the data. The collected features from the data of infected people are used as input to the modified PNN classifier in the current model. The neural network is also trained using Spider Monkey Optimization (SMO) approach. For analyzing the severity level, the classifier predicts four classes. The degree of skin cancer is determined depending on classifications. According to findings, the system achieved a 0.10% False Positive Rate (FPR), 0.03% error and 0.98% accuracy, while previous methods like KNN, NB, RF and SVM have accuracies of 0.90%, 0.70%, 0.803% and 0.86% correspondingly, which is lesser than the proposed approach.
Singing voice conversion (SVC) is a technique that modifies the singing voice of a reference singer to sound like the voice of the target singer while keeping the phonetic content unchanged [1]. The converted singing voice sounds as if the target is performing the same lyrical composition as the source. SVC finds applications in the entertainment industry.
Voice conversion systems are mainly categorized as parallel [2] and nonparallel voice conversion systems [3,4]. In a parallel voice conversion system, the training data consists of both the reference and target speakers (singers) performing the same lyrical content. In contrast, nonparallel voice conversion systems, do not require both speakers to utter the same sentence set during training. Conventionally, many voice conversion methods rely on parallel training data, which necessitates frame level alignment between source and target utterances. However, collecting perfectly aligned parallel training data in real life scenarios can be extremely challenging. To tackle this problem, recent studies are focused on unsupervised voice conversion techniques that utilize nonparallel training data for voice conversion [5]. The SVC Challenge (SVCC) [6] focuses on comparing and understanding various voice conversion systems for SVC based on a shared dataset.
Recently, neural network (NN) based voice conversion methods such as deep neural network (DNN) [7], recurrent neural network (RNN) [8], and convolutional neural network [9] have been proposed[10,11]. Additionally, there exist different approaches for style transfer in images. Style transfer involves transferring the style of an input image to another image without changing the content of the former image. Interestingly, these techniques can also be applied to non-parallel voice conversion tasks. Generative adversarial networks (GANs) [12,13] - originally developed for image translation - can be effectively applied to audio data for voice conversion. Notable nonparallel many-to-many GAN-based approaches include Wasserstein generative adversarial network (WD-GAN) [14], cycle-consistent generative adversarial networks (CycleGAN) [15] and starGAN [16]. In these many-to-many nonparallel techniques, both source and target singing voices must be included in the training process.
Exploring the latest advancements in research focuses on image enhancement tasks, specifically super-resolution and inpainting using deep learning, attention mechanisms, and multi-scale features [17,18,19]. DNNs play a central role in these papers, learning complex mappings from low-resolution or incomplete images to high-resolution or complete versions[20,21]. Focusing on image inpainting, Chen et al. [22] proposes a noise-robust voice conversion model. Users can choose whether to retain or remove background sounds during conversion. The model combines speech separation and voice conversion modules. Tomasz et al. [23] discusses techniques for seamlessly transferring a speaker's identity to another speaker while preserving speech content. Another study explores using cosine similarity between x-vector speaker embeddings as an objective metric for evaluating SVC [24]. The system preprocesses the source singer's audio to obtain melody features via F0 contour, loudness curve, and phonetic posteriorgram.
To disentangle singer identity from linguistic information, auto-encoders can be used. Thus, encoder-decoder-based networks are also proposed for unsupervised voice conversion. These auto-encoder-based techniques include variational auto-encoder (VAE) [25], cycle-consistent variational auto-encoder (CycleVAE) [26,27], and variational auto-encoding Wasserstein generative adversarial network (VAWGAN) [28]. Although these systems generate high-quality singing voices, their major limitation is that they can only be applied to many-to-many voice conversion tasks. These methods are inefficient for SVC when the reference and target singers are absent in the training process. The robust one-shot SVC model [29] relies on GANs to accurately recover the target pitch by matching pitch distributions. Adaptive instance normalization (AdaIN)-skip conditioning further enhances the model's performance. Unlike traditional voice cloning tasks, which modify audio waveform to match a desired voice specified by reference audio, a novel task called visual voice cloning (V2C)[30] bridges the gap between voice cloning and visual information.
Some recent works on voice conversion have focused on one-shot voice conversion. Examples include zero-shot voice style transfer with only auto-encoder loss (AutoVC) [31], AdaGAN-VC [32], and two-level nested U-structure VC (U2VC) [33,34]. In one-shot voice conversion, either or both speakers that are unseen in the training data can be used for the inference. Recently, voice conversion in the speech domain has utilized one-shot voice conversion techniques. These approaches focus on the separation of speaker and content information. Audio features such as mel-cepstral coefficients (MCEPS), aperiodicities and fundamental frequencies are extracted using vocoders. The voice conversion process involves removing speaker dependent features from the source speaker's utterance and incorporating the target speaker's attributes. AutoVC introduces a vanilla auto-encoder with designed information bottleneck and a pretrained speaker encoder for decoupling the speaker and content information. Meanwhile, a U-Net architecture combining vector quantization (VQ) and instance normalization (IN) (VQVC+) [35] successfully separates linguistic information using vector quantization. However, content leakage remains a significant limitation.
AdaGAN-VC requires adversarial training for the separation of speech attributes, which causes instability problems during the training process. This issue is resolved in AdaIN-VC [36] and activation guidance and adaptive instance normalization (AGAIN)-VC, which utilize instance normalization techniques to removing the speaker information. AdaIN was first introduced in [37] for style transfer in image translation networks. It addresses the limitations of existing normalization methods in deep learning. AdaIN extends the popular normalization technique, IN by incorporating style information from a reference image. Specifically, it maps the normalized mean and standard deviation of the content image to match those of the styled image. AdaIN-VC comprises two encoders with AdaIN for the disentanglement of information and one decoder. Lian et al. [38] proposed arbitrary voice conversion without any supervision, which is achieved using instance normalization and adversarial learning. Meanwhile, masked auto-encoder (MAE-VC) [39] is an end-to-end masked auto-encoder that converts the speaker style of the source speech to that of the target speech while maintaining content consistency. In Activation Guidance and Adaptive Instance Normalization (AGAIN)-VC [40], speaker and content information are disentangled with the help of a single encoder. These approaches are introduced for voice conversion in speech domain. In this paper, one-shot SVC is proposed using a combination of convolutional layers and LSTM architecture. AdaIN and AGAIN techniques are applied for SVC, using them as baseline architectures.
These methods are designed for voice conversion in the speech domain. Only a few works have been proposed for the conversion of singing voice. This paper proposes a hybrid convolutional neural network with long short term memory (CNN-LSTM) model for one-shot SVC system using the AGAIN technique. The proposed AGAIN-SVC method requires a single encoder to separate the vocal timbre of the singer from the phonetic information. Remarkably, without any frame alignment procedures, one's singing voice can be converted into another person's voice using nonparallel data, even if both singers are unavailable during the training process. In this paper, two recent works on voice conversion - AdaIN and AGAIN - are applied for one-shot SVC and also used as baseline models. To improve conversion performance, a combination of convolution layers and LSTM layers is used for the encoder and decoder architecture. Simultaneously achieving high synthesis quality and maintaining singer similarity poses a challenge, as enhancing one often comes at the cost of the other. Many VC systems employ disentanglement techniques to separate singer and linguistic content information. However, some methods reduce the dimension or quantize content embeddings to prevent singer information leakage, inadvertently compromising synthesis quality. Activation guidance serves as an information bottleneck on content embeddings, allowing better control over singer-related features. Additionally, AdaIN dynamically adjusts normalization statistics during training, enhancing the model's flexibility.
The proposed technique significantly improves the delicate balance between synthesis quality and singer similarity. Its one-shot voice conversion relies on learned content features and adaptation mechanisms to transform the source speaker's voice into the desired target speaker's style, even when the target speaker is unseen during training. This simplifies the architecture, making it efficient and practical. Indeed, CNNs excel at extracting local features from audio data, but they fall short in modeling long-term dependencies over extended sequences. Recurrent architectures, such as LSTM and gated recurrent unit (GRU), address this limitation by maintaining hidden states across time steps. By adding them after the CNN blocks, it can introduce the ability to learn contextual information beyond local features. This can improve the model's understanding of speech patterns, intonation, and context.
For this paper, the main contributions are as follows: (ⅰ) Rapid voice transformation from an unseen target singer is used to directly convert the source singer's voice to the desired target singer's style. (ⅱ) The one shot voice conversion capability simplifies the architecture, making it efficient and practical. (ⅲ) AdaIN and activation mechanisms achieve a balance between synthesis quality and speaker similarity. (ⅳ) Incorporating recurrent architecture enhances contextual learning and improves temporal modeling.
The paper is organized into four sections. The network architecture of the proposed SVC technique is described in Section 2. Experiment setup and evaluation results are discussed in the following section. Finally, the conclusion is presented in Section 4.
The AGAIN technique was first introduced for voice conversion. This paper combines the AGAIN technique with the proposed hybrid CNN-LSTM architecture for SVC.
Internal covariance shift is defined as the variation in the distribution of each network layer input due to the alterations in the parameters of the previous layer. The training of earlier DNNs was affected by internal covariance shift, which in turn slowed down the training process. To alleviate this serious issue, batch normalization (BN) [41,42] was introduced. Consider N is the batch size, C is the number of channels, H is the height of each activation map, and W is the width of activation input. The activation layer input dimensions can be represented as N∗C∗H∗W. Generally, normalization can be applied to activation layers by shifting the mean and scaling the variance. In BN, mean and variance for each channel are computed across all samples across both spatial dimensions. Thus, BN normalizes the activations across N∗H∗W axes. The statistics of BN layers include batch-wise mean and variance. Interestingly, the style of an image can be represented using the BN statistics. Consequently, neural style transfer of two feature maps becomes possible by replacing the batch-wise mean and variance of the source image with those of the target image [43].
Instead of BN, IN was introduced in [44] for feed forward stylization tasks, achieving noticeable improvements in the generated stylized images. Unlike BN, IN can be applied at both training and testing time, ensuring consistency during training, transfer, and testing. IN normalizes across HXW axes by calculating the mean and variance across both spatial dimensions for each channel and each sample.
Let X and Y represent the feature map of reference and target waveform respectively. IN can be computed as follows:
IN(X)=γ(X−μ(X)σ(X))+β | (2.1) |
where β and γ are the learned affine parameters. Here, μ(X) is the mean and σ(X) is the standard deviation computed across spatial dimensions independently for each feature channel and each sample.
AdaIN is simply an extension of IN. Unlike IN, it has no learnable affine transformations. While IN normalizes the input style to a single style, AdaIN normalizes the input to an arbitrarily given style. Thus, AdaIN conveys the style of the target as the reference style by simply adjusting the mean and standard deviation of the reference input to match those of the target input. AdaIN can be represented as:
AdaIN(X,μ(Y),σ(Y))=σ(Y)(X−μ(X)σ(X))+μ(Y) | (2.2) |
Here, X and Y are content input and style input respectively. The AdaIN layer thus transfers the style of X with style of Y by scaling the normalized content input with the standard deviation of style input and shifts it with μ(Y).
AdaIN was proposed in for the arbitrary style transfer in real time. The idea of AdaIN was extended for the voice conversion of speech signals. In audio, the statistics (mean and variance) of the content features might represent phonemes, or musical notes. The style features could correspond to timbres or specific audio effects. By applying AdaIN, it can transfer the style characteristics of one audio signal onto the content of another. In the Eq 2.1 γ parameter adjusts the style features by scaling them whereas the β parameter shifts the content features by adding an offset. AdaIN-SVC employs a VAE with distinct encoders to encode both the singer-specific information and the phonetic content information. AdaIN approach can also be applied for the conversion of singing voices.
Consider the singing voice as the phonetic information sung by a singer with unique features defined as the singer identity information. While the lyrical information changes drastically after each frame, the singer information experiences only slight variations. Since the mean and variance over an entire singing waveform rarely change, singer information is expressed using the same.
AdaIN-SVC employs two encoders to encode the singer information and phonetic content information. The encoder takes a mel-spectrogram of the singing waveform as input frames. IN layers are added in each encoder block. These IN layers effectively separate the singer information from the lyrical content. To achieve this, the channel-wise mean (μ) and standard deviation (σ) is computed. The IN layer detaches the singer representations from the encoded phonetic content information. However, during the decoding process, the detached singer information is reintroduced to the AdaIN layer.
AdaIN-SVC uses a VAE with separate content and speaker encoders. It also leverages AdaIN to separate singer and content representations but relies on separate encoders. It either reduces the dimension or quantizes content embeddings to prevent singer information leakage. However, this strong information bottleneck can harm synthesis quality. AGAIN-SVC introduces a proper activation as an information bottleneck on content embeddings. It simplifies the architecture to an auto-encoder with a single encoder and a decoder. By using activation guidance, it can drastically improve the trade-off between synthesis quality and speaker similarity in converted speech.
The proposed AGAIN-SVC is an auto-encoder-based model that disentangles the singer identity and phonetic information. The framework comprises a single encoder and a decoder. Unlike AdaIN-SVC, which utilizes separate encoders for phonetic content and singer information, AGAIN-SVC uses only one encoder. The singer information is considered time independent information, whereas the phonetic content information is time dependent. Components, the style transfer of singing data, becomes easier. To prevent information leakage without affecting the conversion performance, an activation guidance is used as an information bottleneck. The training procedure of the proposed AGAIN SVC system is shown in Figure 1. Instead of modeling raw singing audio data, mel-spectrograms are used. These lower resolution representations are easier to model than raw temporal audio and faithfully regenerate back to audio. The IN layer detaches the linguistic content and singer dependent features. Later, the singer embeddings are later added to the AdaIN layer in the decoding section. The evaluation procedure of AGAIN-SVC is illustrated in Figure 2. Given reference X and target Y, the AdaIN layer transfers the singer features (channel-wise mean and variance) from the reference to those of the target singer. Meanwhile, the content of the reference remains unchanged, and only the singer identity is transferred.
The encoder phase acts as a content encoder, responsible for encoding the phonetic content information from the singing data. However, the mean and variance, which are time independent features, seldom carry any phonetic content information. To address this, IN layers in each encoder block remove the time-invariant information that represents the singer identity. To prevent any leakage of singer information along with the phonetic content embedding, an activation function is included as an information bottleneck. The sigmoid function with the hyperparameter value α=0.1 is chosen. The sigmoid function proves effective in disentangling the singer identity from phonetic content embedding while minimizing the reconstruction error. This activation function ensures that the encoding contains only the phonetic information.
The basic architecture is based on the AGAIN voice conversion system. It employs a U-net architecture, where both encoder and decoder utilize 1D convolutional layers. A CNN [45,46,47] or ConvNet is a feed-forward neural network. CNN contains a stack of convolutional layers, typically the hidden layer that performs convolutions. The proposed hybrid architecture combines a CNN with an RNN. In DNNs, especially those with many layers, gradients can diminish exponentially as they propagate backward. This phenomenon hinders effective training and can significantly prolong the learning process or even cause convergence failure. The vanishing gradient problem is particularly associated with activation functions like sigmoid. The specialized RNN architectures, LSTM and GRU address the vanishing gradient problem by introducing gating mechanisms and maintaining cell states. LSTMs [48] introduce gates (input, forget, and output gates) that control the flow of information within the cell and maintain them over longer sequences. It maintains a cell state that allows information to flow across time steps without vanishing gradients. GRUs [49] introduce two essential gates: the update gate and the reset gate. It computes a new hidden state by blending the previous hidden state and the updated input. This mechanism allows them to capture temporal dependencies effectively. The hybrid models integrate the RNN (either LSTM or GRU), which acts as a feedback network, with the feed-forward CNN. In these hybrid networks, CNN layers extract complex features, while LSTM or GRU layers capture long-term dependencies.
The encoder and decoder shown in Figures 1 and 2 form a U-net architecture, as depicted in Figure 3. The input mel-spectrogram features pass through 1D convolution layers. Both the encoder and decoder uses 1D convolution blocks, which consist of 1D convolution layers followed by BN and leaky rectified linear unit (LeakyReLU) activation function. Skip connections between the encoder and decoder are formed by transferring speaker embeddings μ and σ between IN layers and AdaIN layers. These skip connections allow encoded features from the encoder to be directly forwarded to the decoder. By connecting the encoder and decoder, information from earlier layers can bypass the bottleneck (latent space) and directly influence the reconstruction process. This helps mitigate the vanishing gradient problem and facilitates better information flow. By allowing gradients to flow more freely, skip connections stabilize training and improve convergence. The content embeddings from the encoder is passed to the decoder, and the converted mel-spectrogram features are extracted from the decoder. LSTM layers are placed after convolution layers at both the encoder and decoder ends. To obtain hybrid CNN-GRU model, LSTM layers are replaced with GRU. Both LSTM and GRU are used in the same way and both of them have somewhat similar effects. The pseudocode algorithm for the proposed system is also included here.
Algorithm 1 Enhanced AGAIN-VC with ConvLSTM Pseudocode |
1: Load preprocessed data (mel-spectrograms) for source and target speakers 2: Initialize model parameters (encoder, ConvLSTM layers, and decoder) 3: Encoder: 4: Input: Mel-spectrogram of source speaker's speech 5: Output: Content embeddings 6: Encode the input mel-spectrogram using the encoder (including ConvLSTM layers) 7: Activation Guidance: 8: Apply activation guidance to the content embeddings 9: Restrict the flow of information to essential features 10: Adaptive Instance Normalization (AdaIN): 11: Adapt the statistics (mean and variance) of content embeddings 12: to match those of the target speaker 13: Preserve linguistic content while adjusting for style 14: Decoder: 15: Input: Adapted content embeddings and target speaker's style 16: Output: Converted mel-spectrogram 17: Decode the adapted content embeddings using the decoder 18: Loss Functions: 19: L1 Loss: Minimize difference between input and output mel-spectrograms 20: Perceptual Loss: Encourage perceptual similarity with target speaker's speech 21: Total loss: Combination of L1 and perceptual losses 22: Training: 23: Optimize model parameters using backpropagation and gradient descent 24: Monitor training progress (e.g., total steps, evaluation steps) 25: Inference: 26: Given a source mel-spectrogram, convert it to target speaker's style 27: Output the converted mel-spectrogram 28: ConvLSTM Layers: 29: Incorporate ConvLSTM layers within the encoder and decoder 30: Enhance the temporal modeling capabilities by capturing spatio-temporal features 31: Adjust hyperparameters (e.g., kernel size, hidden units) as needed |
MelGAN [50] is specifically designed for efficient audio waveform synthesis. Unlike autoregressive models (such as WaveNet and WaveRNN), which suffer from high computational complexity, MelGAN focuses on reducing this complexity. It achieves this by decomposing raw audio samples into learned basis functions and associated weights. As a result, MelGAN significantly reduces computational demands compared to other GAN-based neural vocoders, such as HiFi-GAN [51]. Moreover, MelGAN consistently produces high-quality audio that rivals other GAN-based vocoders. MelGAN comprises two main components: the generator and the discriminator. The primary objective of MelGAN is to generate high-quality audio waveforms from mel-spectrograms. The generator, being lightweight, efficiently converts mel-spectrograms into raw audio waveforms. Conversely, the discriminator plays the crucial role of distinguishing between real and generated audio. Through an adversarial training process, the generator continually improves the quality of the synthesized audio. Additionally, by conditioning the generator on specific attributes (such as speaker identity or emotion), MelGAN can synthesize audio with desired characteristics. To enable conditional synthesis, MelGAN requires paired data during training, allowing it to learn the mapping from source mel-spectrograms to target audio waveforms.
The model is evaluated using the National University of Singapore (NUS) sung and spoken lyrics corpus (NUS-48E corpus) [52]. The corpus comprises speaking and singing voices from 48 English songs performed by six male and six female subjects, out of which 20 songs are unique. The total length of audio recordings is approximately 169 minutes, with 115 minutes dedicated to singing data and the remaining 54 minutes to speech data. From the NUS-48E database, nine singers are selected for training, while the remaining three singers (two male and one female singer) are reserved for testing. The audio is converted to 22050 Hz, and silent frames are removed. 80 bin mel-scale spectrograms are generated from the audio data as an acoustic feature. To generate the waveform back from the mel-spectrogram, MelGAN is used as the vocoder. MelGAN is a fully convolutional architecture within a GAN setup, employed for conditional audio waveform synthesis.
An adaptive moment estimation (ADAM optimizer) is used for training. The model is trained for 50,000 training steps with a learning rate of 0.0005. The batch size is 32. For performance evaluation of the proposed architecture, AdaIN-SVC and AGAIN-SVC are used as baseline models.
SVC is primarily associated with the transformation of MCEPS. For evaluation purposes, it is necessary to calculate the distortion between the original and transformed MCEPS features. Mel-cepstral distortion (MCD) [53,54] is a widely used measure for evaluating synthesized voice. MCD represents the Euclidean distance between the real and converted MCEPS features. It is defined as follows:
MCD(dB)=10Nln10N∑n=1√2D∑d=1(mc(n,d)−mt(n,d))2 | (3.1) |
where mc(n,d) and mt(n,d) are the dth coefficients of the converted and target MCEPS features respectively, at the nth frame. D = 24 is the dimension of the MCEPS. N is the total number of frames. MCD values are tabulated in Table 1 for four possible conversions, male to male (MM), male to female (MF), female to male (FM), and female to female (FF). Conversions with lower MCD values exhibit smaller distortion and better performance. Interestingly, the proposed method for FF conversion yields the best results. Female voices often share more similar acoustic characteristics with each other compared to male voices. These shared characteristics include pitch range, formant frequencies, and overall timbre. When converting from one female voice to another, the transformation is less drastic, that might lead to better results.
System | MCD (dB) | ||||
MM | MF | FM | FF | Average | |
AdaIN-SVC | 8.20 | 7.92 | 8.30 | 8.79 | 8.30 |
AGAIN-SVC | 6.32 | 6.31 | 6.37 | 6.92 | 6.48 |
Hybrid CNN-GRU | 5.92 | 5.80 | 5.95 | 5.33 | 5.75 |
Hybrid CNN-LSTM | 5.99 | 5.56 | 5.94 | 4.91 | 5.60 |
For a comparative study with state of the art techniques, experiments are conducted using various machine learning methods in the field of speech conversion, including VAE, VAWGAN, CycleGAN, and cycle-consistent boundary equilibrium generative adversarial network (CycleBEGAN), along with the baseline models (Table 2). The evaluation measures include MCD, root mean square error (RMSE) of mel-spectrogram features from real and converted voices, and log spectral distortion (LSD). LSD computes the difference between the linear prediction coding (LPC) log power spectra from the original and converted singing voices. The converted singing voice retains spectral features similar to the original if the distortion is minimized.
System | MCD(dB) | RMSE(dB) | LSD(dB) |
VAE | 8.20 | 3.14 | 11.13 |
VAWGAN | 6.32 | 3.36 | 10.14 |
CycleGAN | 5.92 | 3.32 | 10.91 |
CycleBEGAN | 5.99 | 3.16 | 9.95 |
AdaIN-SVC | 8.30 | 4.5 | 11.24 |
AGAIN-SVC | 6.48 | 3.14 | 10.29 |
Hybrid CNN-GRU | 5.75 | 2.47 | 10.11 |
Hybrid CNN-LSTM | 5.6 | 2.04 | 9.80 |
From the results, it can be seen that the hybrid models yield more promising outcomes than the baseline methods. Specifically, hybrid CNN-RNN systems demonstrate superior performance compared to the baseline AdaIN-SVC and AGAIN-SVC models. Although the GRU model is less complex and faster to train, the fusion of LSTM architecture produces a more similar converted voice aligned with the target. This phenomenon arises because the LSTM network requires more trainable parameters, which, in turn, enhance the system's efficiency during training.
Additionally, the global variance (GV) distribution [55] plot for synthesized and target singing voices serves as a valuable measure to assess their closeness. GV visualizes spectral features in terms of variance distribution. For instance, Figure 4 illustrates the variance distribution plot for target singing voice and converted singing voice across four conversions. The results affirm that the converted singing voices closely match the target singing voice.
Incorporating LSTM or GRU with CNN can enhance voice conversion models by capturing both local and contextual features. The choice depends on specific requirements and available resources. LSTM has more parameters and is capable of capturing intricate long-term dependencies, while GRU simplifies the architecture by having fewer gates and parameters. If the system prioritizes accuracy and has sufficient data, LSTM might provide better results. If efficiency is crucial, GRU could be a better choice. GRU typically trains faster than LSTM due to its simpler structure. If you have limited training resources, GRU might be preferable. LSTM might be more suitable for tasks requiring precise modeling of long-range dependencies (e.g., music composition). GRU could be preferable for real-time applications where efficiency matters (e.g., voice-controlled systems).
Objective evaluation measures alone have several limitations because the performance of the synthesized singing voice depends on the perceptual abilities of humans. To ensure the quality of the synthesized singing voice, subjective evaluation is also performed. In this subjective evaluation, 24 participants (12 male and 12 female), all without any hearing problems, took part in listening tests. Each participant evaluated a mean opinion score (MOS) for three attributes: naturalness, melodic similarity, and phoneme intelligibility, along with XAB preference tests related to singer similarity. XAB preference testing is a method used to compare two different stimuli (A and B) with an unknown third stimulus (X). During the listening tests, participants were required to listen to two randomly chosen songs from each conversion.
For the XAB test on singer similarity, participants listened to a target song and their converted singing voice in two systems. They then had to choose their preferred system based on singer similarity with the target. The experiments included intra-gender (MM or FF) and inter-gender (MF or FM) conversions. Initially, the preference test involved converted songs from the AdaIN-SVC and AGAIN-SVC models. Subsequently, comparison was conducted between AdaIN-SVC vs hybrid CNN-LSTM, as well as between AGAIN-SVC vs hybrid CNN-LSTM. The results of the XAB preference test are depicted in Figure 5, Figure 6, and Figure 7.
Additionally, MOS tests were conducted, where each participant provided a score on a scale of 1 to 5 for each system (Figure 8). A score of 1 corresponds to the worst case, while a score of 5 represents the best case. Naturalness reflects how closely the converted singing voice resembles a natural human voice. The melodic similarity attribute is evaluated based on the resemblance in melody between the target voice and the converted voice. Finally, MOS on phoneme intelligibility provides insights into the clarity of phonemes in the converted singing voice.
The singer similarity between AdaIN-SVC and AGAIN-SVC exhibits somewhat equal preferences. However, from the preference test results comparing AdaIN-SVC with the proposed hybrid CNN-LSTM system, the latter shows greater resemblance to the target singer's voice. Specifically, for inter-gender conversions, the hybrid CNN-LSTM system achieves 86.67% similarity, whereas AdaIN-SVC achieves only 13.3%. For intra-gender conversions, the hybrid CNN-LSTM system achieves 93.3% similarity, while AdaIN-SVC lags behind at 6.7%. Additionally, the hybrid CNN-LSTM model is preferred over AGAIN-SVC. Analyzing the percentage preferences reveals substantial differences: AdaIN-SVC vs. the hybrid model exhibits large disparities, indicating that AdaIN-SVC has the least preference. Meanwhile, AGAIN-SVC vs. the hybrid model comparison shows that the hybrid model is highly preferred. Furthermore, the melodic similarity and phoneme intelligibility of the converted singing voice in the proposed hybrid systems outperform the baseline AdaIN-SVC and AGAIN-SVC models (Figure 8). This improvement is attributed to the activation guidance, which mitigates the phonetic content loss associated with the latter. However, there is room for improvement in the naturalness of the overall system.
In the proposed work, a combination of the CNN model and RNN model is used for one shot SVC, employing adaptive normalization layers and activation guidance techniques. IN layers remove singer information from the source singing mel-spectrogram, while AdaIN layers add target singer information to the content representation. Activation guidance prevents singer information leakage, ensuring better control over singer-related features. This technique disentangles singer and content information without leaking singer information into the content embeddings. By incorporating the recurrent architectures such as LSTM and GRU, which excel at capturing long-term dependencies in sequential data, this paper achieves high-quality converted singing voice while maintaining singer characteristics. The one-shot capability simplifies the architecture, making it efficient and practical. The conversion performance of singing voice is assessed through objective and subjective evaluation. For the evaluation, voice conversion techniques such as AdaIN and AGAIN are used as baseline architectures. It is evident that the fusion of CNN and LSTM model consistently yields better results across all experiments. The MCD is least for the proposed hybrid CNN-LSTM model (5.6dB), ensuring superior conversion. The majority of people preferred the proposed conversion technique, with AdaIN-SVC being the least preferred, followed by AGAIN-SVC and hybrid CNN-LSTM. The proposed technique achieves MOSs of 2.93, 3.35, and 3.21 for naturalness, melodic similarity, and phoneme intelligibility respectively.
Although the proposed hybrid CNN-LSTM model aims to balance content preservation with speaker identity transformation, achieving the perfect trade-off remains challenging. Formulating SVC as a multi-objective optimization problem or introducing a latent space and adjusting specific dimensions within this space allows users to control the balance between similarity and quality. One-shot methods are more susceptible to background noise, accent variations, and emotional fluctuations. Robustness to such factors is an ongoing research area. These modifications represent exciting avenues for future research.
Assila Yousuf: Resources, Investigation, Validation, Formal Analysis, Writing ‒ original draft; David Solomon George: Supervision; All authors: Conceptualization, Methodology. All authors have read and approved the final version of the manuscript for publication.
The authors would like to thank Directorate of Technical Education (DTE) and A.P.J Abdul Kalam Technological University, Kerala, India for the support provided.
All authors declare no conflicts of interest in this paper
[1] |
Putra TA, Rufaida SI and Leu JS (2020) Enhanced skin condition prediction through machine learning using dynamic training and testing augmentation. IEEE Access 8: 40536–40546. https://doi.org/10.1109/ACCESS.2020.2976045 doi: 10.1109/ACCESS.2020.2976045
![]() |
[2] | Nahata H and Singh SP (2020) Deep learning solutions for skin cancer detection and diagnosis. Machine Learning with Health Care Perspective, 159–182. Springer, Cham. https://doi.org/10.1007/978-3-030-40850-3_8 |
[3] |
Mijwil MM (2021) Skin cancer disease images classification using deep learning solutions. Multimed Tools Appl 80: 26255–26271. https://doi.org/10.1007/s11042-021-10952-7 doi: 10.1007/s11042-021-10952-7
![]() |
[4] |
Jeihooni AK and Moradi M (2019) The effect of educational intervention based on PRECEDE model on promoting skin cancer preventive behaviors in high school students. J Cancer Educ 34: 796–802. https://doi.org/10.1007/s13187-018-1376-y doi: 10.1007/s13187-018-1376-y
![]() |
[5] |
Jeihooni AK and Rakhshani T (2019) The effect of educational intervention based on health belief model and social support on promoting skin cancer preventive behaviors in a sample of Iranian farmers. J Cancer Educ 34: 392–401. https://doi.org/10.1007/s13187-017-1317-1 doi: 10.1007/s13187-017-1317-1
![]() |
[6] | Mohapatra S, Abhishek NVS, Bardhan D, et al. (2021) Skin cancer classification using convolution neural networks. Advances in Distributed Computing and Machine Learning, 433–442. Springer, Singapore. https://doi.org/10.1007/978-981-15-4218-3_42 |
[7] |
Maxwell A, Li R, Yang B, et al. (2017) Deep learning architectures for multi-label classification of intelligent health risk prediction. BMC bioinformatics, 18: 121–131. https://doi.org/10.1186/s12859-017-1898-z doi: 10.1186/s12859-017-1898-z
![]() |
[8] |
Han SS, Park I, Chang SE, et al. (2020) Augmented intelligence dermatology: deep neural networks empower medical professionals in diagnosing skin cancer and predicting treatment options for 134 skin disorders. J Investigat Dermatol 140: 1753–1761. https://doi.org/10.1016/j.jid.2020.01.019 doi: 10.1016/j.jid.2020.01.019
![]() |
[9] |
Kadampur MA and Riyaee SA (2020) Skin cancer detection: Applying a deep learning based model driven architecture in the cloud for classifying dermal cell images. Informatics in Medicine Unlocked 18: 100282. https://doi.org/10.1016/j.imu.2019.100282 doi: 10.1016/j.imu.2019.100282
![]() |
[10] | Nahata H and Singh SP (2020) Deep learning solutions for skin cancer detection and diagnosis. Machine Learning with Health Care Perspective, 159–182. Springer, Cham. https://doi.org/10.1007/978-3-030-40850-3_8 |
[11] |
Pacheco AGC and Krohling A (2020) The impact of patient clinical information on automated skin cancer detection. Comput biol med 116: 103545. https://doi.org/10.1016/j.compbiomed.2019.103545 doi: 10.1016/j.compbiomed.2019.103545
![]() |
[12] |
Höhn J, Hekler A, Krieghoff-Henning E, et al. (2021) Integrating patient data into skin cancer classification using convolutional neural networks: systematic review. J Med Internet Res 23: e20708. https://doi.org/10.2196/20708 doi: 10.2196/20708
![]() |
[13] |
Bushehri SF and Zarchi MS (2019) An expert model for self-care problems classification using probabilistic neural network and feature selection approach. Appl Soft Comput 82: 105545. https://doi.org/10.1016/j.asoc.2019.105545 doi: 10.1016/j.asoc.2019.105545
![]() |
[14] |
Han SS, Park I, Chang SE, et al. (2020) Augmented intelligence dermatology: deep neural networks empower medical professionals in diagnosing skin cancer and predicting treatment options for 134 skin disorders. J Invest Dermatol 140: 1753–1761. https://doi.org/10.1016/j.jid.2020.01.019 doi: 10.1016/j.jid.2020.01.019
![]() |
[15] |
Kadampur MA and Riyaee SA (2020) Skin cancer detection: Applying a deep learning based model driven architecture in the cloud for classifying dermal cell images. Informatics in Medicine Unlocked 18: 100282. https://doi.org/10.1016/j.imu.2019.100282 doi: 10.1016/j.imu.2019.100282
![]() |
[16] | Huang CW, Nguyen AP, Wu CC, et al. (2021) Develop a Prediction Model for Nonmelanoma Skin Cancer Using Deep Learning in EHR Data. Soft Computing for Biomedical Applications and Related Topics, 11–18. Springer, Cham. https://doi.org/10.1007/978-3-030-49536-7_2 |
[17] | Nahata H and Singh S (2020) Deep learning solutions for skin cancer detection and diagnosis. Machine Learning with Health Care Perspective, 2020,159–182. Springer, Cham. https://doi.org/10.1007/978-3-030-40850-3_8 |
[18] |
Abhishek K, Kawahara J and Hamarneh G (2021) Predicting the clinical management of skin lesions using deep learning. Scientific reports 11: 1–14. https://doi.org/10.1038/s41598-021-87064-7 doi: 10.1038/s41598-021-87064-7
![]() |
[19] | Ashraf R, Kiran I, Mahmood T, et al. (2020) An efficient technique for skin cancer classification using deep learning. 2020 IEEE 23rd International Multitopic Conference (INMIC), 1–5. IEEE. https://doi.org/10.1109/INMIC50486.2020.9318164 |
[20] |
Maron RC, Schlager JG, Haggenmüller S, et al. (2021) A benchmark for neural network robustness in skin cancer classification. Eur J Cancer 155: 191–199. https://doi.org/10.1016/j.ejca.2021.06.047 doi: 10.1016/j.ejca.2021.06.047
![]() |
[21] |
Ali MS, Miah MS, Haque J, et al. (2021) An enhanced technique of skin cancer classification using deep convolutional neural network with transfer learning models. Machine Learning with Applications 5: 100036. https://doi.org/10.1016/j.mlwa.2021.100036 doi: 10.1016/j.mlwa.2021.100036
![]() |
[22] | Pramanik A and Chakraborty R (2021) A Deep Learning Prediction Model for Detection of Cancerous Lesions from Dermatoscopic Images. Advanced Machine Learning Approaches in Cancer Prognosis, 395–423. Springer, Cham. https: //doi.org/10.1007/978-3-030-71975-3_15 |
[23] | Salian AC, Vaze S, Singh P, et al. (2020) Skin lesion classification using deep learning architectures. 2020 3rd International conference on communication system, computing and IT applications (CSCITA), 168–173. IEEE. https://doi.org/10.1109/CSCITA47329.2020.9137810 |
[24] |
Wang JS, Song JD and Gao J (2015) Rough set-probabilistic neural networks fault diagnosis method of polymerization kettle equipment based on shuffled frog leaping algorithm. Information 6: 49–68. https://doi.org/10.3390/info6010049 doi: 10.3390/info6010049
![]() |
[25] |
Chakravarthy SSR and Rajaguru H (2015) Lung cancer detection using probabilistic neural network with modified crow-search algorithm. Asian Pacific journal of cancer prevention: APJCP 20: 2159. https://doi.org/10.31557/APJCP.2019.20.7.2159 doi: 10.31557/APJCP.2019.20.7.2159
![]() |
[26] | Sharma H, Hazrati G and Bansal JC (2019) Spider monkey optimization algorithm. Evolutionary and swarm intelligence algorithms, 43–59. Springer, Cham. https://doi.org/10.1007/978-3-319-91341-4_4 |
[27] |
Kumar S, Sharma B, Sharma VK, et al. (2020) Plant leaf disease identification using exponential spider monkey optimization. Sustainable computing: Informatics and systems 28: 100283. https://doi.org/10.1016/j.suscom.2018.10.004 doi: 10.1016/j.suscom.2018.10.004
![]() |
[28] |
Rajeshwari J and Sughasiny M (2022) Modified Filter Based Feature Selection Technique for Dermatology Dataset Using Beetle Swarm Optimization. EAI Endorsed Trans S, e2-e2. https://doi.org/10.4108/eetsis.vi.1998 doi: 10.4108/eetsis.vi.1998
![]() |
[29] |
Rajeshwari J and Sughasiny M (2022) Dermatology disease prediction based on firefly optimization of ANFIS classifier. AIMS Electronics and Electrical Engineering 6: 61–80. https://doi.org/10.3934/electreng.2022005 doi: 10.3934/electreng.2022005
![]() |
[30] | Dermatology Data set. Available from: https://archive.ics.uci.edu/ml/datasets/dermatology. |
[31] | Izonin I, Tkachenko R, Ryvak L, et al. (2020) Addressing Medical Diagnostics Issues: Essential Aspects of the PNN-based Approach. IDDM, 209–218. |
[32] |
Izonin I, Tkachenko R, Gregus M, et al. (2021) Hybrid Classifier via PNN-based Dimensionality Reduction Approach for Biomedical Engineering Task. Procedia Computer Science 191: 230–237. https://doi.org/10.1016/j.procs.2021.07.029 doi: 10.1016/j.procs.2021.07.029
![]() |
[33] |
Izonin I, Tkachenko R, Gregus M, et al. (2022) PNN-SVM Approach of Ti-Based Powder's Properties Evaluation for Biomedical Implants Production. CMC-COMPUT MATER CON 71: 5933–5947. https://doi.org/10.32604/cmc.2022.022582 doi: 10.32604/cmc.2022.022582
![]() |
[34] |
Guan Y, Aamir M, Rahman Z, et al. (2021) A framework for efficient brain tumor classification using MRI images. Math Biosci Eng 18: 5790–5815. https://doi.org/10.3934/mbe.2021292 doi: 10.3934/mbe.2021292
![]() |
[35] |
Aamir M, Rahman Z, Dayo ZA, et al. (2022) A deep learning approach for brain tumor classification using MRI images. Comput Electr Eng 101: 108105. https://doi.org/10.1016/j.compeleceng.2022.108105 doi: 10.1016/j.compeleceng.2022.108105
![]() |
System | MCD (dB) | ||||
MM | MF | FM | FF | Average | |
AdaIN-SVC | 8.20 | 7.92 | 8.30 | 8.79 | 8.30 |
AGAIN-SVC | 6.32 | 6.31 | 6.37 | 6.92 | 6.48 |
Hybrid CNN-GRU | 5.92 | 5.80 | 5.95 | 5.33 | 5.75 |
Hybrid CNN-LSTM | 5.99 | 5.56 | 5.94 | 4.91 | 5.60 |
System | MCD(dB) | RMSE(dB) | LSD(dB) |
VAE | 8.20 | 3.14 | 11.13 |
VAWGAN | 6.32 | 3.36 | 10.14 |
CycleGAN | 5.92 | 3.32 | 10.91 |
CycleBEGAN | 5.99 | 3.16 | 9.95 |
AdaIN-SVC | 8.30 | 4.5 | 11.24 |
AGAIN-SVC | 6.48 | 3.14 | 10.29 |
Hybrid CNN-GRU | 5.75 | 2.47 | 10.11 |
Hybrid CNN-LSTM | 5.6 | 2.04 | 9.80 |
System | MCD (dB) | ||||
MM | MF | FM | FF | Average | |
AdaIN-SVC | 8.20 | 7.92 | 8.30 | 8.79 | 8.30 |
AGAIN-SVC | 6.32 | 6.31 | 6.37 | 6.92 | 6.48 |
Hybrid CNN-GRU | 5.92 | 5.80 | 5.95 | 5.33 | 5.75 |
Hybrid CNN-LSTM | 5.99 | 5.56 | 5.94 | 4.91 | 5.60 |
System | MCD(dB) | RMSE(dB) | LSD(dB) |
VAE | 8.20 | 3.14 | 11.13 |
VAWGAN | 6.32 | 3.36 | 10.14 |
CycleGAN | 5.92 | 3.32 | 10.91 |
CycleBEGAN | 5.99 | 3.16 | 9.95 |
AdaIN-SVC | 8.30 | 4.5 | 11.24 |
AGAIN-SVC | 6.48 | 3.14 | 10.29 |
Hybrid CNN-GRU | 5.75 | 2.47 | 10.11 |
Hybrid CNN-LSTM | 5.6 | 2.04 | 9.80 |