Alternating current servo motor and programmable logic controller coupled with a pipe cutting machine based on human-machine interface using dandelion optimizer algorithm - attention pyramid convolution neural network

Santosh Prabhakar Agnihotri; Mandar Padmakar Joshi; Santosh Prabhakar Agnihotri; Mandar Padmakar Joshi

doi:10.3934/electreng.2024001

AIMS Electronics and Electrical Engineering

2024, Volume 8, Issue 1: 1-27. doi: 10.3934/electreng.2024001

Previous Article Next Article

Research article

Alternating current servo motor and programmable logic controller coupled with a pipe cutting machine based on human-machine interface using dandelion optimizer algorithm - attention pyramid convolution neural network

Department of Electronics and Telecommunication Engineering, GES R. H. Sapat College of Engineering Management studies and Research, Nasik, India

Academic Editor: Fanny Spagnolo

Received: 21 November 2023 Revised: 03 January 2024 Accepted: 05 January 2024 Published: 02 February 2024

The proposed research addresses the optimization challenges in servo motor control for pipe-cutting machines, aiming to enhance performance and efficiency. Recognizing the existing limitations in parameter optimization and system behavior prediction, a novel hybrid approach is introduced. The methodology combines a Dandelion optimizer algorithm (DOA) for servo motor parameter optimization and an Attention pyramid convolution neural network (APCNN) (APCNN) for system behavior prediction. Integrated with a Programmable Logic Controller (PLC) and human-machine interface (HMI), this approach offers a comprehensive solution. Our research identifies a significant research gap in the efficiency of existing methods, emphasizing the need for improved control parameter optimization and system behavior prediction for cost reduction and enhanced efficiency. Through implementation on the MATLAB platform, the proposed DOA-APCNN approach demonstrates a noteworthy 30% reduction in computation time compared to existing methods such as Heap-based optimizer (HBO), Cuckoo Search Algorithm (CSA), and Salp Swarm Algorithm (SSA). These findings pave the way for faster and more efficient pipe-cutting operations, contributing to advancements in industrial automation and control systems.

Keywords:

Citation: Santosh Prabhakar Agnihotri, Mandar Padmakar Joshi. Alternating current servo motor and programmable logic controller coupled with a pipe cutting machine based on human-machine interface using dandelion optimizer algorithm - attention pyramid convolution neural network[J]. AIMS Electronics and Electrical Engineering, 2024, 8(1): 1-27. doi: 10.3934/electreng.2024001

Related Papers:

[1]	Vani H Y, Anusuya M A . Improving speech recognition using bionic wavelet features. AIMS Electronics and Electrical Engineering, 2020, 4(2): 200-215. doi: 10.3934/ElectrEng.2020.2.200
[2]	Abdul Yussif Seidu, Elvis Twumasi, Emmanuel Assuming Frimpong . Hybrid optimized artificial neural network using Latin hypercube sampling and Bayesian optimization for detection, classification and location of faults in transmission lines. AIMS Electronics and Electrical Engineering, 2024, 8(4): 508-541. doi: 10.3934/electreng.2024024
[3]	Sebin J Olickal, Renu Jose . LSTM projected layer neural network-based signal estimation and channel state estimator for OFDM wireless communication systems. AIMS Electronics and Electrical Engineering, 2023, 7(2): 187-195. doi: 10.3934/electreng.2023011
[4]	Tamizhelakkiya K, Sabitha Gauni, Prabhu Chandhar . Modulation classification analysis of CNN model for wireless communication systems. AIMS Electronics and Electrical Engineering, 2023, 7(4): 337-353. doi: 10.3934/electreng.2023018
[5]	Desh Deepak Sharma, Ramesh C Bansal . LSTM-SAC reinforcement learning based resilient energy trading for networked microgrid system. AIMS Electronics and Electrical Engineering, 2025, 9(2): 165-191. doi: 10.3934/electreng.2025009
[6]	Imad Ali, Faisal Ghaffar . Robust CNN for facial emotion recognition and real-time GUI. AIMS Electronics and Electrical Engineering, 2024, 8(2): 227-246. doi: 10.3934/electreng.2024010
[7]	D Venkata Ratnam, K Nageswara Rao . Bi-LSTM based deep learning method for 5G signal detection and channel estimation. AIMS Electronics and Electrical Engineering, 2021, 5(4): 334-341. doi: 10.3934/electreng.2021017
[8]	Vuong Quang Phuoc, Nguyen Van Dien, Ho Duc Tam Linh, Nguyen Van Tuan, Nguyen Van Hieu, Le Thai Son, Nguyen Tan Hung . An optimized LSTM-based equalizer for 100 Gigabit/s-class short-range fiber-optic communications. AIMS Electronics and Electrical Engineering, 2024, 8(4): 404-419. doi: 10.3934/electreng.2024019
[9]	Loris Nanni, Michelangelo Paci, Gianluca Maguolo, Stefano Ghidoni . Deep learning for actinic keratosis classification. AIMS Electronics and Electrical Engineering, 2020, 4(1): 47-56. doi: 10.3934/ElectrEng.2020.1.47
[10]	Tadele A. Abose, Thomas O. Olwal, Muna M. Mohammed, Murad R. Hassen . Performance analysis of insertion loss incorporated hybrid precoding for massive MIMO. AIMS Electronics and Electrical Engineering, 2024, 8(2): 187-210. doi: 10.3934/electreng.2024008

Abstract

1. Introduction

Singing voice conversion (SVC) is a technique that modifies the singing voice of a reference singer to sound like the voice of the target singer while keeping the phonetic content unchanged ^[1]. The converted singing voice sounds as if the target is performing the same lyrical composition as the source. SVC finds applications in the entertainment industry.

Voice conversion systems are mainly categorized as parallel ^[2] and nonparallel voice conversion systems ^[3,4]. In a parallel voice conversion system, the training data consists of both the reference and target speakers (singers) performing the same lyrical content. In contrast, nonparallel voice conversion systems, do not require both speakers to utter the same sentence set during training. Conventionally, many voice conversion methods rely on parallel training data, which necessitates frame level alignment between source and target utterances. However, collecting perfectly aligned parallel training data in real life scenarios can be extremely challenging. To tackle this problem, recent studies are focused on unsupervised voice conversion techniques that utilize nonparallel training data for voice conversion ^[5]. The SVC Challenge (SVCC) ^[6] focuses on comparing and understanding various voice conversion systems for SVC based on a shared dataset.

Recently, neural network (NN) based voice conversion methods such as deep neural network (DNN) ^[7], recurrent neural network (RNN) ^[8], and convolutional neural network ^[9] have been proposed^[10,11]. Additionally, there exist different approaches for style transfer in images. Style transfer involves transferring the style of an input image to another image without changing the content of the former image. Interestingly, these techniques can also be applied to non-parallel voice conversion tasks. Generative adversarial networks (GANs) ^[12,13] - originally developed for image translation - can be effectively applied to audio data for voice conversion. Notable nonparallel many-to-many GAN-based approaches include Wasserstein generative adversarial network (WD-GAN) ^[14], cycle-consistent generative adversarial networks (CycleGAN) ^[15] and starGAN ^[16]. In these many-to-many nonparallel techniques, both source and target singing voices must be included in the training process.

Exploring the latest advancements in research focuses on image enhancement tasks, specifically super-resolution and inpainting using deep learning, attention mechanisms, and multi-scale features ^[17,18,19]. DNNs play a central role in these papers, learning complex mappings from low-resolution or incomplete images to high-resolution or complete versions^[20,21]. Focusing on image inpainting, Chen et al. ^[22] proposes a noise-robust voice conversion model. Users can choose whether to retain or remove background sounds during conversion. The model combines speech separation and voice conversion modules. Tomasz et al. ^[23] discusses techniques for seamlessly transferring a speaker's identity to another speaker while preserving speech content. Another study explores using cosine similarity between x-vector speaker embeddings as an objective metric for evaluating SVC ^[24]. The system preprocesses the source singer's audio to obtain melody features via F0 contour, loudness curve, and phonetic posteriorgram.

To disentangle singer identity from linguistic information, auto-encoders can be used. Thus, encoder-decoder-based networks are also proposed for unsupervised voice conversion. These auto-encoder-based techniques include variational auto-encoder (VAE) ^[25], cycle-consistent variational auto-encoder (CycleVAE) ^[26,27], and variational auto-encoding Wasserstein generative adversarial network (VAWGAN) ^[28]. Although these systems generate high-quality singing voices, their major limitation is that they can only be applied to many-to-many voice conversion tasks. These methods are inefficient for SVC when the reference and target singers are absent in the training process. The robust one-shot SVC model ^[29] relies on GANs to accurately recover the target pitch by matching pitch distributions. Adaptive instance normalization (AdaIN)-skip conditioning further enhances the model's performance. Unlike traditional voice cloning tasks, which modify audio waveform to match a desired voice specified by reference audio, a novel task called visual voice cloning (V2C)^[30] bridges the gap between voice cloning and visual information.

Some recent works on voice conversion have focused on one-shot voice conversion. Examples include zero-shot voice style transfer with only auto-encoder loss (AutoVC) ^[31], AdaGAN-VC ^[32], and two-level nested U-structure VC (U2VC) ^[33,34]. In one-shot voice conversion, either or both speakers that are unseen in the training data can be used for the inference. Recently, voice conversion in the speech domain has utilized one-shot voice conversion techniques. These approaches focus on the separation of speaker and content information. Audio features such as mel-cepstral coefficients (MCEPS), aperiodicities and fundamental frequencies are extracted using vocoders. The voice conversion process involves removing speaker dependent features from the source speaker's utterance and incorporating the target speaker's attributes. AutoVC introduces a vanilla auto-encoder with designed information bottleneck and a pretrained speaker encoder for decoupling the speaker and content information. Meanwhile, a U-Net architecture combining vector quantization (VQ) and instance normalization (IN) (VQVC+) ^[35] successfully separates linguistic information using vector quantization. However, content leakage remains a significant limitation.

AdaGAN-VC requires adversarial training for the separation of speech attributes, which causes instability problems during the training process. This issue is resolved in AdaIN-VC ^[36] and activation guidance and adaptive instance normalization (AGAIN)-VC, which utilize instance normalization techniques to removing the speaker information. AdaIN was first introduced in ^[37] for style transfer in image translation networks. It addresses the limitations of existing normalization methods in deep learning. AdaIN extends the popular normalization technique, IN by incorporating style information from a reference image. Specifically, it maps the normalized mean and standard deviation of the content image to match those of the styled image. AdaIN-VC comprises two encoders with AdaIN for the disentanglement of information and one decoder. Lian et al. ^[38] proposed arbitrary voice conversion without any supervision, which is achieved using instance normalization and adversarial learning. Meanwhile, masked auto-encoder (MAE-VC) ^[39] is an end-to-end masked auto-encoder that converts the speaker style of the source speech to that of the target speech while maintaining content consistency. In Activation Guidance and Adaptive Instance Normalization (AGAIN)-VC ^[40], speaker and content information are disentangled with the help of a single encoder. These approaches are introduced for voice conversion in speech domain. In this paper, one-shot SVC is proposed using a combination of convolutional layers and LSTM architecture. AdaIN and AGAIN techniques are applied for SVC, using them as baseline architectures.

These methods are designed for voice conversion in the speech domain. Only a few works have been proposed for the conversion of singing voice. This paper proposes a hybrid convolutional neural network with long short term memory (CNN-LSTM) model for one-shot SVC system using the AGAIN technique. The proposed AGAIN-SVC method requires a single encoder to separate the vocal timbre of the singer from the phonetic information. Remarkably, without any frame alignment procedures, one's singing voice can be converted into another person's voice using nonparallel data, even if both singers are unavailable during the training process. In this paper, two recent works on voice conversion - AdaIN and AGAIN - are applied for one-shot SVC and also used as baseline models. To improve conversion performance, a combination of convolution layers and LSTM layers is used for the encoder and decoder architecture. Simultaneously achieving high synthesis quality and maintaining singer similarity poses a challenge, as enhancing one often comes at the cost of the other. Many VC systems employ disentanglement techniques to separate singer and linguistic content information. However, some methods reduce the dimension or quantize content embeddings to prevent singer information leakage, inadvertently compromising synthesis quality. Activation guidance serves as an information bottleneck on content embeddings, allowing better control over singer-related features. Additionally, AdaIN dynamically adjusts normalization statistics during training, enhancing the model's flexibility.

The proposed technique significantly improves the delicate balance between synthesis quality and singer similarity. Its one-shot voice conversion relies on learned content features and adaptation mechanisms to transform the source speaker's voice into the desired target speaker's style, even when the target speaker is unseen during training. This simplifies the architecture, making it efficient and practical. Indeed, CNNs excel at extracting local features from audio data, but they fall short in modeling long-term dependencies over extended sequences. Recurrent architectures, such as LSTM and gated recurrent unit (GRU), address this limitation by maintaining hidden states across time steps. By adding them after the CNN blocks, it can introduce the ability to learn contextual information beyond local features. This can improve the model's understanding of speech patterns, intonation, and context.

For this paper, the main contributions are as follows: (ⅰ) Rapid voice transformation from an unseen target singer is used to directly convert the source singer's voice to the desired target singer's style. (ⅱ) The one shot voice conversion capability simplifies the architecture, making it efficient and practical. (ⅲ) AdaIN and activation mechanisms achieve a balance between synthesis quality and speaker similarity. (ⅳ) Incorporating recurrent architecture enhances contextual learning and improves temporal modeling.

The paper is organized into four sections. The network architecture of the proposed SVC technique is described in Section 2. Experiment setup and evaluation results are discussed in the following section. Finally, the conclusion is presented in Section 4.

2. Proposed system

The AGAIN technique was first introduced for voice conversion. This paper combines the AGAIN technique with the proposed hybrid CNN-LSTM architecture for SVC.

2.1. Adaptive instance normalization

Internal covariance shift is defined as the variation in the distribution of each network layer input due to the alterations in the parameters of the previous layer. The training of earlier DNNs was affected by internal covariance shift, which in turn slowed down the training process. To alleviate this serious issue, batch normalization (BN) ^[41,42] was introduced. Consider N is the batch size, C is the number of channels, H is the height of each activation map, and W is the width of activation input. The activation layer input dimensions can be represented as $N * C * H * W$ . Generally, normalization can be applied to activation layers by shifting the mean and scaling the variance. In BN, mean and variance for each channel are computed across all samples across both spatial dimensions. Thus, BN normalizes the activations across $N * H *W$ axes. The statistics of BN layers include batch-wise mean and variance. Interestingly, the style of an image can be represented using the BN statistics. Consequently, neural style transfer of two feature maps becomes possible by replacing the batch-wise mean and variance of the source image with those of the target image ^[43].

Instead of BN, IN was introduced in ^[44] for feed forward stylization tasks, achieving noticeable improvements in the generated stylized images. Unlike BN, IN can be applied at both training and testing time, ensuring consistency during training, transfer, and testing. IN normalizes across $H X W$ axes by calculating the mean and variance across both spatial dimensions for each channel and each sample.

Let X and Y represent the feature map of reference and target waveform respectively. IN can be computed as follows:

$\begin{equation} IN(X) = \gamma(\frac{ X - \mu(X)}{\sigma(X)}) + \beta \end{equation}$

(2.1)

where $\beta$ and $\gamma$ are the learned affine parameters. Here, $\mu(X)$ is the mean and $\sigma(X)$ is the standard deviation computed across spatial dimensions independently for each feature channel and each sample.

AdaIN is simply an extension of IN. Unlike IN, it has no learnable affine transformations. While IN normalizes the input style to a single style, AdaIN normalizes the input to an arbitrarily given style. Thus, AdaIN conveys the style of the target as the reference style by simply adjusting the mean and standard deviation of the reference input to match those of the target input. AdaIN can be represented as:

$\begin{equation} AdaIN(X, \mu(Y), \sigma(Y)) = \sigma(Y) (\frac{X-\mu(X)}{\sigma(X)}) + \mu(Y) \end{equation}$

(2.2)

Here, X and Y are content input and style input respectively. The AdaIN layer thus transfers the style of X with style of Y by scaling the normalized content input with the standard deviation of style input and shifts it with $\mu(Y)$ .

2.2. AdaIN-SVC

AdaIN was proposed in for the arbitrary style transfer in real time. The idea of AdaIN was extended for the voice conversion of speech signals. In audio, the statistics (mean and variance) of the content features might represent phonemes, or musical notes. The style features could correspond to timbres or specific audio effects. By applying AdaIN, it can transfer the style characteristics of one audio signal onto the content of another. In the Eq 2.1 $\gamma$ parameter adjusts the style features by scaling them whereas the $\beta$ parameter shifts the content features by adding an offset. AdaIN-SVC employs a VAE with distinct encoders to encode both the singer-specific information and the phonetic content information. AdaIN approach can also be applied for the conversion of singing voices.

Consider the singing voice as the phonetic information sung by a singer with unique features defined as the singer identity information. While the lyrical information changes drastically after each frame, the singer information experiences only slight variations. Since the mean and variance over an entire singing waveform rarely change, singer information is expressed using the same.

AdaIN-SVC employs two encoders to encode the singer information and phonetic content information. The encoder takes a mel-spectrogram of the singing waveform as input frames. IN layers are added in each encoder block. These IN layers effectively separate the singer information from the lyrical content. To achieve this, the channel-wise mean ( $\mu$ ) and standard deviation ( $\sigma$ ) is computed. The IN layer detaches the singer representations from the encoded phonetic content information. However, during the decoding process, the detached singer information is reintroduced to the AdaIN layer.

2.3. Proposed AGAIN-SVC

AdaIN-SVC uses a VAE with separate content and speaker encoders. It also leverages AdaIN to separate singer and content representations but relies on separate encoders. It either reduces the dimension or quantizes content embeddings to prevent singer information leakage. However, this strong information bottleneck can harm synthesis quality. AGAIN-SVC introduces a proper activation as an information bottleneck on content embeddings. It simplifies the architecture to an auto-encoder with a single encoder and a decoder. By using activation guidance, it can drastically improve the trade-off between synthesis quality and speaker similarity in converted speech.

The proposed AGAIN-SVC is an auto-encoder-based model that disentangles the singer identity and phonetic information. The framework comprises a single encoder and a decoder. Unlike AdaIN-SVC, which utilizes separate encoders for phonetic content and singer information, AGAIN-SVC uses only one encoder. The singer information is considered time independent information, whereas the phonetic content information is time dependent. Components, the style transfer of singing data, becomes easier. To prevent information leakage without affecting the conversion performance, an activation guidance is used as an information bottleneck. The training procedure of the proposed AGAIN SVC system is shown in Figure 1. Instead of modeling raw singing audio data, mel-spectrograms are used. These lower resolution representations are easier to model than raw temporal audio and faithfully regenerate back to audio. The IN layer detaches the linguistic content and singer dependent features. Later, the singer embeddings are later added to the AdaIN layer in the decoding section. The evaluation procedure of AGAIN-SVC is illustrated in Figure 2. Given reference X and target Y, the AdaIN layer transfers the singer features (channel-wise mean and variance) from the reference to those of the target singer. Meanwhile, the content of the reference remains unchanged, and only the singer identity is transferred.

Figure 1. Training phase procedure of AGAIN SVC. Here,

$\mu$ represents mean and

$\sigma$ is the standard deviation. P denotes linguistic information.

DownLoad: Full-Size Img PowerPoint

Figure 2. Diagram of the evaluation phase of AGAIN SVC. The linguistic content denoted by

$P_s$ and the singer representation (

$\mu_s$ and

$\sigma_s$ ) of source data are separated and

$P_s$ is given into the decoder without changing.

$\mu_s$ and

$\sigma_s$ are replaced with the target singer representation (

$\mu_t$ and

$\sigma_t$ ) and fed into the AdaIN layer of the decoder.

DownLoad: Full-Size Img PowerPoint

2.3.1. Activation guidance

The encoder phase acts as a content encoder, responsible for encoding the phonetic content information from the singing data. However, the mean and variance, which are time independent features, seldom carry any phonetic content information. To address this, IN layers in each encoder block remove the time-invariant information that represents the singer identity. To prevent any leakage of singer information along with the phonetic content embedding, an activation function is included as an information bottleneck. The sigmoid function with the hyperparameter value $\alpha = 0.1$ is chosen. The sigmoid function proves effective in disentangling the singer identity from phonetic content embedding while minimizing the reconstruction error. This activation function ensures that the encoding contains only the phonetic information.

2.3.2. Network architecture

The basic architecture is based on the AGAIN voice conversion system. It employs a U-net architecture, where both encoder and decoder utilize 1D convolutional layers. A CNN ^[45,46,47] or ConvNet is a feed-forward neural network. CNN contains a stack of convolutional layers, typically the hidden layer that performs convolutions. The proposed hybrid architecture combines a CNN with an RNN. In DNNs, especially those with many layers, gradients can diminish exponentially as they propagate backward. This phenomenon hinders effective training and can significantly prolong the learning process or even cause convergence failure. The vanishing gradient problem is particularly associated with activation functions like sigmoid. The specialized RNN architectures, LSTM and GRU address the vanishing gradient problem by introducing gating mechanisms and maintaining cell states. LSTMs ^[48] introduce gates (input, forget, and output gates) that control the flow of information within the cell and maintain them over longer sequences. It maintains a cell state that allows information to flow across time steps without vanishing gradients. GRUs ^[49] introduce two essential gates: the update gate and the reset gate. It computes a new hidden state by blending the previous hidden state and the updated input. This mechanism allows them to capture temporal dependencies effectively. The hybrid models integrate the RNN (either LSTM or GRU), which acts as a feedback network, with the feed-forward CNN. In these hybrid networks, CNN layers extract complex features, while LSTM or GRU layers capture long-term dependencies.

The encoder and decoder shown in Figures 1 and form a U-net architecture, as depicted in . The input mel-spectrogram features pass through 1D convolution layers. Both the encoder and decoder uses 1D convolution blocks, which consist of 1D convolution layers followed by BN and leaky rectified linear unit (LeakyReLU) activation function. Skip connections between the encoder and decoder are formed by transferring speaker embeddings $\mu$ and $\sigma$ between IN layers and AdaIN layers. These skip connections allow encoded features from the encoder to be directly forwarded to the decoder. By connecting the encoder and decoder, information from earlier layers can bypass the bottleneck (latent space) and directly influence the reconstruction process. This helps mitigate the vanishing gradient problem and facilitates better information flow. By allowing gradients to flow more freely, skip connections stabilize training and improve convergence. The content embeddings from the encoder is passed to the decoder, and the converted mel-spectrogram features are extracted from the decoder. LSTM layers are placed after convolution layers at both the encoder and decoder ends. To obtain hybrid CNN-GRU model, LSTM layers are replaced with GRU. Both LSTM and GRU are used in the same way and both of them have somewhat similar effects. The pseudocode algorithm for the proposed system is also included here.

Figure 3. The network architecture of the proposed hybrid CNN-LSTM model.

DownLoad: Full-Size Img PowerPoint

Algorithm 1 Enhanced AGAIN-VC with ConvLSTM Pseudocode

1: Load preprocessed data (mel-spectrograms) for source and target speakers
2: Initialize model parameters (encoder, ConvLSTM layers, and decoder)
3: Encoder:
4:  Input: Mel-spectrogram of source speaker's speech
5:  Output: Content embeddings
6:  Encode the input mel-spectrogram using the encoder (including ConvLSTM layers)
7: Activation Guidance:
8:  Apply activation guidance to the content embeddings
9:  Restrict the flow of information to essential features
10: Adaptive Instance Normalization (AdaIN):
11:  Adapt the statistics (mean and variance) of content embeddings
12:  to match those of the target speaker
13:  Preserve linguistic content while adjusting for style
14: Decoder:
15:  Input: Adapted content embeddings and target speaker's style
16:  Output: Converted mel-spectrogram
17:  Decode the adapted content embeddings using the decoder
18: Loss Functions:
19:  L1 Loss: Minimize difference between input and output mel-spectrograms
20:  Perceptual Loss: Encourage perceptual similarity with target speaker's speech
21:  Total loss: Combination of L1 and perceptual losses
22: Training:
23:  Optimize model parameters using backpropagation and gradient descent
24:  Monitor training progress (e.g., total steps, evaluation steps)
25: Inference:
26:  Given a source mel-spectrogram, convert it to target speaker's style
27:  Output the converted mel-spectrogram
28: ConvLSTM Layers:
29:  Incorporate ConvLSTM layers within the encoder and decoder
30:  Enhance the temporal modeling capabilities by capturing spatio-temporal features
31:  Adjust hyperparameters (e.g., kernel size, hidden units) as needed

2.4. MelGAN

MelGAN ^[50] is specifically designed for efficient audio waveform synthesis. Unlike autoregressive models (such as WaveNet and WaveRNN), which suffer from high computational complexity, MelGAN focuses on reducing this complexity. It achieves this by decomposing raw audio samples into learned basis functions and associated weights. As a result, MelGAN significantly reduces computational demands compared to other GAN-based neural vocoders, such as HiFi-GAN ^[51]. Moreover, MelGAN consistently produces high-quality audio that rivals other GAN-based vocoders. MelGAN comprises two main components: the generator and the discriminator. The primary objective of MelGAN is to generate high-quality audio waveforms from mel-spectrograms. The generator, being lightweight, efficiently converts mel-spectrograms into raw audio waveforms. Conversely, the discriminator plays the crucial role of distinguishing between real and generated audio. Through an adversarial training process, the generator continually improves the quality of the synthesized audio. Additionally, by conditioning the generator on specific attributes (such as speaker identity or emotion), MelGAN can synthesize audio with desired characteristics. To enable conditional synthesis, MelGAN requires paired data during training, allowing it to learn the mapping from source mel-spectrograms to target audio waveforms.

3. Results and discussion

The model is evaluated using the National University of Singapore (NUS) sung and spoken lyrics corpus (NUS-48E corpus) ^[52]. The corpus comprises speaking and singing voices from 48 English songs performed by six male and six female subjects, out of which 20 songs are unique. The total length of audio recordings is approximately 169 minutes, with 115 minutes dedicated to singing data and the remaining 54 minutes to speech data. From the NUS-48E database, nine singers are selected for training, while the remaining three singers (two male and one female singer) are reserved for testing. The audio is converted to 22050 Hz, and silent frames are removed. 80 bin mel-scale spectrograms are generated from the audio data as an acoustic feature. To generate the waveform back from the mel-spectrogram, MelGAN is used as the vocoder. MelGAN is a fully convolutional architecture within a GAN setup, employed for conditional audio waveform synthesis.

An adaptive moment estimation (ADAM optimizer) is used for training. The model is trained for 50,000 training steps with a learning rate of 0.0005. The batch size is 32. For performance evaluation of the proposed architecture, AdaIN-SVC and AGAIN-SVC are used as baseline models.

3.1. Objective evaluation

SVC is primarily associated with the transformation of MCEPS. For evaluation purposes, it is necessary to calculate the distortion between the original and transformed MCEPS features. Mel-cepstral distortion (MCD) ^[53,54] is a widely used measure for evaluating synthesized voice. MCD represents the Euclidean distance between the real and converted MCEPS features. It is defined as follows:

$\begin{equation} MCD(dB) = \frac{10}{Nln10}\sum\limits_{n = 1}^{N}\sqrt{2\sum\limits_{d = 1}^{D}(m_{c(n, d)}-m_{t(n, d)})^{2}} \end{equation}$

(3.1)

where $m_{c(n, d)}$ and $m_{t(n, d)}$ are the $d^{th}$ coefficients of the converted and target MCEPS features respectively, at the $n^{th}$ frame. D = 24 is the dimension of the MCEPS. N is the total number of frames. MCD values are tabulated in Table 1 for four possible conversions, male to male (MM), male to female (MF), female to male (FM), and female to female (FF). Conversions with lower MCD values exhibit smaller distortion and better performance. Interestingly, the proposed method for FF conversion yields the best results. Female voices often share more similar acoustic characteristics with each other compared to male voices. These shared characteristics include pitch range, formant frequencies, and overall timbre. When converting from one female voice to another, the transformation is less drastic, that might lead to better results.

Table 1. MCD results for MM, MF, FM, and FF conversions.

System	MCD (dB)
System	MM	MF	FM	FF	Average
AdaIN-SVC	8.20	7.92	8.30	8.79	8.30
AGAIN-SVC	6.32	6.31	6.37	6.92	6.48
Hybrid CNN-GRU	5.92	5.80	5.95	5.33	5.75
Hybrid CNN-LSTM	5.99	5.56	5.94	4.91	5.60

| Show Table

DownLoad: CSV

For a comparative study with state of the art techniques, experiments are conducted using various machine learning methods in the field of speech conversion, including VAE, VAWGAN, CycleGAN, and cycle-consistent boundary equilibrium generative adversarial network (CycleBEGAN), along with the baseline models (Table 2). The evaluation measures include MCD, root mean square error (RMSE) of mel-spectrogram features from real and converted voices, and log spectral distortion (LSD). LSD computes the difference between the linear prediction coding (LPC) log power spectra from the original and converted singing voices. The converted singing voice retains spectral features similar to the original if the distortion is minimized.

Table 2. Comparative study with the state of the art techniques.

System	MCD(dB)	RMSE(dB)	LSD(dB)
VAE	8.20	3.14	11.13
VAWGAN	6.32	3.36	10.14
CycleGAN	5.92	3.32	10.91
CycleBEGAN	5.99	3.16	9.95
AdaIN-SVC	8.30	4.5	11.24
AGAIN-SVC	6.48	3.14	10.29
Hybrid CNN-GRU	5.75	2.47	10.11
Hybrid CNN-LSTM	5.6	2.04	9.80

| Show Table

DownLoad: CSV

From the results, it can be seen that the hybrid models yield more promising outcomes than the baseline methods. Specifically, hybrid CNN-RNN systems demonstrate superior performance compared to the baseline AdaIN-SVC and AGAIN-SVC models. Although the GRU model is less complex and faster to train, the fusion of LSTM architecture produces a more similar converted voice aligned with the target. This phenomenon arises because the LSTM network requires more trainable parameters, which, in turn, enhance the system's efficiency during training.

Additionally, the global variance (GV) distribution ^[55] plot for synthesized and target singing voices serves as a valuable measure to assess their closeness. GV visualizes spectral features in terms of variance distribution. For instance, Figure 4 illustrates the variance distribution plot for target singing voice and converted singing voice across four conversions. The results affirm that the converted singing voices closely match the target singing voice.

Figure 4. Global variance distribution of target and converted singing voice for each frequency index.

DownLoad: Full-Size Img PowerPoint

Incorporating LSTM or GRU with CNN can enhance voice conversion models by capturing both local and contextual features. The choice depends on specific requirements and available resources. LSTM has more parameters and is capable of capturing intricate long-term dependencies, while GRU simplifies the architecture by having fewer gates and parameters. If the system prioritizes accuracy and has sufficient data, LSTM might provide better results. If efficiency is crucial, GRU could be a better choice. GRU typically trains faster than LSTM due to its simpler structure. If you have limited training resources, GRU might be preferable. LSTM might be more suitable for tasks requiring precise modeling of long-range dependencies (e.g., music composition). GRU could be preferable for real-time applications where efficiency matters (e.g., voice-controlled systems).

3.2. Subjective evaluation

Objective evaluation measures alone have several limitations because the performance of the synthesized singing voice depends on the perceptual abilities of humans. To ensure the quality of the synthesized singing voice, subjective evaluation is also performed. In this subjective evaluation, 24 participants (12 male and 12 female), all without any hearing problems, took part in listening tests. Each participant evaluated a mean opinion score (MOS) for three attributes: naturalness, melodic similarity, and phoneme intelligibility, along with XAB preference tests related to singer similarity. XAB preference testing is a method used to compare two different stimuli (A and B) with an unknown third stimulus (X). During the listening tests, participants were required to listen to two randomly chosen songs from each conversion.

For the XAB test on singer similarity, participants listened to a target song and their converted singing voice in two systems. They then had to choose their preferred system based on singer similarity with the target. The experiments included intra-gender (MM or FF) and inter-gender (MF or FM) conversions. Initially, the preference test involved converted songs from the AdaIN-SVC and AGAIN-SVC models. Subsequently, comparison was conducted between AdaIN-SVC vs hybrid CNN-LSTM, as well as between AGAIN-SVC vs hybrid CNN-LSTM. The results of the XAB preference test are depicted in Figure 5, Figure 6, and Figure 7.

Figure 5. XAB preference test results between AdaIN-SVC and AGAIN-SVC.

DownLoad: Full-Size Img PowerPoint

Figure 6. XAB preference test results between AdaIN-SVC and the proposed hybrid CNN-LSTM.

DownLoad: Full-Size Img PowerPoint

Figure 7. XAB preference test results between AGAIN-SVC and the proposed hybrid CNN-LSTM.

DownLoad: Full-Size Img PowerPoint

Additionally, MOS tests were conducted, where each participant provided a score on a scale of 1 to 5 for each system (Figure 8). A score of 1 corresponds to the worst case, while a score of 5 represents the best case. Naturalness reflects how closely the converted singing voice resembles a natural human voice. The melodic similarity attribute is evaluated based on the resemblance in melody between the target voice and the converted voice. Finally, MOS on phoneme intelligibility provides insights into the clarity of phonemes in the converted singing voice.

Figure 8. MOS for three attributes; naturalness, melodic similarity, and phoneme intelligibility.

DownLoad: Full-Size Img PowerPoint

The singer similarity between AdaIN-SVC and AGAIN-SVC exhibits somewhat equal preferences. However, from the preference test results comparing AdaIN-SVC with the proposed hybrid CNN-LSTM system, the latter shows greater resemblance to the target singer's voice. Specifically, for inter-gender conversions, the hybrid CNN-LSTM system achieves 86.67% similarity, whereas AdaIN-SVC achieves only 13.3%. For intra-gender conversions, the hybrid CNN-LSTM system achieves 93.3% similarity, while AdaIN-SVC lags behind at 6.7%. Additionally, the hybrid CNN-LSTM model is preferred over AGAIN-SVC. Analyzing the percentage preferences reveals substantial differences: AdaIN-SVC vs. the hybrid model exhibits large disparities, indicating that AdaIN-SVC has the least preference. Meanwhile, AGAIN-SVC vs. the hybrid model comparison shows that the hybrid model is highly preferred. Furthermore, the melodic similarity and phoneme intelligibility of the converted singing voice in the proposed hybrid systems outperform the baseline AdaIN-SVC and AGAIN-SVC models (Figure 8). This improvement is attributed to the activation guidance, which mitigates the phonetic content loss associated with the latter. However, there is room for improvement in the naturalness of the overall system.

4. Conclusions and future work

In the proposed work, a combination of the CNN model and RNN model is used for one shot SVC, employing adaptive normalization layers and activation guidance techniques. IN layers remove singer information from the source singing mel-spectrogram, while AdaIN layers add target singer information to the content representation. Activation guidance prevents singer information leakage, ensuring better control over singer-related features. This technique disentangles singer and content information without leaking singer information into the content embeddings. By incorporating the recurrent architectures such as LSTM and GRU, which excel at capturing long-term dependencies in sequential data, this paper achieves high-quality converted singing voice while maintaining singer characteristics. The one-shot capability simplifies the architecture, making it efficient and practical. The conversion performance of singing voice is assessed through objective and subjective evaluation. For the evaluation, voice conversion techniques such as AdaIN and AGAIN are used as baseline architectures. It is evident that the fusion of CNN and LSTM model consistently yields better results across all experiments. The MCD is least for the proposed hybrid CNN-LSTM model (5.6dB), ensuring superior conversion. The majority of people preferred the proposed conversion technique, with AdaIN-SVC being the least preferred, followed by AGAIN-SVC and hybrid CNN-LSTM. The proposed technique achieves MOSs of 2.93, 3.35, and 3.21 for naturalness, melodic similarity, and phoneme intelligibility respectively.

Although the proposed hybrid CNN-LSTM model aims to balance content preservation with speaker identity transformation, achieving the perfect trade-off remains challenging. Formulating SVC as a multi-objective optimization problem or introducing a latent space and adjusting specific dimensions within this space allows users to control the balance between similarity and quality. One-shot methods are more susceptible to background noise, accent variations, and emotional fluctuations. Robustness to such factors is an ongoing research area. These modifications represent exciting avenues for future research.

Author contributions

Assila Yousuf: Resources, Investigation, Validation, Formal Analysis, Writing ‒ original draft; David Solomon George: Supervision; All authors: Conceptualization, Methodology. All authors have read and approved the final version of the manuscript for publication.

Acknowledgments

The authors would like to thank Directorate of Technical Education (DTE) and A.P.J Abdul Kalam Technological University, Kerala, India for the support provided.

Conflict of interest

All authors declare no conflicts of interest in this paper

References

[1]	Franco JT, Buzatto HK, de Carvalho MA (2023) TRIZ as a Strategy for Improvement of Process Control in the Wood Industry. In TRIZ in Latin America: Case Studies 175‒191. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-031-20561-3_8
[2]	Ding Y, Ma Y, Liu T, Zhang J, Yang C (2023) Experimental Study on the Dynamic Stability of Circular Saw Blades during the Processing of Bamboo-Based Fiber Composite Panels. Forests 14: 1855. https://doi.org/10.3390/f14091855 doi: 10.3390/f14091855
[3]	Kim M, Lee SU, Kim SS (2021) Real-time simulator of a six degree-of-freedom hydraulic manipulator for pipe-cutting applications. IEEE Access 9: 153371‒153381. https://doi.org/10.1109/ACCESS.2021.3127502 doi: 10.1109/ACCESS.2021.3127502
[4]	Chen Z, Zhang Y, Wang C, Chen B (2021) Understanding the cutting mechanisms of composite structured soft tissues. Int J Mach Tool Manu 161: 103685. https://doi.org/10.1016/j.ijmachtools.2020.103685 doi: 10.1016/j.ijmachtools.2020.103685
[5]	Wang H, Satake U, Enomoto T (2023) Serrated chip formation mechanism in orthogonal cutting of cortical bone at small depths of cut. Journal of Materials Processing Technology 319: 118097. https://doi.org/10.1016/j.jmatprotec.2023.118097 doi: 10.1016/j.jmatprotec.2023.118097
[6]	Wang Z, Li T (2022) Design of gas drainage system based on PLC redundancy control technology. https://doi.org/10.21203/rs.3.rs-2361008/v1
[7]	Paul A, Biradar B, Salkar T, Mahajan S (2023) Study on Various Type of Packaging Machines. Grenze International Journal of Engineering & Technology (GIJET) 9(2).
[8]	Peng C, Zhang Z, Liu W, Li D (2022) Ranking of Key Components of CNC Machine Tools Based on Complex Network. Math Probl Eng 2022. https://doi.org/10.1155/2022/6031626 doi: 10.1155/2022/6031626
[9]	Syufrijal S, Rif'an M, Prabumenang AW, Wicaksono R (2021) Design and implementation of pipe cutting machine with AC servo motor and PLC based on HMI. In IOP Conference Series: Materials Science and Engineering 1098: 042082. IOP Publishing. https://doi.org/10.1088/1757-899X/1098/4/042082
[10]	Wang G, Sun X, Luo Z, Ye T (2021) Cutting Device for Production of Spiral Submerged Arc Welded Pipe Based on PLC Control System. In Big Data Analytics for Cyber-Physical System in Smart City: BDCPS, 28-29 2020, Shanghai, China 901‒906. Springer Singapore. https://doi.org/10.1007/978-981-33-4572-0_129
[11]	Rallabandi SR, Yanda S, Rao CJ, Ramakrishna B, Apparao D (2023) Development of a color-code sorting machine operating with a pneumatic and programmable logic control. Materials Today: Proceedings. https://doi.org/10.1016/j.matpr.2023.05.150
[12]	Attar KA, Patil SA, Patil PD, Sutar AD, Patil SA, Bartake G (2020) Development and Fabrication of Automatic Chakali Making Machine.
[13]	Talahma M, Isied O (2022) Design and Implement of a Prototyping Automatic Sweet Forming Machine.
[14]	Wu Q, Mao Y, Chen J, Wang C (2021) Application research of digital twin-driven ship intelligent manufacturing system: Pipe machining production line. J Mar Sci Eng 9: 338. https://doi.org/10.3390/jmse9030338 doi: 10.3390/jmse9030338
[15]	Yu M, Wang B, Ji P, Li B, Zhang L, Zhang Q (2023) Simulation analysis of the circular sawing process of medium density fiberboard (MDF) based on the Johnson–Cook model. Eur J Wood Wood Prod 1‒3. https://doi.org/10.1007/s00107-023-02007-5
[16]	Shaheed BN, Selman NH (2023) Design and implementation of a control system for a steel plate cutting production line using programmable logic controller. International Journal of Electrical and Computer Engineering (IJECE) 13: 3969‒3976. https://doi.org/10.11591/ijece.v13i4.pp3969-3976 doi: 10.11591/ijece.v13i4.pp3969-3976
[17]	Zhu H, Chen JW, Ren ZQ, Zhang PH, Gao QL, Le XL, et al. (2022) A new technique for high-fidelity cutting technology for hydrate samples. Journal of Zhejiang University-SCIENCE A 23: 40‒54. https://doi.org/10.1631/jzus.A2100188 doi: 10.1631/jzus.A2100188
[18]	Zhu X, Zhong J, Jing J, Ye W, Zhou B, Shan H (2023) Fuzzy proportional–integral–derivative control system of electric drive downhole cutting tool based on genetic algorithm. Proceedings of the Institution of Mechanical Engineers, Part E: Journal of Process Mechanical Engineering 09544089231172608. https://doi.org/10.1177/09544089231172608
[19]	Du T, Dong F, Xu R, Zou Y, Wang H, Jiang X, et al. (2022) A Drill Pipe‐Embedded Vibration Energy Harvester and Self‐Powered Sensor Based on Annular Type Triboelectric Nanogenerator for Measurement while Drilling System. Adv Mater Technol 7: 2200003. https://doi.org/10.1002/admt.202200003 doi: 10.1002/admt.202200003
[20]	Mu X, Xue Y, Jia YB (2023) Dexterous Robotic Cutting Based on Fracture Mechanics and Force Control. IEEE T Autom Sci Eng. https://doi.org/10.1109/TASE.2023.3309784
[21]	Limonov L, Osichev A, Tkachenko A, Kunchenko T (2022) Dynamics of the Electric Drive of the Flying Saw of an Electric Pipe Welding Mill. In 2022 IEEE 3rd KhPI Week on Advanced Technology (KhPIWeek) 1‒6. IEEE. https://doi.org/10.1109/KhPIWeek57572.2022.9916384
[22]	Salem ME, El-Batsh HM, El-Betar AA, Attia AM (2021) Application of neural network fitting for pitch angle control of small wind turbines. IFAC-PapersOnLine 54: 185‒190. https://doi.org/10.1016/j.ifacol.2021.10.350 doi: 10.1016/j.ifacol.2021.10.350
[23]	Vadi S, Bayindir R, Toplar Y, Colak I (2022) Induction motor control system with a Programmable Logic Controller (PLC) and Profibus communication for industrial plants—An experimental setup. ISA T 122: 459‒471. https://doi.org/10.1016/j.isatra.2021.04.019 doi: 10.1016/j.isatra.2021.04.019
[24]	Barbosa AF, Campilho RD, Silva FJ, Sánchez-Arce IJ, Prakash C, Buddhi D (2022) Design of a Spiral Double-Cutting Machine for an Automotive Bowden Cable Assembly Line. Machines 10: 811. https://doi.org/10.3390/machines10090811 doi: 10.3390/machines10090811
[25]	Ali AW, Alquhali AH (2020) Improved Internal Model Control Technique for Position Control of AC Servo Motors. ELEKTRIKA-Journal of Electrical Engineering 19: 33‒40. https://doi.org/10.11113/elektrika.v19n1.179 doi: 10.11113/elektrika.v19n1.179
[26]	Vaishnavi Patil DP, Manjunath TC (2023) Design, Development of a Diversified Implemention of a Supervisory Control And Data Acquisition based VLSI System (SCADA) framework Utilizing Microcontroller based Programmable Logic Controllers. Tuijin Jishu/Journal of Propulsion Technology 44: 879‒890. https://doi.org/10.52783/tjjpt.v44.i3.389 doi: 10.52783/tjjpt.v44.i3.389
[27]	Acosta PC, Terán HC, Arteaga O, Terán MB (2020) Machine learning in intelligent manufacturing system for optimization of production costs and overall effectiveness of equipment in fabrication models. Journal of Physics: Conference Series 1432: 012085. IOP Publishing. https://doi.org/10.1088/1742-6596/1432/1/012085 doi: 10.1088/1742-6596/1432/1/012085
[28]	Chengye L, Rui W, Wanjin W, Yajun W, Hao L, Xianmeng Z (2021) Development of Special Device for Cutting Irradiation Test Tube. 核动力工程 42: 10‒14.
[29]	Srinivas GL, Singh SP, Javed A (2021) Experimental evaluation of topologically optimized manipulator-link using PLC and HMI based control system. Materials Today: Proceedings 46: 9690‒9696. https://doi.org/10.1016/j.matpr.2020.08.023 doi: 10.1016/j.matpr.2020.08.023
[30]	Zuo Y, Mei J, Zhang X, Lee CH (2020) Simultaneous identification of multiple mechanical parameters in a servo drive system using only one speed. IEEE T Power Electr 36: 716‒726. https://doi.org/10.1109/TPEL.2020.3000656 doi: 10.1109/TPEL.2020.3000656
[31]	Eshraghian JK, Wang X, Lu WD (2022) Memristor-based binarized spiking neural networks: Challenges and applications. IEEE Nanotechnol Mag 16: 14‒23. https://doi.org/10.1109/MNANO.2022.3141443 doi: 10.1109/MNANO.2022.3141443
[32]	Fuller A, Fan Z, Day C, Barlow C (2020) Digital twin: Enabling technologies, challenges and open research. IEEE access 8: 108952‒108971. https://doi.org/10.1109/ACCESS.2020.2998358 doi: 10.1109/ACCESS.2020.2998358
[33]	Roozbahani H, Alizadeh M, Ahomä ki A, Handroos H (2021) Coordinate-based control for a materials handling equipment utilizing real-time simulation. Automat Constr 122: 103483. https://doi.org/10.1016/j.autcon.2020.103483 doi: 10.1016/j.autcon.2020.103483
[34]	Park CG, Yoo S, Ahn H, Kim J, Shin D (2020) A coupled hydraulic and mechanical system simulation for hydraulic excavators. Proceedings of the Institution of Mechanical Engineers, Part I: Journal of Systems and Control Engineering 234: 527‒549. https://doi.org/10.1177/0959651819861612 doi: 10.1177/0959651819861612
[35]	Zhang Y, Ding W, Deng H (2020) Reduced dynamic modeling for heavy-duty hydraulic manipulators with multi-closed-loop mechanisms. IEEE Access 8: 101708‒101720. https://doi.org/10.1109/ACCESS.2020.2998058 doi: 10.1109/ACCESS.2020.2998058
[36]	Zhu Z, Buck D, Guo X, Cao P, Wang J (2020) Cutting performance in the helical milling of stone-plastic composite with diamond tools. CIRP J Manuf Sci Tech 31: 119‒129. https://doi.org/10.1016/j.cirpj.2020.10.005 doi: 10.1016/j.cirpj.2020.10.005
[37]	Wang BJ, Lin CH, Lee WC, Hsiao CC (2023) Development of a Bamboo Toothbrush Handle Machine with a Human–Machine Interactive Interface for Optimizing Process Conditions. Sustainability 15: 11459. https://doi.org/10.3390/su151411459 doi: 10.3390/su151411459
[38]	Ding Y, Ma Z, Wen S, Xie J, Chang D, Si Z, et al. (2021) AP-CNN: Weakly supervised attention pyramid convolutional neural network for fine-grained visual classification. IEEE T Image Process 30: 2826‒2836. https://doi.org/10.1109/TIP.2021.3055617 doi: 10.1109/TIP.2021.3055617

Reader Comments

Your name:*

Email:*
© 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)