1.
Introduction
Modulation classification (MC) has been found to be a fundamental task in wireless communication systems that involves identifying the modulation class used to encode information in the received signal. It plays a crucial part in various applications, including signal analysis, spectrum monitoring [1] and cognitive radio (CR) [2]. The primary goal of MC is to blindly recognize the modulation types of the received signal based on characteristics, such as amplitude, phase and frequency variations. This enables the receivers to properly demodulate the signal and extract the transmitted information. To perform MC, several techniques and algorithms have been developed [3]. These methods leverage different features extracted from the received signal and employ various classification algorithms, such as machine learning and statistical analysis. The key steps involved in MC are as follows [4]:
● Signal Preprocessing: The received signal is typically processed to remove noise, interference, and other unwanted effects that may affect the classification accuracy. This processing technique involves filtering, signal normalization and sampling rate adjustment.
● Feature Extraction: Relevant features are extracted from the preprocessed signal to capture its unique characteristics. Commonly used features include statistical moments, power spectral density, cyclic auto-correlation and higher-order cumulants. These features aim to capture the temporal and spectral properties of the received signal.
● MC: Once the feature set is obtained, the model generation process is carried out by either the likelihood-based or the feature-based method. The different machine learning (ML) / deep learning (DL) algorithms have been adopted to perform the classification task. These algorithms are trained using labeled data to learn the patterns and characteristics associated with each modulation class.
● Performance Evaluation: The accuracy of the MC system is assessed using performance metrics such as classification accuracy, confusion matrix and detection probability. These metrics provide insights into the system's ability to correctly classify different modulation classes.
In [5], the authors suggested that the MC can be performed by using ML methods such as the random forest (RF), support vector machine (SVM) and neural network (NN). Further, in DL, salient features have been extracted from the received signals [6], such as power spectral density, instantaneous voltage, phase and frequency for MC. The obtained features are suffering from channel impairments and fading environments that may lead to a major challenge for recognizing modulation type under different channel conditions [7].
The improvement in classification accuracy has been achieved at the expense of increased computational complexity by the traditional convolutional neural network (CNN) [8] models such as Alexnet, Inception [9] and GoogleNet [10]. Similarly, the channel state information (CSI) parameters, such as rank indicator (RI), delay spread (DS), and signal to interference noise ratio (SINR), have been classified with higher accuracy than the third generation partnership project (3GPP) recommended value by using a CNN model [11]. The authors in [9] suggest that the combination of the Inception and ResNet network model has a faster convergence rate among all traditional CNN models. Similarly, the hybrid residual and long short term memory (LSTM)-based model has been developed to improve the classification accuracy at low signal to noise ratio (SNR) scenarios [12]. The automatic modulation recognition (AMR)-based NN [13] and CNN [14] have been implemented on field programmable gate arrays (FPGA) for dynamic spectrum access (DSA) and CR systems with a latency of 8μs. Moreover, the LSTM-based channel estimator has been demonstrated to detect received orthogonal frequency division multiplexing (OFDM) signals [15]. In [16], the authors proposed an adversarial transfer learning architecture (ATLA), which performs the domain-level asymmetric mapping with an accuracy of 17.3% higher than parameter-based transfer methods. Moreover, ATLA is applicable for knowledge transfer between the parameters for the reduced training data and assists the CNN model to improve model accuracy. Although ATLA provides optimal performance, it is also sensitive to parameter variations and becomes impractical for real-time prediction scenarios.
To recognize the radio signals, various datasets are available as listed in Table 1. It has been noticed that the graphics processing unit (GPU) has been used to create the model, which reduces the training overhead more than the central processing unit (CPU). Moreover, Hisarmod2019.1 and RadioML2018.10.a datasets have been adopted with channel impairments such as multipath fading and frequency offset. The aforementioned simulation-based dataset may not be sufficient for real-world problems. Moreover, improvements in MC models for smaller variations with respect to decimation rate, number of signals and number of layers are still lacking in the literature.
In resource-constrained applications, the above-mentioned parameters play an important role in decoding physical resource block (PRB), primary synchronization signal (PSS) and secondary synchronization signal (SSS) in the physical channels. In this work, we have utilized radio frequency signal classification (RFSC) dataset [21] to analyze the multi-layer CNN model performance. We leverage the benefits of CNN models to tackle the issues of MC, which severely hinders the practical resource-constrained scenarios. The main contributions of this paper are summarized as follows:
● We propose a multilayer CNN model for MC using the RFSC dataset.
● We investigate the performance analysis of the multilayer CNN model for different decimation rates (D), input signals (K), number of input frames (F) and number of CNN layers (l).
● We evaluate the proposed model performance in terms of confusion matrix, training/testing loss and accuracy.
● We demonstrate the recognition accuracy comparison among different DL models and cumulant-based models for RFSC dataset.
● We also analyze the performance of generated models using RadioML, Hisarmod and RFSC datasets.
The paper is organized as follows: In section two, the proposed CNN architecture with different hyperparameters has been presented in detail. The simulation results are discussed in section three. The conclusions are finally drawn in section four.
2.
Proposed model
The proposed CNN architecture incorporates feature extraction capabilities to learn the discriminating representations directly from the raw data. The CNN model receives input in a time-domain representation, which is translated into a 2D matrix containing the amplitude and phase information. The CNN architecture is made up of numerous convolutional layers with increased complexities that allow for the extraction of low-level and high-level features from the input signal. After each convolutional layer, a pooling layer is applied to perform the down-sampling on the feature map to retain the most important information.
2.1. Convolutional layer
The received signal r(t) is modeled as,
Here, s(t) denotes the transmitted signal and h(t,τ) is a channel output described as a complex channel finite impulse response (FIR) filter. The received signal r(t) is down-converted and sampled to derive the baseband digital samples (rb[n]=r[nTs]=rb,I[n]+jrb,Q[n]). The k-th dataset vector of a r(t) sample is denoted as:
The k-th feature vector of the m-th class xmk that is derived from a dataset of N samples and decimated by a factor D is given by,
Here, m=1,2,.....,M.
The k-th feature vector is further translated into a matrix (R) to train the proposed CNN model, xmk∈RfH×fW×Cin with size (fH×fW×Cin), where fH is the height, fW is the width of an image and Cin is the number of channels, respectively. The matrix R is fed as an input to the CNN model to perform feature extraction through the activation function. The features are extracted by the consecutive convolutional layers that contain the multiple kernels to perform the convolution operation. Kernels are acting as feature detectors, generally FIR filters, which perform the convolution operation on the input image samples and produce the transformed version as an output. The proposed CNN architecture with l convolutional layers and two dense layers has been shown in Figure 1.
The mathematical representation of the convolutional layer is as follows: For l-th convolutional layer with n-th filter, the convolution operation using filters has been performed as:
where z[l]x,y,n=conv(a[l−1],K(n))x,y,n is the output value at position (x,y) in the n-th feature map (output channel) of the l-th convolutional layer. fH and fW are the height and width of the filter K, respectively. Cin is the number of input channels in the previous layer (a[l−1]). Where a[l−1] = xmk,∀m,k is an input to the l-th layer. K[l]m,n,c,n represents the weight (filter) value at position (m,n) in the c-th input channel and n-th output channel of the l-th convolutional layer. a[l−1]x+m,y+n,c is the input value at position (x+m,y+n) in the c-th feature map of the (l−1)-th layer. b[l]n is the bias term for the n-th output channel.
Here, s is a stride parameter used to define the step size of the convolution operation; p represents the amount of zero padding; K(n) is a kernel represented by (f[l],f[l],C[l−1]in); f[l] denotes the size of the filter in the l-th layer. The activation function applied at the l-th layer can be defined as:
where a[l]x,y,n is the output value at position (x,y) in the n-th feature map of the l-th activation layer. The activation function ψ[l] is applied element-wise to the output of the convolutional layer and it is defined as:
The output of the l-th convolutional layer can be obtained as
with dimension [a[l]]=[f[l]H,f[l]W,C[l]in].
The number of learning parameters at l-th layer is given by
2.2. Pooling layer
At each convolution layer, CNN uses the pooling layers for feature map dimensionality reduction to avoid overfitting and improves the model accuracy. The down-sampling has been performed using a 2×2 kernel with a stride of two. Here, it uses a max-pooling operation, which takes the maximum value from the set of four values after each stride. The output of the l-th pooling layer is given by,
Here, (p[l]x,y,z) is the output value at position (x,y) in the z-th feature map of the l-th pooling layer. ph and pw are the height and width of the pooling window, respectively. The pooling window is applied with these dimensions to the input feature map. sx and sy are the stride values in the horizontal and vertical directions, respectively. The pooling window moves with these strides over the input feature map. (a[l−1]x⋅sx+i,y⋅sy+j,z) represents the input value at position (x⋅sx+i,y⋅sy+j) in the z-th feature map of the (l−1)-th layer. The max-pooling operation selects the maximum value within a local region, which is determined by the pooling window size and stride of the previous layer (a[l−1]), assigns to the corresponding position in the pooling layer.
2.3. Flatten and dense layers
The extracted features from the convolutional layer have been passed to the flattening layer to create a 1D vector. The flattened layer converts the 2D feature map matrix into 1D vector: [f[l−1]H×f[l−1]W×C[l−1]in,1]. This 1D vector has been fed into the fully connected layer, which contains the dense layers for the MC technique.
where (on[l]) is the output value of the n-th neuron in the output vector of the fully connected layer (l-th layer). (Hl−1) and (Wl−1) are the height and width of the feature map in the (l−1)-th layer (a[l−1]), respectively. Cl−1 is the number of input channels in the (l−1)-th layer. (W[l]i,j,c,n) represents the weight connecting the (i,j,c) position in the (l−1)-th layer to the n-th neuron in the fully connected layer. The input a[l−1] represents the output from the flattened layer or previous dense layer, and (b[l]n) represents the bias term. The n-th node of l-th layer of a fully connected network is given by,
where ϕ[l] is an activation function. The last dense layer performs the maximum likelihood probability over M modulation classes, using softmax classifier.
In the output softmax layer, m represents the index of the classes. In a classification problem, the softmax layer computes the probabilities of the input belonging to each class. If there are M classes in the classification task, m ranges from zero to M−1, representing the M class labels. The softmax function takes the raw scores o[l]n from the previous fully connected layer and converts them into a probability distribution over the classes. The probability of the input belonging to class m is denoted by p[l]m. The softmax function is defined as follows:
where o[l]m is the raw score for class m in the output vector and M is the total number of classes. Each p[l]m represents the probability of the input being classified into class m and the sum of all p[l]m values will be equal to one, ensuring that the probabilities form a valid probability distribution. The class with the highest probability (max(p[l]m)) is typically considered as the predicted class (ˆyi) for the applied input during inference.
The likelihood probability of a particular modulation type belongs to the input signal x, denoted as p(y=k|x;Θ), where k is a 1D tensor and k∈RM is for the different modulation types of the classification task. The CNN parameter (Θ) has been determined to minimize the training loss in the dataset (xi,yi)i∈S with training dataset size S,
where lp(.) denotes the categorical cross-entropy. It is defined as a log-likelihood function and represented by
where i=1,2,...,M. The loss function lp is evaluated at the last dense layer (softmax layer), which evaluates the error between the predicted ˆyi and the actual modulation labels yi.
The classification testing accuracy (Atest) for the given testing dataset can be calculated by
The layer type, dimension and learning parameters of each layer for a five convolutional layer CNN model have been shown in Table 2. It has been clearly mentioned that the learning parameters have been calculated for each convolution and dense layer. It can be found that the total learning parameters for the five layer model (5-CNN) are 2, 72,391, whereas for the three layer (3-CNN) model, the total learnable parameters are 3, 21,415.
3.
Results
The simulations have been conducted with the Intel R CoreTM i5 processor. The Ubuntu OS and Python 3.5 have been utilized to carry out the simulation using parameters as listed in Table 3.
In this work, we choose samples of different modulation types such as binary phase-shift keying (BPSK), quadrature phase-shift keying (QPSK), continuous-phase frequency-shift keying (CPFSK), 16 quadrature amplitude modulation (16QAM), 64 quadrature amplitude modulation (64QAM), gaussian minimum shift keying (GMSK) and gaussian frequency shift keying (GFSK) at a high SNR position (10 dB) from the RFSC dataset [21] for the generation of the multilayer CNN model. The RFSC dataset created in the laboratory environment includes the exact channel conditions as well as the controlled channel parameters. The verification of the generated CNN models has been performed by using the collected testing dataset from the assumed receiver position.
3.1. Model training performance analysis
The dataset split-up framework for the CNN model generation, validation and classification has been depicted in Figure 2. We can observe that the partition set {(P input data samples} represents the number of samples used for training and testing process, and {(Q training data samples} denotes the number of samples used for model generation and validation process, respectively. For the adopted RFSC dataset, we have 1,500 and 500 samples (75:25) for each modulation type for the training and testing process. Further, the dataset for the training process (10,500 samples) has been divided into 7,350 and 3,150 samples (70:30) for model generation and validation, respectively. Note that the models are generated with a training epoch of 70 using the samples collected at locations very close to the transmitter. Here, the training samples are collected from the chosen receiver position with the presence of transmitter at location one (TX1). The proposed CNN model is trained on the RFSC dataset comprised of the labeled signals with known modulation classes. The back-propagation and stochastic gradient descent algorithms are used to optimize the model parameters and minimize the classification loss.
3.1.1. Training loss and accuracy
Figure 3(a) shows the model training with a minimal training loss. It can be noticed that the CNN model with l>2 provides a similar performance throughout the epoch. Figure 3(b) depicts the training accuracy for all four models. It has been observed that the overshoot gets reduced after 60 epochs, where all the models converge with similar performance.
3.1.2. Visualization of feature maps, weights and bias value of the layer
In the training process, the CNN model extracts the features using the filters in each convolutional layer that capture and activate similar patterns. The captured pattern has been visualized as a feature map corresponding to each layer of the CNN model. While passing an input image into a CNN model, each layer generates the feature maps that are moving on to the next successive layer. The initial convolutional layer extracts the horizontal/diagonal edges of an image. The consecutive layers detect the corners of an image. On moving deeper into the network, we can identify even more complex features. The features extracted by the filters used in the third convolutional layer have been depicted in Figure 4.
The mathematical representations of neurons in the convolutional and pooling layers are used to calculate the weighted sum of multiple inputs and outputs. For each neuron, the weights are real values associated with each feature and indicate the significance of that related feature in predicting the modulation type. The weights assigned to each feature play an important role in selecting the output class. The weights close to zero tend to be less important in the classification process than the weights with higher values. The weight value for different index values has been represented in Figure 5(a). The bias value of a neuron is used to push the activation function toward left/right as shown in Figure 5(b).
3.2. Model testing analysis
In this section, we analyze the multilayer CNN model performance for different input parameters such as decimation factor, image frame size, sample size and the number of convolutional layers.
3.2.1. Different decimation factor
Table 4 shows the testing accuracy of the CNN model for different input frame sizes (F=1,2,3,4) and various decimation rates (D=6,12,24,48). It has been observed that the model generated with four input frames and D=12 yields the highest testing accuracy among other configurations. It has been noticed that accuracy gets reduced with increasing the modulation classes.
3.2.2. Different input sample size
In 1D, the actual samples are converted into images with a single frame for a single channel. In 2D, the actual samples are converted into images with four frames. Each frame consists of two channels. For both 1D and 2D cases, the generated CNN model has been tested for K = 500 and 1,000 signals. Table 5 compares the model testing accuracy and loss in both cases. It has been observed that the accuracy increases with the respective increment in sample size for both 1D and 2D formats.
For K = 500 signals, the model classification accuracies (Atest) for an image of size (fH×fW×Cin) of 14×14×2, 28×28×2 and 56×56×2 are 88.2%, 96.4% and 99%, respectively. Correspondingly, the model testing losses (lp) are 0.655, 0.22 and 0.084. Similarly, for K = 1,000 signals, the model classification accuracies (Atest) for 14×14×2, 28×28×2 and 56×56×2 are 87.8%, 96.7% and 99.3%, respectively. Correspondingly, the model testing losses (lp) are 0.66, 0.2 and 0.0403. It has been noticed that the image size 28×28×2 gives an optimum classification accuracy with moderate computational complexity compared with the other two image sizes. As K increases from 100 to 500 signals, there is a corresponding increase in the testing accuracy, which is found to be 96.3% as shown in Figure 6. A gradual increase in the number of signals leads to increased testing accuracy. So, we have chosen K as 500 signals for our work to save computation time.
3.2.3. Different numbers of CNN layers
Table 6 compares the testing loss and testing accuracy of the CNN model for a varying number of convolution layers (l) in the feature extraction phase. It can be observed that the increasing number of convolution layers in the CNN model yields a reduced testing loss (lp) and improved accuracy. Moreover, it can be noticed that the testing accuracy remains static after three convolution layers.
To minimize the computational complexity, we carry out the work with 3-CNN architecture. The classification accuracy increases with increasing convolutional layers as shown in Figure 6. Simultaneously, increasing the number of layers and signals will lead to over-fitting errors. Also, there are chances for noise to be misinterpreted as a signal and vice versa. It is mandatory for the designer to properly tune the convolutional layers based on the number of input signals (K). So, we have chosen a 3-CNN model that provides better testing accuracy and reduced computational time compared with other models.
3.3. Confusion matrix
The confusion matrix for the three-layer CNN at the selected receiver position in the presence of TX1 is shown in Figure 7. The probability values in each box represent the classification accuracy of each modulation type. It has been observed that the model provides an improved performance with the chosen receiver position. Moreover, the misclassification rate among different modulation classes has been found to be reduced with predicted labels. Due to the nature of MC systems, the multilayer CNN model performs well for the higher SNR values. The proposed model shows better classification performance for the adopted seven modulation classes.
3.4. Analysis of RFSC dataset classification results
From the RFSC dataset [21] we have selected seven modulation types that are used to test the proposed CNN model for different SNR values. Similarly, the different DL models have been generated and compared against various SNR values as shown in Figure 8(a). It has been observed that the proposed CNN model shows higher performance than the other models at SNR levels lower than 2 dB. The cumulant model has been found to be effective with a maximum accuracy of 88.5%. Moreover, the proposed CNN model performs with a maximum accuracy of 95.7%. Furthermore, the other models provide up to 93% accuracy, along with high computational complexity.
In Table 7, it has been noticed that the proposed CNN model provides the best performance with the least number of convolutional layers and learnable parameters. Furthermore, the proposed 3-CNN model provides a peak training accuracy of 98% for the least time per epoch than the prior models adopted for comparison. Also, the proposed 3-CNN model has been generated with the lowest trainable parameters in a short duration of 15 minutes. The proposed CNN model produces the lowest complexity and CPU inference time at the expense of memory usage allocated for feature extraction.
3.5. Comparison of model performance between RadioML and RFSC dataset
In this section, we also compare the proposed CNN models generated using RadioML, Hisarmod and RFSC datasets. The RadioML and Hisarmod consists of different modulation classes that have been created using GNU Radio [22], which include a variety of channel impairments such as fading, frequency offset and sample rate offset [17]. We have analyzed the model performance for the SNR range varying between -5 dB to 10 dB. From Figure 8(b), we can observe that the proposed CNN achieves better recognition accuracy at the low SNR scenario. The RFSC dataset produces the best classification accuracy in the medium SNR scenario. Moreover, the proposed multilayer CNN model has been tested for limited SNR values (-5 dB to 10 dB). Here, the modulation recognition has been restricted to single-carrier modulation types. Moreover, it has been noticed that the CNN model generated using RadioML and the Hisarmod dataset provides less than 80 recognition accuracy for SNR values between 4 dB and 10 dB.
In ATLA [16], the authors have frozen the weights of the convolution layers, which reduces the feature extraction capability, whereas in our proposed 3-CNN model, we have not frozen the convolutional layers but adjusted (proper selection/tuning) the weights and bias of the convolutional kernels/filters to achieve the highest classification accuracy. This provides an improvement in the classification accuracy compared with existing DL-based models. Additionally, only two convolutional layers are adopted in ATLA to extract fewer features than the 3-CNN model. Moreover, authors vary the size of the dataset only with sampling frequency. However, in our proposed work, we not only vary the sampling rate but also include the number of signals and decimation rate. This greatly reduces the issues associated with decoding PRB/master information block (MIB), particularly in resource-constrained Internet of Things (IoT) applications.
4.
Conclusions and future enhancements
MC has been used extensively in the military, civilian and future-generation wireless systems to detect and classify modulation types, hence preserving the radio spectrum. A detailed mathematical framework for the proposed multilayer CNN model has been presented. We adopted the RFSC dataset for the generation of the proposed multilayer CNN model. Moreover, improvements in the classification accuracy for better decimation rates, frame size and input sample size for the different convolutional layer model have been discussed. Furthermore, the 3-CNN model achieved nearly 98% classification accuracy. Thus, we conclude that there exists a trade-off while choosing the CNN model between the model parameter size and the computational complexity, which are more viable for resource-limited scenarios. We presented the recognition performance comparison between the DL models and the cumulant-based model. We also showed that the model generated by the RFSC dataset achieves higher accuracy than the models generated by the RadioML and Hisarmod datasets. In the near future, we will investigate MC for various modulation orders by identifying each modulation class. Extensive research for varying the number and size of convolutional filters finds an interesting topic to be investigated in resource-constrained applications.
Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
Conflict of interest
The authors declare that they have no conflict of interest.