
Citation: Frances A. Maratos, Matthew Garner, Alexandra M. Hogan, Anke Karl. When is a Face a Face? Schematic Faces, Emotion, Attention and the N170[J]. AIMS Neuroscience, 2015, 2(3): 172-182. doi: 10.3934/Neuroscience.2015.3.172
[1] | Herbert F. Jelinek, Andrei V. Kelarev . A Survey of Data Mining Methods for Automated Diagnosis of Cardiac Autonomic Neuropathy Progression. AIMS Medical Science, 2016, 3(2): 217-233. doi: 10.3934/medsci.2016.2.217 |
[2] | Andrei V. Kelarev, Xun Yi, Hui Cui, Leanne Rylands, Herbert F. Jelinek . A survey of state-of-the-art methods for securing medical databases. AIMS Medical Science, 2018, 5(1): 1-22. doi: 10.3934/medsci.2018.1.1 |
[3] | Isaac Kofi Owusu, Emmanuel Acheamfour-Akowuah, Lois Amoah-Kumi, Yaw Amo Wiafe, Stephen Opoku, Enoch Odame Anto . The correlation between obesity and other cardiovascular disease risk factors among adult patients attending a specialist clinic in Kumasi. Ghana. AIMS Medical Science, 2023, 10(1): 24-36. doi: 10.3934/medsci.2023003 |
[4] | Frantisek Franek, W. F. Smyth, Xinfang Wang . The Role of The Prefix Array in Sequence Analysis: A Survey. AIMS Medical Science, 2017, 4(3): 261-273. doi: 10.3934/medsci.2017.3.261 |
[5] | Masoud Nazemiyeh, Mehrzad Hajalilou, Mohsen Rajabnia, Akbar Sharifi, Sabah Hasani . Diagnostic value of Endothelin 1 as a marker for diagnosis of pulmonary parenchyma involvement in patients with systemic sclerosis. AIMS Medical Science, 2020, 7(3): 234-242. doi: 10.3934/medsci.2020014 |
[6] | Kavin Mozhi James, Divya Ravikumar, Sindhura Myneni, Poonguzhali Sivagananam, Poongodi Chellapandian, Rejili Grace Joy Manickaraj, Yuvasree Sargunan, Sai Ravi Teja Kamineni, Vishnu Priya Veeraraghavan, Malathi Kullappan, Surapaneni Krishna Mohan . Knowledge, attitudes on falls and awareness of hospitalized patient's fall risk factors among the nurses working in Tertiary Care Hospitals. AIMS Medical Science, 2022, 9(2): 304-321. doi: 10.3934/medsci.2022013 |
[7] | Giuliano Crispatzu, Alexandra Schrader, Michael Nothnagel, Marco Herling, Carmen Diana Herling . A Critical Evaluation of Analytic Aspects of Gene Expression Profiling in Lymphoid Leukemias with Broad Applications to Cancer Genomics. AIMS Medical Science, 2016, 3(3): 248-271. doi: 10.3934/medsci.2016.3.248 |
[8] | Nicole Lavender, David W. Hein, Guy Brock, La Creis R. Kidd . Evaluation of Oxidative Stress Response Related Genetic Variants, Pro-oxidants, Antioxidants and Prostate Cancer. AIMS Medical Science, 2015, 2(4): 271-294. doi: 10.3934/medsci.2015.4.271 |
[9] | Manasseh B. Wireko, Jacobus Hendricks, Kweku Bedu-Addo, Marlise Van Staden, Emmanuel A. Ntim, Samuel F. Odoom, Isaac K. Owusu . Alcohol consumption and HIV disease prognosis among virally unsuppressed in Rural KwaZulu Natal, South Africa. AIMS Medical Science, 2023, 10(3): 223-236. doi: 10.3934/medsci.2023018 |
[10] | Katsiaryna V Gris, Kenzo Yamamoto, Marjan Gharagozloo, Shaimaa Mahmoud, Camille Simard, Pavel Gris, Denis Gris . Exhaustive behavioral profile assay to detect genotype differences between wild-type, inflammasome-deficient, and Nlrp12 knock-out mice. AIMS Medical Science, 2018, 5(3): 238-251. doi: 10.3934/medsci.2018.3.238 |
With the advent of the era of big data, deep learning technology has become a research hotspot in the field of artificial intelligence. It has shown great advantages in image recognition, speech recognition, natural language processing and other fields. The problem of sequence labeling is the most common problem in natural language. Shao et al. [1] assign semantic labels in input sequences, exploiting encoding patterns in the form of latent variables in conditional random fields to capture latent structure in observed data. Lin et al. [2] proposed an attentional segmentation recurrent neural network (ASRNN), which relies on a hierarchical attentional neural semi-Markov conditional random field (semi-CRF) model for sequence labeling tasks.
Convolutional neural networks (CNN) have been widely used in computer vision recognition tasks. Djenouri et al. [3] proposed a technique for particle clustering for object detection (CPOD), built on top of region-based methods, using outlier detection, clustering, particle swarm optimization (PSO), and deep convolutional networks to identify smart object data. Shao et al. [4] proposed an end-to-end multi-objective neuroevolution algorithm based on decomposition and dominance (MONEADD) for combinatorial optimization problems to improve the performance of the model in inference. From 2010 to 2017, the ImageNet Large Scale Visual Recognition Challenge has been held for seven years. The image classification accuracy of the champions has increased from 71.8% to 97.3%. The emergence of AlexNet in 2012 was a milestone in deep learning field. After that, the ImageNet dataset accuracy has been significantly improved by novel CNNs, like VGG [5], GoogleNet [6], ResNet [7,8], DenseNet [9], SE-Net [10], and automatic neutral architecture search [11,12,13].
However, it is necessary to consider high accuracy, platform resources, and the efficiency of systems in real-world applications, e.g., automatic drive systems, intelligent robot systems, and mobile device applications. Moreover, most of the best-performing CNNs need to run on a high-performance graphics processing unit (GPU). So, real-world tasks have driven the development of more lightweight CNNs, to allow CNN to be used in more low-performance devices [14,15], like Xception [16], MobileNet [17], MobileNet V2 [18,19], ShuffleNet [20], ShuffleNet V2 [21] and CondenseNet [22]. Group convolution and depth-wise separable convolution [23] are crucial in these works.
As the best paper at the CVPR 2017 conference, DenseNet beat the best performing ResNet on ImageNet without group convolution or depth-wise separable convolution. Subsequently, the SE-Net achieved the best results in the history of ImageNet in ILSVRC2017, but there are still too many parameters in SE-Net. Following these works, Huang et al. [9] have proposed Learned Group Convolutions to improve DenseNet connection and convolution methods. Inspired by these jobs, we study using Squeeze-and-Excitation block (SE-block) to improve the lightweight CNN. Furthermore, we explore how to design the structure of the convolutional layer to enhance the network's performance.
We propose a more efficient network, CED-Net, which combines bottleneck layer with learned group convolution and SE block. Learned group convolution can crop the network channel during the training phase. And the SE block can recalibrate the feature channel to enhance the channel beneficial to the network. Through experiments, we demonstrate that CED-Net is superior to other lightweight network in terms of accuracy, the number of parameters, and FLOPs.
In the past few years, designing CNNs by adjusting an optimal depth to balance accuracy and performance was a very active field. Most recent work has been many progresses in algorithm optimization exploration, including pruning redundant connections [24,25,26,27], using low-accuracy or quantized weights [28,29], or designing efficient network architectures.
Early researchers proved pruning redundant and quantization are effective methods because deep neural networks often have a substantial number of redundant weights that can be pruned or quantized without sacrificing (and sometimes even improving) accuracy. For CNNs, different pruning techniques may lead to varying levels of granularity [30]. Fine-grained pruning, e.g., independent weight pruning [31], generally achieves a high degree of sparsity. Coarse grained pruning methods such as filter-level pruning earn a lower degree of sparsity, but the resulting networks are much more regular, facilitating efficient implementations.
Recently researchers have explored the structures of the efficient network that can be applied on mobile devices such as MobileNet V2, ShuffleNet V2, and NasNet. In these networks, depth-wise separable convolutions play a vital role, which can reduce a large number of network parameters without significantly reducing the accuracy. However, according to the Howard et al. [17,18], a large amount of depth-wise separable convolutions will decrease the computational speed of the network. Therefore, CED-Net uses a more efficient group convolution and densely connected architecture to reduce the number of parameters of the network. Furthermore, because many deep-learning libraries efficiently implement group convolutions, they save a lot of computational time in theory and practice.
In addition, the bottleneck layer proposed in ResNet can effectively reduce parameters for multilayer network. Our experiments show that CED-Net can achieve higher accuracy and fewer parameters than CondenseNet of the same structure when layers are deeper.
Huang et al. [9], as the best paper for CVPR2017, proposed a densely connection network that is better than the previous champion ResNet on the ImageNet. After that, CondenseNet achieved the same accuracy with only half of the number of parameters of DenseNet. In CondenseNet, learned group convolution plays a key role; it can train the network with sparsity inducing regularization for a fixed number of iterations. Subsequently, it prunes away unimportant filters with low magnitude weights. Because many deep-learning libraries efficiently implement group convolutions, they save a lot of computational time in theory and practice.
Moreover, the Squeeze-and-Excitation structure that shines on ILSVRC2017 has been experimented on by most famous networks. Squeeze and Excitation are two very critical operations. First, it is used to model the interdependencies between feature channels explicitly. It is a new "channel recalibration" strategy. Specifically, by automatically learning the importance of each feature channel, SE-Net enhances the proper channel and suppresses useless channels. Most of the current mainstream networks are constructed based on superimposed basic blocks. It can be seen that the SE module can be embedded in almost all network structures, so CED-Net achieves more efficient performance by embedding the SE module.
In this section, we first introduce the structure and function of the bottleneck layer. Next, we explore how SE Block as a channel enhancement block can improve the performance of CED-Net. Finally, we describe the network details of CED-Net for CIFAR dataset.
As shown in Figure 1, H, W, $ {C}_{in} $ are the height, width, and the number of channels of the input image, respectively, and g is the growth coefficient of the channel. CED-Net consists of multiple dense blocks for feature extraction. The dense block is shown in Figure 2(c). It consists of two 1 × 1 LG-Conv (Learned Group Convolution) layers and one 3 × 3 G-Conv (Group Convolution) layer. Each 1 × 1 LG-Conv layer uses a permute operation for channel shuffling to reduce accuracy. BN-ReLU nonlinearly activates the input and output in the dense block. And use the AvgPool layer for down sampling.
The bottleneck layer is proposed in ResNet, and the detailed structure is shown in Figure 2(a). The three-layer bottleneck structure consists of 1 × 1, 3 × 3, and 1 × 1 convolutional layers, where two 1 × 1 convolutions are used to reduce and increase (restore) dimensions. The 3 × 3 convolutional layer can be seen as a bottleneck for a smaller input/output dimension. We replace the 1 × 1 standard convolution with the learned group convolution, and the 3 × 3 standard convolution is replaced with the group convolution. Unlike ResNet, the CED-Net replaces element-wise addition with channel concatenation. Because it can use the semantic information of different scale feature maps to achieve better performance by increasing the channel, the element addition operation does not take up too much memory during network transmission. Still, it may introduce extra noise that will lose some feature map information.
Figure 2(b) shows the structure used in CondenseNet. The Permute layer, enabling shuffling between channels, is designed to reduce the adverse effects of the introduction of 1 × 1 LG-Conv. But there are still many parameters in a deep network with the bottleneck layer. Figure 2(c) shows part of the structure used by CED-Net. This structure has fewer parameters than that in Figure 2(b). Expressly, the condense factor and bottleneck factor in CED-Net are set to 4 and reduced by half compared to CondenseNet. This is to reduce the parameters caused by adding a 1 × 1 LG-Conv layer.
One dense layer used in CED-Net is of quadratic time complexity (Θ(25G2/4+4CG)) concerning the number (C) of input channels and the number (G) of output channels. Compared with ordinary 3 × 3 convolution (Θ(9CG)), as a result of C is much greater than G with the deepening of network layers, CED-Net reduces the time complexity by half.
Figure 3 shows how channels change the process of the bottleneck layer based on learning group convolution. The parameters and calculation amount are 1/4 of the standard bottleneck layer. Based on the image classification comparing experiments on the CIFAR dataset, we can conclude that our structure can increase the classification accuracy by 0.4% when the number of parameters and the amount of calculation is almost the same as CondenseNet (see Section 4). When network layers are deeper (depth is 272), the number of parameters and the amount of calculation of CED-Net are smaller than the CondenseNet of the same depth. Still, the classification accuracy is higher than that of CondenseNet.
In CED-Net, since the network is a densely connected structure, the input data of each convolution layer has a large amount of channel information. And the output after convolution is the sum of all previous channel information. This has led to the entanglement of information and spatial relevance. Furthermore, in lightweight networks, group convolution can significantly reduce the amount of computation by ensuring that each convolution operation is only on the corresponding input channel group. However, if multiple sets of convolutions are stacked together, there is a side effect: A channel output is only derived from a few numbers of input channels. This would reduce the information flow between channel groups and express information.
Therefore, we use the channel permute (see Figure 2(c)) and the Squeeze-and-Excitation block to make the information between the groups more circulated to allow the network to focus on more helpful information. As shown in Figure 4, Squeeze-and-Excitation blocks can improve the representation of the network by increasing the interdependence between convolution feature channels. The detailed process is divided into two steps: Squeeze and Excitation.
Squeeze. CNNs all have the problem that due to the nature of convolutional calculations, each convolution filter can only focus on specific spatial information. To alleviate this problem, the Squeeze, as a global description operation, encodes the global spatial information into the channel descriptor and calculates the mean of each channel through global average pooling.
$ {z}_{c} = {F}_{sq}\left({u}_{c}\right) = \frac{1}{W\times H}\sum _{i = 1}^{W}\sum _{j = 1}^{H}{u}_{c}(i, j) $ | (1) |
As shown in Eq (1), where $ {\mathrm{Z}}_{\mathrm{c}} $ is the output of the squeeze layer, W, H are the width and height of the input feature map of the current layer. $ {u}_{c} $ is the input feature map, and $ {F}_{sq}\left(\mathrm{*}\right) $ can represent the global information of the entire feature map. The global average pooling used in this paper squeezes the feature map into a value to indicate the importance of the corresponding channel.
Excitation. To take advantage of the information obtained by the squeeze operation, the excitation operation needs to meet two criteria to achieve full capture of channel dependencies. First, it must be able to learn nonlinear interactions between channels. And second, it must learn a non-mutually exclusive relationship. Specifically, the gate mechanism is parameterized by concatenating two fully connected (FC) layers above and below the nonlinear (ReLU) and then activated with the sigmoid function.
$ s = {F}_{ex}\left(z, W\right) = \sigma \left(g\left(z, W\right)\right) = \sigma \left({W}_{2}\theta \left({W}_{1}z\right)\right) $ | (2) |
where $ \mathrm{\theta } $ is the ReLU function.$ {\mathrm{W}}_{1}\in {\mathrm{R}}^{\frac{\mathrm{C}}{\mathrm{r}}\times \mathrm{C}} $, $ {\mathrm{W}}_{2}\in {\mathrm{R}}^{\mathrm{C}\times \frac{\mathrm{C}}{\mathrm{r}}} $ are the weights of the dimensionality reduction layer and the dimensionality increase layer, respectively. Where r is the dimensionality reduction rate, and C is the number of channels. To limit the complexity of the model and increase the generalization, a "bottleneck" is formed by a two-layer FC layer around a nonlinear map, where r sets 16. Finally, after obtaining the so-called gate, by multiplying the channel gates by the corresponding feature maps, you can control the flow of information for each feature map.
We embed the Squeeze-and-Excitation block into the 3 × 3 G-Conv layer because the number of input/output feature channels in the first 3 × 3 G-Conv is the same and smaller. The Squeeze-and-Excitation block can effectively enhance the effective channel after feature extraction without extra parameters. According to the research results of Hu et al., this method can balance the accuracy of the model and the number of parameters.
Algorithm 1 Image classification based on CED-Net |
Input: In = datasets ($ {\boldsymbol{x}}_\mathbf{1}, {\boldsymbol{y}}_\mathbf{1} $), ($ {\boldsymbol{x}}_\mathbf{2}, {\boldsymbol{y}}_\mathbf{2} $), …, ($ {\boldsymbol{x}}_{\boldsymbol{m}}, {\boldsymbol{y}}_{\boldsymbol{m}} $) |
Output: Op = Classification accuracy: ($ {\boldsymbol{y}}_{1}, {\boldsymbol{y}}_\mathbf{2}, \dots, {\boldsymbol{y}}_{\boldsymbol{n}} $) |
Set: CED-Net feature extraction: $ {\boldsymbol{G}}_{\boldsymbol{k}}(·) $, k$ \in $ (0, n) |
for x = 1 : m do |
$ \boldsymbol{S}\boldsymbol{o}\boldsymbol{f}\boldsymbol{t}\boldsymbol{m}\boldsymbol{a}\boldsymbol{x}\left({\boldsymbol{G}}_{\boldsymbol{k}}\left({\boldsymbol{x}}_{\boldsymbol{i}}\right)\right) = \frac{{\boldsymbol{e}}^{{\boldsymbol{g}}_{\boldsymbol{i}}}}{{\sum }_{\boldsymbol{k}}^{\boldsymbol{n}}{\boldsymbol{e}}^{{\boldsymbol{g}}_{\boldsymbol{k}}}} $ |
i$ \in $ [1, m], where $ {\boldsymbol{g}}_{\boldsymbol{i}} $ is one class value in $ {\boldsymbol{G}}_{\boldsymbol{k}}(·) $. |
Return Op |
end for |
CED-Net can guarantee good performance while maintaining lightweight models because of the effective combination of bottleneck layer structure and channel enhancement blocks. An important difference between CED-Net and other network architectures is that CED-Net has a very narrow layer. The relatively small channel growth rate is sufficient to obtain the most advanced results on the test dataset. This can increase the proportion of features from the later layers relative to features from the previous layers. So, we set the channel growth rate of a dense connection layer to 4. And we found that if the number of early layers is set too deep, it will significantly increase the FLOPs of the network.
Architectural details. The model used in our experiments has three dense blocks. Before the data enters the first dense block, the input image would go through a 3 × 3 standard convolution which output channels are 16 and stride size is 2. In the dense layer, the number of channel enhancement blocks should be set according to the growth rate, the input channels, and the output channels, see Eq (3).
$ n = \frac{Cout-Cin}{g} $ | (3) |
where g is the growth rate, $ {\mathrm{C}}_{\mathrm{i}\mathrm{n}} $ is the input channels, n is the number of channel enhancement blocks, and $ {\mathrm{C}}_{\mathrm{o}\mathrm{u}\mathrm{t}} $ is the output channels. For example, in the experiment, we set the growth rate to 8, 16, and 32, and the channels of dense layer output is 256, 756, and 1696 respectively, so the number of channel enhancement blocks in the dense layer are all 30.
For each convolutional layer with a kernel size of 3 × 3, each side of the input is zeros-padded to keep the feature size fixed. In general, we add the batch normalization layer and the ReLU function after the last dense layer and then use the global average pooling to compress the feature map into one dimension as the input of the Softmax layer. The exact network configuration is shown in Tables 1 and 2.
Layers | Output Size | Output Channels | Repeat | Stride |
3 × 3 Convolution | 32 × 32 | 16 | 1 | 1 |
Dense bottleneck block | 32 × 32 | 256 (g = 8) | 30 | 1 |
Avg pooling | 16 × 16 | 1 | 2 | |
Dense bottleneck block | 16 × 16 | 736 (g = 16) | 30 | 1 |
Avg pooling | 8 × 8 | 1 | 2 | |
Dense bottleneck block | 8 × 8 | 1696 (g = 32) | 30 | 1 |
Global avg pooling | 1 × 1 | 1696 | 1 | 8 |
Fully connected | 1 × 1 | 10 | 1 |
Layers | Output Size | Output Channels | Repeat | Stride |
3 × 3 Convolution | 112 × 112 | 64 | 1 | 2 |
Dense bottleneck block | 112 × 112 | 96 (g = 8) | 4 | 1 |
Avg pooling | 56 × 56 | 1 | 2 | |
Dense bottleneck block | 56 × 56 | 192 (g = 16) | 6 | 1 |
Avg pooling | 28 × 28 | 1 | 2 | |
Dense bottleneck block | 28 × 28 | 448 (g = 32) |
8 | 1 |
Avg pooling | 14 × 14 | 1 | 2 | |
Dense bottleneck block | 14 × 14 | 1088 (g = 64) | 10 | 1 |
Avg pooling | 7 × 7 | 1 | 2 | |
Dense bottleneck block | 7 × 7 | 2112 (g = 128) | 8 | 1 |
Global avg pooling | 1 × 1 | 2112 | 1 | 7 |
Fully connected | 1 × 1 | 1000 | 1 |
The training process of CED-Net is shown in Algorithm 1. (xi, yi) in the input represent the images and label of the ith batch respectively. For each batch, we use softmax to obtain the output Yi of CED-Net. Finally, the image features Gk of n categories are obtained.
This section conducted experiments on the CIFAR10, CIFAR-100, and the ImageNet (ILSVRC 2012) datasets. First, we compared them with other advanced convolutional neural networks, such as VGG16, ResNet-101, and DenseNet. Then, we conducted ablation experiments to CED-Net, mainly comparing three networks, the primary network of CED-Net-128, the optimization network with only the bottleneck layer, and the network with only the channel enhancement block. Through these experiments, we verify the effectiveness of our improved method. Next, we will introduce the data set and the evaluation indicators of the experiment.
The CIFAR-10 and CIFAR-100 datasets consist of colored natural images with 32 × 32 pixels. CIFAR-10 consists of images drawn from 10 classes and CIFAR-100 from 100 classes. The training and test sets contain 50, 000 and 10, 000 images, respectively, and we picked up 5000 training images as a validation set. We adopt a standard data augmentation scheme (mirroring/shifting) and image zero-padded with 4 pixels per side, and then randomly cropped to generate a 32 × 32 image. The image is flipped horizontally at a probability of 0.5 and normalized by subtracting the channel average and dividing by the channel standard deviation.
The ImageNet datasets consist of 224 × 224 pixels colored natural images with 1000 classes. The training and validation sets contain 1, 280, 000 and 50, 000 images, respectively. We adopt the data-augmentation scheme at training time and perform a rescaling to 256 × 256 followed by a 224 × 224 center crop at test time before feeding the input image into the networks.
We evaluate CED-Net on three criteria:
Accuracy is the most common metric. It is the number of samples that are paired divided by the number of all samples. Generally speaking, the higher the accuracy is, the better the classifier will be:
$ accuracy = \left(TP+TN\right)/\left(P+N\right) $ | (4) |
where P (positive) is the number of positive examples in the sample, and N (negative) is the number of negative examples. TP (true positives) is the number of samples that are positive examples that are correctly classified. TN (true negatives) is the number of samples that are actually negative that are correctly classified.
For a single convolutional kernel we have:
$ parameters = {k}^{2}\times Cin\times Cout $ | (5) |
where 𝑘 is the convolution filter's size, 𝐶in is the input channels, and 𝐶out is the output channels;
To measure the amount of calculation of the model, we compute the number of FLOPs of each layer. For convolutional kernels, we have:
$ FLOPs = 2HW({k}^{2}Cin+ 1)Cout $ | (6) |
where 𝐻, 𝑊 are height and width. For fully connected layers, we compute FLOPs as:
$ FLOPs = (2I- 1)O $ | (7) |
where 𝐼 is the input dimensionality and 𝑂 is the output dimensionality.
To further prove the stability of CED-Net, we added the interpretation and comparison of precision, recall and F-measure in the ablation experiment:
$ precision = TP/\left(TP+FP\right) $ | (8) |
$ recall = TP/\left(TP+FN\right) $ | (9) |
$ F-measure = 2*precision*recall/\left(precision+recall\right) $ | (10) |
We train all models with stochastic gradient descent (SGD) using similar optimization hyper-parameters [23,24,25,26,27,28,29,30]. And we set the Nesterov momentum weight to 0.9 without damping and use a weight decay of 0.0001. All models are trained with mini-batch size 128 for 200 epochs on the training datasets. We use the cosine annealing learning rate curve, starting from 0.1 and gradually reducing to 0.
In this part, we train CED-Net and other advanced convolutional neural networks on the CIFAR-10 and CIFAR100 datasets. We compared these models under the above three evaluation criteria. See Table 3 for a detailed list.
Model | Params | FLOPs | CIFAR-10 | CIFAR-100 |
VGG-16 | 14.73 M | 314 M | 92.64 | 72.23 |
ResNet-101 | 42.51 M | 2515 M | 93.75 | 77.78 |
ResNeXt-29 | 9.13 M | 1413 M | 94.82 | 78.83 |
MobileNet V2 | 2.30 M | 92 M | 94.43 | 68.08 |
DenseNet-121 | 6.96 M | 893 M | 94.04 | 77.01 |
CondenseNet-86 | 0.52 M | 65 M | 94.48 | 76.36 |
CondenseNet-182 | 4.20 M | 513 M | 95.87 | 80.13 |
CED-Net-128 CED-Net-272 |
0.69 M 5.32 M |
75 M 649 M |
94.89 96.31 |
77.35 80.72 |
In Table 3, we show the results of comparing 128-layer CED-Net and 272-layer CED-Net with other state-of-the-art CNN architectures. All models were trained in 200 epochs in the experiment. The results show that after introducing the bottleneck layer structure and channel enhancement blocks to CED-Net, the CondenseNet increases the accuracy by 0.4–0.5% with minimal parameters and FLOPs cost compared with the same number of stacked blocks n datasets. Moreover, compared to the more advanced MobileNet V2, CED-Net is more accurate without using depth-wise separable convolutions. And the parameter amount is 1/4 of it, and the FLOPs are also more minor.
In this part, we train CED-Net and other advanced convolutional neural networks on the ImageNet datasets. We compared these models under the above four evaluation criteria. See Table 4 for a detailed list.
Model | Params | FLOPs | Top-1 | Top-5 |
VGG-16 | 138.36 M | 15.48 G | 71.93 | 90.67 |
ResNet-101 | 44.55 M | 7.83 G | 80.13 | 95.4 |
MobileNet V2 | 3.5 M | 0.3 G | 71.8 | 91 |
DenseNet-121 | 7.98 M | 2.87 G | 74.98 | 92.29 |
CondenseNet | 4.8 M | 0.53 G | 73.8 | 91.7 |
SE-Net | 115 M | 20.78 G | 81.32 | 95.53 |
CED-Net-115 | 9.3 M | 1.13 G | 78.65 | 93.7 |
In Table 4, we show the results of comparing 115-layer CED-Net with other CNN architectures. The results show that the accuracy of Top-1 and Top-5 is improved by 4.85 and 2%, respectively, compared with the same depth of CondenseNet. At the same time, the dense bottleneck block used in the CED net is more complex. Compared with DenseNet, CED-Net increases the number of parameters by 16.5% but reduces the amount of calculation by 39.4%; the accuracies of Top-1 and top-5 are improved by 3.67 and 1.41%, respectively. Compared with SE-Net, CED-Net reduces the Top-1 accuracy by 2.67%, but the parameter quantity is only 8.1% of SE-Net.
Some misclassification images are shown in Figure 5. There may be unavoidable interference information in these pictures; Also, it may be that the network model constructed in this paper does not learn a sufficient number of diverse features and cannot correctly identify each picture with different features.
In the dense bottleneck block shown in Figure 2, we use the learned group revolution before and after the 3 × 3 group convolution, which means that there are two consecutive learned group convolutions between the two 3 × 3 group convolutions. The two index layers used have redundancy, but we think it is necessary. These redundancies can improve the learned group revolution's generalization performance and help subsequent feature extraction. But this design increases the amount of calculation and parameters of the intermediate convolution.
In this part, we performed a CED-Net ablation experiment. We trained four models on the CIFAR-10 dataset, CEDNet-128a with no bottleneck layer and channel enhancement block, CED-Net-128b with convolutional layer structure changed to bottleneck layer, CED-Net-128c with channel enhancement block based on CondenseNet-86 and CED-Net-128 that we proposed in this paper.
In Table 5, CondenseNet-86 is our basic model. It can be seen that when we turn the structure of CondenseNet into a bottleneck layer, the parameters and FLOPs of the network are only slightly improved, and the accuracy can be increased by about 0.3%. When we added the channel enhancement block to CondenseNet-86, we saw not much increase in FLOPs. But the parameters are raised, and the accuracy can be improved by about 0.3%. In our CED-Net-128, the accuracy rate has been significantly improved, and the channel enhancement block mainly causes the increase in parameters. The bottleneck layer structure causes an increase in FLOPs. In addition, the Accuracy, Precision, Recall and F-measure of each model are very close, which prove that the four models have extracted stable features.
Model | Params | FLOPs | Accuracy | Precision | Recall | F-measure |
CondenseNeta | 0.52M | 65.82M | 94.48 | 94.50 | 94.48 | 94.49 |
CED-Net-128b | 0.59M | 75.04M | 94.75 | 94.75 | 94.75 | 94.75 |
CED-Net-128c | 0.66M | 67.04M | 94.74 | 94.76 | 94.74 | 94.75 |
CED-Net-128 | 0.69M | 75.41M | 94.89 | 94.89 | 94.89 | 94.88 |
Note: aThe basic model of CED-Net same as CondenseNet-86 without bottleneck layer and channel enhancement block; bThe basic model of CED-Net only add a bottleneck layer; cThe basic model of CED-Net only add channel enhancement block. |
This paper introduces CED-Net: a more efficient densely concatenated convolutional neural network based on feature enhancement block and bottleneck layer structure, which increases accuracy by learning group convolution and feature reuse. To make the reasoning effective, the pruned network can be converted to a network with conventional group convolution, which is effectively implemented in most deep learning libraries. In our experiments, CED-Net outperformed its underlying network CondenseNet and other advanced convolutional neural networks such as Mobilenet V2 and ResNeXt in terms of computational efficiency at the same accuracy level. Moreover, CED-Net has a much simpler structure with higher accuracy. We anticipate further research in CED-Net to combine this framework to the Neural Architecture Search (NAS), so as to design more lightweight Convolutional Neural Network models. We hope our work will draw more attention toward a broader view of using lightweight architecture for deep learning.
This work was supported by the National Natural Science Foundation of China (No. 61976217), the Opening Foundation of Key Laboratory of Opto-technology and Intelligent Control, Ministry of Education (KFKT2020-3), the Fundamental Research Funds of Central Universities (No. 2019XKQ YMS87), Science and Technology Planning Project of Xuzhou (No. KC21193).
The authors declare there is no conflict of interest.
[1] | Bannerman RL, Milders M, de Gelder B, et al. (2009) Orienting to threat: faster localization of fearful facial expressions and body postures revealed by saccadic eye movements. Proc Biol Sci 276(1662): 1635-1641. |
[2] | Simon EW, Rosen M, Ponpipom A (1996) Age and IQ as predictors of emotion identification in adults with mental retardation. Res Dev Disabil 17(5): 383-389. |
[3] | Eimer M (2011) The face-sensitive N170 component of the event-related brain potential. Oxford handbook face percept: 329-344. |
[4] | Rossion B, Jacques C (2011) 5 The N170: Understanding the time course. Oxford handbook potent components 115. |
[5] | Maurer U, Rossion B, McCandliss BD (2008) Category specificity in early perception: face and word N170 responses differ in both lateralization and habituation properties. Front Hum Neurosci. |
[6] | Eimer M, Kiss M, Nicholas S (2010) Response profile of the face-sensitive N170 component: a rapid adaptation study. Cerebral Cortex 312. |
[7] | Jacques C, Rossion B (2010) Misaligning face halves increases and delays the N170 specifically for upright faces: Implications for the nature of early face representations. Brain Res 13(18): 96-109. |
[8] | Itier RJ, Alain C, Sedore K, et al. (2007) Early face processing specificity: It's in the eyes! J Cog Neurosci 19: 1815-1826. |
[9] | Itier RJ, Batty M (2009) Neural bases of eye and gaze processing: the core of social cognition. Neurosci Biobehav Rev 33(6): 843-863. |
[10] | Dering B, Martin CD, Moro S, et al. (2011) Face-sensitive processes one hundred milliseconds after picture onset. Front Hum Neurosci 5. |
[11] | Eimer M (2011) The face-sensitivity of the n170 component. Front Hum Neurosci 5. |
[12] | Ganis G, Smith D, Schendan HE (2012) The N170, not the P1, indexes the earliest time for categorical perception of faces, regardless of interstimulus variance. Neuroimage 62(3): 1563-1574. |
[13] | Rossion B, Caharel S (2011) ERP evidence for the speed of face categorization in the human brain: Disentangling the contribution of low-level visual cues from face perception. Vision Res 51(12): 1297-1311. |
[14] | Dering B, Hoshino N, Theirry G (2013) N170 modulation is expertisedriven: evidence from word-inversion effects in speakers of different languages. Neuropsycholo Trend 13. |
[15] |
Tanaka JW, Curran T (2001) A Neural Basis for Expert Object Recognition. Psychol Sci 12: 43-47. doi: 10.1111/1467-9280.00308
![]() |
[16] |
Gauthier I, Curran T, Curby KM, et al. (2003) Perceptual interference supports a non-modular account of face processing. Nat Neurosci 6: 428-432. doi: 10.1038/nn1029
![]() |
[17] |
Fan C, Chen S, Zhang L, et al. (2015) N170 changes reflect competition between faces and identifiable characters during early visual processing. NeuroImage 110: 32-38. doi: 10.1016/j.neuroimage.2015.01.047
![]() |
[18] |
Rugg M D, Milner AD, Lines CR, et al. (1987) Modulation of visual event-related potentials by spatial and non-spatial visual selective attention. Neuropsychologia 25: 85-96. doi: 10.1016/0028-3932(87)90045-5
![]() |
[19] |
Schinkel S, Ivanova G, Kurths J, et al. (2014) Modulation of the N170 adaptation profile by higher level factors. Bio Psychol 97: 27-34. doi: 10.1016/j.biopsycho.2014.01.003
![]() |
[20] | Gong J, Lv J, Liu X, et al. (2008) Different responses to same stimuli. Neuroreport 19. |
[21] | Thierry G, Martin CD, Downing P, et al. (2007) Controlling for interstimulus perceptual variance abolishes N170 face selectivity. Nat Neurosci 10: 505-511. |
[22] |
Vuilleumier P, Pourtois G (2007) Distributed and interactive brain mechanisms during emotion face perception: Evidence from functional neuroimaging. Neuropsychologia 45: 174-194. doi: 10.1016/j.neuropsychologia.2006.06.003
![]() |
[23] |
Munte TF, Brack M, Grootheer O, et al. (1998) Brain potentials reveal the timing of face identity and expression judgments. Neurosci Res 30: 25-34. doi: 10.1016/S0168-0102(97)00118-1
![]() |
[24] |
Eimer M, Holmes A (2007) Event-related brain potential correlates of emotional face processing. Neuropsychologia 45: 15-31. doi: 10.1016/j.neuropsychologia.2006.04.022
![]() |
[25] |
Batty M, Taylor MJ (2003) Early processing of the six basic facial emotional expressions. Cog Brain Res 17: 613-620. doi: 10.1016/S0926-6410(03)00174-5
![]() |
[26] |
Krombholz A, Schaefer F, Boucsein W (2007) Modification of N170 by different emotional expression of schematic faces. Biol Psychol 76: 156-162. doi: 10.1016/j.biopsycho.2007.07.004
![]() |
[27] |
Jiang Y, Shannon RW, Vizueta N, et al. (2009) Dynamics of processing invisible faces in the brain: Automatic neural encoding of facial expression information. Neuroimage 44: 1171-1177. doi: 10.1016/j.neuroimage.2008.09.038
![]() |
[28] |
Hung Y, Smith ML, Bayle DJ, et al. (2010) Unattended emotional faces elicit early lateralized amygdala-frontal and fusiform activations. Neuroimage 50: 727-733. doi: 10.1016/j.neuroimage.2009.12.093
![]() |
[29] |
Pegna AJ, Landis T, Khateb A (2008) Electrophysiological evidence for early non-conscious processing of fearful facial expressions. Int J Psychophysiol 70: 127-136. doi: 10.1016/j.ijpsycho.2008.08.007
![]() |
[30] | Hinojosa JA, Mercado F, Carretié L (2015) N170 sensitivity to facial expression: A meta-analysis. Neurosci Biobehav Rev. |
[31] | Ledoux JE (1996) The emotional brain: The mysterious underpinnings of emotional life. New York: Simon & Schuster. |
[32] |
Öhman A, Flykt A, Esteves F (2001) Emotion drives attention: Detecting the snake in the grass. J Exper Psychology-General 130: 466-478. doi: 10.1037/0096-3445.130.3.466
![]() |
[33] | Luo Q, Holroyd T, Jones M, et al. (2007) Neural dynamics for facial threat processing as revealed by gamma band synchronization using MEG. Neuroimage 34(2): 839-847. |
[34] | Maratos FA, Mogg K, Bradley BP, et al. (2009) Coarse threat images reveal theta oscillations in the amygdala: a magnetoencephalography study. Cog Affect Behav Neurosci 9(2): 133-143. |
[35] | Maratos FA, Senior C, Mogg K, et al. (2012) Early gamma-band activity as a function of threat processing in the extrastriate visual cortex. Cog Neurosci 3(1): 62-68. |
[36] | Fox E, Russo R, Dutton K (2002) Attentional bias for threat: Evidence for delayed disengagement from emotional faces. Cog Emotion 16(3): 355-379. |
[37] | Gray KLH, Adams WJ, Hedger N, et al. (2013) Faces and awareness: low-level, not emotional factors determine perceptual dominance. Emotion 13(3): 537-544. |
[38] | Stein T, Seymour K, Hebart MN, et al. (2014) Rapid fear detection relies on high spatial frequencies. Psychol Sci 25(2): 566-574. |
[39] | Öhman A, Soares SC, Juth P, et al. (2012) Evolutionary derived modulations of attention to two common fear stimuli: Serpents and hostile humans. J Cog Psychol 24(1): 17-32. |
[40] | Dickins DS, Lipp OV (2014) Visual search for schematic emotional faces: angry faces are more than crosses. Cog Emotion 28(1): 98-114. |
[41] | Öhman A, Lundqvist D, Esteves F (2001) The face in the crowd revisited: a threat advantage with schematic stimuli. J Personal Soc Psychol 80: 381-396. |
[42] | Maratos FA, Mogg K, Bradley BP (2008) Identification of angry faces in the attentional blink. Cog Emotion 22(7): 1340-1352. |
[43] | Maratos FA (2011) Temporal processing of emotional stimuli: the capture and release of attention by angry faces. Emotion 11(5): 1242. |
[44] | Simione L, Calabrese L, Marucci FS, et al. (2014) Emotion based attentional priority for storage in visual short-term memory. PloS one 9(5): e95261. |
[45] | Pinkham AE, Griffin M, Baron R, et al. (2010) The face in the crowd effect: anger superiority when using real faces and multiple identities. Emotion 10(1): 141. |
[46] | Stein T, Sterzer P (2012) Not just another face in the crowd: detecting emotional schematic faces during continuous flash suppression. Emotion 12(5): 988. |
[47] |
Gratton G, Coles MGH, Donchin E (1983) A new method for off-line removal of ocular artifact. Electroencephalogr Clin Neurophysiol 55:468-474 doi: 10.1016/0013-4694(83)90135-9
![]() |
[48] | Kolassa IT, Musial F, Kolassa S, et al. (2006) Event-related potentials when identifying or color-naming threatening schematic stimuli in spider phobic and non-phobic individuals. BMC Psychiatry 6(38). |
[49] |
Babiloni C, Vecchio F, Buffo P, et al. (2010). Cortical responses to consciousness of schematic emotional facial expressions: A high‐resolution EEG study. Hum Brain Map 31(10): 1556-1569. doi: 10.1002/hbm.20958
![]() |
[50] | Deffke I, Sander T, Heidenreich J, et al. (2007) MEG/EEG sources of the 170 ms response to faces are co-localized in the fusiform gyrus. Neuroimage 35(4): 1495-1501. |
[51] | Luo S, Luo W, He W, et al. (2013) P1 and N170 components distinguish human-like and animal-like makeup stimuli. Neuroreport 24(9): 482-486. |
[52] | Mercure E, Cohen Kadosh K, Johnson M (2011) The N170 shows differential repetition effects for faces, objects, and orthographic stimuli. Front Hum Neurosci 5(6). |
1. | Andrei V. Kelarev, Xun Yi, Hui Cui, Leanne Rylands, Herbert F. Jelinek, A survey of state-of-the-art methods for securing medical databases, 2018, 5, 2375-1576, 1, 10.3934/medsci.2018.1.1 | |
2. | Hend Amraoui, Faouzi Mhamdi, Mourad Elloumi, 2019, Chapter 43, 978-3-030-35230-1, 591, 10.1007/978-3-030-35231-8_43 | |
3. | Hend Amraoui, Faouzi Mhamdi, Mourad Elloumi, 2019, Association Rule Mining Using Discrete Jaya Algorithm, 978-1-7281-4484-9, 872, 10.1109/HPCS48598.2019.9188123 |
Layers | Output Size | Output Channels | Repeat | Stride |
3 × 3 Convolution | 32 × 32 | 16 | 1 | 1 |
Dense bottleneck block | 32 × 32 | 256 (g = 8) | 30 | 1 |
Avg pooling | 16 × 16 | 1 | 2 | |
Dense bottleneck block | 16 × 16 | 736 (g = 16) | 30 | 1 |
Avg pooling | 8 × 8 | 1 | 2 | |
Dense bottleneck block | 8 × 8 | 1696 (g = 32) | 30 | 1 |
Global avg pooling | 1 × 1 | 1696 | 1 | 8 |
Fully connected | 1 × 1 | 10 | 1 |
Layers | Output Size | Output Channels | Repeat | Stride |
3 × 3 Convolution | 112 × 112 | 64 | 1 | 2 |
Dense bottleneck block | 112 × 112 | 96 (g = 8) | 4 | 1 |
Avg pooling | 56 × 56 | 1 | 2 | |
Dense bottleneck block | 56 × 56 | 192 (g = 16) | 6 | 1 |
Avg pooling | 28 × 28 | 1 | 2 | |
Dense bottleneck block | 28 × 28 | 448 (g = 32) |
8 | 1 |
Avg pooling | 14 × 14 | 1 | 2 | |
Dense bottleneck block | 14 × 14 | 1088 (g = 64) | 10 | 1 |
Avg pooling | 7 × 7 | 1 | 2 | |
Dense bottleneck block | 7 × 7 | 2112 (g = 128) | 8 | 1 |
Global avg pooling | 1 × 1 | 2112 | 1 | 7 |
Fully connected | 1 × 1 | 1000 | 1 |
Model | Params | FLOPs | CIFAR-10 | CIFAR-100 |
VGG-16 | 14.73 M | 314 M | 92.64 | 72.23 |
ResNet-101 | 42.51 M | 2515 M | 93.75 | 77.78 |
ResNeXt-29 | 9.13 M | 1413 M | 94.82 | 78.83 |
MobileNet V2 | 2.30 M | 92 M | 94.43 | 68.08 |
DenseNet-121 | 6.96 M | 893 M | 94.04 | 77.01 |
CondenseNet-86 | 0.52 M | 65 M | 94.48 | 76.36 |
CondenseNet-182 | 4.20 M | 513 M | 95.87 | 80.13 |
CED-Net-128 CED-Net-272 |
0.69 M 5.32 M |
75 M 649 M |
94.89 96.31 |
77.35 80.72 |
Model | Params | FLOPs | Top-1 | Top-5 |
VGG-16 | 138.36 M | 15.48 G | 71.93 | 90.67 |
ResNet-101 | 44.55 M | 7.83 G | 80.13 | 95.4 |
MobileNet V2 | 3.5 M | 0.3 G | 71.8 | 91 |
DenseNet-121 | 7.98 M | 2.87 G | 74.98 | 92.29 |
CondenseNet | 4.8 M | 0.53 G | 73.8 | 91.7 |
SE-Net | 115 M | 20.78 G | 81.32 | 95.53 |
CED-Net-115 | 9.3 M | 1.13 G | 78.65 | 93.7 |
Model | Params | FLOPs | Accuracy | Precision | Recall | F-measure |
CondenseNeta | 0.52M | 65.82M | 94.48 | 94.50 | 94.48 | 94.49 |
CED-Net-128b | 0.59M | 75.04M | 94.75 | 94.75 | 94.75 | 94.75 |
CED-Net-128c | 0.66M | 67.04M | 94.74 | 94.76 | 94.74 | 94.75 |
CED-Net-128 | 0.69M | 75.41M | 94.89 | 94.89 | 94.89 | 94.88 |
Note: aThe basic model of CED-Net same as CondenseNet-86 without bottleneck layer and channel enhancement block; bThe basic model of CED-Net only add a bottleneck layer; cThe basic model of CED-Net only add channel enhancement block. |
Layers | Output Size | Output Channels | Repeat | Stride |
3 × 3 Convolution | 32 × 32 | 16 | 1 | 1 |
Dense bottleneck block | 32 × 32 | 256 (g = 8) | 30 | 1 |
Avg pooling | 16 × 16 | 1 | 2 | |
Dense bottleneck block | 16 × 16 | 736 (g = 16) | 30 | 1 |
Avg pooling | 8 × 8 | 1 | 2 | |
Dense bottleneck block | 8 × 8 | 1696 (g = 32) | 30 | 1 |
Global avg pooling | 1 × 1 | 1696 | 1 | 8 |
Fully connected | 1 × 1 | 10 | 1 |
Layers | Output Size | Output Channels | Repeat | Stride |
3 × 3 Convolution | 112 × 112 | 64 | 1 | 2 |
Dense bottleneck block | 112 × 112 | 96 (g = 8) | 4 | 1 |
Avg pooling | 56 × 56 | 1 | 2 | |
Dense bottleneck block | 56 × 56 | 192 (g = 16) | 6 | 1 |
Avg pooling | 28 × 28 | 1 | 2 | |
Dense bottleneck block | 28 × 28 | 448 (g = 32) |
8 | 1 |
Avg pooling | 14 × 14 | 1 | 2 | |
Dense bottleneck block | 14 × 14 | 1088 (g = 64) | 10 | 1 |
Avg pooling | 7 × 7 | 1 | 2 | |
Dense bottleneck block | 7 × 7 | 2112 (g = 128) | 8 | 1 |
Global avg pooling | 1 × 1 | 2112 | 1 | 7 |
Fully connected | 1 × 1 | 1000 | 1 |
Model | Params | FLOPs | CIFAR-10 | CIFAR-100 |
VGG-16 | 14.73 M | 314 M | 92.64 | 72.23 |
ResNet-101 | 42.51 M | 2515 M | 93.75 | 77.78 |
ResNeXt-29 | 9.13 M | 1413 M | 94.82 | 78.83 |
MobileNet V2 | 2.30 M | 92 M | 94.43 | 68.08 |
DenseNet-121 | 6.96 M | 893 M | 94.04 | 77.01 |
CondenseNet-86 | 0.52 M | 65 M | 94.48 | 76.36 |
CondenseNet-182 | 4.20 M | 513 M | 95.87 | 80.13 |
CED-Net-128 CED-Net-272 |
0.69 M 5.32 M |
75 M 649 M |
94.89 96.31 |
77.35 80.72 |
Model | Params | FLOPs | Top-1 | Top-5 |
VGG-16 | 138.36 M | 15.48 G | 71.93 | 90.67 |
ResNet-101 | 44.55 M | 7.83 G | 80.13 | 95.4 |
MobileNet V2 | 3.5 M | 0.3 G | 71.8 | 91 |
DenseNet-121 | 7.98 M | 2.87 G | 74.98 | 92.29 |
CondenseNet | 4.8 M | 0.53 G | 73.8 | 91.7 |
SE-Net | 115 M | 20.78 G | 81.32 | 95.53 |
CED-Net-115 | 9.3 M | 1.13 G | 78.65 | 93.7 |
Model | Params | FLOPs | Accuracy | Precision | Recall | F-measure |
CondenseNeta | 0.52M | 65.82M | 94.48 | 94.50 | 94.48 | 94.49 |
CED-Net-128b | 0.59M | 75.04M | 94.75 | 94.75 | 94.75 | 94.75 |
CED-Net-128c | 0.66M | 67.04M | 94.74 | 94.76 | 94.74 | 94.75 |
CED-Net-128 | 0.69M | 75.41M | 94.89 | 94.89 | 94.89 | 94.88 |
Note: aThe basic model of CED-Net same as CondenseNet-86 without bottleneck layer and channel enhancement block; bThe basic model of CED-Net only add a bottleneck layer; cThe basic model of CED-Net only add channel enhancement block. |