
Accurate segmentation of colonoscopic polyps is considered a fundamental step in medical image analysis and surgical interventions. Many recent studies have made improvements based on the encoder-decoder framework, which can effectively segment diverse polyps. Such improvements mainly aim to enhance local features by using global features and applying attention methods. However, relying only on the global information of the final encoder block can result in losing local regional features in the intermediate layer. In addition, determining the edges between benign regions and polyps could be a challenging task. To address the aforementioned issues, we propose a novel separated edge-guidance transformer (SegT) network that aims to build an effective polyp segmentation model. A transformer encoder that learns a more robust representation than existing convolutional neural network-based approaches was specifically applied. To determine the precise segmentation of polyps, we utilize a separated edge-guidance module consisting of separator and edge-guidance blocks. The separator block is a two-stream operator to highlight edges between the background and foreground, whereas the edge-guidance block lies behind both streams to strengthen the understanding of the edge. Lastly, an innovative cascade fusion module was used and fused the refined multi-level features. To evaluate the effectiveness of SegT, we conducted experiments with five challenging public datasets, and the proposed model achieved state-of-the-art performance.
Citation: Feiyu Chen, Haiping Ma, Weijia Zhang. SegT: Separated edge-guidance transformer network for polyp segmentation[J]. Mathematical Biosciences and Engineering, 2023, 20(10): 17803-17821. doi: 10.3934/mbe.2023791
[1] | Chenqian Li, Jun Liu, Jinshan Tang . Simultaneous segmentation and classification of colon cancer polyp images using a dual branch multi-task learning network. Mathematical Biosciences and Engineering, 2024, 21(2): 2024-2049. doi: 10.3934/mbe.2024090 |
[2] | Nan Mu, Jinjia Guo, Rong Wang . Automated polyp segmentation based on a multi-distance feature dissimilarity-guided fully convolutional network. Mathematical Biosciences and Engineering, 2023, 20(11): 20116-20134. doi: 10.3934/mbe.2023891 |
[3] | Zhenwu Xiang, Qi Mao, Jintao Wang, Yi Tian, Yan Zhang, Wenfeng Wang . Dmbg-Net: Dilated multiresidual boundary guidance network for COVID-19 infection segmentation. Mathematical Biosciences and Engineering, 2023, 20(11): 20135-20154. doi: 10.3934/mbe.2023892 |
[4] | Qiming Li, Chengcheng Chen . A robust and high-precision edge segmentation and refinement method for high-resolution images. Mathematical Biosciences and Engineering, 2023, 20(1): 1058-1082. doi: 10.3934/mbe.2023049 |
[5] | Tong Shan, Jiayong Yan, Xiaoyao Cui, Lijian Xie . DSCA-Net: A depthwise separable convolutional neural network with attention mechanism for medical image segmentation. Mathematical Biosciences and Engineering, 2023, 20(1): 365-382. doi: 10.3934/mbe.2023017 |
[6] | Yanxia Sun, Xiang Li, Yuechang Liu, Zhongzheng Yuan, Jinke Wang, Changfa Shi . A lightweight dual-path cascaded network for vessel segmentation in fundus image. Mathematical Biosciences and Engineering, 2023, 20(6): 10790-10814. doi: 10.3934/mbe.2023479 |
[7] | Yijun Yin, Wenzheng Xu, Lei Chen, Hao Wu . CoT-UNet++: A medical image segmentation method based on contextual transformer and dense connection. Mathematical Biosciences and Engineering, 2023, 20(5): 8320-8336. doi: 10.3934/mbe.2023364 |
[8] | Qian Wu, Yuyao Pei, Zihao Cheng, Xiaopeng Hu, Changqing Wang . SDS-Net: A lightweight 3D convolutional neural network with multi-branch attention for multimodal brain tumor accurate segmentation. Mathematical Biosciences and Engineering, 2023, 20(9): 17384-17406. doi: 10.3934/mbe.2023773 |
[9] | Xiaoli Zhang, Kunmeng Liu, Kuixing Zhang, Xiang Li, Zhaocai Sun, Benzheng Wei . SAMS-Net: Fusion of attention mechanism and multi-scale features network for tumor infiltrating lymphocytes segmentation. Mathematical Biosciences and Engineering, 2023, 20(2): 2964-2979. doi: 10.3934/mbe.2023140 |
[10] | Shen Jiang, Jinjiang Li, Zhen Hua . Transformer with progressive sampling for medical cellular image segmentation. Mathematical Biosciences and Engineering, 2022, 19(12): 12104-12126. doi: 10.3934/mbe.2022563 |
Accurate segmentation of colonoscopic polyps is considered a fundamental step in medical image analysis and surgical interventions. Many recent studies have made improvements based on the encoder-decoder framework, which can effectively segment diverse polyps. Such improvements mainly aim to enhance local features by using global features and applying attention methods. However, relying only on the global information of the final encoder block can result in losing local regional features in the intermediate layer. In addition, determining the edges between benign regions and polyps could be a challenging task. To address the aforementioned issues, we propose a novel separated edge-guidance transformer (SegT) network that aims to build an effective polyp segmentation model. A transformer encoder that learns a more robust representation than existing convolutional neural network-based approaches was specifically applied. To determine the precise segmentation of polyps, we utilize a separated edge-guidance module consisting of separator and edge-guidance blocks. The separator block is a two-stream operator to highlight edges between the background and foreground, whereas the edge-guidance block lies behind both streams to strengthen the understanding of the edge. Lastly, an innovative cascade fusion module was used and fused the refined multi-level features. To evaluate the effectiveness of SegT, we conducted experiments with five challenging public datasets, and the proposed model achieved state-of-the-art performance.
According to the reports published by Globocan'2020, colorectal cancer (CRC) is the second most prevalent cancer type worldwide in terms of mortality and the third most pervasive disease across the globe [1]. Colorectal polyps are abnormal tissue growths in the lining of the colon that are a precursor to CRC. After 10 to 15 years, polyps can turn into cancer if they are not treated. The best way to lower the prevalence of CRC is by early detection and effective treatment. Colonoscopy is the gold standard method of examining the gastrointestinal tract. It is used to find polyps and remove them before they turn into cancer. However, colonoscopy is a highly operator-dependent procedure, and one in four polyps may be missed during a single colonoscopy owing to human factors, such as clinician skill or subjectivity[2]. In addition, there is evidence that the absence or incomplete resection of the tumor are two key factors in the development of cancer after colonoscopy[3]. Therefore, an automatic and accurate polyp segmentation method is needed to help doctors locate.
Machine learning has been widely used in many research fields, such as wind power time analysis [4], water quality prediction [5], etc. However, automatically segmenting polyps remains a formidable challenge. Polyps, which result from abnormal cell growth in the human colon, are strongly related to their surrounding environment. They can vary in shape, size, texture, and color, making their appearance highly diverse. One of the significant difficulties in polyp segmentation arises from the fact that the edges of polyps and the surrounding mucosa are not always clearly distinguishable during colonoscopy. This ambiguity is particularly pronounced in different lighting conditions and when dealing with flat lesions or inadequate bowel preparations. Consequently, the learning model for polyp segmentation faces considerable uncertainty due to these factors. In summary, despite advancements in machine learning and computer vision, the automatic segmentation of polyps remains challenging due to the wide variety of polyp appearances and the difficulties associated with accurately identifying polyp edges and mucosal boundaries in colonoscopy images. These factors introduce significant uncertainty into the learning process, making the task particularly demanding.
In recent years, the rapid development of deep learning has led to an increasing number of deep convolutional neural networks (DCNNs) [6,7,8,9,10,11] being proposed for polyp image segmentation. Brandao et al. [6] introduced the fully connected convolution network (FCN) into the polyp region extraction issue by converting AlexNet, visual geometry group (VGG), and ResNets into FCNs. U-shaped [8,10] architecture containing an encoder and a decoder built up from convolutional layers is widely used for segmentation tasks with impressive performance. However, the limitation of convolutional neural networks (CNNs) indicates that the receptive field is limited, and the model only obtains local information but disregards spatial context and global information. In addition, CNNs behave similarly to a series of high-pass filters, favoring high-frequency information. Transformer [12,13,14,15,16] is a recently proposed deep neural network architecture. Compared with CNN, the self-attention layer in transformer is similar to a low-pass filter and can effectively identify long-term dependencies. Therefore, combining the advantages of convolutional and self-attention layers can improve the representation ability of deep networks.
Although these methods have substantially improved accuracy and generalization ability compared with traditional methods, locating the edges of polyps remain a challenge for them, as shown in Figure 1. The color and texture of polyps are markedly similar to surrounding tissues; low contrast provides them with powerful camouflage properties [17] and makes them difficult to identify. Previous studies [18,19,20,21,22] have explored fusing low-scale boundary and high-scale semantic information to preserve boundary details better. Takikawa et al. [23] and Zhen et al.[24] designed a boundary stream and coupled the task of boundary and semantics modeling. PraNet [7] generated a global map as the initial guidance region and used the reverse attention module thereafter to reveal more complete objects. However, these endeavors seldom consider simulating how humans detect polyps with their ambiguous boundaries to the backgrounds.
To address the previously mentioned issues, we have the motivation to imitate the human manner to detect polyp areas. We observe that when people detect potential polyp targets in colonoscopy images, they will first look for the possible polyp region. Thereafter, they will outline the precise edge of the polyp area by comparing the difference between the foreground and background. Inspired by this observation, we propose an effectively separated edge-guidance transformer network (SegT) for polyp segmentation.
Following the design motivation, SegT exploits three main modules to help improve the polyp segmentation results: the edge extractor module (EEM) to capture edge context information, the separated edge-guidance (SEG) module to accurately highlight the boundary of a polyp object from the foreground and background, cascade fusion module (CFM) to achieve more effective output feature fusion. Specifically, different from using only the reverse attention models [7,25] or edge information models [26,27], SEG not only contains two streams attentions: the normal and reverse attention streams focus on the foreground and background of the input image, but also has the Edge Guidance (EG) block to enhance the edge detection capability after each stream. Our model generates high-quality segmentation maps imitating the human manner and has demonstrated remarkable performance in various challenging scenarios.
The key contributions of our work are as follows:
● We propose a novel framework called SegT for polyp segmentation, which adapts the PVT as an encoder rather than the existing CNN-based methods to extract features.
● We design a SEG module, which is composed of two parts: separator (SE) and EG blocks. Their purpose is to simulate how humans detect polyp targets. In particular, the SE block is utilized to highlight the object's edges between an image's background and foreground. The EG block aims to embed edge information into the feature map, which can significantly address the "ambiguous" problem of edges.
● We present a CFM, which collects polyps' semantic and location information from the features through progressive integration to obtain refined segmentation results.
Traditional Methods. Computer-aided detection [28,29,30] is an effective alternative to manual detection, and hand-engineered methods are widely used in polyp detection. The methods of polyp segmentation schemes are mainly based on low-level features, such as texture and geometric features. In the method proposed by Sánchez-González et al. [28], the shape, color, and curvature features of edges are utilized for polyp segmentation. Figueiredo et al. [29] proposed a unified bottom-up and top-down saliency method for polyp detection that considers shape, color, and texture information. However, these methods have a high risk of missed or false detection owing to the high similarity between polyps and the surrounding tissues.
Deep Learning-Based Methods. Owing to the powerful feature expression and analysis capabilities of deep learning models [6,7,8,31,32], many deep learning-based methods have been proposed for polyp segmentation tasks. Brandao et al. [6] introduced FCN into the polyp region extraction issue by converting the classification neural network. However, this fully convolutional network architecture lacks detailed semantic features, and the segmentation result is not ideal. Encoder–decoder-based models, such as U-Net [8] and UNet++ [10], have recently become important model frameworks in this direction, which have excellent performance. U-Net [8] introduced incremental up-sampling of feature maps alongside the corresponding scales of low-level feature maps with "skip-connections." U-Net++ [10] included additional layers and dense connections, which are used to reduce the gap between low and high-level features. With the increasing importance of polyp segmentation, attention mechanism [33] has been designed specifically for polyp datasets in recent years. PraNet [7] utilizes a reverse attention module to establish the relationship between region and boundary cues, recovering a clear boundary between a polyp and its surrounding mucosa. However, solely using reverse attention may lead to false detections and introduce unnecessary noise. Inspired by Chen et al. [34], we adopt the separate attention mechanism, which combines reverse and normal attention to focus on the background and foreground, respectively.
Transformer [14] is a markedly influential deep neural network architecture originally proposed to solve similar problems, such as natural language processing. Originally, the transformer architecture was not well suited for image analysis. To apply transformers to computer vision tasks, Dosovitskiy et al. [13] proposed a vision transformer (ViT), which is the first pure transformer for image classification. ViT splits an image into patches and processes them as consecutive labels. This method substantially reduces the computational cost and enables transformers to efficiently process large-scale images. However, ViT requires large-scale datasets to train effectively and is severely limited when trained on small datasets. This property hinders its usage in such problems as medical segmentation, where the dataset is scarce.
Recent studies have attempted to enhance ViT in several ways further. DeiT [35] introduces a data-efficient training strategy combined with a distillation method, which helps improve performance when training on small datasets. Hierarchical visual transformer (HVT) [36] is based on a hierarchical progressive pooling method to compress the sequence length of tokens, reducing redundancy and computation. Transformer in transformer (TNT) [37] adopts a transformer suitable for fine-grained image tasks to segment the original image patch and perform self-attention mechanism calculations in small units. Simultaneously, global and local features are extracted using external and internal transformers. Previous research has demonstrated that the pyramid structure in convolutional networks is also applicable to transformers and various downstream tasks, such as Swin Transformer [38], PVT [39], and Segformer [40]. PVT is less computationally intensive than ViT and uses the classic semantic feature pyramid networks (FPN) to deploy semantic segmentation tasks.
In medical image segmentation, the TransUNet [41] and TransFuse [42] models are developed based on a transformer for polyp segmentation and have achieved good results. TransUNet uses a transformer-based network with a hybrid ViT encoder and an upsampled CNN decoder. Hybrid ViT stacks CNN and transformer together, resulting in high computational costs. TransFuse solves this problem by using a parallel architecture. Both models use the attention gate mechanism [43] and the so-called BiFusion module. These components make the network architecture large and highly complex. To efficiently train models on medical images, Poly-PVT [19] introduces a similarity aggregation module based on the graph convolutional domain [44]. CASCaded Attention DEcoder (CASCADE) [45] builds a cascaded attention decoder that focuses on leveraging multi-scale features of hierarchical vision transformers. However, these methods ignore the influence of boundary constraints on the polyp segmentation task.
Locating pixels on the border is considerably difficult, as demonstrated by many previous methods. To address the issues, various edge-aware models (or say boundary-aware models) have been developed to highlight these hard pixels.
Learning edge information has shown excellent performance in many image segmentation tasks in recent years. In early studies on FCN-based semantic segmentation, Bertasius et al. [46] and Chen et al. [47] used boundaries for post-starting to refine the results at the end of the network. SFANet [48] applies region boundary constraints to supervise polyp learning. To compensate for missing object parts, Chen et al. [49] and Fan et al. [7] utilized reverse attention blocks to learn missing parts and details. However, using only edge information as shape constraints or reverse attention may lead to incorrect detection or introduce unnecessary noise. Several recent approaches have explicitly parallelized boundary detection as an independent subtask with semantic segmentation to achieve cleaner results. Ma et al. [26] explicitly exploited boundary information for context aggregation, further enhancing the semantic representation of the model. Kim et al. [27] went a step further than BlendMask [50] and explored base mask representations and boundary information for instance-specific features.
Although the preceding methods can improve performance, they only use boundary information as supplementary clues to effectively refine the target region segmentation. These methods minimally exploit the complementary relationship between regional and boundary features. Compared with the methods above, our proposed method can mine the deep information of the foreground and background and combine the boundary information to enhance the features at the junction of the foreground and background, thereby improving the segmentation performance of the polyp targets.
The full architecture network is shown in Figure 2. For the input I∈RW×H×3, where W and H denote the width and height of an image, we use PVT as our backbone to extract the multi-level features Eni,i∈{1,2,3,4}. First, we input En1,En2,En3 and En4 into the channel-wise feature pyramid (CFP) [51] to determine the features of different receptive fields. Second, we utilize SEG module to refine the feature maps. The SEG module consists of SE blocks and EG blocks. The SE blocks contain the normal and reverse attention streams to focus on the foreground and background. The foreground maps Fi,i∈{1,2,3,4} are supervised by the ground truth. Furthermore, we utilize an effective EEM to obtain the edge map, which is exploited in the EG blocks. Thereafter, we obtain the edge refined maps marked as EGi,i∈1,2,3 and feed them to the CFM to fuse refined feature maps, leading to a final feature map f1. We choose the sum of EG1 and f1 as the final output in the inference stage.
Some recent works[40,52] report that ViT[39,53] have stronger performance and robustness to input disturbances such as noise than CNNs[54,55]. Inspired by this, we choose a ViT as our backbone network to extract features for polyp segmentation. Compared with[13,38], the PVT[39] is a pyramid architecture whose spatial-reduction attention can reduce computing resource consumption. For the segmentation task, we design a polyp segmentation head on top of four multi-level feature maps (i.e., En1,En2,En3,En4). Among these feature maps, En1 gives detailed low-level appearance information, and En2,En3,En4 provides the high-level feature.
Object detection can benefit from a good edge prior to segmentation and localization[56,57]. Even though low-level features contain rich edge details, they introduce non-object edge information. To easily explore edge features associated with the polyp area, high-level semantic or location information is required. As illustrated in Figure 2, we combine the high-level feature (En4) and low-level feature (En1) in this module to model the object-related edge information. First, the channels of En1 and En4 are separately changed to 32 and 256, respectively, using a 1 × 1 convolution layer. Second, the feature En′1 and up-sampled feature En′4 are combined using a concatenation technique. Lastly, we generate the edge feature fe using two 3 × 3 convolution layers and one 1 × 1 convolution layer. The produced edge map and its edge ground-truth label can be measured using the binary cross-entropy loss function, which is given as follows:
Ledge=−∑i[Egtilog(EMi)+(1−Egti)log(1−EMi)], | (3.1) |
where EMi denotes the produced edge map of the i-th image after the upsampling operator of the edge feature fe, and Egti denotes the edge ground-truth map. In our model, Egti is extracted from the ground-truth map by the Canny edge detection method during the model training phase. Moreover, our EEM can provide edge-enhanced representation fe to guide the detection in the SEG module. In addition, fe is cascaded to multiple supervisions to enhance the ability of feature representations.
The original input images have blurred boundaries that disguise areas of polyps that are difficult to segment. To address these issues, a SEG module is proposed. The module integrates the SE and EG blocks, as shown in Figure 3. The SE block contains forward and reverse streams to focus on the foreground and background, respectively. The EG block integrates edge information from EEM into the feature space to enhance the sensitivity of the model to the edge.
Separator. When delineating the polyp area in colonoscopic images, information at the boundary between the foreground and background is an important cue. The human vision system is able to perceive border information effectively because of the fusion of information from the background and interior of the object. Inspired by [34], we adopt the SE block, which contains two steams to focus on the foreground and background. In the first stream, we erase the internal details of objects to focus on the background. Meanwhile, the internal information of the object is recovered in the second stream to focus on the foreground. The operation mechanism of the separator is to highlight the boundary through the synergy between the foreground and background information. The separator can be written as follows:
Fi=Outi=Conv(Ci⊗expand(σ(Fi+1))), | (3.2) |
Bi=Conv(Ci⊗expand(1−σ(Fi+1))), | (3.3) |
where Ci denotes the i-th layer of the feature map produced by the CFP module [51]. The foreground map in the i-th layer Fi is the result of upsampling on the coarse map of the (i+1)-th layer, written as 1−σ(Fi+1), where σ denotes the sigmoid function, Conv is a 1 × 1 convolution, ⊗ indicates the multiplication operation, and expand() aims to expand the channel of maps similar to Ci. The background map is the foreground map substracted from 1, which is defined as 1−σ(Fi+1). Outi is the coarse output map of the i-th layer, which is supervised by the ground-truth map.
Edge-guidance. We use Fi and Bi as the input to Channel Attention Module (CAM) [57,58], which is beneficial in representing different scales of features in a more general way. Moreover, the attention that obtains the weight of the feature maps on a global and local scale can be written as W(Fi+Bi). After the attention module, we add an edge-guidance to enhance the model's ability to understand the edges, so they can be more prominent after the two streams are merged. In particular, we integrate the information of the two streams by simple addition. The mechanism of the edge-guidance block is similar to the conditional normalization module with prior knowledge of edge map. We consider edge prediction as our condition, and such a block embeds the spatial information into feature maps Fi and Bi, which allows the feature map to learn better edge features. The formula of the operation is defined as follows:
EGFi=BN(W(Fi+Bi)⊗Fi)⊗Conv3×3(fe)⊕Conv3×3(fe), | (3.4) |
EGBi=BN((1−W(Fi+Bi))⊗Bi)⊗Conv3×3(fe)⊕Conv3×3(fe), | (3.5) |
EGi=EGFi⊕EGBi,i=1,2,3, | (3.6) |
where EGFi and EGBi are the output results of the foreground and background streams, respectively, EGi is the output of the SEG module, fe is the edge feature map, BN denotes batch normalization, and Conv3×3 denotes a 3 × 3 convolutional layer to encode information on the edge map and enlarge the channel to the same as coarse feature maps. Thereafter, the shuffle attention module (SAM) [59] is utilized to make the model focus on the informative channels.
A multi-level feature fusion strategy was applied and verified to improve segmentation performance. The feature fusion has an immense influence on the quality of the segmentation result, so we design a CFM to achieve more effective output feature fusion. CFM obtains three edge-guided maps of the first round predictions marked as { EGi,i=1,…,3 } and F4. Each lower-level feature map is aggregated with the result of the fusion process. The process can be summarized as Fusion(fi,EGi)=Concat(fi⊗EGi,EGi). The four levels of fusion feature stacks are shown in Figure 2. The four levels of fusion feature { fi,i=1,…,4 } are computed using Equation 3.7, and the final output is computed using ∑ifi.
{f4=F4f3=Fusion(f4,EG3)f2=Fusion(f3,EG2)f1=Fusion(f2,EG1), | (3.7) |
Binary cross-entropy loss is widely used in many polyp segmentation tasks. However, it has clear shortcomings that will lead to poor performance when the number of foreground pixels is considerably less than that of background pixels. Inspired by[60], we combine the two loss functions as the total loss for supervision with the following formula:
Lt=Lwbce+LwIOU, | (3.8) |
where LwIOU and Lwbce denote the weighted IoU loss and BCE loss for global and local restrictions, respectively. Note that LwIOU can increase the weights of hard pixels to highlight their importance, and Lwbce focuses more on hard pixels rather than treating all pixels equally. Moreover, our model includes six supervised outputs, including four foreground maps (F1,F2,F3,F4), one feature fusion map f1, and one edge map fe. Each map (i.e., F1,F2,F3,F4,f1) is up-sampled to have the same size as the ground-truth map (i.e., G). Thus, the final total loss function can be represented as follows:
Ltotal =4∑i=1Lt(Fi,G)+Lt(f1,G)+Ledge . | (3.9) |
Following PraNet [7], we conduct experiments on five polyp segmentation datasets (ETIS) [61], CVCClinicDB (ClinicDB) [62], CVC-ColonDB (ColonDB) [63], EndoScene-CVC300 (EndoScene) [64], Kvasir-SEG (Kvasir) [65]). Our training set contains 900 randomly selected images in Kvasir and 550 selected images in CVC-ClinicDB, while the remaining 100 pieces of Kvasir and 62 pieces of CVC-ClinicDB are used as test sets. Test on the out-of-distribution datasets includes ColonDB with 380 images, EndoScene with 60 images, and ETIS with 196 images (unseen data). Three widely used metrics, namely, mean Dice (mDice), mean IoU (mIoU) and mean absolute error (MAE), are used to evaluate the model performances.
Our method is implemented based on the PyTorch framework and runs on an NVIDIA GeForce RTX 3090 GPU. Considering the differences in the sizes of each polyp image, the input image is simply resized to 352 × 352, and we adopt a multi-scale training strategy thereafter [7,66,67]. The network is trained end-to-end by an AdamW [68] optimizer. The learning rate is set to 1 ×10−4, and the weight decay is adjusted to 1 ×10−4 as well. The batch size is set at 16.
We first evaluate our proposed SegT model for its segmentation performance on the seen datasets. As summarized in Table 1, our model is compared to four recently published CNN-based neural networks: U-Net [8], UNet++ [10], PraNet [7], and CaraNet [25]. Note that our proposed model outperforms other models on the seen datasets, as shown in Table 1. In the first two rows of the table, we compared two classic medical image segmentation networks (the U-Net and the U-Net++). The SegT network achieves over 10% gains in mDice and mIoU on the Kvasir datasets and ClinicDB. In the 3rd and 4th rows, we compared state-of-the-art models of the polyp segmentation task. Our proposed model outperforms the two models in mDice, mIoU, and MAE on the seen datasets. In Table 1, we also report the results of three transformer-based methods (TransUNet [41], TransFuse [42], and Polyp-PVT [19]) with our proposed framework. Although TransFuse is close to the performance of our proposed model in the ClinicDB dataset, our model is more stable in terms of overall performance. Furthermore, without considering model complexity, our model has a 1% improvement in the mIoU metric in the challenging Kvasir dataset. When compared to the other two types of transformer-based models, the advantage of the metric performance is more obvious.
Methods | Pub. | Type | Kvasir | ClinicDB | ||||
mDice↑ | mIoU↑ | MAE↓ | mDice↑ | mIoU↑ | MAE↓ | |||
U-Net[8] | MICCAI'15 | CNN | 0.818 | 0.746 | 0.055 | 0.823 | 0.755 | 0.019 |
UNet++[10] | TMI'19 | CNN | 0.821 | 0.743 | 0.048 | 0.794 | 0.729 | 0.022 |
PraNet[7] | MICCAI'20 | CNN | 0.898 | 0.840 | 0.030 | 0.899 | 0.849 | 0.009 |
CaraNet[25] | JMI'23 | CNN | 0.918 | 0.865 | 0.023 | 0.936 | 0.887 | 0.007 |
TransUNet[41] | arXiv'21 | Transformer | 0.913 | 0.857 | 0.028 | 0.935 | 0.887 | 0.008 |
TransFuse[42] | MICCAI'21 | Transformer+CNN | 0.920 | 0.870 | 0.023 | 0.942 | 0.897 | 0.007 |
Polyp-PVT[19] | CAAI AIR'23 | Transformer | 0.917 | 0.864 | 0.023 | 0.937 | 0.889 | 0.006 |
SegT (Ours) | - | Transformer | 0.927 | 0.880 | 0.023 | 0.940 | 0.897 | 0.006 |
We further evaluate the generalization capability of our model on unseen datasets (i.e., ETIS, ColonDB, EndoScene). Table 2 shows that our model outperforms the existing medical segmentation baselines on the unseen datasets. Concretely, performance gains over the best contender built on a CNN-based backbone network (i.e., CaraNet) are (4.1%, 4.3%, 0.016) for metrics (mDice, mIoU, MAE) on the ColonDB dataset, and (6.3%, 6%, 0.004) for metrics (mDice, mIoU, MAE) on ETIS dataset. Besides, when compared with transformer-based backbone networks on the challenging ETIS datasets, our SegT surpasses the best competing method (i.e., Polyp-PVT) by 2.3 and 2.5% for the mDice and mIoU metrics, respectively. However, our evaluation results on the EndoScene dataset show that our method doesn't demonstrate a significant performance advantage compared to other approaches. This outcome is mainly due to the fact that the test dataset consists of only 60 images, which hardly draws definitive conclusions about the superiority of the different methods.
Methods | Pub. | Type | ColonDB | ETIS | EndoScene | ||||||
mDice↑ | mIoU↑ | MAE↓ | mDice↑ | mIoU↑ | MAE↓ | mDice↑ | mIoU↑ | MAE↓ | |||
U-Net [8] | MICCAI'15 | CNN | 0.512 | 0.444 | 0.061 | 0.398 | 0.335 | 0.036 | 0.710 | 0.627 | 0.022 |
UNet++ [10] | TMI'19 | CNN | 0.483 | 0.410 | 0.064 | 0.401 | 0.344 | 0.035 | 0.707 | 0.624 | 0.018 |
PraNet [7] | MICCAI'20 | CNN | 0.712 | 0.640 | 0.043 | 0.628 | 0.567 | 0.031 | 0.851 | 0.797 | 0.010 |
CaraNet [25] | JMI'23 | CNN | 0.773 | 0.689 | 0.042 | 0.747 | 0.672 | 0.017 | 0.903 | 0.838 | 0.007 |
TransUNet [41] | arXiv'21 | Transformer | 0.781 | 0.699 | 0.036 | 0.731 | 0.824 | 0.021 | 0.893 | 0.660 | 0.009 |
TransFuse [42] | MICCAI'21 | Transformer+CNN | 0.781 | 0.706 | 0.035 | 0.737 | 0.826 | 0.020 | 0.894 | 0.654 | 0.009 |
Polyp-PVT [19] | CAAI AIR'23 | Transformer | 0.808 | 0.727 | 0.031 | 0.787 | 0.706 | 0.013 | 0.900 | 0.833 | 0.007 |
SegT (Ours) | - | Transformer | 0.814 | 0.732 | 0.026 | 0.810 | 0.732 | 0.013 | 0.895 | 0.828 | 0.008 |
To intuitively demonstrate the prominent performance of the SegT, several representative prediction maps of the SegT and other state-of-the-art models are shown in Figure 4. Specifically, the 1st shows the results of examples with low-contrast backgrounds. As we can see, while most competing methods cannot identify the boundary area, our SegT almost correctly segments all the polyp regions. The arrows indicate the boundary areas where the segmentation result from other methods do not fit well with the ground truth. The 2nd row is an example of the occluded polyp. As we can observe, our SegT is capable of producing accurate results, while other methods tend to generate results with poor accuracy. The 3rd and 4th rows are examples of relatively large targets and small targets, respectively. As we can observe, our SegT correctly identifies all the targets. Other methods tend to miss several details of the boundaries. These are visible in the arrow-pointing area. The 5th row is an example of brightness interference. Our SegT not only accurately segments the target but also eliminates the salient distraction. In summary, SegT is capable of producing high-quality prediction maps under various challenging scenarios. It is worth noting that in these examples, the texture of these polyp objects is almost identical to that of the surrounding environment, which can well prove that the SegT is effective in locating the targets by leveraging the edge cues.
We describe in detail the effectiveness of each component on the overall model. The training, testing, and hyper-parameter settings are the same as mentioned in Sec.B. We evaluate module effectiveness by removing components from the complete SegT on three datasets, and we choose mDice, mIoU and MAE for evaluation. In order to better explain the relationship between models, we labeled different experimental models as a to e. Model a is composed of the backbone network PVTv2 and CFP module; Model b adds the SE block based on Model a; Model c adds the SEG module on Model a; Model d is the final model without CFM, and Model e is our final model. We evaluate the seven models on three benchmark datasets. Quantitative experimental results are shown in Table 3.
Methods | Kvasir(seen) | ETIS(unseen) | ColonDB(unseen) | ||||||
mDice↑ | mIoU↑ | MAE↓ | mDice↑ | mIoU↑ | MAE↓ | mDice↑ | mIoU↑ | MAE↓ | |
a. baseline | 0.910 | 0.859 | 0.030 | 0.759 | 0.688 | 0.017 | 0.796 | 0.707 | 0.031 |
b. baseline + SE | 0.914 | 0.856 | 0.033 | 0.767 | 0.707 | 0.018 | 0.799 | 0.721 | 0.031 |
c. baseline + SEG (SE + EG) | 0.919 | 0.869 | 0.028 | 0.795 | 0.714 | 0.016 | 0.810 | 0.727 | 0.030 |
d. baseline + CFM | 0.913 | 0.855 | 0.034 | 0.764 | 0.701 | 0.018 | 0.792 | 0.721 | 0.032 |
e. w/o SE | 0.916 | 0.865 | 0.030 | 0.779 | 0.701 | 0.019 | 0.794 | 0.710 | 0.032 |
f. w/o EG | 0.914 | 0.861 | 0.031 | 0.777 | 0.703 | 0.019 | 0.798 | 0.709 | 0.031 |
g. SEG+ CFM (Ours) | 0.927 | 0.880 | 0.023 | 0.810 | 0.732 | 0.013 | 0.814 | 0.732 | 0.026 |
Effectiveness of SEG. By comparing Model a with Model b, we observe that Model b outperforms Model a in most evaluation metrics. It means by adding the separator block, our model can perform better. The apparent improvement in the evaluation metrics shows that the separator can highlight the boundaries of objects by focusing on the foreground and background information separately, thereby improving the accuracy of polyp segmentation. In order to validate the effectiveness of the Edge-guidance, we compare the results of Model b and Model c. After adding the EG, the performance of our Model c increases compared with Model b. Moreover, we further investigate the contribution of the SEG by removing it from the overall model, which is labeled as Model d, the performance without the SEG drops sharply on all three datasets. Compared with Model d, Model e and Model f shows an improvement in most evaluation metrics, which demonstrates the two block in the SEG module work effectively. Since the separator between the foreground and the background, that is, the boundary of the polyp area contains fewer pixels, we need to exploit the Edge-guidance to embed additional edge information into the feature to strengthen the model's understanding of the boundary. With the help of Edge-guidance, the predicted result can maintain a clear edge structure of the object.
Effectiveness of CFM. Similarly, we test the effectiveness of the CFM module by removing it from the overall model and replacing it with an element-wise addition operation, which is called Model c. Compared with SegT, the performance of the Model c drops on all three datasets by a large margin. The performance degradation of the model demonstrates that the CFM is helpful in effectively integrating refined feature information at every stage. By comparing Model a with Model d, the baseline model with the CFM module also can perform better in most of the evaluation metrics.
The visual results are given in Figure 5. Red and green indicate regions that are accurately detected and wrong predicted, respectively. Evidently, our designed module can obtain significant results in the edge detection of small and large target regions. We observe that the SEG module facilitates the fine-grained ambiguous boundaries, and the CFM module significantly improves the accuracies of object detection and target object location.
We proposed a new image polyp segmentation framework called SegT. SegT is inspired by the habit of observing objects with blurred boundaries. By finding the foreground and background, the outline of the object can be depicted. Therefore, this research argues that the boundary information will enhance the ability of polyp segmentation. On the bases of the preceding observations, we first utilize a PVT backbone as an encoder to explicitly extract more powerful and robust features. Thereafter, we propose a SEG module composed of two blocks (i.e., SE and EG blocks). The SE block is used to separate two streams: one stream focuses on the foreground but disregards the background, while the other focuses on the background and erases the foreground. After each stream, edge information is embedded into the features using the EG block, and the two streams are fused to enhance the ability of the model to detect object boundaries. Lastly, CFM is used to obtain more accurate features. Extensive experiments show that SegT consistently outperforms on five challenging datasets without any pre-/post-processing.
Although the SegT model provides a powerful and effective solution for the polyp segmentation task, some limitations still deserve further exploration. First, in the current work, the boundary information is collected explicitly to guide the foreground and background to find the subtle differences between the two to depict the boundary of polyps. In contrast, for humans, the information extraction and integration process should be implicitly included in the knowledge-learning process. Moreover, this design brings additional inference costs. In future work, we will simplify the inference structure further to make it more consistent with the actual human decision-making process. In addition, the backbone of the SegT model was pre-trained on ImageNet, where most natural images differ from medical images. In future work, we will use pre-training that is more suitable for medical image segmentation and adapt the model structure to use it for 3-D medical imaging segmentation.
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
We would like to thank Professor Zhaowei Wang from the Shaoxing Hospital and our colleague Ms Xue Cheng for their support. This research was also funded by the university-level key scientific research platform program of Shaoxing University.
The authors declare there is no conflict of interest.
[1] |
H. Sung, J. Ferlay, R. L. Siegel, M. Laversanne, I. Soerjomataram, A. Jemal, et al., Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries, CA: Cancer J. Clin., 71 (2021), 209–249. https://doi.org/10.3322/caac.21660 doi: 10.3322/caac.21660
![]() |
[2] |
S. B. Ahn, D. S. Han, J. H. Bae, T. J. Byun, J. P. Kim, C. S. Eun, The miss rate for colorectal adenoma determined by quality-adjusted, back-to-back colonoscopies, Gut Liver, 6 (2012), 64. https://doi.org/10.5009/gnl.2012.6.1.64 doi: 10.5009/gnl.2012.6.1.64
![]() |
[3] |
C. M. C. Le Clercq, M. W. E. Bouwens, E. J. A. Rondagh, C. M. Bakker, E. T. P. Keulen, R. J. de Ridder, et al., Postcolonoscopy colorectal cancers are preventable: a population-based study, Gut, 63 (2014), 957–963. http://doi.org/10.1136/gutjnl-2013-304880 doi: 10.1136/gutjnl-2013-304880
![]() |
[4] |
C. Hao, T. Jin, F. Tan, J. Gao, Z. Ma, J. Cao, The analysis of time-varying high-order moment of wind power time series, Energy Rep., 9 (2023), 3154–3159. https://doi.org/10.1016/j.egyr.2023.02.010 doi: 10.1016/j.egyr.2023.02.010
![]() |
[5] |
J. Cao, D. Zhao, C. Tian, T. Jin, F. Song, Adopting improved adam optimizer to train dendritic neuron model for water quality prediction, Math. Biosci. Eng., 20 (2023), 9489–9510. https://doi.org/10.3934/mbe.2023417 doi: 10.3934/mbe.2023417
![]() |
[6] | P. Brandao, O. Zisimopoulos, E. Mazomenos, G. Ciuti, J. Bernal, M. Visentini-Scarzanella, et al., Towards a computed-aided diagnosis system in colonoscopy: automatic polyp segmentation using convolution neural networks, J. Med. Rob. Res., 3 (2018). https://doi.org/10.1142/S2424905X18400020 |
[7] | D. Fan, G. Ji, T. Zhou, G. Chen, H. Fu, J. Shen, et al., Pranet: Parallel reverse attention network for polyp segmentation, in International Conference on Medical Image Computing and Computer-Assisted Intervention, 12266 (2020), 263–273. https://doi.org/10.1007/978-3-030-59725-2_26 |
[8] | O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in International Conference on Medical Image Computing and Computer-Assisted Intervention, 9351 (2015), 234–241. https://doi.org/10.1007/978-3-319-24574-4_28 |
[9] | R. Zhang, G. Li, Z. Li, S. Cui, D. Qian, Y. Yu, Adaptive context selection for polyp segmentation, in International Conference on Medical Image Computing and Computer-Assisted Intervention, 12266 (2020), 253–262. https://doi.org/10.1007/978-3-030-59725-2_25 |
[10] | Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, J. Liang, Unet++: A nested u-net architecture for medical image segmentation, in International Workshop on Deep Learning in Medical Image Analysis, 11045 (2018), 3–11. https://doi.org/10.1007/978-3-030-00889-5_1 |
[11] | F. Shen, X. Du, L. Zhang, X. Shu, J. Tang, Triplet contrastive learning for unsupervised vehicle re-identification, preprint, arXiv: 2301.09498. |
[12] | N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-end object detection with transformers, in European Conference on Computer Vision, 12346 (2020), 213–229. https://doi.org/10.1007/978-3-030-58452-8_13 |
[13] | A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, et al., An image is worth 16 × 16 words: Transformers for image recognition at scale, preprint, arXiv: 2010.11929. |
[14] | A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., Attention is all you need, preprint, arXiv: 1706.03762. |
[15] | L. Pan, W. Luan, Y. Zheng, Q. Fu, J. Li, PSGformer: Enhancing 3D point cloud instance segmentation via precise semantic guidance, preprint, arXiv: 2307.07708. |
[16] |
F. Shen, Y. Xie, J. Zhu, X. Zhu, H. Zeng, Git: Graph interactive transformer for vehicle re-identification, IEEE Trans. Image Process., 32 (2023), 1039–1051. https://doi.org/10.1109/TIP.2023.3238642 doi: 10.1109/TIP.2023.3238642
![]() |
[17] | D. Fan, G. Ji, M. Cheng, L. Shao, Concealed object detection, IEEE Trans. Pattern Anal. Mach. Intell., 44 (2021), 6024–6042. https://doi.org/10.1109/TPAMI.2021.3085766 |
[18] | L. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam, Encoder-decoder with atrous separable convolution for semantic image segmentation, in European Conference on Computer Vision, 11211 (2018), 833–851. https://doi.org/10.1007/978-3-030-01234-2_49 |
[19] | D. Bo, W. Wang, D. Fan, J. Li, H. Fu, L. Shao, Polyp-pvt: Polyp segmentation with pyramidvision transformers, preprint, arXiv: 2108.06932. |
[20] | X. Li, H. Zhao, L. Han, Y. Tong, S. Tan, K. Yang, Gated fully fusion for semantic segmentation, in Proceedings of the AAAI conference on artificial intelligence, 34 (2020), 11418–11425. https://doi.org/10.1609/aaai.v34i07.6805 |
[21] |
F. Shen, J. Zhu, X. Zhu, Y. Xie, J. Huang, Exploring spatial significance via hybrid pyramidal graph network for vehicle re-identification, IEEE Trans. Intell. Transp. Syst., 23 (2022), 8793–8804. https://doi.org/10.1109/TITS.2021.3086142 doi: 10.1109/TITS.2021.3086142
![]() |
[22] |
F. Shen, J. Zhu, X. Zhu, J. Huang, H. Zeng, Z. Lei, et al., An efficient multiresolution network for vehicle reidentification, IEEE Internet Things J., 9 (2022), 9049–9059. https://doi.org/10.1109/JIOT.2021.3119525 doi: 10.1109/JIOT.2021.3119525
![]() |
[23] | T. Takikawa, D. Acuna, V. Jampani, S. Fidler, Gated-scnn: Gated shape cnns for semantic segmentation, in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), (2019), 5228–5237. https://doi.org/10.1109/ICCV.2019.00533 |
[24] | M. Zhen, J. Wang, L. Zhou, S. Li, T. Shen, J. Shang, et al., Joint semantic segmentation and boundary detection using iterative pyramid contexts, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2020), 13663–13672. https://doi.org/10.1109/CVPR42600.2020.01368 |
[25] | A. Lou, S. Guan, M. H. Loew, Caranet: context axial reverse attention network for segmentation of small medical objects, J. Med. Imaging, 10 (2023). https://doi.org/10.1117/1.JMI.10.1.014005 |
[26] | H. Ma, H. Yang, D. Huang, Boundary guided context aggregation for semantic segmentation, preprint, arXiv: 2110.14587. |
[27] | M. Kim, S. Woo, D. Kim, I. S. Kweon, The devil is in the boundary: Exploiting boundary representation for basis-based instance segmentation, in 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), (2021), 928–937. https://doi.org/10.1109/WACV48630.2021.00097 |
[28] |
A. Sánchez-González, B. García-Zapirain, D. Sierra-Sosa, A. Elmaghraby, Automatized colon polyp segmentation via contour region analysis, Comput. Biol. Med., 100 (2018), 152–164. https://doi.org/10.1016/j.compbiomed.2018.07.002 doi: 10.1016/j.compbiomed.2018.07.002
![]() |
[29] |
P. N. Figueiredo, I. N. Figueiredo, L. Pinto, S. Kumar, Y. R. Tsai, A. V. Mamonov, Polyp detection with computer-aided diagnosis in white light colonoscopy: comparison of three different methods, Endosc. Int. Open, 7 (2019), 209–215. https://doi.org/10.1055/a-0808-4456 doi: 10.1055/a-0808-4456
![]() |
[30] | M. Li, M. Wei, X. He, F. Shen, Enhancing pary features via contrastive attention module for vehicle re-identification, in 2022 IEEE International Conference on Image Processing (ICIP), (2022), 1816–1820. https://doi.org/10.1109/ICIP46576.2022.9897943 |
[31] | F. Shen, X. Peng, L. Wang, X. Hao, M. Shu, Y. Wang, Hsgm: A hierarchical similarity graph module for object re-identification, in 2022 IEEE International Conference on Multimedia and Expo (ICME), (2022), 1–6. https://doi.org/10.1109/ICME52920.2022.9859883 |
[32] | F. Shen, L. Lin, M. Wei, J. Liu, J. Zhu, H. Zeng, et al., A large benchmark for fabric image retrieval, in 2019 IEEE 4th International Conference on Image, Vision and Computing (ICIVC), (2019), 247–251. https://doi.org/10.1109/ICIVC47709.2019.8981065 |
[33] | M. Li, M. Wei, X. He, F. Shen, Enhancing part features via contrastive attention module for vehicle re-identification, in 2022 IEEE International Conference on Image Processing (ICIP), (2022), 1816–1820. https://doi.org/10.1109/ICIP46576.2022.9897943 |
[34] | S. Chen, X. Tan, B. Wang, X. Hu, Reverse attention for salient object detection, in European Conference on Computer Vision, 11213 (2018), 236–252. https://doi.org/10.1007/978-3-030-01240-3_15 |
[35] | H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, H. Jégou, Training data-efficient image transformers & distillation through attention, preprint, arXiv: 2012.12877. |
[36] | Z. Pan, B. Zhuang, J. Liu, H. He, J. Cai, Scalable vision transformers with hierarchical pooling, in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (2021), 367–376. https://doi.org/10.1109/ICCV48922.2021.00043 |
[37] | K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, Y. Wang, Transformer in transformer, preprint, arXiv: 2103.00112. |
[38] | Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, et al., Swin transformer: Hierarchical vision transformer using shifted windows, in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (2021), 9992–10002. https://doi.org/10.1109/ICCV48922.2021.00986 |
[39] |
W. Wang, E. Xie, X. Li, D. Fan, K. Song, D. Liang, et al., Pvt v2: Improved baselines with pyramid vision transformer, Comput. Visual Media, 8 (2022), 415–424. https://doi.org/10.1007/s41095-022-0274-8 doi: 10.1007/s41095-022-0274-8
![]() |
[40] | E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, P. Luo, Segformer: Simple and efficient design for semantic segmentation with transformers, preprint, arXiv: 2105.15203. |
[41] | J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, et al., Transunet: Transformers make strong encoders for medical image segmentation, arXiv: 2102.04306. |
[42] | Y. Zhang, H. Liu, Q. Hu, Transfuse: Fusing transformers and cnns for medical image segmentation, in International Conference on Medical Image Computing and Computer-Assisted Intervention, 12901 (2021), 14–24. https://doi.org/10.1007/978-3-030-87193-2_2 |
[43] |
J. Schlemper, O. Oktay, M. Schaap, M. Heinrich, B. Kainz, B. Glocker, et al., Attention gated networks: Learning to leverage salient regions in medical images, Med. Image Anal., 53 (2019), 197–207. https://doi.org/10.1016/j.media.2019.01.012 doi: 10.1016/j.media.2019.01.012
![]() |
[44] | Y. Lu, Y. Chen, D. Zhao, J. Chen, Graph-fcn for image semantic segmentation, in International Symposium on Neural Networks, 11554 (2019), 97–105. https://doi.org/10.1007/978-3-030-22796-8_11 |
[45] | M. M. Rahman, R. Marculescu, Medical image segmentation via cascaded attention decoding, in 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), (2023), 6211–6220. https://doi.org/10.1109/WACV56688.2023.00616 |
[46] | G. Bertasius, J. Shi, L. Torresani, Semantic segmentation with boundary neural fields, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), 3602–3610. https://doi.org/10.1109/CVPR.2016.392 |
[47] |
L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. Yuille, Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs, EEE Trans. Pattern Anal. Mach. Intell., 40 (2018), 834–848. https://doi.org/10.1109/TPAMI.2017.2699184 doi: 10.1109/TPAMI.2017.2699184
![]() |
[48] | Y. Fang, C. Chen, Y. Yuan, K. Tong, Selective feature aggregation network with area-boundary constraints for polyp segmentation, in International Conference on Medical Image Computing and Computer-Assisted Intervention, 11764 (2019), 302–310. https://doi.org/10.1007/978-3-030-32239-7_34 |
[49] |
S. Chen, X. Tan, B. Wang, H. Lu, X. Hu, Y. Fu, Reverse attention-based residual network for salient object detection, IEEE Trans. Image Process., 29 (2020), 3763–3776. https://doi.org/10.1109/TIP.2020.2965989 doi: 10.1109/TIP.2020.2965989
![]() |
[50] | H. Chen, K. Sun, Z. Tian, C. Shen, Y. Huang, Y. Yan, Blendmask: Top-down meets bottom-up for instance segmentation, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2020), 8573–8581. https://doi.org/10.1109/CVPR42600.2020.00860 |
[51] | A. Lou, M. Loew, Cfpnet: channel-wise feature pyramid for real-time semantic segmentation, in 2021 IEEE International Conference on Image Processing (ICIP), (2021), 1894–1898. https://doi.org/10.1109/ICIP42928.2021.9506485 |
[52] | S. Bhojanapalli, A. Chakrabarti, D. Glasner, D. Li, T. Unterthiner, A. Veit, Understanding robustness of transformers for image classification, in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (2021), 10211–10221. https://doi.org/10.1109/ICCV48922.2021.01007 |
[53] | W. Wang, E. Xie, X. Li, D. Fan, K. Song, D. Liang, et al., Pyramid vision transformer: A versatile backbone for dense prediction without convolutions, in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (2021), 548–558. https://doi.org/10.1109/ICCV48922.2021.00061 |
[54] | K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, preprint, arXiv: 1409.1556. |
[55] | K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), 770–778. https://doi.org/10.1109/CVPR.2016.90 |
[56] | J. Zhao, J. Liu, D. Fan, Y. Cao, J. Yang, M. Cheng, Egnet: Edge guidance network for salient object detection, in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), (2019), 8778–8787. https://doi.org/10.1109/ICCV.2019.00887 |
[57] | Z. Zhang, H. Fu, H. Dai, J. Shen, Y. Pang, L. Shao, Et-net: A generic edge-attention guidance network for medical image segmentation, in International Conference on Medical Image Computing and Computer-Assisted Intervention, (2019), 442–450. https://doi.org/10.1007/978-3-030-32239-7_49 |
[58] | Y. Dai, F. Gieseke, S. Oehmcke, Y. Wu, K. Barnard, Attentional feature fusion, in 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), (2021), 3559–3568. https://doi.org/10.1109/WACV48630.2021.00360 |
[59] | Q. Zhang, Y. Yang, Sa-net: Shuffle attention for deep convolutional neural networks, in ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (2021), 2235–2239. https://doi.org/10.1109/ICASSP39728.2021.9414568 |
[60] | B. Dong, M. Zhuge, Y. Wang, H. Bi, G. Chen, Accurate camouflaged object detection via mixture convolution and interactive fusion, preprint, arXiv: 2101.05687. |
[61] |
D. Vázquez, J. Bernal, F. J. Sánchez, G. Fernández-Esparrach, A. M. López, A. Romero, A benchmark for endoluminal scene segmentation of colonoscopy images, J. Healthcare Eng., 2017 (2017), 4037190. https://doi.org/10.1155/2017/4037190 doi: 10.1155/2017/4037190
![]() |
[62] |
J. Silva, A. Histace, O. Romain, X. Dray, B. Granado, Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer, Int. J. Comput. Assisted Radiol. Surg., 9 (2014), 283–293. https://doi.org/10.1007/s11548-013-0926-3 doi: 10.1007/s11548-013-0926-3
![]() |
[63] |
J. Bernal, F. J. Sánchez, G. Fernández-Esparrach, D. Gil, C. Rodríguez, F. Vilariño, Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians, Comput. Med. Imaging Graphics, 43 (2015), 99–111. https://doi.org/10.1016/j.compmedimag.2015.02.007 doi: 10.1016/j.compmedimag.2015.02.007
![]() |
[64] |
N. Tajbakhsh, S. R. Gurudu, J. Liang, Automated polyp detection in colonoscopy videos using shape and context information, IEEE Trans. Med. Imaging, 35 (2016), 630–644. https://doi.org/10.1109/TMI.2015.2487997 doi: 10.1109/TMI.2015.2487997
![]() |
[65] | D. Jha, P. H. Smedsrud, M. A. Riegler, P. Halvorsen, T. de Lange, D. Johansen, et al., Kvasir-seg: A segmented polyp dataset, in International Conference on Multimedia Modeling, 11962 (2020), 451–462. https://doi.org/10.1007/978-3-030-37734-2_37 |
[66] | C. Huang, H. Wu, Y. Lin, Hardnet-mseg: A simple encoder-decoder polyp segmentation neural network that achieves over 0.9 mean dice and 86 fps, preprint, arXiv: 2101.07172. |
[67] | F. Shen, X. He, M. Wei, Y. Xie, A competitive method to vipriors object detection challenge, preprint, arXiv: 2104.09059. |
[68] | I. Loshchilov, F. Hutter, Decoupled weight decay regularization, preprint, arXiv: 1711.05101. |
1. | Ruoqi Zhang, Xiaoming Huang, Qiang Zhu, Weakly supervised salient object detection via image category annotation, 2023, 20, 1551-0018, 21359, 10.3934/mbe.2023945 | |
2. | Dongyang Xie, Yang Zhang, Xiaoxi Tian, Le Xu, Lianhong Duan, Lixia Tian, BGFE-Net: A Boundary-Guided Feature Enhancement Network for segmentation of targets with fuzzy boundaries, 2024, 09252312, 129127, 10.1016/j.neucom.2024.129127 | |
3. | Honghao Jiang, Ling-Fang Li, Xue Yang, Xiaojun Wang, Ming-Xing Luo, BSNet: a boundary-aware medical image segmentation network, 2025, 140, 2190-5444, 10.1140/epjp/s13360-024-05960-z | |
4. | Lisha Pang, Peng He, Yue Han, Hao Cui, Peng Feng, Chi Zhang, Pan Huang, Sukun Tian, Semantic Consistency Network with Edge Learner and Connectivity Enhancer for Cervical Tumor Segmentation from Histopathology Images, 2025, 1913-2751, 10.1007/s12539-025-00691-w | |
5. | Yuhong Ying, Haoyuan Li, Yiwen Zhong, Min Lin, HPANet: Hierarchical Path Aggregation Network with Pyramid Vision Transformers for Colorectal Polyp Segmentation, 2025, 18, 1999-4893, 281, 10.3390/a18050281 |
Methods | Pub. | Type | Kvasir | ClinicDB | ||||
mDice↑ | mIoU↑ | MAE↓ | mDice↑ | mIoU↑ | MAE↓ | |||
U-Net[8] | MICCAI'15 | CNN | 0.818 | 0.746 | 0.055 | 0.823 | 0.755 | 0.019 |
UNet++[10] | TMI'19 | CNN | 0.821 | 0.743 | 0.048 | 0.794 | 0.729 | 0.022 |
PraNet[7] | MICCAI'20 | CNN | 0.898 | 0.840 | 0.030 | 0.899 | 0.849 | 0.009 |
CaraNet[25] | JMI'23 | CNN | 0.918 | 0.865 | 0.023 | 0.936 | 0.887 | 0.007 |
TransUNet[41] | arXiv'21 | Transformer | 0.913 | 0.857 | 0.028 | 0.935 | 0.887 | 0.008 |
TransFuse[42] | MICCAI'21 | Transformer+CNN | 0.920 | 0.870 | 0.023 | 0.942 | 0.897 | 0.007 |
Polyp-PVT[19] | CAAI AIR'23 | Transformer | 0.917 | 0.864 | 0.023 | 0.937 | 0.889 | 0.006 |
SegT (Ours) | - | Transformer | 0.927 | 0.880 | 0.023 | 0.940 | 0.897 | 0.006 |
Methods | Pub. | Type | ColonDB | ETIS | EndoScene | ||||||
mDice↑ | mIoU↑ | MAE↓ | mDice↑ | mIoU↑ | MAE↓ | mDice↑ | mIoU↑ | MAE↓ | |||
U-Net [8] | MICCAI'15 | CNN | 0.512 | 0.444 | 0.061 | 0.398 | 0.335 | 0.036 | 0.710 | 0.627 | 0.022 |
UNet++ [10] | TMI'19 | CNN | 0.483 | 0.410 | 0.064 | 0.401 | 0.344 | 0.035 | 0.707 | 0.624 | 0.018 |
PraNet [7] | MICCAI'20 | CNN | 0.712 | 0.640 | 0.043 | 0.628 | 0.567 | 0.031 | 0.851 | 0.797 | 0.010 |
CaraNet [25] | JMI'23 | CNN | 0.773 | 0.689 | 0.042 | 0.747 | 0.672 | 0.017 | 0.903 | 0.838 | 0.007 |
TransUNet [41] | arXiv'21 | Transformer | 0.781 | 0.699 | 0.036 | 0.731 | 0.824 | 0.021 | 0.893 | 0.660 | 0.009 |
TransFuse [42] | MICCAI'21 | Transformer+CNN | 0.781 | 0.706 | 0.035 | 0.737 | 0.826 | 0.020 | 0.894 | 0.654 | 0.009 |
Polyp-PVT [19] | CAAI AIR'23 | Transformer | 0.808 | 0.727 | 0.031 | 0.787 | 0.706 | 0.013 | 0.900 | 0.833 | 0.007 |
SegT (Ours) | - | Transformer | 0.814 | 0.732 | 0.026 | 0.810 | 0.732 | 0.013 | 0.895 | 0.828 | 0.008 |
Methods | Kvasir(seen) | ETIS(unseen) | ColonDB(unseen) | ||||||
mDice↑ | mIoU↑ | MAE↓ | mDice↑ | mIoU↑ | MAE↓ | mDice↑ | mIoU↑ | MAE↓ | |
a. baseline | 0.910 | 0.859 | 0.030 | 0.759 | 0.688 | 0.017 | 0.796 | 0.707 | 0.031 |
b. baseline + SE | 0.914 | 0.856 | 0.033 | 0.767 | 0.707 | 0.018 | 0.799 | 0.721 | 0.031 |
c. baseline + SEG (SE + EG) | 0.919 | 0.869 | 0.028 | 0.795 | 0.714 | 0.016 | 0.810 | 0.727 | 0.030 |
d. baseline + CFM | 0.913 | 0.855 | 0.034 | 0.764 | 0.701 | 0.018 | 0.792 | 0.721 | 0.032 |
e. w/o SE | 0.916 | 0.865 | 0.030 | 0.779 | 0.701 | 0.019 | 0.794 | 0.710 | 0.032 |
f. w/o EG | 0.914 | 0.861 | 0.031 | 0.777 | 0.703 | 0.019 | 0.798 | 0.709 | 0.031 |
g. SEG+ CFM (Ours) | 0.927 | 0.880 | 0.023 | 0.810 | 0.732 | 0.013 | 0.814 | 0.732 | 0.026 |
Methods | Pub. | Type | Kvasir | ClinicDB | ||||
mDice↑ | mIoU↑ | MAE↓ | mDice↑ | mIoU↑ | MAE↓ | |||
U-Net[8] | MICCAI'15 | CNN | 0.818 | 0.746 | 0.055 | 0.823 | 0.755 | 0.019 |
UNet++[10] | TMI'19 | CNN | 0.821 | 0.743 | 0.048 | 0.794 | 0.729 | 0.022 |
PraNet[7] | MICCAI'20 | CNN | 0.898 | 0.840 | 0.030 | 0.899 | 0.849 | 0.009 |
CaraNet[25] | JMI'23 | CNN | 0.918 | 0.865 | 0.023 | 0.936 | 0.887 | 0.007 |
TransUNet[41] | arXiv'21 | Transformer | 0.913 | 0.857 | 0.028 | 0.935 | 0.887 | 0.008 |
TransFuse[42] | MICCAI'21 | Transformer+CNN | 0.920 | 0.870 | 0.023 | 0.942 | 0.897 | 0.007 |
Polyp-PVT[19] | CAAI AIR'23 | Transformer | 0.917 | 0.864 | 0.023 | 0.937 | 0.889 | 0.006 |
SegT (Ours) | - | Transformer | 0.927 | 0.880 | 0.023 | 0.940 | 0.897 | 0.006 |
Methods | Pub. | Type | ColonDB | ETIS | EndoScene | ||||||
mDice↑ | mIoU↑ | MAE↓ | mDice↑ | mIoU↑ | MAE↓ | mDice↑ | mIoU↑ | MAE↓ | |||
U-Net [8] | MICCAI'15 | CNN | 0.512 | 0.444 | 0.061 | 0.398 | 0.335 | 0.036 | 0.710 | 0.627 | 0.022 |
UNet++ [10] | TMI'19 | CNN | 0.483 | 0.410 | 0.064 | 0.401 | 0.344 | 0.035 | 0.707 | 0.624 | 0.018 |
PraNet [7] | MICCAI'20 | CNN | 0.712 | 0.640 | 0.043 | 0.628 | 0.567 | 0.031 | 0.851 | 0.797 | 0.010 |
CaraNet [25] | JMI'23 | CNN | 0.773 | 0.689 | 0.042 | 0.747 | 0.672 | 0.017 | 0.903 | 0.838 | 0.007 |
TransUNet [41] | arXiv'21 | Transformer | 0.781 | 0.699 | 0.036 | 0.731 | 0.824 | 0.021 | 0.893 | 0.660 | 0.009 |
TransFuse [42] | MICCAI'21 | Transformer+CNN | 0.781 | 0.706 | 0.035 | 0.737 | 0.826 | 0.020 | 0.894 | 0.654 | 0.009 |
Polyp-PVT [19] | CAAI AIR'23 | Transformer | 0.808 | 0.727 | 0.031 | 0.787 | 0.706 | 0.013 | 0.900 | 0.833 | 0.007 |
SegT (Ours) | - | Transformer | 0.814 | 0.732 | 0.026 | 0.810 | 0.732 | 0.013 | 0.895 | 0.828 | 0.008 |
Methods | Kvasir(seen) | ETIS(unseen) | ColonDB(unseen) | ||||||
mDice↑ | mIoU↑ | MAE↓ | mDice↑ | mIoU↑ | MAE↓ | mDice↑ | mIoU↑ | MAE↓ | |
a. baseline | 0.910 | 0.859 | 0.030 | 0.759 | 0.688 | 0.017 | 0.796 | 0.707 | 0.031 |
b. baseline + SE | 0.914 | 0.856 | 0.033 | 0.767 | 0.707 | 0.018 | 0.799 | 0.721 | 0.031 |
c. baseline + SEG (SE + EG) | 0.919 | 0.869 | 0.028 | 0.795 | 0.714 | 0.016 | 0.810 | 0.727 | 0.030 |
d. baseline + CFM | 0.913 | 0.855 | 0.034 | 0.764 | 0.701 | 0.018 | 0.792 | 0.721 | 0.032 |
e. w/o SE | 0.916 | 0.865 | 0.030 | 0.779 | 0.701 | 0.019 | 0.794 | 0.710 | 0.032 |
f. w/o EG | 0.914 | 0.861 | 0.031 | 0.777 | 0.703 | 0.019 | 0.798 | 0.709 | 0.031 |
g. SEG+ CFM (Ours) | 0.927 | 0.880 | 0.023 | 0.810 | 0.732 | 0.013 | 0.814 | 0.732 | 0.026 |