
Citation: Rachel R. Sleeter, William Acevedo, Christopher E. Soulard, Benjamin M. Sleeter. Methods used to parameterize the spatially-explicit components of a state-and-transition simulation model[J]. AIMS Environmental Science, 2015, 2(3): 668-693. doi: 10.3934/environsci.2015.3.668
[1] | Tian Ma, Boyang Meng, Jiayi Yang, Nana Gou, Weilu Shi . A half jaw panoramic stitching method of intraoral endoscopy images based on dental arch arrangement. Mathematical Biosciences and Engineering, 2024, 21(1): 494-522. doi: 10.3934/mbe.2024022 |
[2] | Yijun Yin, Wenzheng Xu, Lei Chen, Hao Wu . CoT-UNet++: A medical image segmentation method based on contextual transformer and dense connection. Mathematical Biosciences and Engineering, 2023, 20(5): 8320-8336. doi: 10.3934/mbe.2023364 |
[3] | Yue Li, Hongmei Jin, Zhanli Li . A weakly supervised learning-based segmentation network for dental diseases. Mathematical Biosciences and Engineering, 2023, 20(2): 2039-2060. doi: 10.3934/mbe.2023094 |
[4] | Wenli Cheng, Jiajia Jiao . An adversarially consensus model of augmented unlabeled data for cardiac image segmentation (CAU+). Mathematical Biosciences and Engineering, 2023, 20(8): 13521-13541. doi: 10.3934/mbe.2023603 |
[5] | Yuqing Zhang, Yutong Han, Jianxin Zhang . MAU-Net: Mixed attention U-Net for MRI brain tumor segmentation. Mathematical Biosciences and Engineering, 2023, 20(12): 20510-20527. doi: 10.3934/mbe.2023907 |
[6] | Zhanhong Qiu, Weiyan Gan, Zhi Yang, Ran Zhou, Haitao Gan . Dual uncertainty-guided multi-model pseudo-label learning for semi-supervised medical image segmentation. Mathematical Biosciences and Engineering, 2024, 21(2): 2212-2232. doi: 10.3934/mbe.2024097 |
[7] | Tong Shan, Jiayong Yan, Xiaoyao Cui, Lijian Xie . DSCA-Net: A depthwise separable convolutional neural network with attention mechanism for medical image segmentation. Mathematical Biosciences and Engineering, 2023, 20(1): 365-382. doi: 10.3934/mbe.2023017 |
[8] | Ran Zhou, Yanghan Ou, Xiaoyue Fang, M. Reza Azarpazhooh, Haitao Gan, Zhiwei Ye, J. David Spence, Xiangyang Xu, Aaron Fenster . Ultrasound carotid plaque segmentation via image reconstruction-based self-supervised learning with limited training labels. Mathematical Biosciences and Engineering, 2023, 20(2): 1617-1636. doi: 10.3934/mbe.2023074 |
[9] | Keying Du, Liuyang Fang, Jie Chen, Dongdong Chen, Hua Lai . CTFusion: CNN-transformer-based self-supervised learning for infrared and visible image fusion. Mathematical Biosciences and Engineering, 2024, 21(7): 6710-6730. doi: 10.3934/mbe.2024294 |
[10] | Zhuang Zhang, Wenjie Luo . Hierarchical volumetric transformer with comprehensive attention for medical image segmentation. Mathematical Biosciences and Engineering, 2023, 20(2): 3177-3190. doi: 10.3934/mbe.2023149 |
Oral health is a pivotal aspect of overall well-being, with dental ailments such as periodontal disease, cavities, and misalignments not only affecting masticatory function and aesthetics but also potentially correlating with systemic maladies like cardiovascular diseases and diabetes [1]. In the field of dental diagnostics, panoramic imaging, also known as orthopantomography, has become increasingly significant [2]. This technology provides a comprehensive view of the mouth, capturing images of all the teeth and the surrounding bone structure in a single shot. Unlike the traditional intraoral radiography, panoramic imaging offers a broad perspective, essential for a holistic assessment of dental health. It is particularly invaluable in identifying problems in areas such as tooth positioning, impacted teeth, and the development of tumors [3,4,5,6]. Moreover, in orthodontic treatments, tooth extractions, and pre-surgical planning, panoramic images offer clinicians a clear and detailed view, crucial for designing precise orthodontic appliances, assessing surgical risks, and formulating effective treatment plans, thereby significantly enhancing patient care [7,8]. Tooth segmentation not only significantly reduces diagnostic time and enhances diagnostic accuracy but also furnishes vital information for pathological analysis and personalized treatment planning [6]. For instance, accurate tooth segmentation can aid in evaluating the relationship between teeth and alveolar bone, determining the optimal position for dental implants, or assessing the outcomes of orthognathic surgery [9]. However, manual tooth segmentation in panoramic imaging interpretation, a task for radiologists and dental specialists, is time-consuming and costly, underscoring the urgent clinical need for automated segmentation technology to assist medical professionals in efficient and accurate diagnostics.
In recent years, the medical imaging field has witnessed a significant transformation with the rapid development of deep learning [10,11,12]. Unlike traditional methods that rely on manual feature extraction [13], deep learning can identify and categorize the complex and diverse features of both the 1D physilogical parameters and 2D medical images [15,16]. The capability of deep learning for automatic feature extraction in medical imaging leads to the creation of robust, quantifiable models with strong adaptability and generalizability, significantly aiding doctors in formulating precise and effective medical plans [17,18,19]. The advent of automatic tooth segmentation technologies [20], leveraging and computer vision techniques, has the potential to autonomously identify and segment dental structures [21,22].
Current approaches predominantly utilize U-shaped convolutional neural network architectures, with methods like Faster R-CNN [23] and Mask R-CNN [24] being widely applied in tooth segmentation and caries detection [25,26]. However, these are typically only suitable for downsampled Cone Beam Computed Tomography (CBCT) images. MSLPNet [27] employs a multi-scale structure to mitigate boundary prediction issues, subsequently utilizing a location-aware approach to pinpoint each dental pixel in panoramic images. Finally, an aggregation module is incorporated to diminish the semantic discrepancies across multiple branches. Two-stage segmentation methods [28,29] generally locate the approximate position of the teeth in the first phase, followed by precise segmentation in the second. In a similar vein, the model in [30] introduces a coarse-to-fine tooth segmentation strategy, pre-trained on large-scale, weakly supervised datasets to initially locate teeth, and then fine-tuned on smaller, meticulously annotated datasets. Beyond weak supervision, researchers often resort to semi-supervised learning strategies with limited annotated data, such as self-training and pseudo-label generation. A novel semi-supervised 3D dental arch segmentation pipeline is proposed by [31], utilizing k-means for self-supervised learning [32,33] and supervised learning on annotated data. The pipeline in [34] refines nnU-Net [35] architecture, training a preliminary nnU-Net model and then allowing medical professionals to supervise its performance on unannotated datasets, selectively updating the model. Undoubtedly, this semi-supervised approach is cost-intensive. Overall, while these methods have achieved commendable performance, convolution-based approaches are limited by their receptive field for relatively larger input images and rely on prior localization of teeth.
The long-range dependency capabilities of Transformer architectures [36,37,38,39] have inspired new paradigms in image processing. The sequence attention mechanism of vision Transformers aggregates different patches of the same image, allowing each patch to interact with others, a significant advantage over CNNs with their inductive bias priors. Transformers have similarly revolutionized medical imaging [40]. TransUNet [41] introduces a U-Net combined with Transformer architecture for medical segmentation, merging CNN's local focus with the Transformer's global feature extraction capabilities, significantly inspiring the medical segmentation field. BoTNet [42], blending Transformers with convolutions, proposes a lightweight instance segmentation backbone, replacing some of the final convolutional layers of ResNet with Transformers. Building on this, GT U-Net [43] introduces a Fourier loss leveraging dental prior knowledge, effectively segmenting dental roots. However, while these Transformer-based methods excel in capturing global interactions in the encoder, they often do not optimally leverage these encoded features due to limitations in their decoding mechanisms. This leads to certain deficiencies in current deep learning approaches to tooth segmentation, resulting in suboptimal performance.
To this end, we introduce STS-TransUNet, a model that merges a CNN-Transformer encoder—blending CNN's shallow local feature extraction with Transformer's deep global encoding [44] and a customized upsampling module as the decoder—aiming at prioritizing key information and filtering out the redundant. Specifically, we have innovated the decoder part of the architecture by incorporating channel and spatial attention mechanisms [45]. This enables the decoder to focus exclusively on pertinent information while disregarding redundant data. The use of deep supervision techniques allows for immediate feedback on each layer of the decoder, thereby accelerating the convergence rate. By integrating the input and output images, our method enhances the model's ability to directly associate and learn from the initial and desired final states of the images. This novel strategy overcomes some of the limitations observed in traditional segmentation methods, where a disconnect between input and processed images can lead to inefficiencies and inaccuracies. Furthermore, we employ a straightforward self-training semi-supervised strategy, effectively segmenting the MICCAI 2023 public challenge dataset (STS-2D) [46] and achieving a distinguished position in the competition. The primary contributions of this paper are threefold:
1). We propose the STS-TransUNet, a novel single-stage model tailored for precise and automated segmentation in clinical dentistry. This model is specifically devised for panoramic dental imaging and leverages advanced deep learning techniques to accurately identify and outline dental structures.
2). A decoder with spatial and channel attention mechanisms, combined with deep supervision techniques, effectively captures the irregularities in dental information, mitigates gradient vanishing, and accelerates convergence.
3). Extensive experiments conducted on the MICCAI STS-2D dataset demonstrate the exemplary performance of our approach.
Deep learning, particularly CNN-based approaches, has demonstrated exceptional performance across a broad spectrum of practical applications [47,48,49,50,51,52,53], including the domain of medical image segmentation. Diverging from traditional approaches in medical image segmentation [21,54,55,56], the advent of the U-Net [10] architecture has heralded a new era in this field, significantly enhancing the precision and efficiency of segmentation tasks. Its encoder-decoder structure was capable of extracting high-level features from input images and using them to generate fine segmentation results [35]. In [57], deep learning methods were first introduced into panoramic X-ray tooth segmentation. Specifically, they performed pre-training on the backbone using the Mask R-CNN on the MSCOCO dataset and fine-tune it on their own dataset. In [58], the influence of factors such as data augmentation, loss functions, and network ensembles on tooth segmentation based on U-Net was investigated, fully exploiting the performance of the U-Net. TSegNet [59] formulated the 3D tooth point cloud data segmentation task as the precise localization of each tooth's center based on distance perception and the segmentation task based on confidence perception. This task was accomplished through accurate positioning in the first stage and precise segmentation in the second stage. All the aforementioned methods employed supervised deep learning techniques. In the realm of semi-supervised learning, MLUA [60] adopted a teacher-student strategy, utilizing a single U-shaped network for both annotated and unannotated data. Considering the irregular shape and significant variability of teeth, this model introduced multi-level perturbations to train more robust systems. Similarly, the model proposed in [61] employed a comparable strategy, focusing on data augmentation in areas of carious lesions, resulting in a high-performing caries segmentation model. The proposal in [34] relied on the expertise of medical professionals to select data for semi-supervised segmentation. The success of these methods largely hinged on the profound impact of CNNs in image processing. However, CNNs inherently possess inductive bias limitations, particularly in their local feature extraction. In contrast to the aforementioned methods, our approach integrates CNN's capability for shallow local feature extraction with global Transformer encoding, thereby achieving comprehensive global capture of dependencies.
CNN-based methods inherently possess inductive biases and struggle to effectively learn global semantic interactions due to the locality of the convolution operation [62]. TransUNet [41] pioneered a new paradigm in medical segmentation by integrating the global encoding capabilities of Transformers with the upsampling features of U-Net. Following this, a multitude of methods based on the TransUNet framework have been custom-tailored and applied to various other domains of medical image segmentation [63,64], demonstrating its versatility and effectiveness. UNETR [65] took this further by transforming volumetric medical images into a sequence prediction problem, marking a significant application of Transformers in 3D medical imaging. Swin-Unet [66] merged the entire topological structure of Unet with the attention mechanisms of Swin Transformer. Its decoder used patch expanding for upsampling and showed remarkable performance on multi-organ CT and ACDC datasets. Similarly, the model in [67] developed a multi-task architecture based on Swin Transformer for segmenting and identifying teeth and dental pulp calcification. The Mask-Transformer-based architecture [68] has demonstrated impressive capabilities in tooth segmentation. It employed a dual-path design combined with a panoramic quality loss function to simplify the training process. While these methods leveraged the global dependency capabilities of Transformer encoders, they often overly focused on global feature extraction by the encoder. Moreover, few studies have explored combining Transformer methods with actual unannotated dental panoramic image data segmentation. Unlike methods based on pure Transformer encoder-decoder architectures, our encoder employs a CNN-Transformer architecture, maximizing the use of U-Net's skip connections. This design choice is informed by the inherent limitation of Transformer architectures in not effectively capturing global dependencies at shallower layers [44,69]. Our decoder focuses on relevant information without the need for prior tooth localization, employing a straightforward self-training method to generate pseudo-labels and iteratively update the model. This approach has demonstrated excellent performance on the MICCAI STS-2D dataset [46].
We utilize a high-quality MICCAI STS-2D dataset [46], including panoramic dental CT images of children aged 3–12 years, obtained from Hangzhou Dental Group, Hangzhou Qiantang Dental Hospital, Electronic Science and Technology University, and Queen Mary University of London. The dataset, serving as the official training set, comprises a total of 5000 images, including 2900 labeled and 2100 unlabeled images. All our experiments utilize this training set as the primary dataset. Our model's results on the official test set are detailed in Section 4.3.
Data split: Fully supervised training data (random 2500 labeled images) are employed for fully supervised training. Semi-supervised training data (random 2000 labeled images and 2100 unlabeled images) are used for semi-supervised training. Test data (400 and 900 labeled images) are reserved for testing fully supervised and semi-supervised method, respectively.
Data preprocessing: The original images have a resolution of 640 × 320. To facilitate training, we resize them to 640 × 640, then further downsample them to 320 × 320 and 160 × 160. These smaller sizes are used for deep supervised training. During training, we apply data augmentation strategies, including random flips, rotations, and cropping, to enhance model robustness and performance.
In our approach, we adopt the well-established Unet architecture, which comprises two fundamental components: An encoder and a decoder, as shown in Figure 1. The encoder plays a crucial role in extracting high-level features from the input images, while the decoder is responsible for generating the final segmented results. We represent our model with the following formulas:
$ H=ViT(LinearProjection(ResNet50(X))), $ | (3.1) |
$ O1,O2,O3=CNNDecoder(H), $ | (3.2) |
where $ H $ denotes the hidden feature obtained from the CNN-Transformer hybrid encoder, and $ O_{1} $, $ O_{2} $, $ O_{3} $ represent the outputs from the last three layers of the CNN decoder, which are used for deep supervised training.
Recognizing the unique strengths of both convolutional neural networks (CNNs) and Transformers, we design a hybrid encoder structure. CNNs excel at capturing position-aware features, while Transformers are proficient at integrating long-range contextual information. By combining these two architectural elements, we harness their complementary advantages. This hybrid encoder structure enhances the model's ability to comprehend the underlying content within the images.
For the decoder, we employ a standalone CNN architecture. This choice aims to facilitate the model's effective learning of spatial and channel-related information. To further enhance performance, we introduce the Convolutional Block Attention Module (CBAM) [45].
CBAM is an attention mechanism employed in computer vision tasks with the primary objective of enhancing the performance of convolutional neural networks (CNNs). It enables better focus on important information in different channels and spatial locations when processing images. CBAM consists of two key components: Channel attention and spatial attention. Channel attention helps the model learn which channels are most crucial for tasks such as image classification, while spatial attention helps the model identify essential regions or positions in an image for the task. This adaptive weighting mechanism allows the model to adapt to various images and tasks. Moreover, CBAM has demonstrated significant performance improvements in computer vision tasks, including image classification, object detection, and semantic segmentation. Its main advantage lies in its ability to automatically learn which features are more important for a given task, thus enhancing the model's performance and robustness. CBAM has found wide application in deep learning, providing a potent tool for the field of computer vision.
$ F′=Mc(F)⊗F, $ | (3.3) |
$ F″=Ms(F′)⊗F′, $ | (3.4) |
where $ \otimes $ denotes element-wise multiplication. During multiplication, the attention values are broadcasted (copied) accordingly: channel attention values are broadcasted along the spatial dimension, and vice versa. $ F'' $ is the final refined output. Figure 2 depicts the computation process of each attention map. Following formulas describe the details of each attention module:
$ Mc=σ(MLP(AvgPool(F))+MLP(MaxPool(F)))=σ(W1(W0(Fcavg))+W1(W0(Fcmax))), $ | (3.5) |
$ Ms=σ(f7×7([AvgPool(F);MaxPool(F)]))=σ(f7×7([Fsavg;Fsmax])), $ | (3.6) |
The CBAM module enhances proposed model's understanding of image content and assists it in prioritizing specific channels. To expedite the model's convergence during training and enhance its ability to generalize to different image scales, we implement a deep supervised training strategy. This strategy involves introducing supervised signals into the last three decoder layers, each corresponding to different image scales. It enables the model to better understand and adapt to various image scales effectively.
We train both fully supervised and semi-supervised models using a loss function that combines dice loss and IoU loss weighting. The formula of used total loss is as following:
$ loss=DeepDiceLoss(ˆYdeep,Ydeep)×0.6+DeepIoULoss(ˆYdeep,Ydeep)×0.4, $ | (3.7) |
$ DiceLoss=1−2⋅|ˆY∩Y||ˆY|+|Y|, $ | (3.8) |
$ IoULoss=1−|ˆY∩Y||ˆY∪Y|, $ | (3.9) |
where $ \hat{Y} $ and Y respectively represent the prediction and ground truth. We employ deep supervision by computing the loss at three different scales, the deep supervision loss are shown as following:
$ DeepLoss=loss(ˆY640,Y640)+loss(ˆY320,Y320)+loss(ˆY160,Y160). $ | (3.10) |
where $ \hat{Y}_{640}, \hat{Y}_{320}, \hat{Y}_{160} $ denote the outputs of the last three decoder layers, each with resolutions of 640,320, and 160, respectively. Similarly, $ Y_{640}, Y_{320}, Y_{160} $ represent the ground truth, resized to correspond to these resolutions. The following section details the specific two-stage training process.
In this stage, we implement a fully supervised training approach using samples with real labels, adopting a 5-fold cross-validation strategy to enhance model robustness. Instead of treating each fold's output as a separate model, we integrated these models from all five folds into a single ensemble model. This ensemble approach capitalizes on the strengths of each fold's training, resulting in a more robust and generalized model that effectively captures the diversity of the training data. The details of the first-stage training are as follows:
Learning rate initialization: We set the initial learning rate to 3e-4 to ensure a stable start to the training process.
Total training epochs: The training process encompasses a total of 200 epochs, providing the model with sufficient time to progressively enhance its performance. However, it is common for the initial few epochs to exhibit some instability.
Warm-up strategy: To mitigate the model's instability at the beginning of training, we implement a warm-up strategy. This involves gradually increasing the learning rate within the first 3 epochs, guiding the model towards a more stable training state.
Cosine curve strategy: Subsequent adjustments to the learning rate follow a cosine curve strategy. This strategy gradually reduces the learning rate, allowing for a more refined adjustment of model parameters until the learning rate decays to 0. This aids the model in better convergence during the later stages of training.
Fully supervised training is conducted to establish the foundational performance of the model, enabling it to learn feature extraction from labeled data and perform tasks. This training phase equips the model with a certain degree of predictive capability, laying the groundwork for subsequent semi-supervised learning.
In the second stage, we employ a semi-supervised training approach, capitalizing on the benefits it offers. Specifically, we adopt the self-training strategy [70] to generate pseudo-labels and facilitate model training. This phase harnesses unlabeled data effectively, maximizing the utility of available resources. The workflow of used self-training is shown as below:
Generating Pseudo-Labels: We initiate this phase by feeding 2100 unlabeled images into the model obtained from the first supervised training stage. The model, in response, produces outputs containing predicted class (foreground and background) probabilities for these images. These probabilities are then averaged across the entire set.
Pseudo-Label selection: To identify high-quality data points for training, we select the top 300 images based on the predicted probabilities. These images are paired with the corresponding high-quality pseudo-labels generated by the model.
Training with augmented data: The chosen images, along with their newly created pseudo-labels, are used to augment the training dataset. The training process initializes with an initial learning rate of 1e-4 and spanned three epochs. This helps the model adapt to the augmented dataset.
Iterative refinement: In pursuit of further model improvement, this process is repeated five times. In each iteration, a new model is employed, and the same steps are repeated. This iterative refinement strategy allows the model to learn progressively from the unlabeled data. This semi-supervised training strategy, specifically the self-training method, is valuable for harnessing the potential of unlabeled data, effectively expanding the training dataset, and improving the model's performance. It is a powerful tool for leveraging available resources and enhancing the robustness of the final model.
All our experiments are conducted on two 32 GB V100 GPUs. Our STS-TransUNet has a training duration of 12 hours, and we use Pytorch 1.12 as the experimental framework. Additionally, to ensure the reproducibility of our results, we have fixed the seed in all our experiments.
Quantitative analysis:After fully supervised and semi-supervised training, the results are presented in Table 1. We use Dice, IoU (Intersection over Union) and Hausdorff distance as our evaluation metrics. The formulas of them are shown as below,
$ Dice=2⋅|ˆY∩Y||ˆY|+|Y|, $ | (4.1) |
$ IoU=|ˆY∩Y||ˆY∪Y|, $ | (4.2) |
$ H(ˆY,Y)=max(supˆy∈ˆYinfy∈Yd(ˆy,y),supy∈Yinfˆy∈ˆYd(y,ˆy)). $ | (4.3) |
where $ \hat{Y} $ and $ Y $ represent the prediction and the ground truth, respectively.
Fully supervised | Semi-supervised | |||||
Dice | IoU | Hausdorff distance | Dice | IoU | Hausdorff distance | |
UNet++ [71] | 0.8978 | 0.9560 | 0.0368 | 0.8689 | 0.9427 | 0.0403 |
UNet 3+ [72] | 0.9070 | 0.9589 | 0.0326 | 0.8739 | 0.9531 | 0.0365 |
R2AU-Net [73] | 0.9081 | 0.9598 | 0.0309 | 0.8826 | 0.9556 | 0.0321 |
SegFormer [76] | 0.9182 | 0.9626 | 0.0304 | 0.9087 | 0.9589 | 0.0303 |
Swin-Unet [66] | 0.9171 | 0.9631 | 0.0303 | 0.9102 | 0.9588 | 0.0301 |
DAE-Former [75] | 0.9251 | 0.9685 | 0.0306 | 0.9153 | 0.9601 | 0.0286 |
STS-TransUNet (Ours) | 0.9318 | 0.9691 | 0.0298 | 0.9206 | 0.9723 | 0.0269 |
For the fully supervised training, models like UNet++ [71], UNet 3+ [72], and R2AU-Net [73], which rely solely on CNN, exhibits relatively weak perception of global information, resulting in less-than-ideal performance. Among the CNN-based models, R2AU-Net, which incorporates attention mechanisms, performed the best. While Transformer blocks are capable of capturing long-range information, they exhibit poorer position awareness inherently and require substantial training data to excel. As a result, the performance of Swin-Unet [66,74], DAE-Former [75] and SegFormer [76] are not on par with our model. In summary, the hybrid combination of CNN and Transformer in our model harnesses the strengths of both and delivered satisfying results. Even in the semi-supervised training phase, our model outperforms the other models. While all models experience a decrease in dice scores on the semi-supervised test set due to its larger size, our model retains its superior performance.
Qualitative analysis:Results from different models on randomly selected 4 samples are presented in Figure 4. Comparing models solely based on CNN with those incorporating attention mechanisms, the latter achieves clearer results. However, in comparison to Transformer-based models, the ability to segment the completeness of teeth remains a challenge, affirming the notion that Transformers possess stronger global modeling capabilities relative to CNN. Nevertheless, models based exclusively on Transformers often struggle with local information awareness compared to CNN. This is evident in Figure 4, where DAE-Former, while superior in overall results to CNN models, falls slightly short in fine details. Our model outperforms others in terms of texture and completeness.
To further validate the effectiveness of CBAM module and deep supervision strategy, we conduct extensive ablation experiments, as detailed in Table 2. Models a, b, c, d denote classical TransUNet, TransUNet$ + $CBAM, TransUNet$ + $CBAM+Concat, TransUNet$ + $Concat+DeepSupervision, TransUNet$ + $CBAM+Concat+DeepSupervision, respectively. The "Concat" means concat the input with the last output feature.
Fully supervised | Semi-supervised | |||||
Dice | IoU | Hausdorff distance | Dice | IoU | Hausdorff distance | |
a | 0.9107 | 0.9559 | 0.0315 | 0.9033 | 0.9589 | 0.0317 |
b | 0.9189 | 0.9588 | 0.0308 | 0.9106 | 0.9601 | 0.0301 |
c | 0.9206 | 0.9637 | 0.0306 | 0.9135 | 0.9634 | 0.0298 |
d | 0.9306 | 0.9657 | 0.0300 | 0.9201 | 0.9698 | 0.0286 |
e | 0.9318 | 0.9691 | 0.0298 | 0.9206 | 0.9723 | 0.0269 |
Fully supervised | Semi-supervised | |||||
Dice | IoU | Hausdorff distance | Dice | IoU | Hausdorff distance | |
Ours | 0.9334 | 0.9686 | 0.0299 | 0.9113 | 0.9746 | 0.0265 |
Effectiveness of CBAM: The comparison between models a and b, as well as models d and e, reveals that CBAM contributes to some improvement in the model's capabilities. Due to the ability of CBAM to dynamically adjust the importance of channels and spatial locations in the feature maps generated by CNNs. Through channel attention, it highlights crucial channels, emphasizing informative ones while downplaying less relevant ones. Furthermore, spatial attention allows the model to focus on significant regions within an image. This adaptive recalibration enhances feature representation, making CBAM effective in diverse computer vision tasks.
Effectiveness of deep supervision: According to the comparison between models c and e, deep supervision strategy plays an important role in our proposed STS-TransUNet. On the Dice metric, the adoption of the deep supervision strategy shows significant improvement in both full supervision and semi-supervised training. By introducing supervisory signals at multiple layers, deep supervision enables more effective learning of hierarchical features. In turn, this contributes to improved convergence during training and enhances the model's ability to capture intricate patterns in the data.
We participated in MICCAI 2023 Challenges STS-2D Competition with STS-TransUNet and achieved top 3% rankings in both the fully supervised (first round) and semi-supervised (second round) tracks. The detailed results are as follows:
We outline the methodology for both fully and semi-supervised learning with panoramic dental images, covering dataset, partitioning, preprocessing, network architecture, training, comparisons, and evaluation metrics.
We harness a high-quality dataset from various institutions and employ general data preprocessing techniques to ensure the performance and robustness of our model. Furthermore, we seamlessly merge fully supervised and semi-supervised learning, effectively harnessing both labeled and unlabeled data.
We employ a U-shape architecture and introduce a hybrid encoder merging CNN and Transformer strengths, enhancing positional awareness and long-range information fusion. Additionally, CBAM is incorporated to improve spatial and channel information management, contributing to exceptional performance. We train the model in two stages: First, with fully supervised training for a robust baseline, and then transition to semi-supervised training. The semi-supervised approach includes a 'self-training' strategy with pseudo-labels, data augmentation, and iterative model optimization, effectively improving performance with limited labeled data. For evaluation, we compare our model with others in the field. The results unequivocally show its superiority across various metrics, excelling in detail representation, tooth segmentation completeness, and global modeling capabilities. This reaffirms the soundness of our model's design.
Our research has limitations, such as the omission of prior clinical dental knowledge in the model construction. We have focused on the model's architectural priors, inadvertently overlooking the integration of valuable clinical insights. In our future work, we plan to adopt a more inclusive approach, incorporating a broader spectrum of clinical priors to infuse the model with greater real-world clinical relevance and accuracy.
In conclusion, our comprehensive methodology, diverse materials, and rigorous evaluation highlight the outstanding performance of our model in dental panoramic image segmentation. The innovative fusion of CNN and Transformer technologies, along with the implementation of semi-supervised training, establishes it as a front-runner in the field. This study not only provides valuable insights into deep learning applications in medical imaging but also underscores the potential of semi-supervised learning with unlabeled data. In the future, we aim to enhance the practical deployment of our model by integrating clinical information, ensuring that it not only excels in theoretical performance but also demonstrates greater real-world clinical efficacy and relevance.
The authors declare that they have not used Artificial Intelligence (AI) tools in the creation of this article.
This work was supported by the Students' Innovation and Entrepreneurship Foundation of USTC (No. XY2023S007).
The authors declare that there are no conflicts of interest.
[1] |
Vitousek PM, Mooney HA, Lubchenco J, et al. (1997) Human domination of earth's ecosystems. Science 277: 494-499. doi: 10.1126/science.277.5325.494
![]() |
[2] | DeFries RS, Foley JA, Asner GP (2004) Land-use choices: Balancing human needs and ecosystem function. Front Ecol Environ 2: 249-257. |
[3] | IPCC (2000) In: Watson, R.T., Noble, I.R., Bolin, B., Ravindranath, N.H., Verardo, D.J., Dokken, D.J. (Eds.), Land Use, Land-Use Change, and Forestry. Special Report of the Intergovernmental Panel on Climate Change. Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA, p. 377. |
[4] | Foley JA, Defries R, Asner GP, et al. (2005) Global consequences of land use. Science 309: 570-574. |
[5] |
Pielke RA (2005) Land use and climate change. Science 310: 1625-1626. doi: 10.1126/science.1120529
![]() |
[6] | Alcamo J (2008) Searching for the future of land: scenarios from the local to global scale. In: Alcamo J (Ed.), Environmental Futures: The Practice of Environmental Scenario Analysis. Elsevier, Amsterdam, The Netherlands. |
[7] | USGCRP, The National Global Change Research Plan 2012-2021: A Strategic Plan for the U.S. Global Change Research Program. 132 pp. The U.S. Global Change Research Program Washington, D.C., 2012, Available from: http://downloads.globalchange.gov/strategic-plan/2012/usgcrp-strategic-plan-2012.pdf |
[8] |
Strengers B, Leemans R, Eickhout B, et al. (2004) The land-use projections and resulting emissions in the IPCC SRES scenarios as simulated by the IMAGE 2.2 model. GeoJournal 61: 381-393. doi: 10.1007/s10708-004-5054-8
![]() |
[9] | National Research Council (2014) Advancing Land Change Modeling: Opportunities and Research Requirements. Washington, DC: The National Academies Press. |
[10] | Soares-Filho BS, Nepstad DC, Curran LM, et al. (2006) Modeling conservation in the Amazon basin. Nature 440: 520-523. |
[11] | Eastman JR (2007) The Land Change Modeler, a software extension for ArcGIS. Worcester, Mass.: Clark University. |
[12] | Matthews RB, Gilbert NG, Roach A, et al. (2007) Agent-based land-use models: A review of applications. Landscape Ecol 22: 1447-1459. |
[13] |
Acosta-Michlik L, Espaldon V (2008) Assessing vulnerability of selected farming communities in the Philippines based on behavioural model of agent's adaptation to global environmental change. Glob Environ Chang 18: 554-563. doi: 10.1016/j.gloenvcha.2008.08.006
![]() |
[14] | Brown DG, Page SE, Riolo R, et al. (2004) Agent based and analytical modeling to evaluate the effectiveness of greenbelts. Environ Modell Softw 19: 1097-1109. |
[15] |
Verburg PH (2002) Land use change modelling at the regional scale: The CLUE-S model. Environ Manage 30: 391-405. doi: 10.1007/s00267-002-2630-x
![]() |
[16] | Burnham BO (1973) Markov intertemporal land use simulation model. South J Agr Econ 5: 253-258. |
[17] | Baker WL (1989) A review of models of landscape change. Landscape Ecol 2: 111-133. |
[18] | Turner M (1987) Spatial simulation of landscape changes in Georgia: A comparison of 3 transition models. Landscape Ecol 1: 29-36. |
[19] | Daniels CJ, Frid L (2011) Predicting landscape vegetation dynamics using state-and-transition simulation models, In Proceedings of the First Landscape State-and-Transition Simulation Modeling Conference, Portland, OR, USA, 14-16 June 2011; Kerns, B.K., Shlisky, A.J., Daniel, C.J., Eds.; U.S. Department of Agriculture, Forest Service, Pacific Northwest Research Station: Portland, OR, USA, 2012; pp.5-22. |
[20] | Sleeter BM, Sohl T, Bouchard MA, et al. (2012) Scenarios of land use and land cover change in the conterminous U.S.: Utilizing the special report on emission scenarios at ecoregional scales. Glob Environ Chang 22: 896-914. |
[21] |
Omernik JM (1987) Ecoregions of the conterminous U.S. Ann Assoc Am Geogr 77: 118-125. doi: 10.1111/j.1467-8306.1987.tb00149.x
![]() |
[22] | Sleeter BM, Liu J, Daniel C, et al. (2015) An integrated approach to modeling changes in land use, land cover, and disturbance and their impact on ecosystem carbon dynamics: a case study in the Sierra Nevada Mountains of California. AIMS Environ Sci 2: 577-606. |
[23] | Nakicenovic N, Swart R (Eds.) (2000) IPCC Special Report on Emission Scenarios; Cambridge University Press; Cambridge, UK; p. 570. |
[24] |
Sleeter BM, Sohl T, Loveland T, et al. (2013) Land-cover change in the conterminous United States from 1973 to 2000. Glob Environ Chang 23: 733-748. doi: 10.1016/j.gloenvcha.2013.03.006
![]() |
[25] | EPA (U.S. Environmental Protection Agency), Primary distinguishing characteristics of Level III ecoregions of the continental United States. Environmental Protection Agency, 1999. Available from: http://www.epa.gov/wed/pages/ecoregions/level_iii.htm |
[26] | Gallant AL, Loveland T, Sohl T, et al. (2004) Using an ecoregion framework to analyze land-cover and land-use dynamics. Environ Manage 34: S89-S110. |
[27] | Soulard CE and Acevedo W, Multi-temporal harmonization of independent land-use/land-cover datasets for the conterminous U.S. American Geophysical Union, 2013. Available from: http://adsabs.harvard.edu/abs/2013AGUFM.B41E0448S |
[28] | McConnell WJ, Moran EF (Eds.) (2001) Meeting in the middle: the challenge of meso-level integration. An international workshop on the harmonization of land-use and land-cover classification. LUCC Report Series No. 5. Anthropological Center for Training and Research on Global Environmental Change - Indiana University and LUCC International Project Office, Louvain-la-Neuve. |
[29] |
Jansen LJM, Groom GB, Carrai G (2008) Land-cover harmonisation and semantic similarity: some methodological issues. J Land Use Sci 3:131-160. doi: 10.1080/17474230802332076
![]() |
[30] | Homer C, Dewitz J, Fry J, et al. (2007) Completion of the 2001 national land cover database for the conterminous U.S. Photogramm Eng Remote Sens 73: 337-341. |
[31] | Vogelmann JE, Howard SM, Yang L, et al. (2001) Completion of the 1990's national land cover data set for the conterminous U.S. Photogramm Eng Remote Sens 67: 650-662. |
[32] | Fry J, Xian G, Jin S, et al. (2011) Completion of the 2006 national land cover database for the conterminous U.S. Photogramm Eng Remote Sens 77:858-864. |
[33] |
Jin S, Yang L, Danielson P (2013) A comprehensive change detection method for updating the National Land Cover Database to circa 2011. Remote Sens Environ 132: 159-175. doi: 10.1016/j.rse.2013.01.012
![]() |
[34] | LANDFIRE: LANDFIRE Existing Vegetation Type layer. U.S. Department of Interior, Geological Survey, 2013 June last update. Available from: http://landfire.cr.usgs.gov/viewer/ (accessed 1 October 2012) |
[35] | U.S Department of Agriculture, National Agricultural Statistics Service (2011) Cropland Data Layer. Available from http://nassgeodata.gmu.edu/CropScape/ (accessed on 1 October 2012). |
[36] | U.S. Geological Survey, Gap Analysis Program (GAP). May 2011. National Land Cover, Version 2 |
[37] |
Hansen MC, Potapov PV, Moore R, et al. (2013) High-Resolution Global Maps of 21st-Century Forest Cover Change. Science 342: 850-853. doi: 10.1126/science.1244693
![]() |
[38] | LANDFIRE: LANDFIRE Disturbance (1999-2010). U.S. Department of Interior, Geological Survey, 2013 June last update. Available from: http://landfire.cr.usgs.gov/viewer/ (accessed on 1 October 2012). |
[39] |
Roy DP, Ju J, Kline K, et al. (2010) Web-enabled Landsat data (WELD): Landsat ETM+ composited mosaics of the conterminous U.S. Remote Sens Environ 114: 35-49. doi: 10.1016/j.rse.2009.08.011
![]() |
[40] | Eidenshink J, Schwind B, Brewer K, et al. (2007) A project for monitoring trends in burn severity. Fire Ecol Spec Issue 3: 3-21. |
[41] | Anderson JR, Hardy E, Roach JT, et al. (1976) A land use and land cover classification scheme for use with remote sensor data. U.S. Geological Survey, Reston, VA. USA, Professional Paper 964. |
[42] |
Pan Y, Chen JM, Birdsey R, et al. (2011) Age structure and disturbance legacy of North American forests. Biogeosciences 8: 715-732. doi: 10.5194/bg-8-715-2011
![]() |
[43] |
Masek JG, Huang R, Wolfe R, et al. (2008) North American forest disturbance mapped from a decadal Landsat record. Remote Sens Environ 112: 2914- 2926. doi: 10.1016/j.rse.2008.02.010
![]() |
[44] |
Huang C, Goward SN, Masek JG, et al. (2010) An automated approach for reconstructing recent forest disturbance history using dense Landsat time series stacks. Remote Sens Environ 114: 183-198. doi: 10.1016/j.rse.2009.08.017
![]() |
[45] |
Tobler W (1970) A computer movie simulating urban growth in the Detroit region. Econ Geogr 46: 234-240. doi: 10.2307/143141
![]() |
[46] | U.S. Geological Survey GAP. Protected Areas Database of the United States (PAD-US), version 1.3 Combined Feature Class. 2012. Available from: http://gapanalysis.usgs.gov/padus/ (accessed on 13 November 2013). |
[47] | Natural Resources Conservation Service. Soil Survey Geographic (SSURGO) Database. U.S. Department of Agriculture, 2011. Available from: http://soildatamart.nrcs.usda.gov (accessed on 22 November 2013). |
[48] | Sleeter R, Gould M (2007) Geographic Information System Software to Remodel Population Data Using Dasymetric Mapping Methods; U.S. Geological Survey Techniques and Methods 11-C2; U.S. Geological Survey, Reston, VA. USA, p. 15. |
[49] | Wilson BT, Woodall CW, DM Griffith (2013) Imputing forest carbon stock estimates from inventory plots to a nationally continuous coverage. Carbon Balance Manag 8: 1-15. |
[50] | IMAGE team (2001) The IMAGE 2.2 implementation of the SRES scenarios: climate change scenarios resulting from runs with several GCMs. RIVM CD-ROM Publication 481508019, National Institute of Public Health and the Environment, Bilthoven. |
[51] |
Van Vuuren D, Edmonds JA, Kainuma M, et al. (2011) The Representative Concentration Pathways: An Overview. Climatic Change 109: 5-31. doi: 10.1007/s10584-011-0148-z
![]() |
[52] |
Wilson TS, Sleeter BM, Sleeter RR, et al. (2014) Land use threats and protected areas: a scenario-based, landscape level approach. Land 3: 362-389. doi: 10.3390/land3020362
![]() |
[53] | Sleeter BM (2008) Late 20th century land change in the Central California Valley Ecoregion. California Geographer 48:27-60 |
1. | Liangyu Chen, Dongping Zhang, Tianxu Yan, Zheng Li, Yutong Wei, Luying Qian, 2025, Chapter 14, 978-3-031-88976-9, 146, 10.1007/978-3-031-88977-6_14 |
Fully supervised | Semi-supervised | |||||
Dice | IoU | Hausdorff distance | Dice | IoU | Hausdorff distance | |
UNet++ [71] | 0.8978 | 0.9560 | 0.0368 | 0.8689 | 0.9427 | 0.0403 |
UNet 3+ [72] | 0.9070 | 0.9589 | 0.0326 | 0.8739 | 0.9531 | 0.0365 |
R2AU-Net [73] | 0.9081 | 0.9598 | 0.0309 | 0.8826 | 0.9556 | 0.0321 |
SegFormer [76] | 0.9182 | 0.9626 | 0.0304 | 0.9087 | 0.9589 | 0.0303 |
Swin-Unet [66] | 0.9171 | 0.9631 | 0.0303 | 0.9102 | 0.9588 | 0.0301 |
DAE-Former [75] | 0.9251 | 0.9685 | 0.0306 | 0.9153 | 0.9601 | 0.0286 |
STS-TransUNet (Ours) | 0.9318 | 0.9691 | 0.0298 | 0.9206 | 0.9723 | 0.0269 |
Fully supervised | Semi-supervised | |||||
Dice | IoU | Hausdorff distance | Dice | IoU | Hausdorff distance | |
a | 0.9107 | 0.9559 | 0.0315 | 0.9033 | 0.9589 | 0.0317 |
b | 0.9189 | 0.9588 | 0.0308 | 0.9106 | 0.9601 | 0.0301 |
c | 0.9206 | 0.9637 | 0.0306 | 0.9135 | 0.9634 | 0.0298 |
d | 0.9306 | 0.9657 | 0.0300 | 0.9201 | 0.9698 | 0.0286 |
e | 0.9318 | 0.9691 | 0.0298 | 0.9206 | 0.9723 | 0.0269 |
Fully supervised | Semi-supervised | |||||
Dice | IoU | Hausdorff distance | Dice | IoU | Hausdorff distance | |
Ours | 0.9334 | 0.9686 | 0.0299 | 0.9113 | 0.9746 | 0.0265 |
Fully supervised | Semi-supervised | |||||
Dice | IoU | Hausdorff distance | Dice | IoU | Hausdorff distance | |
UNet++ [71] | 0.8978 | 0.9560 | 0.0368 | 0.8689 | 0.9427 | 0.0403 |
UNet 3+ [72] | 0.9070 | 0.9589 | 0.0326 | 0.8739 | 0.9531 | 0.0365 |
R2AU-Net [73] | 0.9081 | 0.9598 | 0.0309 | 0.8826 | 0.9556 | 0.0321 |
SegFormer [76] | 0.9182 | 0.9626 | 0.0304 | 0.9087 | 0.9589 | 0.0303 |
Swin-Unet [66] | 0.9171 | 0.9631 | 0.0303 | 0.9102 | 0.9588 | 0.0301 |
DAE-Former [75] | 0.9251 | 0.9685 | 0.0306 | 0.9153 | 0.9601 | 0.0286 |
STS-TransUNet (Ours) | 0.9318 | 0.9691 | 0.0298 | 0.9206 | 0.9723 | 0.0269 |
Fully supervised | Semi-supervised | |||||
Dice | IoU | Hausdorff distance | Dice | IoU | Hausdorff distance | |
a | 0.9107 | 0.9559 | 0.0315 | 0.9033 | 0.9589 | 0.0317 |
b | 0.9189 | 0.9588 | 0.0308 | 0.9106 | 0.9601 | 0.0301 |
c | 0.9206 | 0.9637 | 0.0306 | 0.9135 | 0.9634 | 0.0298 |
d | 0.9306 | 0.9657 | 0.0300 | 0.9201 | 0.9698 | 0.0286 |
e | 0.9318 | 0.9691 | 0.0298 | 0.9206 | 0.9723 | 0.0269 |
Fully supervised | Semi-supervised | |||||
Dice | IoU | Hausdorff distance | Dice | IoU | Hausdorff distance | |
Ours | 0.9334 | 0.9686 | 0.0299 | 0.9113 | 0.9746 | 0.0265 |