Remote sensing images exhibit significant spatial geometric characteristics for ground objects such as buildings and roads, while targets within scenes show enormous scale variations, posing challenges to semantic segmentation algorithms' spatial structure modeling capabilities and cross-scale information processing abilities. Traditional methods lack specialized modeling mechanisms for spatial geometric features and suffer from information loss in multi-scale feature fusion. This paper proposes the SC-Net network, addressing these issues through three key technological innovations. First, we designed a feature attention layer where the spatial attention module captures spatial geometric patterns through directional feature decomposition, and the multi-scale attention module preserves feature information at different scales through adaptive pooling strategies. Second, we constructed a three-branch fusion transformer that employs cross-window attention and nine-group feature key-value pair interactions to achieve collaborative modeling of spatial, multi-scale, and global features. Finally, the multi-branch cascaded decoder enhances segmentation boundary accuracy through hierarchical feature fusion strategies. Comprehensive experiments on three standard remote sensing datasets validated the method's superiority. SC-Net achieved 63.04% mean intersection over union (MIOU) on Wuhan dense labeling dataset (WHDLD), 71.57% on Potsdam dataset, and 81.57% on Vaihingen dataset, outperforming state-of-the-art methods such as AerialFormer and SERNet by 0.67–2.12% MIOU. The method particularly demonstrated outstanding performance in scenarios with complex spatial structures and dense multi-scale targets, providing an effective solution for precise remote sensing image interpretation.
Citation: Fangbin Huang, Yuxuan Guo. Spatial structure-aware and cross-scale feature modeling network for remote sensing image semantic segmentation[J]. Electronic Research Archive, 2025, 33(10): 6391-6417. doi: 10.3934/era.2025282
Remote sensing images exhibit significant spatial geometric characteristics for ground objects such as buildings and roads, while targets within scenes show enormous scale variations, posing challenges to semantic segmentation algorithms' spatial structure modeling capabilities and cross-scale information processing abilities. Traditional methods lack specialized modeling mechanisms for spatial geometric features and suffer from information loss in multi-scale feature fusion. This paper proposes the SC-Net network, addressing these issues through three key technological innovations. First, we designed a feature attention layer where the spatial attention module captures spatial geometric patterns through directional feature decomposition, and the multi-scale attention module preserves feature information at different scales through adaptive pooling strategies. Second, we constructed a three-branch fusion transformer that employs cross-window attention and nine-group feature key-value pair interactions to achieve collaborative modeling of spatial, multi-scale, and global features. Finally, the multi-branch cascaded decoder enhances segmentation boundary accuracy through hierarchical feature fusion strategies. Comprehensive experiments on three standard remote sensing datasets validated the method's superiority. SC-Net achieved 63.04% mean intersection over union (MIOU) on Wuhan dense labeling dataset (WHDLD), 71.57% on Potsdam dataset, and 81.57% on Vaihingen dataset, outperforming state-of-the-art methods such as AerialFormer and SERNet by 0.67–2.12% MIOU. The method particularly demonstrated outstanding performance in scenarios with complex spatial structures and dense multi-scale targets, providing an effective solution for precise remote sensing image interpretation.
| [1] |
L. Wang, B. Zuo, Y. Le, Y. Chen, J. Li, Penetrating remote sensing: next-generation remote sensing for transparent earth, Innovation, 4 (2023), 100519. https://doi.org/10.1016/j.xinn.2023.100519 doi: 10.1016/j.xinn.2023.100519
|
| [2] |
L. Huang, B. Jiang, S. Lv, Y. Liu, Y. Fu, Deep learning-based semantic segmentation of remote sensing images: a survey, IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., 17 (2024), 8370–8396. https://doi.org/10.1109/JSTARS.2023.3335891 doi: 10.1109/JSTARS.2023.3335891
|
| [3] |
Z. Che, L. Shen, L. Huo, C. Hu, Y. Wang, Y. Lu, et al., MAFF-HRNet: multi-attention feature fusion HRNet for building segmentation in remote sensing images, Remote Sens., 15 (2023), 1382. https://doi.org/10.3390/rs15051382 doi: 10.3390/rs15051382
|
| [4] |
M. Li, Y. Chen, T. Zhang, W. Huang, TA-YOLO: a lightweight small object detection model based on multi-dimensional trans-attention module for remote sensing images, Complex Intell. Syst., 10 (2024), 5459–5473. https://doi.org/10.1007/s40747-024-01448-6 doi: 10.1007/s40747-024-01448-6
|
| [5] |
X. Yu, S. Li, Y. Zhang, Incorporating convolutional and transformer architectures to enhance semantic segmentation of fine-resolution urban images, Eur. J. Remote Sens., 57 (2024), 2361768. https://doi.org/10.1080/22797254.2024.2361768 doi: 10.1080/22797254.2024.2361768
|
| [6] |
W. Hua, Q. Chen, A survey of small object detection based on deep learning in aerial images, Artif. Intell. Rev., 58 (2025), 162. https://doi.org/10.1007/s10462-025-11150-9 doi: 10.1007/s10462-025-11150-9
|
| [7] |
F. Wang, Y. Zhang, Q. Hu, Y. Zhu, Remote sensing image semantic segmentation network based on multi-scale feature enhancement fusion, Geocarto Int., 39 (2024), 2297330. https://doi.org/10.1080/10106049.2024.2375585 doi: 10.1080/10106049.2024.2375585
|
| [8] | J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2015), 3431–3440. https://doi.org/10.1109/CVPR.2015.7298965 |
| [9] | O. Ronneberger, P. Fischer, T. Brox, U-Net: convolutional networks for biomedical image segmentation, in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, Springer, Cham, 9351 (2015), 234–241. https://doi.org/10.1007/978-3-319-24574-4_28 |
| [10] | L. C. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam, Encoder-decoder with atrous separable convolution for semantic image segmentation, in Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), Springer, Cham, 11211 (2018), 801–818. https://doi.org/10.1007/978-3-030-01234-2_49 |
| [11] |
Z. Luo, J. Pan, Y. Hu, L. Deng, Y. Li, C. Qi, et al., RS-Dseg: semantic segmentation of high-resolution remote sensing images based on a diffusion model component with unsupervised pretraining, Sci. Rep., 14 (2024), 18609. https://doi.org/10.1038/s41598-024-69022-1 doi: 10.1038/s41598-024-69022-1
|
| [12] |
Z. Li, T. Qu, Q. Chong, J. Xu, FMCNet: a fuzzy multiscale convolution network for remote sensing image segmentation, Can. J. Remote Sens., 50 (2024), 2418091. https://doi.org/10.1080/07038992.2024.2418091 doi: 10.1080/07038992.2024.2418091
|
| [13] |
W. Boulila, H. Ghandorh, S. Masood, A. Alzahem, A. Koubaa, F. Ahmed, et al., A transformer-based approach empowered by a self-attention technique for semantic segmentation in remote sensing, Heliyon, 10 (2024), e29396. https://doi.org/10.1016/j.heliyon.2024.e29396 doi: 10.1016/j.heliyon.2024.e29396
|
| [14] |
X. Wang, H. Wang, Y. Jing, X. Yang, J. Chu, A bio-inspired visual perception transformer for cross-domain semantic segmentation of high-resolution remote sensing images, Remote Sens., 16 (2024), 1514. https://doi.org/10.3390/rs16091514 doi: 10.3390/rs16091514
|
| [15] | S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, et al., Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2021), 6881–6890. |
| [16] | E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, P. Luo, SegFormer: simple and efficient design for semantic segmentation with transformers, Adv. Neural Inf. Process. Syst., 34 (2021), 12077–12090. |
| [17] |
Y. L. Chen, C. L. Lin, Y. C. Lin, T. C. Chen, Transformer-CNN for small image object detection, Signal Process. Image Commun., 129 (2024), 117194. https://doi.org/10.1016/j.image.2024.117194 doi: 10.1016/j.image.2024.117194
|
| [18] | Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, et al., Swin transformer: hierarchical vision transformer using shifted windows, in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), 10012–10022. |
| [19] | X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, et al., CSwin transformer: a general vision transformer backbone with cross-shaped windows, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), 12124–12134. |
| [20] |
J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, et al., Deep high-resolution representation learning for visual recognition, IEEE Trans. Pattern Anal. Mach. Intell., 43 (2020), 3349–3364. https://doi.org/10.1109/TPAMI.2020.2983686 doi: 10.1109/TPAMI.2020.2983686
|
| [21] |
S. Cheng, B. Li, L. Sun, Y. Chen, HRRNet: hierarchical refinement residual network for semantic segmentation of remote sensing images, Remote Sens., 15 (2023), 1244. https://doi.org/10.3390/rs15051244 doi: 10.3390/rs15051244
|
| [22] |
H. Wu, C. Liang, M. Liu, Z. Wen, Optimized HRNet for image semantic segmentation, Expert Syst. Appl., 174 (2021), 114532. https://doi.org/10.1016/j.eswa.2020.114532 doi: 10.1016/j.eswa.2020.114532
|
| [23] |
X. Yang, X. Fan, M. Peng, Q. Guan, L. Tang, Semantic segmentation for remote sensing images based on an AD-HRNet model, Int. J. Digit. Earth, 15 (2022), 2376–2399. https://doi.org/10.1080/17538947.2022.2159080 doi: 10.1080/17538947.2022.2159080
|
| [24] | H. Feng, T. Zhong, Backbone feature enhancement and decoder improvement in HRNet for semantic segmentation, Int. J. Adv. Comput. Sci. Appl., 15 (2024). https://doi.org/10.14569/ijacsa.2024.0151098 |
| [25] |
J. Xiang, J. Liu, D. Chen, Q. Xiong, C. Deng, CTFuseNet: a multi-scale CNN-transformer feature fused network for crop type segmentation on UAV remote sensing imagery, Remote Sens., 15 (2023), 1151. https://doi.org/10.3390/rs15041151 doi: 10.3390/rs15041151
|
| [26] | J. Yang, H. Wan, Z. Shang, Enhanced hybrid CNN and transformer network for remote sensing image change detection, Sci. Rep., 15 (2025), 10161. |
| [27] |
Z. Zhang, L. Huang, B. H. Tang, W. Le, M. Wang, J. Cheng, et al., MATNet: multiattention transformer network for cropland semantic segmentation in remote sensing images, Int. J. Digit. Earth, 17 (2024), 2392845. https://doi.org/10.1080/17538947.2024.2392845 doi: 10.1080/17538947.2024.2392845
|
| [28] |
M. Liu, P. Liu, L. Zhao, Y. Ma, L. Chen, M. Xu, Fast semantic segmentation for remote sensing images with an improved short-term dense-connection (STDC) network, Int. J. Digit. Earth, 17 (2024), 2356122. https://doi.org/10.1080/17538947.2024.2356122 doi: 10.1080/17538947.2024.2356122
|
| [29] | S. Woo, J. Park, J. Y. Lee, I. S. Kweon, CBAM: convolutional block attention module, in Proceedings of the European Conference on Computer Vision (ECCV), (2018), 3–19. |
| [30] | J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, et al., Dual attention network for scene segmentation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), 3146–3154. |
| [31] |
J. Lv, Q. Shen, M. Lv, Y. Li, L. Shi, P. Zhang, Deep learning-based semantic segmentation of remote sensing images: a review, Front. Ecol. Evol., 11 (2023), 1201125. https://doi.org/10.3389/fevo.2023.1201125 doi: 10.3389/fevo.2023.1201125
|
| [32] |
X. Wang, Z. Hu, S. Shi, M. Hou, L. Xu, X. Zhang, A deep learning method for optimizing semantic segmentation accuracy of remote sensing images based on improved UNet, Sci. Rep., 13 (2023), 7600. https://doi.org/10.1038/s41598-023-34379-2 doi: 10.1038/s41598-023-34379-2
|
| [33] |
X. Li, J. Li, MFCA-Net: a deep learning method for semantic segmentation of remote sensing images, Sci. Rep., 14 (2024), 5745. https://doi.org/10.1038/s41598-024-56211-1 doi: 10.1038/s41598-024-56211-1
|
| [34] |
A. Yu, Y. Quan, R. Yu, W. Guo, X. Wang, D. Hong, et al., Deep learning methods for semantic segmentation in remote sensing with small data: a survey, Remote Sens., 15 (2023), 4987. https://doi.org/10.3390/rs15204987 doi: 10.3390/rs15204987
|
| [35] |
Y. Mo, Y. Wu, X. Yang, F. Liu, Y. Liao, Review the state-of-the-art technologies of semantic segmentation based on deep learning, Neurocomputing, 493 (2022), 626–646. https://doi.org/10.1016/j.neucom.2022.01.005 doi: 10.1016/j.neucom.2022.01.005
|
| [36] |
H. Li, K. Qiu, L. Chen, X. Mei, L. Hong, C. Tao, SCAttNet: semantic segmentation network with spatial and channel attention mechanism for high-resolution remote sensing images, IEEE Geosci. Remote Sens. Lett., 18 (2020), 905–909. https://doi.org/10.1109/LGRS.2020.2988294 doi: 10.1109/LGRS.2020.2988294
|
| [37] |
X. Li, F. Xu, F. Liu, X. Lyu, Y. Tong, Z. Xu, et al., A synergistical attention model for semantic segmentation of remote sensing images, IEEE Trans. Geosci. Remote Sens., 61 (2023), 1–16. https://doi.org/10.1109/TGRS.2023.3243954 doi: 10.1109/TGRS.2023.3243954
|
| [38] |
R. Wang, L. Ma, G. He, B. A. Johnson, Z. Yan, M. Chang, et al., Transformers for remote sensing: a systematic review and analysis, Sensors, 24 (2024), 3495. https://doi.org/10.3390/s24113495 doi: 10.3390/s24113495
|
| [39] |
S. Paheding, A. Saleem, M. F. H. Siddiqui, N. Rawashdeh, A. Essa, A. A. Reyes, Advancing horizons in remote sensing: a comprehensive survey of deep learning models and applications in image classification and beyond, Neural Comput. Appl., 36 (2024), 16727–16767. https://doi.org/10.1007/s00521-024-10165-7 doi: 10.1007/s00521-024-10165-7
|
| [40] |
L. Wang, R. Li, C. Zhang, S. Fang, C. Duan, X. Meng, et al., UNetFormer: a UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery, ISPRS J. Photogramm. Remote Sens., 190 (2022), 196–214. https://doi.org/10.1016/j.isprsjprs.2022.06.008 doi: 10.1016/j.isprsjprs.2022.06.008
|
| [41] |
G. Liang, F. Xie, Y. R. Chien, Class-aware self-and cross-attention network for few-shot semantic segmentation of remote sensing images, Mathematics, 12 (2024), 2761. https://doi.org/10.3390/math12172761 doi: 10.3390/math12172761
|
| [42] |
F. Xie, G. Liang, Y. R. Chien, Global–local query-support cross-attention for few-shot semantic segmentation, Mathematics, 12 (2024), 2936. https://doi.org/10.3390/math12182936 doi: 10.3390/math12182936
|
| [43] |
W. Lu, Y. Hu, W. Shao, H. Wang, Z. Zhang, M. Wang, A multiscale feature fusion enhanced CNN with the multiscale channel attention mechanism for efficient landslide detection (MS2LandsNet) using medium-resolution remote sensing data, Int. J. Digit. Earth, 17 (2024), 2300731. https://doi.org/10.1080/17538947.2023.2300731 doi: 10.1080/17538947.2023.2300731
|
| [44] |
D. Lu, S. Cheng, L. Wang, S. Song, Multi-scale feature progressive fusion network for remote sensing image change detection, Sci. Rep., 12 (2022), 11968. https://doi.org/10.1038/s41598-022-16329-6 doi: 10.1038/s41598-022-16329-6
|
| [45] | T. Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid networks for object detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2017), 2117–2125. |
| [46] |
J. Liu, H. Gu, Z. Li, H. Chen, H. Chen, Multi-scale feature fusion attention network for building extraction in remote sensing images, Electronics, 13 (2024), 923. https://doi.org/10.3390/electronics13050923 doi: 10.3390/electronics13050923
|
| [47] |
H. Zheng, M. Zhang, M. Gong, A. K. Qin, T. Liu, F. Jiang, Multi-scale hierarchical feature fusion network for change detection, Pattern Recognit., 161 (2025), 111266. https://doi.org/10.1016/j.patcog.2024.111266 doi: 10.1016/j.patcog.2024.111266
|
| [48] |
C. Wang, L. Li, Z. Wang, J. Ma, Y. Kong, Y. Wang, et al., Multi-scale dense graph attention network for hyperspectral classification, Can. J. Remote Sens., 50 (2024), 2333424. https://doi.org/10.1080/07038992.2024.2333424 doi: 10.1080/07038992.2024.2333424
|
| [49] |
Y. Liu, K. Gao, H. Wang, Z. Yang, P. Wang, S. Ji, et al., A transformer-based multi-modal fusion network for semantic segmentation of high-resolution remote sensing imagery, Int. J. Appl. Earth Obs. Geoinf., 133 (2024), 104083. https://doi.org/10.1016/j.jag.2024.104083 doi: 10.1016/j.jag.2024.104083
|
| [50] |
H. Wang, H. Wang, L. Wu, TGF-Net: transformer and gist CNN fusion network for multi-modal remote sensing image classification, PLoS One, 20 (2025), e0316900. https://doi.org/10.1371/journal.pone.0316900 doi: 10.1371/journal.pone.0316900
|
| [51] | Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, Q. Hu, ECA-Net: efficient channel attention for deep convolutional neural networks, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2020), 11531–11539. https://doi.org/10.1109/CVPR42600.2020.01155 |
| [52] | Q. Hou, D. Zhou, J. Feng, Coordinate attention for efficient mobile network design, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021), 13708–13717. https://doi.org/10.1109/CVPR46437.2021.01350 |
| [53] |
Z. Shao, W. Zhou, X. Deng, M. Zhang, Q. Cheng, Multilabel remote sensing image retrieval based on fully convolutional network, IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., 13 (2020), 318–328. https://doi.org/10.1109/JSTARS.2019.2961634 doi: 10.1109/JSTARS.2019.2961634
|
| [54] |
F. Rottensteiner, G. Sohn, J. Jung, M. Gerke, C. Baillard, S. Benitez, et al., The ISPRS test project on urban classification and 3D building reconstruction, ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci., I-3 (2012), 293–298. https://doi.org/10.5194/isprsannals-I-3-293-2012 doi: 10.5194/isprsannals-I-3-293-2012
|
| [55] |
T. Hanyu, K. Yamazaki, M. Tran, R. A. McCann, H. Liao, C. Rainwater, et al., AerialFormer: multi-resolution transformer for aerial image segmentation, Remote Sens., 16 (2024), 2930. https://doi.org/10.3390/rs16162930 doi: 10.3390/rs16162930
|
| [56] |
X. Zhang, L. Li, D. Di, J. Wang, G. Chen, W. Jing, et al., SERNet: squeeze and excitation residual network for semantic segmentation of high-resolution remote sensing images, Remote Sens., 14 (2022), 4770. https://doi.org/10.3390/rs14194770 doi: 10.3390/rs14194770
|
| [57] | J. Gu, H. Kwon, D. Wang, W. Ye, M. Li, Y. H. Chen, et al., Multi-scale high-resolution vision transformer for semantic segmentation, in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2022), 12084–12093. https://doi.org/10.1109/CVPR52688.2022.01178 |
| [58] |
Y. Zhang, J. Yang, Y. Liu, J. Tian, S. Wang, C. Zhong, et al., Decoupled pyramid correlation network for liver tumor segmentation from CT images, Med. Phys., 49 (2022), 7207–7221. https://doi.org/10.1002/mp.15723 doi: 10.1002/mp.15723
|
| [59] | X. Li, A. You, Z. Zhu, H. Zhao, M. Yang, K. Yang, et al., Semantic flow for fast and accurate scene parsing, in Computer Vision – ECCV 2020. Lecture Notes in Computer Science(), Springer, Cham, 12346 (2020). https://doi.org/10.1007/978-3-030-58452-8_45 |
| [60] |
R. Li, L. Wang, C. Zhang, C. Duan, S. Zheng, A2-FPN for semantic segmentation of fine-resolution remotely sensed images, Int. J. Remote Sens., 43 (2022), 1131–1155. https://doi.org/10.1080/01431161.2022.2030071 doi: 10.1080/01431161.2022.2030071
|
| [61] | Y. Yuan, R. Fu, L. Huang, W. Lin, C. Zhang, X. Chen, et al., HRFormer: high-resolution transformer for dense prediction, preprint, arXiv: 2110.09408. https://doi.org/10.48550/arXiv.2110.09408 |
| [62] | L. Hoyer, D. Dai, L. V. Gool, DAFormer: improving network architectures and training strategies for domain-adaptive semantic segmentation, in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2022), 9914–9925. https://doi.org/10.1109/CVPR52688.2022.00969 |
| [63] |
X. He, Y. Zhou, J. Zhao, D. Zhang, R. Yao, Y. Xue, Swin transformer embedding UNet for remote sensing image semantic segmentation, IEEE Trans. Geosci. Remote Sens., 60 (2022), 1–15. https://doi.org/10.1109/TGRS.2022.3144165 doi: 10.1109/TGRS.2022.3144165
|
| [64] | B. Zhang, Z. Tian, Q. Tang, X. Chu, X. Wei, C. Shen, et al., SegViT: semantic segmentation with plain vision transformers, Adv. Neural Inf. Process. Syst., 35 (2022), 4971–4982. |
| [65] |
L. Wang, R. Li, C. Duan, C. Zhang, X. Meng, S. Fang, A novel transformer based semantic segmentation scheme for fine-resolution remote sensing images, IEEE Geosci. Remote Sens. Lett., 19 (2022), 1–5. https://doi.org/10.1109/LGRS.2022.3143368 doi: 10.1109/LGRS.2022.3143368
|