Traffic sign recognition is crucial not only for autonomous vehicles and traffic safety research but also for multimedia processing and computer vision tasks. However, traffic sign recognition faces several challenges, such as high intraclass variability and interclass similarity in visual features and background complexity. We propose a novel invariant cue-aware feature concentration transformer (TTSNet) to effectively address these challenges. TTSNet aims to learn the invariant and core information of traffic signs. To this end, we introduce three new modules to learn the features of traffic signs: attention-based internal scale feature interaction (DLFL), cross-scale cross-space feature modulation (SSFM), and eliminating spatial and information redundancy (ESIR) modules. The DLFL module extracts invariant cues from traffic signs through feature selection based on discriminative values. The SSFM-Fusion module aggregates multi-scale information from traffic sign images by concatenating multi-layer features. The ESIR module improves feature representation by eliminating spatial and channel information redundancy. Extensive experiments showed that TTSNet achieves state-of-the-art performance on the T100K (89.1%) and CTSDB (89.97%) datasets.
Citation: Yi Deng, Ziyi Wu, Junhai Liu, Hai Liu. TTSNet: Traffic sign recognition via a transformer by Learning Spectrogram Structural Features[J]. Mathematical Biosciences and Engineering, 2026, 23(3): 722-752. doi: 10.3934/mbe.2026028
Traffic sign recognition is crucial not only for autonomous vehicles and traffic safety research but also for multimedia processing and computer vision tasks. However, traffic sign recognition faces several challenges, such as high intraclass variability and interclass similarity in visual features and background complexity. We propose a novel invariant cue-aware feature concentration transformer (TTSNet) to effectively address these challenges. TTSNet aims to learn the invariant and core information of traffic signs. To this end, we introduce three new modules to learn the features of traffic signs: attention-based internal scale feature interaction (DLFL), cross-scale cross-space feature modulation (SSFM), and eliminating spatial and information redundancy (ESIR) modules. The DLFL module extracts invariant cues from traffic signs through feature selection based on discriminative values. The SSFM-Fusion module aggregates multi-scale information from traffic sign images by concatenating multi-layer features. The ESIR module improves feature representation by eliminating spatial and channel information redundancy. Extensive experiments showed that TTSNet achieves state-of-the-art performance on the T100K (89.1%) and CTSDB (89.97%) datasets.
| [1] |
K. Wang, L. J. Zheng, B. Lin, Demand-side incentives, competition, and firms' innovative activities: Evidence from automobile industry in China, Energy Econ., 132 (2024), 107426. https://doi.org/10.1016/j.eneco.2024.107426 doi: 10.1016/j.eneco.2024.107426
|
| [2] |
H. Liu, H. Nie, Z. Zhang, Y.-F. Li, Anisotropic angle distribution learning for head pose estimation and attention understanding in human–computer interaction, Neurocomputing, 433 (2021), 310–322. https://doi.org/10.1016/j.neucom.2020.09.068 doi: 10.1016/j.neucom.2020.09.068
|
| [3] |
L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, H. Li, End-to-end autonomous driving: Challenges and frontiers, IEEE Trans. Pattern Anal. Mach. Intell., (2024). https://doi.org/10.1109/TPAMI.2024.3435937 doi: 10.1109/TPAMI.2024.3435937
|
| [4] | Y. Xu, T. Fu, H. Yang, C. Lee, Dynamic video segmentation network, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, 6556–6565. https://doi.org/10.1109/CVPR.2018.00686 |
| [5] |
H. Liu, S. Fang, Z. Zhang, D. Li, K. Lin, J. Wang, MFDNet: Collaborative poses perception and Matrix Fisher distribution for head pose estimation, IEEE Trans. Multimedia, 24 (2022), 2449–2460. https://doi.org/10.1109/TMM.2021.3081873 doi: 10.1109/TMM.2021.3081873
|
| [6] |
W. Rawat, Z. Wang, Deep convolutional neural networks for image classification: A comprehensive review, Neural Comput., 29 (2017), 2352–2449. https://doi.org/10.1162/neco_a_00990 doi: 10.1162/neco_a_00990
|
| [7] |
D. Wang, K. Mao, Learning semantic text features for web text aided image classification, IEEE Trans. Multimedia, 21 (2019), 2985–2996. https://doi.org/10.1109/TMM.2019.2920620 doi: 10.1109/TMM.2019.2920620
|
| [8] | Y. Zhong, Y. Wei, Y. Liang, X. Liu, R. Ji, Y. Cang, A comparative study of generative adversarial networks for image recognition algorithms based on deep learning and traditional methods, arXiv: 2408.03568, 2024. https://doi.org/10.1109/ICPICS62053.2024.10797049 |
| [9] | H. Liu, Y. Song, T. Liu, J. Ju, J. Tang, UGENet: Learning discriminative embeddings for unconstrained gaze estimation network via self-attention mechanism in human–computer interaction, in: Proc. Int. Conf. Comput. Inf. Big Data Appl., Wuhan, China, 2024. https://doi.org/10.1145/3671151.3671152 |
| [10] | W. Ge, X. Lin, Y. Yu, Weakly supervised complementary parts models for fine-grained image classification from the bottom up, in: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, 3034–3043. |
| [11] | C. Liu, H. Xie, Z. J. Zha, L. Ma, L. Yu, Y. Zhang, Filtration and distillation: Enhancing region attention for fine-grained visual categorization, in: Proc. AAAI Conf. Artif. Intell., 34 (2020), 11555–11562. https://doi.org/10.1609/aaai.v34i07.6822 |
| [12] | Q. Fan, W. Zhuo, C. K. Tang, Y. W. Tai, Few-shot object detection with attention-RPN and multi-relation detector, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, 4013–4022. https://doi.org/10.1109/CVPR42600.2020.00407 |
| [13] | K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv: 1409.1556, 2014. |
| [14] | K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,770–778. https://doi.org/10.1109/CVPR.2016.90 |
| [15] | G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely connected convolutional networks, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, 4700–4708. https://doi.org/10.1109/CVPR.2017.243 |
| [16] | C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, et al., Going deeper with convolutions, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, 1–9. https://doi.org/10.1109/CVPR.2015.7298594 |
| [17] | W. Luo, X. Yang, X. Mo, Y. Lu, L. S. Davis, J. Li, et al., Cross-X learning for fine-grained visual categorization, in: Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, 8242–8251. https://doi.org/10.1109/ICCV.2019.00833 |
| [18] | P. Zhuang, Y. Wang, Y. Qiao, Learning attentive pairwise interaction for fine-grained classification, in: Proc. AAAI Conf. Artif. Intell., 2020, 13130–13137. https://doi.org/10.1609/aaai.v34i07.7016 |
| [19] |
R. Du, J. Xie, Z. Ma, D. Chang, Y. Z. Song, J. Guo, Progressive learning of category-consistent multi-granularity features for fine-grained visual classification, IEEE Trans. Pattern Anal. Mach. Intell., 44 (2022), 9521–9535. https://doi.org/10.1109/TPAMI.2021.3126668 doi: 10.1109/TPAMI.2021.3126668
|
| [20] |
S. Pouyanfar, S. Sadiq, Y. Yan, H. M. Tian, Y. D. Tao, M. Presa, et al., A survey on deep learning: Algorithms, techniques, and applications, ACM Comput. Surv., 51 (2018), 1–36. https://doi.org/10.1145/3234150 doi: 10.1145/3234150
|
| [21] | A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., Attention is all you need, Adv. Neural Inf. Process. Syst., 30 (2017). |
| [22] |
J. He, J. Chen, S. Liu, A. Kortylewski, C. Yang, Y. Bai, et al., TransFG: A transformer architecture for fine-grained recognition, Proc. AAAI Conf. Artif. Intell., 2022,852–860. https://doi.org/10.1609/aaai.v36i1.19967 doi: 10.1609/aaai.v36i1.19967
|
| [23] |
H. Liu, S. Zeng, L. Deng, T. Liu, X. Liu, Z. Zhang, et al., HPCTrans: Heterogeneous plumage cues-aware texton correlation representation for FBIC via transformers, IEEE Trans. Circuits Syst. Video Technol., (2025), https://doi.org/10.1109/TCSVT.2025.3601504 doi: 10.1109/TCSVT.2025.3601504
|
| [24] | S. Branson, G. Van Horn, S. Belongie, P. Perona, Bird species categorization using pose normalized deep convolutional nets, arXiv: 1406.2952, 2014. https://doi.org/10.5244/C.28.87 |
| [25] |
Z. Zhao, P. Zheng, S. Xu, X. Wu, Object detection with deep learning: A review, IEEE Trans. Neural Netw. Learn. Syst., 30 (2019), 3212–3232. https://doi.org/10.1109/TNNLS.2018.2876865 doi: 10.1109/TNNLS.2018.2876865
|
| [26] |
M. H. Guo, T. X. Xu, J. J. Liu, Z. N. Liu, P. T. Jiang, T. J. Mu, et al., Attention mechanisms in computer vision: A survey, Comput. Vis. Media, 8 (2022), 331–368. https://doi.org/10.1007/s41095-022-0271-y doi: 10.1007/s41095-022-0271-y
|
| [27] |
T. Liu, H. Liu, B. Yang, Z. Zhang, LDCNet: Limb direction cues-aware network for flexible HPE in industrial behavioral biometrics systems, IEEE Trans. Ind. Inf., 20 (2024), 8068–8078. https://doi.org/10.1109/TII.2023.3266366 doi: 10.1109/TII.2023.3266366
|
| [28] | N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, aS. Zagoruyko, End-to-end object detection with transformers, in: Proc. Eur. Conf. Comput. Vis., 2020,213–229. https://doi.org/10.1007/978-3-030-58452-8_13 |
| [29] |
H. Liu, T. Liu, Y. Chen, Z. Zhang, Y. Li, EHPE: Skeleton cues-based Gaussian coordinate encoding for efficient human pose estimation, IEEE Trans. Multimedia, 26 (2024), 8464–8475. https://doi.org/10.1109/TMM.2022.3197364 doi: 10.1109/TMM.2022.3197364
|
| [30] | A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. H. Zhai, T. Unterthiner, et al., An image is worth 16×16 words: Transformers for image recognition at scale, in: Proc. Int. Conf. Learn. Represent., 2021. |
| [31] | C. R. Chen, Q. Fan, R. Panda, CrossViT: Cross-attention multiscale vision transformer for image classification, in: Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021,347–356. https://doi.org/10.1109/ICCV48922.2021.00041 |
| [32] | Z. Liu, Y. T. Lin, Y. Cao, H. Hu, Y. X. Wei, Z. Zhang, et al., Swin transformer: Hierarchical vision transformer using shifted windows, in: Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, 9992–10002. https://doi.org/10.1109/ICCV48922.2021.00986 |
| [33] | J. Fang, L. Xie, X. Wang, X. Zhang, W. Liu, Q. Tian, MSG transformer: Exchanging local spatial information by manipulating messenger tokens, in: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, 12053–12062. https://doi.org/10.1109/CVPR52688.2022.01175 |
| [34] |
H. Liu, C. Zhang, Y. Deng, B. Xie, T. Liu, Z. Zhang, et al., TransIFC: Invariant cues-aware feature concentration learning for efficient fine-grained bird image classification, IEEE Transact. Multimedia, 27 (2025), 1677–1690. https://doi.org/10.1109/TMM.2023.3238548 doi: 10.1109/TMM.2023.3238548
|
| [35] | Y. Li, S. K. Zhang, Z. C. Wang, S. Yang, W. K. Yang, S. T. Xia, et al., TokenPose: Learning keypoint tokens for human pose estimation, in: Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, 11293–11302. https://doi.org/10.1109/ICCV48922.2021.01112 |
| [36] |
H. Liu, T. Liu, Z. Zhang, A. K. Sangaiah, B. Yang, Y. Li, ARHPE: Asymmetric relation-aware representation learning for head pose estimation in industrial human–computer interaction, IEEE Trans. Ind. Inf., 18 (2022), 7107–7117. https://doi.org/10.1109/TII.2022.3143605 doi: 10.1109/TII.2022.3143605
|
| [37] |
Y. Sun, S. Ma, S. Y. Sun, P Liu, L. Zhang, J. Ouyang, et al., Partial discharge pattern recognition of transformers based on MobileNets convolutional neural network, Appl. Sci., 11 (2021), 6984. https://doi.org/10.3390/app11156984 doi: 10.3390/app11156984
|
| [38] | X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable DETR: Deformable transformers for end-to-end object detection, arXiv: 2010.04159, 2020. |
| [39] |
W. Han, N. He, X. Wang, F. Sun, S. Liu, IDPD: Improved deformable-DETR for crowd pedestrian detection, Signal Image Video Process., 18 (2024), 2243–2253. https://doi.org/10.1007/s11760-023-02896-2 doi: 10.1007/s11760-023-02896-2
|
| [40] | D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, et al., Conditional DETR for fast training convergence, in: Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, 3651–3660. https://doi.org/10.1109/ICCV48922.2021.00363 |
| [41] |
H. Liu, Q. Chen, Z. Liu, T. Liu, L. Zhao, Z. Zhang, et al., SkeFormer: Skeletal cues-aware bone point relationship learning for efficient FBIC via transformers, IEEE Trans. Multimedia, (2025), https://doi.org/10.1109/TMM.2025.3603431 doi: 10.1109/TMM.2025.3603431
|
| [42] | X. Hou, M. Liu, S. Zhang, P. Wei, B. Chen, Salience DETR: Enhancing detection transformer with hierarchical salience filtering refinement, in: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024, 17574–17583. https://doi.org/10.1109/CVPR52733.2024.01664 |
| [43] |
S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell., 39 (2016), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031 doi: 10.1109/TPAMI.2016.2577031
|
| [44] |
L. Zhang, C. Xing, F. Gao, T.-S. Li, Y.-Q. Tan, Using DSR and MSCR tests to characterize high temperature performance of different rubber modified asphalt, Constr. Build. Mater., 127 (2016), 466–474. https://doi.org/10.1016/j.conbuildmat.2016.10.010 doi: 10.1016/j.conbuildmat.2016.10.010
|
| [45] | R. Wang, Large kernel convolutional neural networks for action recognition based on RepLKNet, in: Proc. Int. Conf. Image Process. Comput. Vis. Mach. Learn., 2023,103–107. https://doi.org/10.1109/ICICML60161.2023.10424920 |
| [46] | H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, et al., DINO: DETR with improved denoising anchor boxes for end-to-end object detection, arXiv: 2203.03605, 2022. |
| [47] | L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z. H. Jiang, et al., Tokens-to-token ViT: Training vision transformers from scratch on ImageNet, in: Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021,558–567. https://doi.org/10.1109/ICCV48922.2021.00060 |