Optimizing Tensor-Train Decomposition for efficient edge AI: Accelerated decoding via GEMM and reshape minimization

Hyunseok Kwak; Sangmin Jeon; Kyeongwon Lee; Woojoo Lee; Hyunseok Kwak; Sangmin Jeon; Kyeongwon Lee; Woojoo Lee

doi:10.3934/math.2025706

AIMS Mathematics

2025, Volume 10, Issue 7: 15755-15784. doi: 10.3934/math.2025706

Previous Article Next Article

Research article

Optimizing Tensor-Train Decomposition for efficient edge AI: Accelerated decoding via GEMM and reshape minimization

Department of Intelligent Semiconductor Engineering, Chung-Ang University 84, Heukseok-ro, Dongjak-gu, Seoul, 06974, Korea

Received: 06 April 2025 Revised: 30 June 2025 Accepted: 03 July 2025 Published: 10 July 2025
MSC : 15A69, 65F30, 68T07

Tensor-Train Decomposition (TTD) has emerged as a powerful mathematical framework for compressing neural network models in edge-oriented deployments, significantly reducing communication overhead between cloud environments and resource-constrained edge devices. However, its widespread adoption is hindered by the substantial computational overhead of decoding compressed parameters on edge hardware. In this paper, we experimentally demonstrate and mathematically validate that TTD achieves superior compression efficiency and better accuracy retention compared to conventional pruning methods, particularly when fine-tuning is impractical. To overcome the critical decoding bottleneck, we propose a mathematically rigorous yet hardware-aware optimization framework specifically tailored for efficient TTD-based deployments. Our approach leverages existing General Matrix Multiplication (GEMM) accelerators, commonly available in modern edge processors, to substantially accelerate the computationally intensive Einsum operations inherent in TTD decoding. Furthermore, we analytically identify redundant reshape operations between the decoding and inference stages, introducing a novel merging strategy that significantly reduces memory-bound overhead. Evaluations on a Field-Programmable Gate Array (FPGA)-based edge inference processor show substantial improvements, including a 3$ \times $ speedup in reshape operations and a 69.3% decrease in decoding time. By seamlessly integrating rigorous mathematical formulation, analytical justification, and practical hardware optimization, this work paves the way for the efficient real-world deployment of TTD-compressed models on edge devices.
- Tensor-Train Decomposition (TTD),
- edge AI,
- model compression,
- GEMM accelerator,
- reshape optimization,
- Cloud-to-Edge deployment,
- FPGA prototyping
Citation: Hyunseok Kwak, Sangmin Jeon, Kyeongwon Lee, Woojoo Lee. Optimizing Tensor-Train Decomposition for efficient edge AI: Accelerated decoding via GEMM and reshape minimization[J]. AIMS Mathematics, 2025, 10(7): 15755-15784. doi: 10.3934/math.2025706

Related Papers:

Abstract

Tensor-Train Decomposition (TTD) has emerged as a powerful mathematical framework for compressing neural network models in edge-oriented deployments, significantly reducing communication overhead between cloud environments and resource-constrained edge devices. However, its widespread adoption is hindered by the substantial computational overhead of decoding compressed parameters on edge hardware. In this paper, we experimentally demonstrate and mathematically validate that TTD achieves superior compression efficiency and better accuracy retention compared to conventional pruning methods, particularly when fine-tuning is impractical. To overcome the critical decoding bottleneck, we propose a mathematically rigorous yet hardware-aware optimization framework specifically tailored for efficient TTD-based deployments. Our approach leverages existing General Matrix Multiplication (GEMM) accelerators, commonly available in modern edge processors, to substantially accelerate the computationally intensive Einsum operations inherent in TTD decoding. Furthermore, we analytically identify redundant reshape operations between the decoding and inference stages, introducing a novel merging strategy that significantly reduces memory-bound overhead. Evaluations on a Field-Programmable Gate Array (FPGA)-based edge inference processor show substantial improvements, including a 3$ \times $ speedup in reshape operations and a 69.3% decrease in decoding time. By seamlessly integrating rigorous mathematical formulation, analytical justification, and practical hardware optimization, this work paves the way for the efficient real-world deployment of TTD-compressed models on edge devices.

References

[1]	S. khan, M. Naeem, M. Qiyas, Deep intelligent predictive model for the identification of diabetes, AIMS Math., 8 (2023), 16446–16462. https://doi.org/10.3934/math.2023840 doi: 10.3934/math.2023840
[2]	K. Tarmissi, H. A. Mengash, N. Negm, Y. Said, A. M. Al-Sharafi, Explainable artificial intelligence with fusion-based transfer learning on adverse weather conditions detection using complex data for autonomous vehicles, AIMS Math., 9 (2024), 35678–35701. https://doi:10.3934/math.20241693 doi: https://doi:10.3934/math.20241693
[3]	J. Chen, H. Yan, Z. Liu, M. Zhang, H. Xiong, S. Yu, When federated learning meets privacy-preserving computation, ACM Comput. Surv., 56 (2024), 1–36. https://10.1145/3679013 doi: https://10.1145/3679013
[4]	H. Zhang, S. Huang, M. Xu, D. Guo, X. Wang, X. Wang, et al., Large-scale measurements and optimizations on latency in edge clouds, IEEE T. Cloud Comput., 12 (2024), 1218–1231. https://doi.org/10.1109/TCC.2024.3452094 doi: 10.1109/TCC.2024.3452094
[5]	A. Zhou, J. Yang, T. Qiao, Y. Qi, Z. Yang, W. Zhao, et al., Graph neural networks automated design and deployment on device-edge co-inference systems, ACM/IEEE Design Autom. Conf., 61 (2024), 1–6. https://doi.org/10.1145/3649329.3655938 doi: 10.1145/3649329.3655938
[6]	X. Wang, M. Shen, K. Yang, On-edge high-throughput collaborative inference for real-time video analytics, IEEE Internet Things, 11 (2024), 33097–33109. https://doi.org/10.1109/JIOT.2024.3424235 doi: 10.1109/JIOT.2024.3424235
[7]	S. Zhu, T. Voigt, F. Rahimian, J. Ko, On-device training: A first overview on existing systems, ACM T. Sensor Network, 20 (2024), 1–39. https://doi.org/10.1145/3696003 doi: 10.1145/3696003
[8]	A. Quélennec, E. Tartaglione, P. Mozharovskyi, V. T. Nguyen, Towards on-device learning on the edge: Ways to select neurons to update under a budget constraint, IEEE/CVF Winter Conf. Appl. Comput. Vis., 2024,685–694. https://doi.org/10.1109/WACVW60836.2024.00080 doi: 10.1109/WACVW60836.2024.00080
[9]	J. Lin, L. Zhu, W. M. Chen, W. C. Wang, C. Gan, S. Han, On-device training under 256KB memory, Adv. Neural Inf. Process. Syst., 35 (2022), 22941–22954.
[10]	S. Sai, M. Prasad, G. Dashore, V. Chamola, B. Sikdar, On-device generative AI: The need, architectures, and challenges, IEEE Consum. Electr. M., (In press). https://doi.org/10.1109/MCE.2024.3518761
[11]	Z. Dong, Q. He, F. Chen, H. Jin, T. Gu, Y. Yang, Edgemove: Pipelining device-edge model training for mobile intelligence, ACM Web Conf., 2023, 3142–3153. https://doi.org/10.1145/3543507.3583540 doi: 10.1145/3543507.3583540
[12]	H. Wen, Y. Li, Z. Zhang, S. Jiang, X. Ye, Y. Ouyang, et al., Adaptivenet: Post-deployment neural architecture adaptation for diverse edge environments, ACM Mobicom., 2023, 1–17. https://doi.org/10.1145/3570361.3592529 doi: 10.1145/3570361.3592529
[13]	Y. Shi, K. Yang, T. Jiang, J. Zhang, K. B. Letaief, Communication-efficient edge AI: Algorithms and systems, IEEE Commun. Surv. Tutor., 22 (2020), 2167–2191. https://doi.org/10.1109/COMST.2020.3007787 doi: 10.1109/COMST.2020.3007787
[14]	X. Li, S. Bi, Optimal AI model splitting and resource allocation for device-edge co-inference in multi-user wireless sensing systems, IEEE Trans. Wireless Commun., (In press). https://doi.org/10.1109/TWC.2024.3378418
[15]	Y. Guo, A. Yao, Y. Chen, Dynamic network surgery for efficient DNNs, Adv. Neural Inf. Process. Syst., 29 (2016).
[16]	X. Dong, S. Chen, S. Pan, Learning to prune deep neural networks via layer-wise optimal brain surgeon, Adv. Neural Inf. Process. Syst., 30 (2017).
[17]	X. Ding, G. Ding, Y. Guo, J. Han, Centripetal SGD for pruning very deep convolutional networks with complicated structure, IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, 4943–4953. https://doi.org/10.1109/CVPR.2019.00508 doi: 10.1109/CVPR.2019.00508
[18]	O. A. Malik, S. Becker, Low-rank Tucker decomposition of large tensors using TensorSketch, Adv. Neural Inf. Process. Syst., 31 (2018).
[19]	Y. Gao, G. Zhang, C. Zhang, J. Wang, L. T. Yang, Y. Zhao, Federated tensor decomposition-based feature extraction approach for industrial IoT, IEEE Trans. Ind. Inform., 17 (2021), 8541–8549. https://doi.org/10.1109/TII.2021.3074152 doi: 10.1109/TII.2021.3074152
[20]	K. Han, W. Xiang, Multiscale tensor decomposition and rendering equation encoding for view synthesis, IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, 4232–4241. https://doi.org/10.1109/CVPR52729.2023.00412 doi: 10.1109/CVPR52729.2023.00412
[21]	M. Nagel, M. Fournarakis, Y. Bondarenko, T. Blankevoort, Overcoming oscillations in quantization-aware training, Int. Conf. Mach. Learn., 2022, 16318–16330.
[22]	Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, W. Gao, Post-training quantization for vision transformer, Adv. Neural Inf. Process. Syst., 34 (2021), 28092–28103.
[23]	A. Marchisio, D. Dura, M. Capra, M. Martina, G. Masera, M. Shafique, Swifttron: An efficient hardware accelerator for quantized transformers, Int. Jt. Conf. Neural Netw., 2023, 1–9. https://doi.org/10.1109/IJCNN54540.2023.10191521 doi: 10.1109/IJCNN54540.2023.10191521
[24]	L. Beyer, X. Zhai, A. Royer, L. Markeeva, R. Anil, A. Kolesnikov, Knowledge distillation: A good teacher is patient and consistent, IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, 10925–10934. https://doi.org/10.1109/CVPR52688.2022.01065 doi: 10.1109/CVPR52688.2022.01065
[25]	C. Blakeney, X. Li, Y. Yan, Z. Zong, Parallel blockwise knowledge distillation for deep neural network compression, IEEE T. Parallel Distr., 32 (2020), 1765–1776. https://doi.org/10.1109/TPDS.2020.3047003 doi: 10.1109/TPDS.2020.3047003
[26]	O. T. C. Chen, Y. X. Chang, C. Y. Chung, Y. Cheng, M. H. Ha, Hardware-aware iterative one-shot neural architecture search with adaptable knowledge distillation for efficient edge computing, IEEE Access, (In press). https://doi.org/10.1109/ACCESS.2025.3554185
[27]	T. Huang, S. You, F. Wang, C. Qian, C. Xu, Knowledge distillation from a stronger teacher, Adv. Neural Inf. Process. Syst., 35 (2022), 33716–33727.
[28]	X. Sun, E. Zhu, G. Qu, F. Zhang, QuaDCNN: Quantized compression of deep CNN based on tensor-train decomposition with automatic rank determination, Neurocomputing, 638 (2025), 130047. https://doi.org/10.1016/j.neucom.2025.130047 doi: 10.1016/j.neucom.2025.130047
[29]	N. Aghli, E. Ribeiro, Combining weight pruning and knowledge distillation for CNN compression, IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, 3191–3198.
[30]	S. Muralidharan, S. T. Sreenivas, R. Joshi, M. Chochowski, M. Patwary, M. Shoeybi, et al., Compact language models via pruning and knowledge distillation, Adv. Neural Inf. Process. Syst., 37, 41076–41102, 2024.
[31]	I. V. Oseledets, Tensor-train decomposition, SIAM J. Sci. Comput., 33 (5), 2295–2317, 2011. https://doi.org/10.1137/090752286 doi: 10.1137/090752286
[32]	C. Yin, B. Acun, C. J. Wu, X. Liu, TT-Rec: Tensor train compression for deep learning recommendation models, Proc. Mach. Learn. Syst., 3 (2021), 448–462.
[33]	C. Yin, D. Zheng, I. Nisa, C. Faloutsos, G. Karypis, R. Vuduc, Nimble GNN embedding with tensor-train decomposition, ACM SIGKDD Conf. Knowl. Discov. Data Min., 2022, 2327–2335. https://doi.org/10.1145/3534678.3539423 doi: 10.1145/3534678.3539423
[34]	D. Lee, R. Yin, Y. Kim, A. Moitra, Y. Li, P. Panda, TT-SNN: Tensor train decomposition for efficient spiking neural network training, Des. Autom. Test Eur. Conf. Exhib., 2024, 1–6. https://doi.org/10.23919/DATE58400.2024.10546679 doi: 10.23919/DATE58400.2024.10546679
[35]	Y. He, L. Xiao, Structured pruning for deep convolutional neural networks: A survey, IEEE T. Pattern Anal., 46 (5), 2900–2919, 2023. https://doi.org/10.1109/TPAMI.2023.3334614 doi: 10.1109/TPAMI.2023.3334614
[36]	Z. Wang, Sparsert: Accelerating unstructured sparsity on GPUs for deep learning inference, ACM Int. Conf. Parallel Archit. Compil. Techn., 2020, 31–42. https://doi.org/10.1145/3410463.3414654 doi: 10.1145/3410463.3414654
[37]	X. Ma, S. Lin, S. Ye, Z. He, L. Zhang, G. Yuan, et al., Non-structured DNN weight pruning—Is it beneficial in any platform? IEEE Trans. Neural Netw. Learn. Syst., 33 (2021), 4930–4944. https://doi.org/10.1109/TNNLS.2021.3063265 doi: 10.1109/TNNLS.2021.3063265
[38]	J. Lee, W. Lee, J. Sim, Tender: Accelerating large language models via tensor decomposition and runtime requantization, ACM/IEEE Int. Symp. Comput. Archit., 2024, 1048–1062. https://doi.org/10.1109/ISCA59077.2024.00080 doi: 10.1109/ISCA59077.2024.00080
[39]	J. Gou, B. Yu, S. J. Maybank, D. Tao, Knowledge distillation: A survey, Int. J. Comput. Vis., 129 (2021), 1789–1819. https://doi.org/10.1007/s11263-021-01453-z doi: 10.1007/s11263-021-01453-z
[40]	D. Hong, T. G. Kolda, J. A. Duersch, Generalized canonical polyadic tensor decomposition, SIAM Rev., 62 (2020), 133–163. https://doi.org/10.1137/18M1203626 doi: 10.1137/18M1203626
[41]	X. Liu, K. K. Parhi, Tensor decomposition for model reduction in neural networks: A review, IEEE Circ. Syst. Mag., 23 (2023), 8–28. https://doi.org/10.1109/MCAS.2023.3267921 doi: 10.1109/MCAS.2023.3267921
[42]	J. G. Jang, U. Kang, Static and streaming Tucker decomposition for dense tensors, ACM T. Knowl. Discov. D., 17 (2023), 1–34. https://doi.org/10.1145/3568682 doi: 10.1145/3568682
[43]	X. Gao, L. Lu, Q. Liu, Non-negative Tucker decomposition with double constraints for multiway dimensionality reduction, AIMS Math., 9 (2024), 21755–21785. https://doi.org/10.3934/math.20241058 doi: 10.3934/math.20241058
[44]	M. Astrid, S. I. Lee, CP-decomposition with tensor power method for convolutional neural networks compression, IEEE Int. Conf. Big Data Smart Comput., 2017,115–118. https://doi.org/10.1109/BIGCOMP.2017.7881725 doi: 10.1109/BIGCOMP.2017.7881725
[45]	S. Fang, R. M. Kirby, S. Zhe, Bayesian streaming sparse Tucker decomposition, Uncertainty Artif. Intell., 2021,558–567.
[46]	Z. C. Wu, T. Z. Huang, L. J. Deng, H. X. Dou, D. Meng, Tensor wheel decomposition and its tensor completion application, Adv. Neural Inf. Process. Syst., 35 (2022), 27008–27020.
[47]	C. Deng, F. Sun, X. Qian, J. Lin, Z. Wang, B. Yuan, TIE: Energy-efficient tensor train-based inference engine for deep neural network, Int. Symp. Comput. Archit., 2019,264–278. https://doi.org/10.1145/3307650.3322258 doi: 10.1145/3307650.3322258
[48]	L. Li, W. Yu, K. Batselier, Faster tensor train decomposition for sparse data, J. Comput. Appl. Math., 405 (2022), 113972. https://doi.org/10.1016/j.cam.2021.113972 doi: 10.1016/j.cam.2021.113972
[49]	P. V. Dantas, W. S. da Silva Jr, L. C. Cordeiro, C. B. Carvalho, A comprehensive review of model compression techniques in machine learning, Appl. Intell., 54 (2024), 11804–11844, 2024. https://doi.org/10.1007/s10489-024-05747-w doi: 10.1007/s10489-024-05747-w
[50]	Y. Wang, C. Yang, S. Lan, L. Zhu, Y. Zhang, End-edge-cloud collaborative computing for deep learning: A comprehensive survey, IEEE Commun. Surv. Tutor., (In press). https://doi.org/10.1109/COMST.2024.3393230
[51]	B. J. Eccles, P. Rodgers, P. Kilpatrick, I. Spence, B. Varghese, DNNshifter: An efficient DNN pruning system for edge computing, Future Gener. Comput. Syst., 152 (2024), 43–54. https://doi.org/10.1016/j.future.2023.09.025 doi: 10.1016/j.future.2023.09.025
[52]	A. Koubaa, A. Ammar, A. Kanhouch, Y. AlHabashi, Cloud versus edge deployment strategies of real-time face recognition inference, IEEE Trans. Netw. Sci. Eng., 9 (2021), 143–160. https://doi.org/10.1109/TNSE.2021.3055835 doi: 10.1109/TNSE.2021.3055835
[53]	S. Manzoor, E. J. Kim, S. H. Joo, S. H. Bae, G. G. In, K. J. Joo, et al., Edge deployment framework of GuardBot for optimized face mask recognition with real-time inference using deep learning, IEEE Access, 10 (2022), 77898–77921. https://doi.org/10.1109/ACCESS.2022.3190538 doi: 10.1109/ACCESS.2022.3190538
[54]	S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, et al, cuDNN: Efficient primitives for deep learning, arXiv Preprint, arXiv: 1410.0759, 2014. https://doi.org/10.48550/arXiv.1410.0759
[55]	G. Golub, W. Kahan, Calculating the singular values and pseudo-inverse of a matrix, J. Soc. Ind. Appl. Math. Ser. B Numer. Anal., 2 (1965), 205–224. https://doi.org/10.1137/0702016 doi: 10.1137/0702016
[56]	G. H. Golub, C. F. Van Loan, Matrix Computations, Johns Hopkins Univ. Press, 4th ed., 2013.
[57]	T. Moreau, T. Chen, L. Vega, J. Roesch, E. Yan, L. Zheng, et al., A hardware–software blueprint for flexible deep learning specialization, IEEE Micro, 39 (2019), 8–16. https://doi.org/10.1109/MM.2019.2928962 doi: 10.1109/MM.2019.2928962
[58]	K. Han, S. Lee, K. I. Oh, Y. Bae, H. Jang, J. J. Lee, et al., Developing TEI-aware ultralow-power SoC platforms for IoT end nodes, IEEE Internet Things, 8 (6) (2020), 4642–4656. https://doi.org/10.1109/JIOT.2020.3027479 doi: 10.1109/JIOT.2020.3027479
[59]	J. Park, E. Choi, K. Lee, J. J. Lee, K. Han, W. Lee, Developing an ultra-low power RISC-V processor for anomaly detection, Des. Autom. Test Eur. Conf. Exhib., 2023, 1–2. https://doi.org/10.23919/DATE56975.2023.10137003 doi: 10.23919/DATE56975.2023.10137003
[60]	E. Choi, J. Park, K. Lee, J. J. Lee, K. Han, W. Lee, Day–Night architecture: Development of an ultra-low power RISC-V processor for wearable anomaly detection, J. Syst. Archit., 152 (2024), 103161. https://doi.org/10.1016/j.sysarc.2024.103161 doi: 10.1016/j.sysarc.2024.103161
[61]	E. Choi, J. Park, K. Han, W. Lee, AESware: Developing AES-enabled low-power multicore processors leveraging open RISC-V cores with a shared lightweight AES accelerator, Eng. Sci. Technol. Int. J., 60 (2024), 101894. https://doi.org/10.1016/j.jestch.2024.101894 doi: 10.1016/j.jestch.2024.101894
[62]	J. Park, K. Han, E. Choi, J. J. Lee, K. Lee, W. Lee, et al., Designing low-power RISC-V multicore processors with a shared lightweight floating point unit for IoT endnodes, IEEE Trans. Circuits Syst. I Regul. Pap., 71 (2024), 4106–4119. https://doi.org/10.1109/TCSI.2024.3427681 doi: 10.1109/TCSI.2024.3427681
[63]	K. Lee, S. Jeon, K. Lee, W. Lee, M. Pedram, Radar-PIM: Developing IoT processors utilizing Processing-in-Memory architecture for ultrawideband-radar-based respiration detection, IEEE Internet Things J., 12 (2025), 515–530. https://doi.org/10.1109/JIOT.2024.3466228 doi: 10.1109/JIOT.2024.3466228
[64]	S. Jeon, K. Lee, K. Lee, W. Lee, HH-PIM: Dynamic optimization of power and performance with heterogeneous-hybrid PIM for edge AI devices, arXiv preprint arXiv: 2504.01468, 2025. https://arXiv.org/abs/2504.01468
[65]	SiFIVE, Available from: https://github.com/chipsalliance/rocket-chip. Accessed Thursday 10^th July, 2025.
[66]	K. Han, S. Lee, J. J. Lee, W. Lee, M. Pedram, TIP: A temperature effect inversion-aware ultra-low power system-on-chip platform, In Proc. IEEE/ACM Int. Symp. Low Power Electron. Des., 1–6, 2019. https://doi.org/10.1109/ISLPED.2019.8824925
[67]	Genesys2 FPGA Board, Available from: https://digilent.com/shop/genesys-2-kintex-7-fpga-development-board. Accessed Thursday 10^th July, 2025.

Reader Comments

Your name:*

Email:*
© 2025 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)