Human-like arm swing strategies in ES–SAC humanoid gait: Stability and performance on flat vs rough terrain

Mustafa Ayyıldız; Övünç Polat; Mustafa Ayyıldız; Övünç Polat

doi:10.3934/era.2026069

Electronic Research Archive

2026, Volume 34, Issue 3: 1524-1545. doi: 10.3934/era.2026069

Previous Article Next Article

Research article Special Issues

Human-like arm swing strategies in ES–SAC humanoid gait: Stability and performance on flat vs rough terrain

Mustafa Ayyıldız ^{1
,
,},
Övünç Polat ²

1.
The Vocational School of Technical Sciences, Department of Electronics and Automation, Akdeniz University, Antalya, Turkey
2.
Faculty of Engineering, Department of Electrical and Electronics Engineering, Akdeniz University, Antalya, Turkey

Received: 11 November 2025 Revised: 02 February 2026 Accepted: 04 February 2026 Published: 13 February 2026

This study presents a hybrid evolution strategy–soft actor-critic (ES–SAC) controller for a reduced-degrees of freedom spatial (3D) humanoid gait model with predominantly sagittal-plane motion that focuses on arm-leg coordination. Policies are trained on flat terrain only and then evaluated on both flat and uneven ground under identical simulation settings. Arm-leg coordination is examined systematically in three modes (counter-phase with the legs [normal], in-phase [anti-normal], and fixed [passive]), and the results are compared with findings from human experiments. Whereas most prior studies evaluate policies primarily via reward curves, this work conducts a deep analysis using interpretable metrics aligned with human-like walking: speed-normalized power and torque, lateral and vertical deviations, and moment-balance terms. Simulation outcomes are reported quantitatively through these metrics rather than reward alone. Across five random seeds, a clear terrain-dependent trade-off appears between the swinging strategies: anti-normal attains higher forward speed and lower torque-per-speed, whereas normal provides better lateral tracking and lower power-per-speed on rough ground. Directional trends agree with human experiments (e.g., immobilized or in-phase arms raise metabolic cost), while numerical gaps reflect that the simulator measures mechanical power rather than metabolic energy. Within this framework, the impact of coordinated arm swing on balance and efficiency is quantified with a breadth and clarity uncommon in the literature.
- humanoid robot gait,
- arm swing strategies,
- evolutionary computation,
- actor-critic algorithm,
- uneven terrain,
- reinforcement learning
Citation: Mustafa Ayyıldız, Övünç Polat. Human-like arm swing strategies in ES–SAC humanoid gait: Stability and performance on flat vs rough terrain[J]. Electronic Research Archive, 2026, 34(3): 1524-1545. doi: 10.3934/era.2026069

Related Papers:

Abstract

This study presents a hybrid evolution strategy–soft actor-critic (ES–SAC) controller for a reduced-degrees of freedom spatial (3D) humanoid gait model with predominantly sagittal-plane motion that focuses on arm-leg coordination. Policies are trained on flat terrain only and then evaluated on both flat and uneven ground under identical simulation settings. Arm-leg coordination is examined systematically in three modes (counter-phase with the legs [normal], in-phase [anti-normal], and fixed [passive]), and the results are compared with findings from human experiments. Whereas most prior studies evaluate policies primarily via reward curves, this work conducts a deep analysis using interpretable metrics aligned with human-like walking: speed-normalized power and torque, lateral and vertical deviations, and moment-balance terms. Simulation outcomes are reported quantitatively through these metrics rather than reward alone. Across five random seeds, a clear terrain-dependent trade-off appears between the swinging strategies: anti-normal attains higher forward speed and lower torque-per-speed, whereas normal provides better lateral tracking and lower power-per-speed on rough ground. Directional trends agree with human experiments (e.g., immobilized or in-phase arms raise metabolic cost), while numerical gaps reflect that the simulator measures mechanical power rather than metabolic energy. Within this framework, the impact of coordinated arm swing on balance and efficiency is quantified with a breadth and clarity uncommon in the literature.

References

[1]	S. H. Collins, P. G. Adamczyk, A. D. Kuo, Dynamic arm swinging in human walking, Proc. R. Soc. B Biol. Sci. , 276 (2009), 3679-3688.https://doi.org/10.1098/rspb.2009.0664 doi: 10.1098/rspb.2009.0664
[2]	N. Itahashi, H. Itoh, H. Fukumoto, H. Wakuya, Reinforcement learning of bipedal walking using a simple reference motion, Appl. Sci. , 14 (2024), 1803.https://doi.org/10.3390/app14051803 doi: 10.3390/app14051803
[3]	F. Wu, Z. Gu, H. Wu, A. Wu, Y. Zhao, Infer and adapt: Bipedal locomotion reward learning from demonstrations via inverse reinforcement learning, in 2024 IEEE International Conference on Robotics and Automation (ICRA), (2024), 16243-16250.https://doi.org/10.48550/arXiv.2309.16074
[4]	Z. Li, X. B. Peng, P. Abbeel, S. Levine, G. Berseth, K. Sreenath, Reinforcement learning for versatile, dynamic, and robust bipedal locomotion control, preprint, arXiv: 2401.16889.https://doi.org/10.48550/arXiv.2401.16889
[5]	R. P. Singh, Z. Xie, P. Gergondet, F. Kanehiro, Learning bipedal walking for humanoids with current feedback, IEEE Access, 11 (2023), 82013-82023.https://doi.org/10.1109/ACCESS.2023.3301175 doi: 10.1109/ACCESS.2023.3301175
[6]	T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, et al., Continuous control with deep reinforcement learning, preprint, arXiv: 1509.02971.https://doi.org/10.48550/arXiv.1509.02971
[7]	S. Fujimoto, H. van Hoof, D. Meger, Addressing function approximation error in actor-critic methods, in International Conference on Machine Learning, PMLR, (2018), 1587-1596.https://doi.org/10.48550/arXiv.1802.09477
[8]	N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, et al., Emergence of locomotion behaviours in rich environments, preprint, arXiv: 1707.02286.https://doi.org/10.48550/arXiv.1707.02286
[9]	X. Zhang, X. Wang, L. Zhang, G. Guo, X. Shen, W. Zhang, Achieving stable high-speed locomotion for humanoid robots with deep reinforcement learning, preprint, arXiv: 2409.16611.https://doi.org/10.48550/arXiv.2409.16611
[10]	X. Wang, W. Guo, S. Yin, S. Zhang, F. Zha, M. Li, et al., Walking control of humanoid robots based on improved footstep planner and whole-body coordination controller, Front. Neurorob. , 19 (2025), 1538979.https://doi.org/10.3389/fnbot.2025.1538979 doi: 10.3389/fnbot.2025.1538979
[11]	H. Herr, M. Popovic, Angular momentum in human walking, J. Exp. Biol. , 211 (2008), 467-481.https://doi.org/10.1242/jeb.008573 doi: 10.1242/jeb.008573
[12]	B. R. Umberger, Effects of suppressing arm swing on kinematics, kinetics, and energetics of human walking, J. Biomech. , 41 (2008), 2575-2580.https://doi.org/10.1016/j.jbiomech.2008.05.024 doi: 10.1016/j.jbiomech.2008.05.024
[13]	A. Pourchot, O. Sigaud, CEM-RL: Combining evolutionary and gradient-based methods for policy search, preprint, arXiv: 1810.01222.https://doi.org/10.48550/arXiv.1810.01222
[14]	S. Khadka, K. Tumer, Evolution-guided policy gradient in reinforcement learning, Adv. Neural Inf. Process. Syst., 31 (2018).https://doi.org/10.48550/arXiv.1805.07917
[15]	A. Roaas, G. B. J. Andersson, Normal range of motion of the hip, knee and ankle joints in male subjects, 30-40 years of age, Acta Orthop. Scand. , 53 (1982), 205-208.https://doi.org/10.3109/17453678208992202 doi: 10.3109/17453678208992202
[16]	J. M. Soucie, C. Wang, A. Forsyth, A. Funk, S. Denny, M. Roach, et al., Hemophilia treatment center network, range of motion measurements: reference values and a database for comparison studies, Haemophilia, 17 (2011), 500-507.https://doi.org/10.1111/j.1365-2516.2010.02399.x doi: 10.1111/j.1365-2516.2010.02399.x
[17]	T. Salimans, J. Ho, X. Chen, S. Sidor, I. Sutskever, Evolution strategies as a scalable alternative to reinforcement learning, preprint, arXiv: 1703.03864.https://doi.org/10.48550/arXiv.1703.03864
[18]	N. Hansen, The CMA evolution strategy: A tutorial, preprint, arXiv: 1604.00772.https://doi.org/10.48550/arXiv.1604.00772
[19]	T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor, in International Conference on Machine Learning, (2018), 1861-1870.https://doi.org/10.48550/arXiv.1801.01290
[20]	V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Venesset, M. G. Bellemare, et al., Human-level control through deep reinforcement learning, Nature, 518 (2015), 529-533.https://doi.org/10.1038/nature14236 doi: 10.1038/nature14236
[21]	K. Suri, X. Q. Shi, K. N. Plataniotis, Y. A. Lawryshyn, Maximum mutation reinforcement learning for scalable control, preprint, arXiv: 2007.13690.https://doi.org/10.48550/arXiv.2007.13690
[22]	K. Lee, B. U. Lee, U. Shin, I. S. Kweon, An efficient asynchronous method for integrating evolutionary and gradient-based policy search, Adv. Neural Inf. Process. Syst. , 33 (2020), 10124-10135.https://doi.org/10.48550/arXiv.2012.05417 doi: 10.48550/arXiv.2012.05417
[23]	M. Calì, A. Sinigaglia, N. Turcato, R. Carli, G. A. Susto, Finetuning deep reinforcement learning policies with evolutionary strategies for control of underactuated robots, IFAC-PapersOnLine, 59 (2025), 31-36.https://doi.org/10.1016/j.ifacol.2025.12.006 doi: 10.1016/j.ifacol.2025.12.006
[24]	K. Suri, X. Shi, K. N. Plataniotis, Y. A. Lawryshyn, Evolve to control: evolution-based soft actor-critic for scalable reinforcement learning, preprint, arXiv: 2007.13690.https://doi.org/10.48550/arXiv.2007.13690v1
[25]	X. B. Peng, G. Berseth, K. Yin, M. V. De Panne, DeepLoco: dynamic locomotion skills using hierarchical deep reinforcement learning, ACM Trans. Graphics, 36 (2017), 1-13.https://doi.org/10.1145/3072959.3073602 doi: 10.1145/3072959.3073602
[26]	I. Radosavovic, T. Xiao, B. Zhang, T. Darrell, J. Malik, K. Sreenath, Real-world humanoid locomotion with reinforcement learning, Sci. Rob. , 9 (2024), eadi9579.https://doi.org/10.1126/scirobotics.adi9579 doi: 10.1126/scirobotics.adi9579
[27]	A. A. Issa, A. A. Aldair, Learning the quadruped robot by reinforcement learning (RL), Iraqi J. Electr. Electron. Eng. , 18 (2022), 117-126.https://doi.org/10.37917/ijeee.18.2.15 doi: 10.37917/ijeee.18.2.15
[28]	K. Hong, Y. Li, A. Tewari, A primal-dual-critic algorithm for offline constrained reinforcement learning, in International Conference on Artificial Intelligence and Statistics, (2024), 280-288.https://doi.org/10.48550/arXiv.2306.07818
[29]	R. S. Sutton, A. G. Barto, Reinforcement Learning: An Introduction, Cambridge: MIT press, 1998.
[30]	A. Louette, G. Lambrechts, D. Ernst, E. Pirard, G. Dislaire, Reinforcement learning to improve delta robot throws for sorting scrap metal, preprint, arXiv: 2406.13453.https://doi.org/10.48550/arXiv.2406.13453
[31]	D. Masters, C. Luschi, Revisiting small batch training for deep neural networks, preprint, arXiv: 1804.07612.https://doi.org/10.48550/arXiv.1804.07612
[32]	D. Tarasov, A. Surina, C. Gulcehre, The role of deep learning regularizations on actors in offline RL, preprint, arXiv: 2409.07606.https://doi.org/10.48550/arXiv.2409.07606
[33]	Z. Liu, X. Li, B. Kang, T. Darrell, Regularization matters in policy optimization: an empirical study on continuous control, in International Conference on Learning Representations, 2021.https://doi.org/10.48550/arXiv.2102.03050
[34]	P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, D. Meger, Deep reinforcement learning that matters, in Proceedings of the AAAI Conference on Artificial Intelligence, 32 (2018).https://doi.org/10.1609/aaai.v32i1.11694
[35]	R. Islam, P. Henderson, M. Gomrokchi, D. Precup, Reproducibility of benchmarked deep reinforcement learning tasks for continuous control, preprint, arXiv: 1708.04133.https://doi.org/10.48550/arXiv.1708.04133
[36]	B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, N. de Freitas, Taking the human out of the loop: A review of Bayesian optimization, Proc. IEEE, 104 (2016), 148-175.https://doi.org/10.1109/JPROC.2015.2494218 doi: 10.1109/JPROC.2015.2494218
[37]	J. Snoek, H. Larochelle, R. P. Adams, Practical bayesian optimization of machine learning algorithms, Adv. Neural Inf. Process. Syst. , 25 (2012), 2951-2959.https://doi.org/10.48550/arXiv.1206.2944 doi: 10.48550/arXiv.1206.2944
[38]	M. A. Gelbart, J. Snoek, R. P. Adams, Bayesian optimization with unknown constraints, preprint, arXiv: 1403.5607.https://doi.org/10.48550/arXiv.1403.5607
[39]	R. Agarwal, M. Schwarzer, P. S. Castro, A. Courville, M. G. Bellemare, Deep reinforcement learning at the edge of the statistical precipice, Adv. Neural Inf. Process. Syst. , 34 (2021), 29304-29320.https://doi.org/10.48550/arXiv.2108.13264 doi: 10.48550/arXiv.2108.13264
[40]	H. B. Mann, D. R. Whitney, On a test of whether one of two random variables is stochastically larger than the other, Ann. Math. Stat. , 18 (1947), 50-60.https://doi.org/10.1214/aoms/1177730491 doi: 10.1214/aoms/1177730491
[41]	Q. Wang, L. D. Baets, A. Timmermans, W. Chen, L. Giacolini, T. Matheve, et al., Motor control training for the shoulder with smart garments, Sensors, 17 (2017), 1687.https://doi.org/10.3390/s17071687 doi: 10.3390/s17071687
[42]	J. Chen, A. Tang, G. Zhou, L. Lin, G. Jiang, Walking dynamics for an ascending stair biped robot with telescopic legs and impulse thrust, Electron. Res. Arch. , 30 (2022), 4108-4135.https://doi.org/10.3934/era.2022208 doi: 10.3934/era.2022208
[43]	Y. Chen, H. Zhao, M. Ogura, H. Yu, L. Peng, Data-driven event-triggered fixed-time load frequency control for multi-area power systems with input delays, IEEE Trans. Circuits Syst. I Regul. Pap. , 72 (2025), 8492-8504.https://doi.org/10.1109/TCSI.2025.3580122 doi: 10.1109/TCSI.2025.3580122
[44]	Y. Dong, X. Zhou, Advancements in AI-driven multilingual comprehension for social robot interactions: An extensive review, Electron. Res. Arch. , 31 (2023), 6600-6633.https://doi.org/10.3934/era.2023334 doi: 10.3934/era.2023334
[45]	Y. Lei, Z. Su, C. Cheng, Virtual reality in human-robot interaction: challenges and benefits, Electron. Res. Arch. , 31 (2023), 2374-2408.https://doi.org/10.3934/era.2023121 doi: 10.3934/era.2023121

Reader Comments

Your name:*

Email:*
© 2026 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)