Research article Special Issues

Human-like arm swing strategies in ES–SAC humanoid gait: Stability and performance on flat vs rough terrain

  • Published: 13 February 2026
  • This study presents a hybrid evolution strategy–soft actor-critic (ES–SAC) controller for a reduced-degrees of freedom spatial (3D) humanoid gait model with predominantly sagittal-plane motion that focuses on arm-leg coordination. Policies are trained on flat terrain only and then evaluated on both flat and uneven ground under identical simulation settings. Arm-leg coordination is examined systematically in three modes (counter-phase with the legs [normal], in-phase [anti-normal], and fixed [passive]), and the results are compared with findings from human experiments. Whereas most prior studies evaluate policies primarily via reward curves, this work conducts a deep analysis using interpretable metrics aligned with human-like walking: speed-normalized power and torque, lateral and vertical deviations, and moment-balance terms. Simulation outcomes are reported quantitatively through these metrics rather than reward alone. Across five random seeds, a clear terrain-dependent trade-off appears between the swinging strategies: anti-normal attains higher forward speed and lower torque-per-speed, whereas normal provides better lateral tracking and lower power-per-speed on rough ground. Directional trends agree with human experiments (e.g., immobilized or in-phase arms raise metabolic cost), while numerical gaps reflect that the simulator measures mechanical power rather than metabolic energy. Within this framework, the impact of coordinated arm swing on balance and efficiency is quantified with a breadth and clarity uncommon in the literature.

    Citation: Mustafa Ayyıldız, Övünç Polat. Human-like arm swing strategies in ES–SAC humanoid gait: Stability and performance on flat vs rough terrain[J]. Electronic Research Archive, 2026, 34(3): 1524-1545. doi: 10.3934/era.2026069

    Related Papers:

  • This study presents a hybrid evolution strategy–soft actor-critic (ES–SAC) controller for a reduced-degrees of freedom spatial (3D) humanoid gait model with predominantly sagittal-plane motion that focuses on arm-leg coordination. Policies are trained on flat terrain only and then evaluated on both flat and uneven ground under identical simulation settings. Arm-leg coordination is examined systematically in three modes (counter-phase with the legs [normal], in-phase [anti-normal], and fixed [passive]), and the results are compared with findings from human experiments. Whereas most prior studies evaluate policies primarily via reward curves, this work conducts a deep analysis using interpretable metrics aligned with human-like walking: speed-normalized power and torque, lateral and vertical deviations, and moment-balance terms. Simulation outcomes are reported quantitatively through these metrics rather than reward alone. Across five random seeds, a clear terrain-dependent trade-off appears between the swinging strategies: anti-normal attains higher forward speed and lower torque-per-speed, whereas normal provides better lateral tracking and lower power-per-speed on rough ground. Directional trends agree with human experiments (e.g., immobilized or in-phase arms raise metabolic cost), while numerical gaps reflect that the simulator measures mechanical power rather than metabolic energy. Within this framework, the impact of coordinated arm swing on balance and efficiency is quantified with a breadth and clarity uncommon in the literature.



    加载中


    [1] S. H. Collins, P. G. Adamczyk, A. D. Kuo, Dynamic arm swinging in human walking, Proc. R. Soc. B Biol. Sci. , 276 (2009), 3679-3688.https://doi.org/10.1098/rspb.2009.0664 doi: 10.1098/rspb.2009.0664
    [2] N. Itahashi, H. Itoh, H. Fukumoto, H. Wakuya, Reinforcement learning of bipedal walking using a simple reference motion, Appl. Sci. , 14 (2024), 1803.https://doi.org/10.3390/app14051803 doi: 10.3390/app14051803
    [3] F. Wu, Z. Gu, H. Wu, A. Wu, Y. Zhao, Infer and adapt: Bipedal locomotion reward learning from demonstrations via inverse reinforcement learning, in 2024 IEEE International Conference on Robotics and Automation (ICRA), (2024), 16243-16250.https://doi.org/10.48550/arXiv.2309.16074
    [4] Z. Li, X. B. Peng, P. Abbeel, S. Levine, G. Berseth, K. Sreenath, Reinforcement learning for versatile, dynamic, and robust bipedal locomotion control, preprint, arXiv: 2401.16889.https://doi.org/10.48550/arXiv.2401.16889
    [5] R. P. Singh, Z. Xie, P. Gergondet, F. Kanehiro, Learning bipedal walking for humanoids with current feedback, IEEE Access, 11 (2023), 82013-82023.https://doi.org/10.1109/ACCESS.2023.3301175 doi: 10.1109/ACCESS.2023.3301175
    [6] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, et al., Continuous control with deep reinforcement learning, preprint, arXiv: 1509.02971.https://doi.org/10.48550/arXiv.1509.02971
    [7] S. Fujimoto, H. van Hoof, D. Meger, Addressing function approximation error in actor-critic methods, in International Conference on Machine Learning, PMLR, (2018), 1587-1596.https://doi.org/10.48550/arXiv.1802.09477
    [8] N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, et al., Emergence of locomotion behaviours in rich environments, preprint, arXiv: 1707.02286.https://doi.org/10.48550/arXiv.1707.02286
    [9] X. Zhang, X. Wang, L. Zhang, G. Guo, X. Shen, W. Zhang, Achieving stable high-speed locomotion for humanoid robots with deep reinforcement learning, preprint, arXiv: 2409.16611.https://doi.org/10.48550/arXiv.2409.16611
    [10] X. Wang, W. Guo, S. Yin, S. Zhang, F. Zha, M. Li, et al., Walking control of humanoid robots based on improved footstep planner and whole-body coordination controller, Front. Neurorob. , 19 (2025), 1538979.https://doi.org/10.3389/fnbot.2025.1538979 doi: 10.3389/fnbot.2025.1538979
    [11] H. Herr, M. Popovic, Angular momentum in human walking, J. Exp. Biol. , 211 (2008), 467-481.https://doi.org/10.1242/jeb.008573 doi: 10.1242/jeb.008573
    [12] B. R. Umberger, Effects of suppressing arm swing on kinematics, kinetics, and energetics of human walking, J. Biomech. , 41 (2008), 2575-2580.https://doi.org/10.1016/j.jbiomech.2008.05.024 doi: 10.1016/j.jbiomech.2008.05.024
    [13] A. Pourchot, O. Sigaud, CEM-RL: Combining evolutionary and gradient-based methods for policy search, preprint, arXiv: 1810.01222.https://doi.org/10.48550/arXiv.1810.01222
    [14] S. Khadka, K. Tumer, Evolution-guided policy gradient in reinforcement learning, Adv. Neural Inf. Process. Syst., 31 (2018).https://doi.org/10.48550/arXiv.1805.07917
    [15] A. Roaas, G. B. J. Andersson, Normal range of motion of the hip, knee and ankle joints in male subjects, 30-40 years of age, Acta Orthop. Scand. , 53 (1982), 205-208.https://doi.org/10.3109/17453678208992202 doi: 10.3109/17453678208992202
    [16] J. M. Soucie, C. Wang, A. Forsyth, A. Funk, S. Denny, M. Roach, et al., Hemophilia treatment center network, range of motion measurements: reference values and a database for comparison studies, Haemophilia, 17 (2011), 500-507.https://doi.org/10.1111/j.1365-2516.2010.02399.x doi: 10.1111/j.1365-2516.2010.02399.x
    [17] T. Salimans, J. Ho, X. Chen, S. Sidor, I. Sutskever, Evolution strategies as a scalable alternative to reinforcement learning, preprint, arXiv: 1703.03864.https://doi.org/10.48550/arXiv.1703.03864
    [18] N. Hansen, The CMA evolution strategy: A tutorial, preprint, arXiv: 1604.00772.https://doi.org/10.48550/arXiv.1604.00772
    [19] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor, in International Conference on Machine Learning, (2018), 1861-1870.https://doi.org/10.48550/arXiv.1801.01290
    [20] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Venesset, M. G. Bellemare, et al., Human-level control through deep reinforcement learning, Nature, 518 (2015), 529-533.https://doi.org/10.1038/nature14236 doi: 10.1038/nature14236
    [21] K. Suri, X. Q. Shi, K. N. Plataniotis, Y. A. Lawryshyn, Maximum mutation reinforcement learning for scalable control, preprint, arXiv: 2007.13690.https://doi.org/10.48550/arXiv.2007.13690
    [22] K. Lee, B. U. Lee, U. Shin, I. S. Kweon, An efficient asynchronous method for integrating evolutionary and gradient-based policy search, Adv. Neural Inf. Process. Syst. , 33 (2020), 10124-10135.https://doi.org/10.48550/arXiv.2012.05417 doi: 10.48550/arXiv.2012.05417
    [23] M. Calì, A. Sinigaglia, N. Turcato, R. Carli, G. A. Susto, Finetuning deep reinforcement learning policies with evolutionary strategies for control of underactuated robots, IFAC-PapersOnLine, 59 (2025), 31-36.https://doi.org/10.1016/j.ifacol.2025.12.006 doi: 10.1016/j.ifacol.2025.12.006
    [24] K. Suri, X. Shi, K. N. Plataniotis, Y. A. Lawryshyn, Evolve to control: evolution-based soft actor-critic for scalable reinforcement learning, preprint, arXiv: 2007.13690.https://doi.org/10.48550/arXiv.2007.13690v1
    [25] X. B. Peng, G. Berseth, K. Yin, M. V. De Panne, DeepLoco: dynamic locomotion skills using hierarchical deep reinforcement learning, ACM Trans. Graphics, 36 (2017), 1-13.https://doi.org/10.1145/3072959.3073602 doi: 10.1145/3072959.3073602
    [26] I. Radosavovic, T. Xiao, B. Zhang, T. Darrell, J. Malik, K. Sreenath, Real-world humanoid locomotion with reinforcement learning, Sci. Rob. , 9 (2024), eadi9579.https://doi.org/10.1126/scirobotics.adi9579 doi: 10.1126/scirobotics.adi9579
    [27] A. A. Issa, A. A. Aldair, Learning the quadruped robot by reinforcement learning (RL), Iraqi J. Electr. Electron. Eng. , 18 (2022), 117-126.https://doi.org/10.37917/ijeee.18.2.15 doi: 10.37917/ijeee.18.2.15
    [28] K. Hong, Y. Li, A. Tewari, A primal-dual-critic algorithm for offline constrained reinforcement learning, in International Conference on Artificial Intelligence and Statistics, (2024), 280-288.https://doi.org/10.48550/arXiv.2306.07818
    [29] R. S. Sutton, A. G. Barto, Reinforcement Learning: An Introduction, Cambridge: MIT press, 1998.
    [30] A. Louette, G. Lambrechts, D. Ernst, E. Pirard, G. Dislaire, Reinforcement learning to improve delta robot throws for sorting scrap metal, preprint, arXiv: 2406.13453.https://doi.org/10.48550/arXiv.2406.13453
    [31] D. Masters, C. Luschi, Revisiting small batch training for deep neural networks, preprint, arXiv: 1804.07612.https://doi.org/10.48550/arXiv.1804.07612
    [32] D. Tarasov, A. Surina, C. Gulcehre, The role of deep learning regularizations on actors in offline RL, preprint, arXiv: 2409.07606.https://doi.org/10.48550/arXiv.2409.07606
    [33] Z. Liu, X. Li, B. Kang, T. Darrell, Regularization matters in policy optimization: an empirical study on continuous control, in International Conference on Learning Representations, 2021.https://doi.org/10.48550/arXiv.2102.03050
    [34] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, D. Meger, Deep reinforcement learning that matters, in Proceedings of the AAAI Conference on Artificial Intelligence, 32 (2018).https://doi.org/10.1609/aaai.v32i1.11694
    [35] R. Islam, P. Henderson, M. Gomrokchi, D. Precup, Reproducibility of benchmarked deep reinforcement learning tasks for continuous control, preprint, arXiv: 1708.04133.https://doi.org/10.48550/arXiv.1708.04133
    [36] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, N. de Freitas, Taking the human out of the loop: A review of Bayesian optimization, Proc. IEEE, 104 (2016), 148-175.https://doi.org/10.1109/JPROC.2015.2494218 doi: 10.1109/JPROC.2015.2494218
    [37] J. Snoek, H. Larochelle, R. P. Adams, Practical bayesian optimization of machine learning algorithms, Adv. Neural Inf. Process. Syst. , 25 (2012), 2951-2959.https://doi.org/10.48550/arXiv.1206.2944 doi: 10.48550/arXiv.1206.2944
    [38] M. A. Gelbart, J. Snoek, R. P. Adams, Bayesian optimization with unknown constraints, preprint, arXiv: 1403.5607.https://doi.org/10.48550/arXiv.1403.5607
    [39] R. Agarwal, M. Schwarzer, P. S. Castro, A. Courville, M. G. Bellemare, Deep reinforcement learning at the edge of the statistical precipice, Adv. Neural Inf. Process. Syst. , 34 (2021), 29304-29320.https://doi.org/10.48550/arXiv.2108.13264 doi: 10.48550/arXiv.2108.13264
    [40] H. B. Mann, D. R. Whitney, On a test of whether one of two random variables is stochastically larger than the other, Ann. Math. Stat. , 18 (1947), 50-60.https://doi.org/10.1214/aoms/1177730491 doi: 10.1214/aoms/1177730491
    [41] Q. Wang, L. D. Baets, A. Timmermans, W. Chen, L. Giacolini, T. Matheve, et al., Motor control training for the shoulder with smart garments, Sensors, 17 (2017), 1687.https://doi.org/10.3390/s17071687 doi: 10.3390/s17071687
    [42] J. Chen, A. Tang, G. Zhou, L. Lin, G. Jiang, Walking dynamics for an ascending stair biped robot with telescopic legs and impulse thrust, Electron. Res. Arch. , 30 (2022), 4108-4135.https://doi.org/10.3934/era.2022208 doi: 10.3934/era.2022208
    [43] Y. Chen, H. Zhao, M. Ogura, H. Yu, L. Peng, Data-driven event-triggered fixed-time load frequency control for multi-area power systems with input delays, IEEE Trans. Circuits Syst. I Regul. Pap. , 72 (2025), 8492-8504.https://doi.org/10.1109/TCSI.2025.3580122 doi: 10.1109/TCSI.2025.3580122
    [44] Y. Dong, X. Zhou, Advancements in AI-driven multilingual comprehension for social robot interactions: An extensive review, Electron. Res. Arch. , 31 (2023), 6600-6633.https://doi.org/10.3934/era.2023334 doi: 10.3934/era.2023334
    [45] Y. Lei, Z. Su, C. Cheng, Virtual reality in human-robot interaction: challenges and benefits, Electron. Res. Arch. , 31 (2023), 2374-2408.https://doi.org/10.3934/era.2023121 doi: 10.3934/era.2023121
  • Reader Comments
  • © 2026 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(529) PDF downloads(44) Cited by(0)

Article outline

Figures and Tables

Figures(5)  /  Tables(8)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog