Deep learning has become a cornerstone in numerous applications, such as robotics, augmented reality, and autonomous systems, where accurate 6D pose estimation – determining an object's position and orientation in 3D space – is critical. Despite its success, designing optimal neural architectures and tuning hyperparameters remain computationally expensive and challenging, especially in high-variance and data-intensive domains like 6D pose estimation. To address this, we investigate the application of a Neural Architecture Search (NAS) technique guided by classical Machine Learning-based performance predictors. As these predictors estimate final model performance using early-stage training data, we demonstrate that leveraging such a strategy allows for efficient exploration of the hyperparameter space, significantly reducing the computational burden of exhaustive search while still achieving improved performance. Building on an existing NAS method, we introduce a novel modification that enhances the efficiency of the search process, further accelerating convergence by nearly $ 71\% $ without sacrificing accuracy. Through extensive experimentation on the LineMOD dataset, we demonstrate that our method consistently discovers high-performing configurations of an Augmented Autoencoder for 6D pose estimation, outperforming benchmark models by almost $ 15\% $ in pose accuracy and $ 42\% $ in reconstruction loss. These results underscore the potential of predictor-based NAS as a powerful and computationally efficient tool for neural architecture optimization in complex, real-world tasks.
Citation: Matteo Lombardi, Davide Sapienza, Elena Govi, Giorgia Franchini. Hyperparameter optimization of an augmented autoencoder for 6D object pose estimation via neural architecture search[J]. Mathematics in Engineering, 2026, 8(2): 150-180. doi: 10.3934/mine.2026006
Deep learning has become a cornerstone in numerous applications, such as robotics, augmented reality, and autonomous systems, where accurate 6D pose estimation – determining an object's position and orientation in 3D space – is critical. Despite its success, designing optimal neural architectures and tuning hyperparameters remain computationally expensive and challenging, especially in high-variance and data-intensive domains like 6D pose estimation. To address this, we investigate the application of a Neural Architecture Search (NAS) technique guided by classical Machine Learning-based performance predictors. As these predictors estimate final model performance using early-stage training data, we demonstrate that leveraging such a strategy allows for efficient exploration of the hyperparameter space, significantly reducing the computational burden of exhaustive search while still achieving improved performance. Building on an existing NAS method, we introduce a novel modification that enhances the efficiency of the search process, further accelerating convergence by nearly $ 71\% $ without sacrificing accuracy. Through extensive experimentation on the LineMOD dataset, we demonstrate that our method consistently discovers high-performing configurations of an Augmented Autoencoder for 6D pose estimation, outperforming benchmark models by almost $ 15\% $ in pose accuracy and $ 42\% $ in reconstruction loss. These results underscore the potential of predictor-based NAS as a powerful and computationally efficient tool for neural architecture optimization in complex, real-world tasks.
| [1] | T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A next-generation hyperparameter optimization framework, Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, 2623–2631. https://doi.org/10.1145/3292500.3330701 |
| [2] | B. Baker, O. Gupta, R. Raskar, N. Naik, Accelerating neural architecture search using performance prediction, arXiv, 2017. https://doi.org/10.48550/arXiv.1705.10823 |
| [3] | J. Bergstra, Y. Bengio, Random search for hyper-parameter optimization, J. Mach. Learn. Res., 13 (2012), 281–305. |
| [4] | M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, et al., End to end learning for self-driving cars, arXiv, 2016. https://doi.org/10.48550/arXiv.1604.07316 |
| [5] | T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, et al., Language models are few-shot learners, arXiv, 2020. https://doi.org/10.48550/arXiv.2005.14165 |
| [6] | A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, et al., An image is worth 16x16 words: transformers for image recognition at scale, arXiv, 2020. https://doi.org/10.48550/arXiv.2010.11929 |
| [7] | T. Elsken, J. H. Metzen, F. Hutter, Neural architecture search: a survey, J. Mach. Learn. Res., 20 (2019), 1997–2017. |
| [8] |
M. Everingham, L. Van Gool, C. K. Williams, J. Winn, A. Zisserman, The pascal visual object classes (VOC) challenge, Int. J. Comput. Vis., 88 (2010), 303–338. https://doi.org/10.1007/s11263-009-0275-4 doi: 10.1007/s11263-009-0275-4
|
| [9] |
G. Franchini, GreenNAS: A green approach to the hyperparameters tuning in deep learning, Mathematics, 12 (2024), 850. https://doi.org/10.3390/math12060850 doi: 10.3390/math12060850
|
| [10] |
G. Franchini, V. Ruggiero, F. Porta, L. Zanni, Neural architecture search via standard machine learning methodologies, Math. Eng., 5 (2023), 1–21. https://doi.org/10.3934/mine.2023012 doi: 10.3934/mine.2023012
|
| [11] |
E. Govi, D. Sapienza, S. Toscani, I. Cotti, G. Franchini, M. Bertogna, Addressing challenges in industrial pick and place: a deep learning-based 6 degrees-of-freedom pose estimation solution, Comput. Ind., 161 (2024), 104130. https://doi.org/10.1016/j.compind.2024.104130 doi: 10.1016/j.compind.2024.104130
|
| [12] | Y. Guo, Y. Chen, Y. Zheng, P. Zhao, J. Chen, J. Huang, et al., Breaking the curse of space explosion: towards efficient NAS with curriculum search, Proceedings of the 37th International Conference on Machine Learning, 2020, 3822–3831. |
| [13] | K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016,770–778. https://doi.org/10.1109/CVPR.2016.90 |
| [14] |
S. Hinterstoisser, C. Cagniart, S. Ilic, P. Sturm, N. Navab, P. Fua, et al., Gradient response maps for real-time detection of textureless objects, IEEE transactions on pattern analysis and machine intelligence, 34 (2011), 876–888. https://doi.org/10.1109/TPAMI.2011.206 doi: 10.1109/TPAMI.2011.206
|
| [15] | S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, et al., Technical demonstration on model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes, In: A. Fusiello, V. Murino, R. Cucchiara, Computer Vision – ECCV 2012. Workshops and Demonstrations, Lecture Notes in Computer Science, Springer, 7585 (2012), 593–896. https://doi.org/10.1007/978-3-642-33885-4_60 |
| [16] | T. Hodaň, D. Barath, J. Matas, EPOS: Estimating 6D pose of objects with symmetries, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, 11700–11709. https://doi.org/10.1109/cvpr42600.2020.01172 |
| [17] | T. Hodaň, J. Matas, Š. Obdržálek, On evaluation of 6D object pose estimation, In: G. Hua, H. Jégou, Computer Vision – ECCV 2016 Workshops. ECCV 2016, Lecture Notes in Computer Science, Springer, 9915 (2016), 606–619. https://doi.org/10.1007/978-3-319-49409-8_52 |
| [18] | T. Hodaň, F. Michel, E. Brachmann, W. Kehl, A. GlentBuch, D. Kraft, et al., Bop: Benchmark for 6D object pose estimation, In: V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss, Computer Vision – ECCV 2018, Lecture Notes in Computer Science, Springer, 11214 (2018), 19–35. https://doi.org/10.1007/978-3-030-01249-6_2 |
| [19] |
Y. Hu, N. Belkhir, J. Angulo, A. Yao, G. Franchi, Learning deep morphological networks with neural architecture search, Pattern Recogn., 131 (2022), 108893. https://doi.org/10.1016/j.patcog.2022.108893 doi: 10.1016/j.patcog.2022.108893
|
| [20] |
Y. Hu, X. Wang, L. Li, Q. Gu, Improving one-shot NAS with shrinking-and-expanding supernet, Pattern Recogn., 118 (2021), 108025. https://doi.org/10.1016/j.patcog.2021.108025 doi: 10.1016/j.patcog.2021.108025
|
| [21] |
D. P. Huttenlocher, G. A. Klanderman, W. J. Rucklidge, Comparing images using the hausdorff distance, IEEE Trans. Pattern Anal. Mach. Intell., 15 (1993), 850–863. https://doi.org/10.1109/34.232073 doi: 10.1109/34.232073
|
| [22] |
Y. Jaafra, J. L. Laurent, A. Deruyver, M. S. Naceur, Reinforcement learning for neural architecture search: a review, Image Vision Comput., 89 (2019), 57–66. https://doi.org/10.1016/j.imavis.2019.06.005 doi: 10.1016/j.imavis.2019.06.005
|
| [23] | Y. Jiang, C. Hu, T. Xiao, C. Zhang, J. Zhu, Improved differentiable architecture search for language modeling and named entity recognition, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, 3585–3590, https://doi.org/10.18653/v1/d19-1367 |
| [24] | K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, E. Xing, Neural architecture search with Bayesian optimisation and optimal transport, Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, 2020–2029. |
| [25] | W. Kehl, F. Manhardt, F. Tombari, S. Ilic, N. Navab, SSD-6D: Making rgb-based 3D detection and 6d pose estimation great again, 2017 IEEE International Conference on Computer Vision (ICCV), 2017, 1530–1538. https://doi.org/10.1109/ICCV.2017.169 |
| [26] | A. Kendall, M. Grimes, R. Cipolla, PoseNet: A convolutional network for real-time 6-dof camera relocalization, 2015 IEEE International Conference on Computer Vision (ICCV), 2015, 2938–2946. https://doi.org/10.1109/ICCV.2015.336 |
| [27] |
A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, Commun. ACM, 60 (2017), 84–90. https://doi.org/10.1145/3065386 doi: 10.1145/3065386
|
| [28] | Y. Labbé, J. Carpentier, M. Aubry, J. Sivic, CosyPose: Consistent multi-view multi-object 6D pose estimation, In: A. Vedaldi, H. Bischof, T. Brox, JM. Frahm, Computer Vision – ECCV 2020. ECCV 2020, Lecture Notes in Computer Science, Springer, 12362 (2020), 574–591. https://doi.org/10.1007/978-3-030-58520-4_34 |
| [29] | J. Lin, L. Liu, D. Lu, K. Jia, SAM-6D: Segment anything model meets zero-shot 6D object pose estimation, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, 27906–27916. https://doi.org/10.1109/CVPR52733.2024.02636 |
| [30] | C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. J. Li, et al., Progressive neural architecture search, In: V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss, Computer Vision – ECCV 2018, 11205 (2018), 19–35. https://doi.org/10.1007/978-3-030-01246-5_2 |
| [31] | H. Liu, K. Simonyan, Y. Yang, DARTS: Differentiable architecture search, arXiv, 2018. https://doi.org/10.48550/arXiv.1806.09055 |
| [32] | J. G. López, A. Agudo, F. Moreno-Noguer, E-DNAS: Differentiable neural architecture search for embedded systems, 2020 25th International Conference on Pattern Recognition (ICPR), 2021, 4704–4711. https://doi.org/10.1109/ICPR48806.2021.9412130 |
| [33] | R. Luo, X. Tan, R. Wang, T. Qin, E. Chen, T. Y. Liu, Accuracy prediction with non-neural model for neural architecture search, arXiv, 2020. https://doi.org/10.48550/arXiv.2007.04785 |
| [34] | R. Luo, F. Tian, T. Qin, E. Chen, T. Y. Liu, Neural architecture optimization, arXiv, 2018. https://doi.org/10.48550/arXiv.1808.07233 |
| [35] | R. Pasunuru, M. Bansal, Continual and multi-task architecture search, arXiv, 2019. https://doi.org/10.48550/arXiv.1906.05226 |
| [36] | S. Peng, Y. Liu, Q. Huang, X. Zhou, H. Bao, PVNet: Pixel-wise voting network for 6DoF pose estimation, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, 4556–4565. https://doi.org/10.1109/CVPR.2019.00469 |
| [37] | H. Pham, M. Guan, B. Zoph, Q. Le, J. Dean, Efficient neural architecture search via parameters sharing, Proceedings of the 35th International Conference on Machine Learning, 2018. |
| [38] |
M. Poyser, T. P. Breckon, Neural architecture search: a contemporary literature review for computer vision applications, Pattern Recogn., 147 (2024), 110052. https://doi.org/10.1016/j.patcog.2023.110052 doi: 10.1016/j.patcog.2023.110052
|
| [39] | A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, et al., Learning transferable visual models from natural language supervision, Proceedings of the 38th International Conference on Machine Learning, 2021. |
| [40] |
D. Sapienza, E. Govi, S. Aldhaheri, M. Bertogna, E. Roura, È. Pairet, et al., Model-based underwater 6D pose estimation from RGB, IEEE Robot. Autom. Lett., 8 (2023), 7535–7542. https://doi.org/10.1109/LRA.2023.3320028 doi: 10.1109/LRA.2023.3320028
|
| [41] | C. Scribano, D. Sapienza, G. Franchini, M. Verucchi, M. Bertogna, All you can embed: Natural language based vehicle retrieval with spatio-temporal transformers, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2021, 4248–4257. https://doi.org/10.1109/CVPRW53098.2021.00481 |
| [42] | K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv, 2014. https://doi.org/10.48550/arXiv.1409.1556 |
| [43] |
E. Strubell, A. Ganesh, A. McCallum, Energy and policy considerations for modern deep learning research, Proceedings of the AAAI Conference on Artificial Intelligence, 34 (2020), 13693–13696. https://doi.org/10.1609/aaai.v34i09.7123 doi: 10.1609/aaai.v34i09.7123
|
| [44] | Y. Su, M. Saleh, T. Fetzer, J. Rambach, N. Navab, B. Busam, et al., ZebraPose: Coarse to fine surface encoding for 6dof object pose estimation, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, 6728–6738 https://doi.org/10.1109/CVPR52688.2022.00662 |
| [45] |
M. Sundermeyer, Z. C. Marton, M. Durner, R. Triebel, Augmented autoencoders: implicit 3D orientation learning for 6D object detection, Int. J. Comput. Vis., 128 (2020), 714–729. https://doi.org/10.1007/s11263-019-01243-8 doi: 10.1007/s11263-019-01243-8
|
| [46] | B. Tekin, S. N. Sinha, P. Fua, Real-time seamless single shot 6D object pose prediction, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018,292–301. https://doi.org/10.1109/CVPR.2018.00038 |
| [47] | N. C. Thompson, K. Greenewald, K. Lee, G. F. Manso, The computational limits of deep learning, arXiv, 2020. https://doi.org/10.48550/arXiv.2007.05558 |
| [48] | J. Tremblay, T. To, B. Sundaralingam, Y. Xiang, D. Fox, S. Birchfield, Deep object pose estimation for semantic robotic grasping of household objects, arXiv, 2018. https://doi.org/10.48550/arXiv.1809.10790 |
| [49] | H. Wang, S. Sridhar, J. Huang, J. Valentin, S. Song, L. J. Guibas, Normalized object coordinate space for category-level 6D object pose and size estimation, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, 2637–2646. https://doi.org/10.1109/CVPR.2019.00275 |
| [50] | G. Wang, F. Manhardt, F. Tombari, X. Ji, GDR-Net: Geometry-guided direct regression network for monocular 6D object pose estimation, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, 16606–16616. https://doi.org/10.1109/CVPR46437.2021.01634 |
| [51] | B. Wen, W. Yang, J. Kautz, S. Birchfield, FoundationPose: Unified 6D pose estimation and tracking of novel objects, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, 17868–17879. https://doi.org/10.1109/CVPR52733.2024.01692 |
| [52] | W. Wen, H. Liu, Y. Chen, H. Li, G. Bender, P. J. Kindermans, Neural predictor for neural architecture search, In: A. Vedaldi, H. Bischof, T. Brox, JM. Frahm, Computer Vision – ECCV 2020, Lecture Notes in Computer Science, Springer, 12374 (2020), 660–676. https://doi.org/10.1007/978-3-030-58526-6_39 |
| [53] |
C. White, W. Neiswanger, Y. Savani, BANANAS: Bayesian optimization with neural architecture for neural architecture search, Proc. AAAI Conf. Artif. Intell., 35 (2021), 10293–10301. https://doi.org/10.1609/aaai.v35i12.17233 doi: 10.1609/aaai.v35i12.17233
|
| [54] | Y. Xiang, T. Schmidt, V. Narayanan, D. Fox, PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes, arXiv, 2017. https://doi.org/10.48550/arXiv.1711.00199 |
| [55] | S. Xie, H. Zheng, C. Liu, L. Lin, SNAS: Stochastic neural architecture search, arXiv, 2018. https://doi.org/10.48550/arXiv.1812.09926 |