Edit-discrepancy-guided feature transformation in encoder-based GAN inversion for real image attribute editing

Wenbo Yan; Xing Xu; Yinglong Zhang; Xuewen Xia; Yuanxiang Li; Wenbo Yan; Xing Xu; Yinglong Zhang; Xuewen Xia; Yuanxiang Li

doi:10.3934/era.2026216

Electronic Research Archive

2026, Volume 34, Issue 7: 4889-4912. doi: 10.3934/era.2026216

Previous Article Next Article

Research article Special Issues

Edit-discrepancy-guided feature transformation in encoder-based GAN inversion for real image attribute editing

1.
School of Physics and Information Engineering, Minnan Normal University, Zhangzhou 363000, China
2.
Center for China-ASEAN Regional Collaborative Development, Minnan Normal University, Zhangzhou 363000, China
3.
School of Computer Science, Wuhan University, Wuhan 430072, China
4.
Digital Strategy Development Research Institute of Hechi University, Hechi 546399, China

Received: 13 December 2025 Revised: 26 April 2026 Accepted: 20 May 2026 Published: 08 June 2026

Image attribute editing based on generative adversarial networks (GANs) typically begins by mapping real images to the latent space of a pretrained StyleGAN, followed by manipulating the corresponding latent codes. However, low-rate latent codes suffer from an information bottleneck, making it challenging to faithfully reconstruct complex real images. Recent encoder-based methods enhance reconstruction by injecting high-rate features into intermediate generator layers to better preserve fine details, but they often yield misaligned details in the edited images. The primary reason is that these methods still rely on global linear transformations of high-rate features, which overlook the nonlinear and spatially localized nature of real edits. To this end, we build on a high-fidelity encoder-based GAN inversion backbone and introduce an additional adaptive feature editor that is specifically trained to convert high-rate features during editing so that fine details are correctly aligned with the edited image. The backbone refines the feature through a cross-attention mechanism and residual enhancement. Building on this, the feature editor employs a window-based cross-attention mechanism to extract a discrepancy signal between the original and edited generator features, which specifies both where to modify and what content to change. This signal is then fused into the feature through spatially adaptive modulation techniques, enabling region-selective attribute changes while preserving irrelevant details. Experiments on face and car benchmarks demonstrate that our method improves both reconstruction fidelity and editing quality compared to existing GAN inversion methods.
- image attribute editing,
- GAN inversion,
- generative adversarial networks,
- cross-attention,
- spatially-adaptive modulation
Citation: Wenbo Yan, Xing Xu, Yinglong Zhang, Xuewen Xia, Yuanxiang Li. Edit-discrepancy-guided feature transformation in encoder-based GAN inversion for real image attribute editing[J]. Electronic Research Archive, 2026, 34(7): 4889-4912. doi: 10.3934/era.2026216

Related Papers:

Abstract

Image attribute editing based on generative adversarial networks (GANs) typically begins by mapping real images to the latent space of a pretrained StyleGAN, followed by manipulating the corresponding latent codes. However, low-rate latent codes suffer from an information bottleneck, making it challenging to faithfully reconstruct complex real images. Recent encoder-based methods enhance reconstruction by injecting high-rate features into intermediate generator layers to better preserve fine details, but they often yield misaligned details in the edited images. The primary reason is that these methods still rely on global linear transformations of high-rate features, which overlook the nonlinear and spatially localized nature of real edits. To this end, we build on a high-fidelity encoder-based GAN inversion backbone and introduce an additional adaptive feature editor that is specifically trained to convert high-rate features during editing so that fine details are correctly aligned with the edited image. The backbone refines the feature through a cross-attention mechanism and residual enhancement. Building on this, the feature editor employs a window-based cross-attention mechanism to extract a discrepancy signal between the original and edited generator features, which specifies both where to modify and what content to change. This signal is then fused into the feature through spatially adaptive modulation techniques, enabling region-selective attribute changes while preserving irrelevant details. Experiments on face and car benchmarks demonstrate that our method improves both reconstruction fidelity and editing quality compared to existing GAN inversion methods.

References

[1]	I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, et al., Generative adversarial networks, Commun. ACM, 63 (2020), 139–144. https://doi.org/10.1145/3422622 doi: 10.1145/3422622
[2]	T. Karras, T. Aila, S. Laine, J. Lehtinen, Progressive growing of gans for improved quality, stability, and variation, preprint, arXiv: 1710.10196. https://doi.org/10.48550/arXiv.1710.10196
[3]	T. Karras, S. Laine, T. Aila, A style-based generator architecture for generative adversarial networks, IEEE Trans. Pattern Analy. Mach. Intell., 43 (2021), 4217–4228. https://doi.org/10.1109/TPAMI.2020.2970919 doi: 10.1109/TPAMI.2020.2970919
[4]	T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, T. Aila, Analyzing and improving the image quality of stylegan, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2020), 8107–8116. https://doi.org/10.1109/CVPR42600.2020.00813
[5]	T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, T. Aila, Training generative adversarial networks with limited data, in Advances in Neural Information Processing Systems, 33 (2020), 12104–12114.
[6]	T. Karras, M. Aittala, S. Laine, E. Härkönen, J. Hellsten, J. Lehtinen, et al., Alias-free generative adversarial networks, in Advances in Neural Information Processing Systems, 34 (2021), 852–863.
[7]	R. Abdal, Y. Qin, P. Wonka, Image2stylegan++: How to edit the embedded images?, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), 8296–8305. https://doi.org/10.1109/CVPR42600.2020.00832
[8]	J. Y. Zhu, P. Krähenbühl, E. Shechtman, A. A. Efros, Generative visual manipulation on the natural image manifold, in European Conference on Computer Vision, (2016), 597–613. https://doi.org/10.1007/978-3-319-46454-1_36
[9]	N. Tishby, N. Zaslavsky, Deep learning and the information bottleneck principle, in 2015 IEEE Information Theory Workshop, (2015), 1–5. https://doi.org/10.1109/ITW.2015.7133169
[10]	X. Yao, A. Newson, Y. Gousseau, P. Hellier, A style-based gan encoder for high fidelity reconstruction of images and videos, in European Conference on Computer Vision, (2022), 581–597. https://doi.org/10.1007/978-3-031-19784-0_34
[11]	H. Liu, Y. Song, Q. Chen, Delving stylegan inversion for image editing: A foundation latent space viewpoint, in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2023), 10072–10082. https://doi.org/10.1109/CVPR52729.2023.00971
[12]	Z. Zhang, Y. Yan, J. H. Xue, H. Wang, Spatial-contextual discrepancy information compensation for GAN inversion, in Proceedings of the AAAI Conference on Artificial Intelligence, (2024), 7432–7440. https://doi.org/10.1609/aaai.v38i7.28574
[13]	T. Park, M. Y. Liu, T. C. Wang, J. Y. Zhu, Semantic image synthesis with spatially-adaptive normalization, in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2019), 2332–2341. https://doi.org/10.1109/CVPR.2019.00244
[14]	A. Creswell, A. A. Bharath, Inverting the generator of a generative adversarial network, IEEE Trans. Neural Networks Learn. Syst., 30 (2019), 1967–1974. https://doi.org/10.1109/TNNLS.2018.2875194 doi: 10.1109/TNNLS.2018.2875194
[15]	R. Abdal, Y. Qin, P. Wonka, Image2stylegan: How to embed images into the stylegan latent space?, in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), (2019), 4431–4440. https://doi.org/10.1109/ICCV.2019.00453
[16]	P. Zhu, R. Abdal, J. Femiani, P. Wonka, Barbershop: GAN-based image compositing using segmentation masks, ACM Trans. Graphics, 40 (2021), 1–13. https://doi.org/10.1145/3478513.3480537 doi: 10.1145/3478513.3480537
[17]	D. Roich, R. Mokady, A. H. Bermano, D. Cohen-Or, Pivotal tuning for latent-based editing of real images, ACM Trans. Graphics, 42 (2022), 1–13. https://doi.org/10.1145/3544777 doi: 10.1145/3544777
[18]	E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan, Y. Azar, S. Shapiro, et al., Encoding in style: A stylegan encoder for image-to-image translation, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021), 2287–2296. https://doi.org/10.1109/CVPR46437.2021.00232
[19]	O. Tov, Y. Alaluf, Y. Nitzan, O. Patashnik, D. Cohen-Or, Designing an encoder for stylegan image manipulation, ACM Trans. Graphics, 40 (2021), 1–14. https://doi.org/10.1145/3450626.3459838 doi: 10.1145/3450626.3459838
[20]	Y. Alaluf, O. Patashnik, D. Cohen-Or, Restyle: A residual-based stylegan encoder via iterative refinement, in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (2021), 6691–6700. https://doi.org/10.1109/ICCV48922.2021.00664
[21]	X. Hu, Q. Huang, Z. Shi, S. Li, C. Gao, L. Sun, et al., Style transformer for image inversion and editing, in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2022), 11327–11336. https://doi.org/10.1109/CVPR52688.2022.01105
[22]	H. Pehlivan, Y. Dalva, A. Dundar, Styleres: Transforming the residuals for real image editing with stylegan, in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2023), 1828–1837. https://doi.org/10.1109/CVPR52729.2023.00182
[23]	D. Bobkov, V. Titov, A. Alanov, D. Vetrov, The devil is in the details: Stylefeatureeditor for detail-rich stylegan inversion and high quality image editing, in 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2024), 9337–9346. https://doi.org/10.1109/CVPR52733.2024.00892
[24]	A. B. Yildirim, H. Pehlivan, A. Dundar, Warping the residuals for image editing with stylegan, Int. J. Comput. Vision, 133 (2025), 2311–2326. https://doi.org/10.1007/s11263-024-02301-6 doi: 10.1007/s11263-024-02301-6
[25]	P. Isola, J. Y. Zhu, T. Zhou, A. A. Efros, Image-to-image translation with conditional adversarial networks, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017), 5967–5976. https://doi.org/10.1109/CVPR.2017.632
[26]	J. Y. Zhu, T. Park, P. Isola, A. A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, in 2017 IEEE International Conference on Computer Vision (ICCV), (2017), 2242–2251. https://doi.org/10.1109/ICCV.2017.244
[27]	Y. Choi, M. Choi, M. Kim, J. W. Ha, S. Kim, J. Choo, Stargan: Unified generative adversarial networks for multi-domain image-to-image translation, in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2018), 8789–8797. https://doi.org/10.1109/CVPR.2018.00916
[28]	Z. He, W. Zuo, M. Kan, S. Shan, X. Chen, Attgan: Facial attribute editing by only changing what you want, IEEE Trans. Image Process., 28 (2019), 5464–5478. https://doi.org/10.1109/TIP.2019.2916751 doi: 10.1109/TIP.2019.2916751
[29]	M. Liu, Y. Ding, M. Xia, X. Liu, E. Ding, W. Zuo, et al., Stgan: A unified selective transfer network for arbitrary image attribute editing, in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2019), 3668–3677. https://doi.org/10.1109/CVPR.2019.00379
[30]	Z. Wu, D. Lischinski, E. Shechtman, Stylespace analysis: Disentangled controls for stylegan image generation, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021), 12858–12867. https://doi.org/10.1109/CVPR46437.2021.01267
[31]	Y. Shen, C. Yang, X. Tang, B. Zhou, Interfacegan: Interpreting the disentangled face representation learned by GANs, IEEE Trans. Pattern Anal. Mach. Intell., 44 (2020), 2004–2018. https://doi.org/10.1109/TPAMI.2020.3034267 doi: 10.1109/TPAMI.2020.3034267
[32]	R. Abdal, P. Zhu, N. J. Mitra, P. Wonka, Styleflow: Attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing flows, ACM Trans. Graphics, 40 (2021), 1–21. https://doi.org/10.1145/3447648 doi: 10.1145/3447648
[33]	E. Härkönen, A. Hertzmann, J. Lehtinen, S. Paris, Ganspace: Discovering interpretable GAN controls, in Advances in Neural Information Processing Systems, 33 (2020), 9841–9850.
[34]	O. Patashnik, Z. Wu, E. Shechtman, D. Cohen-Or, D. Lischinski, Styleclip: Text-driven manipulation of stylegan imagery, in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (2021), 2065–2074. https://doi.org/10.1109/ICCV48922.2021.00209
[35]	J. Choi, Y. Choi, Y. Kim, J. Kim, S. Yoon, Custom-edit: Text-guided image editing with customized diffusion models, preprint, arXiv: 2305.15779. https://doi.org/10.48550/arXiv.2305.15779
[36]	R. Jiang, X. Fu, G. Zheng, T. Li, T. Yao, X. Li, Energy-guided optimization for personalized image editing with pretrained text-to-image diffusion models, in Proceedings of the AAAI Conference on Artificial Intelligence, (2025), 4048–4056. https://doi.org/10.1609/aaai.v39i4.32424
[37]	Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, et al., Swin transformer: Hierarchical vision transformer using shifted windows, in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (2021), 9992–10002. https://doi.org/10.1109/ICCV48922.2021.00986
[38]	J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, R. Timofte, Swinir: Image restoration using swin transformer, in 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), (2021), 1833–1844. https://doi.org/10.1109/ICCVW54120.2021.00210
[39]	J. Krause, M. Stark, J. Deng, F. F. Li, 3D object representations for fine-grained categorization, in 2013 IEEE International Conference on Computer Vision Workshops, (2013), 554–561. https://doi.org/10.1109/ICCVW.2013.77
[40]	T. Wang, Y. Zhang, Y. Fan, J. Wang, Q. Chen, High-fidelity gan inversion for image attribute editing, in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2022), 11369–11378. https://doi.org/10.1109/CVPR52688.2022.01109
[41]	Y. Alaluf, O. Tov, R. Mokady, R. Gal, A. Bermano, Hyperstyle: Stylegan inversion with hypernetworks for real image editing, in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2022), 18490–18500. https://doi.org/10.1109/CVPR52688.2022.01796
[42]	R. Zhang, P. Isola, A. A. Efros, E. Shechtman, O. Wang, The unreasonable effectiveness of deep features as a perceptual metric, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2018), 586–595. https://doi.org/10.1109/CVPR.2018.00068
[43]	Z. Wang, E. P. Simoncelli, A. C. Bovik, Multiscale structural similarity for image quality assessment, in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, (2003), 1398–1402. https://doi.org/10.1109/ACSSC.2003.1292216
[44]	Y. Huang, Y. Wang, Y. Tai, X. Liu, P. Shen, S. Li, et al., Curricularface: Adaptive curriculum learning loss for deep face recognition, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2020), 5900–5909. https://doi.org/10.1109/CVPR42600.2020.00594
[45]	M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, GANs trained by a two time-scale update rule converge to a local Nash equilibrium, in Advances in Neural Information Processing Systems, 30 (2017).

Reader Comments

Your name:*

Email:*
© 2026 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)