Although the field of machine learning in health and well-being has experienced significant growth in recent years, the use of sensitive data in these applications is often restricted. While sports and well-being data may appear less sensitive, they are frequently subject to the same standards as health data. Synthetic data generation has emerged as a potential solution to privacy issues, aiming to replicate the properties of real data without disclosing identifiable information. In this study, we generated synthetic recreational runner data using a combined variational autoencoder and neural network model based on baseline measurements and training responses from a three-month training period. We then evaluated the synthetic data by training predictive models and comparing their performance to models trained on real data, with additional metrics measuring statistical similarity and privacy. While some challenges remain, particularly regarding the modeling of rare cases, handling of missing data, and ensuring privacy, our results demonstrate that synthetic data could be used to train predictive models with performance comparable to that of models trained on real data.
Citation: Joonas Tuomikoski, Ville Vesterinen, Rami Luisto, Ilkka Pölönen, Sami Äyrämö. A variational autoencoder and neural network approach to generating synthetic data in well-being research[J]. Applied Computing and Intelligence, 2025, 5(2): 191-212. doi: 10.3934/aci.2025012
Although the field of machine learning in health and well-being has experienced significant growth in recent years, the use of sensitive data in these applications is often restricted. While sports and well-being data may appear less sensitive, they are frequently subject to the same standards as health data. Synthetic data generation has emerged as a potential solution to privacy issues, aiming to replicate the properties of real data without disclosing identifiable information. In this study, we generated synthetic recreational runner data using a combined variational autoencoder and neural network model based on baseline measurements and training responses from a three-month training period. We then evaluated the synthetic data by training predictive models and comparing their performance to models trained on real data, with additional metrics measuring statistical similarity and privacy. While some challenges remain, particularly regarding the modeling of rare cases, handling of missing data, and ensuring privacy, our results demonstrate that synthetic data could be used to train predictive models with performance comparable to that of models trained on real data.
| [1] |
A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, et al., Dermatologist-level classification of skin cancer with deep neural networks, Nature, 542 (2017), 115–118. https://doi.org/10.1038/nature21056 doi: 10.1038/nature21056
|
| [2] |
W. Gouda, M. Almurafeh, M. Humayun, N. Z. Jhanjhi, Detection of COVID-19 based on chest X-rays using deep learning, Healthcare, 10 (2022), 343. https://doi.org/10.3390/healthcare10020343 doi: 10.3390/healthcare10020343
|
| [3] |
J. Abramson, J. Adler, J. Dunger, R. Evans, T. Green, A. Pritzel, et al., Accurate structure prediction of biomolecular interactions with AlphaFold 3, Nature, 630 (2024), 493–500. https://doi.org/10.1038/s41586-024-07487-w doi: 10.1038/s41586-024-07487-w
|
| [4] | European Parliament and Council of the European Union, Regulation (EU) 2016/679 of the European Parliament and of the Council, European Union, 2016. Available from: https://eur-lex.europa.eu/eli/reg/2016/679/oj. |
| [5] | J. Jordon, L. Szpruch, F. Houssiau, M. Bottarelli, G. Cherubin, C. Maple, et al., Synthetic data–-what, why and how? arXiv: 2205.03257. https://doi.org/10.48550/arXiv.2205.03257 |
| [6] |
A. Benaim, R. Almog, Y. Gorelik, I. Hochberg, L. Nassar, T. Mashiach, et al., Analyzing medical research results based on synthetic data and their relation to real data results: systematic comparison from five observational studies, JMIR Med. Inform., 8 (2020), e16492. https://doi.org/10.2196/16492 doi: 10.2196/16492
|
| [7] |
A. Gonzales, G. Guruswamy, S. R. Smith, Synthetic data in health care: a narrative review, PLOS Digit Health, 2 (2023), e0000082. https://doi.org/10.1371/journal.pdig.0000082 doi: 10.1371/journal.pdig.0000082
|
| [8] | M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, H. Greenspan, Synthetic data augmentation using GAN for improved liver lesion classification, Proceedings of IEEE 15th international symposium on biomedical imaging, 2018,289–293. https://doi.org/10.1109/ISBI.2018.8363576 |
| [9] | S. Lala, M. Shady, A. Belyaeva, M. Liu, Evaluation of mode collapse in generative adversarial networks, High Performance Extreme Computing, 2018, 1–9. |
| [10] |
P. Eigenschink, T. Reutterer, S. Vamosi, R. Vamosi, C. Sun, K. Kalcher, Deep generative models for synthetic data: a survey, IEEE Access, 11 (2023), 47304–47320. https://doi.org/10.1109/ACCESS.2023.3275134 doi: 10.1109/ACCESS.2023.3275134
|
| [11] |
K. El Emam, Seven ways to evaluate the utility of synthetic data, IEEE Secur. Priv., 18 (2020), 56–59. https://doi.org/10.1109/MSEC.2020.2992821 doi: 10.1109/MSEC.2020.2992821
|
| [12] |
M. Hernadez, G. Epelde, A. Alberdi, R. Cilla, D. Rankin, Synthetic tabular data evaluation in the health domain covering resemblance, utility, and privacy dimensions, Methods Inf. Med., 62 (2023), e19–e38. https://doi.org/10.1055/s-0042-1760247 doi: 10.1055/s-0042-1760247
|
| [13] |
H. Murtaza, M. Ahmed, N. F. Khan, G. Murtaza, S. Zafar, A. Bano, Synthetic data generation: State of the art in health care domain, Comput. Sci. Rev., 48 (2023), 100546. https://doi.org/10.1016/j.cosrev.2023.100546 doi: 10.1016/j.cosrev.2023.100546
|
| [14] | R. Shokri, M. Stronati, C. Song, V. Shmatikov, Membership inference attacks against machine learning models, Proceedings of IEEE symposium on security and privacy, 2017, 3–18. https://doi.org/10.1109/SP.2017.41 |
| [15] | B. van Breugel, H. Sun, Z. Qian, M. van der Schaar, Membership inference attacks against synthetic data through overfitting detection, arXiv: 2302.12580. https://doi.org/10.48550/arXiv.2302.12580 |
| [16] | N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, et al., Extracting training data from large language models, Proceedings of 30th USENIX security symposium (USENIX Security 21), 2021, 2633–2650. |
| [17] | G. Gondim-Ribeiro, P. Tabacof, E. Valle, Adversarial attacks on variational autoencoders, arXiv: 1806.04646. https://doi.org/10.48550/arXiv.1806.04646 |
| [18] |
A. Yale, S. Dash, R. Dutta, I. Guyon, A. Pavao, K. P. Bennett, Generation and evaluation of privacy preserving synthetic health data, Neurocomputing, 416 (2020), 244–255. https://doi.org/10.1016/j.neucom.2019.12.136 doi: 10.1016/j.neucom.2019.12.136
|
| [19] | H. Ping, J. Stoyanovich, B. Howe, Datasynthesizer: privacy-preserving synthetic datasets, Proceedings of the 29th International Conference on Scientific and Statistical Database Management, 2017, 1–5. https://doi.org/10.1145/3085504.3091117 |
| [20] |
D. Kaur, M. Sobiesk, S. Patil, J. Liu, P. Bhagat, A. Gupta, et al., Application of Bayesian networks to generate synthetic health data, J. Am. Med. Inform. Assoc., 28 (2020), 801–811. https://doi.org/10.1093/jamia/ocaa303 doi: 10.1093/jamia/ocaa303
|
| [21] | B. Tang, H. He, KernelADASYN: kernel based adaptive synthetic data generation for imbalanced learning, Proceedings of IEEE congress on evolutionary computation (CEC), 2015,664–671. https://doi.org/10.1109/CEC.2015.7256954 |
| [22] | I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, et al., Generative adversarial nets, Proceedings of the 28th International Conference on Neural Information Processing Systems, 2014, 2672–2680. |
| [23] | D. P. Kingma, M. Welling, Auto-encoding variational Bayes, arXiv: 1312.6114. https://doi.org/10.48550/arXiv.1312.6114 |
| [24] | D. Aha, K. Nottingham, R. Longjohn, M. Kelly, P. Murphy, C. Merz, et al., UCI machine learning repository, UC Irvine, 2007. Available from: https://archive.ics.uci.edu/ml/index.php. |
| [25] |
A. Johnson, T. Pollard, L. Shen, H. Lehman, M. Feng, M. Ghassemi, et al., MIMIC-Ⅲ, a freely accessible critical care database, Sci. Data, 3 (2016), 160035 https://doi.org/10.1038/sdata.2016.35 doi: 10.1038/sdata.2016.35
|
| [26] | D. Jarrett, B. Cebere, T. Liu, A. Curth, M. van der Schaar, Hyperimpute: generalized iterative imputation with automatic model selection, Proceedings of the 39th International Conference on Machine Learning, 2022, 9916–9937. |
| [27] | L. Xu, M. Skoularidou, A. Cuesta-Infante, K. Veeramachaneni, Modeling tabular data using conditional GAN, Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019, 7335–7345. |
| [28] | C. M. Bishop, Pattern recognition and machine learning, New York: Springer, 2006. |
| [29] |
J. H. Friedman, Greedy function approximation: a gradient boosting machine, Ann. Statist., 29 (2001), 1189–1232. https://doi.org/10.1214/aos/1013203451 doi: 10.1214/aos/1013203451
|
| [30] | T. Hastie, R. Tibshirani, J. H. Friedman, The elements of statistical learning: data mining, inference, and prediction, New York: Springer, 2009. https://doi.org/10.1007/978-0-387-21606-5 |
| [31] | L. N. Vaserstein, Markov processes over denumerable products of spaces, describing large systems of automata, Probl. Peredachi Inf., 5 (1969), 64–72. |
| [32] | N. Patki, R. Wedge, K. Veeramachaneni, The synthetic data vault, Proceedings of IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2016,399–410. https://doi.org/10.1109/DSAA.2016.49 |
| [33] | D. P. Kingma, J. Ba, Adam: a method for stochastic optimization, Proceedings of 3rd International Conference for Learning Representations, 2015, 6. |
| [34] | Z. Qian, B. C. Cebere, M. van der Schaar, Synthcity: facilitating innovative use cases of synthetic data in different data modalities, arXiv: 2301.07573. https://doi.org/10.48550/arXiv.2301.07573 |
| [35] |
M. Hernandez, G. Epelde, A. Beristain, R. Álvarez, C. Molina, X. Larrea, et al., Incorporation of synthetic data generation techniques within a controlled data processing workflow in the health and wellbeing domain, Electronics, 11 (2022), 812. https://doi.org/10.3390/electronics11050812 doi: 10.3390/electronics11050812
|
| [36] |
B. van Breugel, T. Liu, D. Oglic, M. van der Schaar, Synthetic data in biomedicine via generative artificial intelligence, Nat. Rev. Bioeng., 2 (2024), 991–1004. https://doi.org/10.1038/s44222-024-00245-7 doi: 10.1038/s44222-024-00245-7
|
| [37] | A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., Attention is all you need, Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, 6000–6010. |
| [38] | J. Ho, A. Jain, P. Abbeel, Denoising diffusion probabilistic models, Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020, 6840–6851. |
| [39] | F. M. Shiri, T. Perumal, N. Mustapha, R. Mohamed, A comprehensive overview and comparative analysis on deep learning models: CNN, RNN, LSTM, GRU, arXiv: 2305.17473. https://doi.org/10.48550/arXiv.2305.17473 |
| [40] |
A. Amirahmadi, M. Ohlsson, K. Etminani, Deep learning prediction models based on EHR trajectories: a systematic review, J. Biomed. Inform., 144 (2023), 104430. https://doi.org/10.1016/j.jbi.2023.104430 doi: 10.1016/j.jbi.2023.104430
|
| [41] | T. Stadler, B. Oprisanu, C. Troncoso, Synthetic data–-anonymisation groundhog day, Proceedings of the 31st USENIX Security Symposium, 2022, 1451–1468. |