ijact-book-coverT

A Review of AI-Based Synthetic Data Generation Approaches

© 2025 by IJACT

Volume 3 Issue 1

Year of Publication : 2025

Author : Anurag Bhagat

:10.56472/25838628/IJACT-V3I1P101

Citation :

Anurag Bhagat, 2025. "A Review of AI-Based Synthetic Data Generation Approaches" ESP International Journal of Advancements in Computational Technology (ESP-IJACT)  Volume 2, Issue 2: 1-4.

Abstract :

Creating synthetic data, which closely resembles real data, using AI based techniques is becoming increasingly important in solving machine learning problems across the entire lifecycle of ML from training to tuning and testing. Synthetic data can solve multiple limitations like data being scarce or unavailable, data privacy concerns like in healthcare scenarios with PII and PHI data, or can just speed up the AI model development journey by providing fast access to data while the real data is being prepared. This review paper provides a view into various methodologies and key advancements in synthetic data creation with some examples, with a special focus on Generative AI based techniques which have really made this more accessible to a lot of people.

References :

[1] Antoniou, A., Storkey, A., & Edwards, H. (2019). Data Augmentation Generative Adversarial Networks. arXiv preprint arXiv:1711.04340.

[2] Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A Simple Framework for Contrastive Learning of Visual Representations. Proceedings of the 37th International Conference on Machine Learning.

[3] Qixin Hu, Alan Yuille, Zongwei Zhou(2023), Synthetic Data as Validation https://arxiv.org/abs/2310.16052

[4] Boris van Breugel, Nabeel Seedat, Fergus Imrie, Mihaela van der Schaar (2023), Can You Rely on Your Model Evaluation? Improving Model Evaluation with Synthetic Test Data. Neurips 2023

[5] Tshilidzi Marwala, Eleonore Fournier-Tombs, Serge Stinckwich(2023), The Use of Synthetic Data to Train AI Models: Opportunities and Risks for Sustainable Development, https://arxiv.org/pdf/2309.00652

[6] Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W. F., & Sun, J. (2017). Generating Multi-label Discrete Patient Records using Generative Adversarial Networks. arXiv preprint arXiv:1703.06490.

[7] Goodfellow, I., Pouget-Abadie, J., Mirza, M., et al. (2014). Generative Adversarial Networks. Advances in Neural Information Processing Systems.

[8] Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Modeling Tabular Data using Conditional GAN. Advances in Neural Information Processing Systems.

[9] Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. Proceedings of the IEEE International Conference on Computer Vision.

[10] Abedi, Hempel, Sadeghi, Kirsten (2022). GAN-Based Approaches for Generating Structured Data in the Medical Domain. Appl. Sci., 12(14), 7075; https://doi.org/10.3390/app12147075

[11] Mirza, M.; Osindero, S. Conditional Generative Adversarial Nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]

[12] Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein GAN. arXiv 2017, arXiv:1701.07875.

[13] M. Razghandi, H. Zhou, M. Erol-Kantarci and D. Turgut, "Variational Autoencoder Generative Adversarial Network for Synthetic Data Generation in Smart Home," ICC 2022 - IEEE International Conference on Communications, Seoul, Korea, Republic of, 2022, pp. 4781-4786, doi: 10.1109/ICC45855.2022.9839249.

[14] K. Khadka, J. Chandrasekaran, Y. Lei, R. N. Kacker and D. Richard Kuhn, "Synthetic Data Generation Using Combinatorial Testing and Variational Autoencoder," 2023 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), Dublin, Ireland, 2023, pp. 228-236, doi: 10.1109/ICSTW58534.2023.00048

[15] Ally Salim Jr (2018); Synthetic Patient Generation: A Deep Learning Approach Using Variational Autoencoders. arXiv:1808.06444, https://doi.org/10.48550/arXiv.1808.06444

[16] Abhyuday Desai, Cynthia Freeman, Zuhui Wang, Ian Beaver, TimeVAE: A Variational Auto-Encoder for Multivariate Time Series Generation arXiv:2111.08095, https://doi.org/10.48550/arXiv.2111.08095

[17] https://www.pymnts.com/artificial-intelligence-2/2024/nvidias-new-ai-simulator-could-rev-up-robotics-self-driving-cars/

[18] Tesla’s filed patent, Data Synthesis for Autonomous Control Systems https://ppubs.uspto.gov/pubwebapp/

[19] Ziqi Zhang, Chao Yan, Thomas A Lasko, Jimeng Sun, Bradley A Malin, SynTEG: a framework for temporal structured electronic health data simulation, Journal of the American Medical Informatics Association, Volume 28, Issue 3, March 2021, Pages 596–604, https://doi.org/10.1093/jamia/ocaa262

Keywords :

Artificial Intelligence, Generative AI, Synthetic Data, AI/ML, GenAI.