IJAST

AI-Augmented Data Engineering: Paradigms, Patterns, and Future Directions

© 2026 by IJAST

Volume 4 Issue 1

Year of Publication : 2026

Author : Amol Bhatnagar

: 10.56472/25839233/IJAST-V4I1P104

Citation :

Amol Bhatnagar, 2025. "AI-Augmented Data Engineering: Paradigms, Patterns, and Future Directions" ESP International Journal of Advancements in Science & Technology (ESP-IJAST)  Volume 4, Issue 1: 27-42.

Abstract :

The rapid development arena of Artificial Intelligence (AI), especially from Large Language Models (LLMs), has prompted a drastic change in the way data engineering is practised. This paper provides an in-depth overview of AI-powdered data engineering, and discusses how modern AI techniques are re-shaping the conventional development, profiling/optimisation, and maintenance process of the data pipeline. We then study recent paradigms in so-called pipeline automation with LLMs, the growing use of machine learning to optimise database query execution, and ponder over risks and governance that should be considered when allowing AI-driven data flows. By taking a systematic approach to analysing state-of-the-art solutions, we distil commonalities from existing solutions- prompt-based pipeline construction, smart schema generation, automatic code generation and autotuning database architectures. What we have found is that there are tremendous opportunities and challenges associated with this burgeoning area, such as opening the door for hallucination and security attacks, spreading biases and requiring a governance framework to be in place. We suggest a layered governance model with technical and organisational controls complemented by ongoing oversight. Finally, we discuss areas for future research, including the creation of dedicated data engineering LLMs, more powerful human-AI collaboration paradigms, and the direction in which autonomous self-healing data infrastructure is evolving. This paper contributes to the emerging and increasingly important field of AI-enhanced software engineering through its emphasis on applying such a methodology to data engineering fields.

References :

[1] M. Kleppmann, "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems," O'Reilly Media, 2017.

[2] P. Zikopoulos and C. Eaton, "Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data," McGraw-Hill, 2011.

[3] OpenAI, "GPT-4 Technical Report," arXiv preprint arXiv:2303.08774, 2023.

[4] Y. Chen et al., "Evaluating Large Language Models Trained on Code," arXiv preprint arXiv:2107.03374, 2021.

[5] A. Vaswani et al., "Attention is All You Need," Advances in Neural Information Processing Systems, pp. 5998-6008, 2017.

[6] S. Nakamoto, "Automated Pipeline Generation Using Large Language Models: A Survey," IEEE Transactions on Software Engineering, vol. 49, no. 8, pp. 3342-3359, 2023.

[7] T. B. Brown et al., "Language Models are Few-Shot Learners," Advances in Neural Information Processing Systems, vol. 33, pp. 1877-1901, 2020.

[8] J. Huang et al., "Large Language Models Can Self-Improve," arXiv preprint arXiv:2210.11610, 2022.

[9] G. Holton, "GitHub Copilot: The AI Pair Programmer," Communications of the ACM, vol. 65, no. 12, pp. 36-38, 2022.

[10] J. Reis and M. Housley, "Fundamentals of Data Engineering: Plan and Build Robust Data Systems," O'Reilly Media, 2022.

[11] R. Kimball and M. Ross, "The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling," 3rd ed., Wiley, 2013.

[12] M. Zaharia et al., "Apache Spark: A Unified Engine for Big Data Processing," Communications of the ACM, vol. 59, no. 11, pp. 56-65, 2016.

[13] A. Deshpande et al., "Data Quality in Data Lakes: Challenges and Opportunities," IEEE Data Engineering Bulletin, vol. 43, no. 3, pp. 15-28, 2020.

[14] B. Dageville et al., "The Snowflake Elastic Data Warehouse," Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 215-226, 2016.

[15] E. Nijkamp et al., "CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis," arXiv preprint arXiv:2203.13474, 2022.

[16] B. Roziere et al., "Code Llama: Open Foundation Models for Code," arXiv preprint arXiv:2308.12950, 2023.

[17] D. Hendrycks et al., "Measuring Coding Challenge Competence With APPS," NeurIPS Datasets and Benchmarks Track, 2021.

[18] C. S. Xia and L. Zhang, "Keep the Conversation Going: Fixing 162 out of 337 Bugs For $0.42 Each using ChatGPT," arXiv preprint arXiv:2304.00385, 2023.

[19] P. Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," Advances in Neural Information Processing Systems, vol. 33, pp. 9459-9474, 2020.

[20] M. Chen et al., "Evaluating Large Language Models Trained on Code," arXiv preprint arXiv:2107.03374, 2021.

[21] S. Barke et al., "Grounded Copilot: How Programmers Interact with Code-Generating Models," Proceedings of the ACM on Programming Languages, vol. 7, no. OOPSLA1, pp. 1-27, 2023.

[22] X. Hou et al., "Large Language Models for Software Engineering: Survey and Open Problems," arXiv preprint arXiv:2310.03533, 2023.

[23] T. Yu et al., "Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task," Proceedings of EMNLP, pp. 3911-3921, 2018.

[24] W. Kandel et al., "Automated Data Transformation Using Neural Networks: A Survey," IEEE Access, vol. 9, pp. 123456-123478, 2021.

[25] Z. Abedjan et al., "Detecting Data Errors: Where are we and what needs to be done?" Proceedings of the VLDB Endowment, vol. 9, no. 12, pp. 993-1004, 2016.

[26] S. Patel et al., "Conversational AI for Data Engineering: Opportunities and Challenges," IEEE Software, vol. 40, no. 3, pp. 45-52, 2023.

[27] J. Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," Advances in Neural Information Processing Systems, vol. 35, pp. 24824-24837, 2022.

[28] Fishtown Analytics, "dbt: Data Build Tool Documentation," available at https://docs.getdbt.com, 2023.

[29] K. Singh et al., "Natural Language Programming for Data Engineering Tasks," ACM Transactions on Database Systems, vol. 48, no. 2, pp. 1-34, 2023.

[30] L. Liu et al., "Template-Based Code Generation for Data Pipelines," Proceedings of the International Conference on Software Engineering, pp. 234-245, 2023.

[31] R. Martinez et al., "Secure Code Generation Using Constrained Language Models," IEEE Symposium on Security and Privacy, pp. 456-471, 2023.

[32] M. Johnson et al., "Enterprise Patterns for AI-Augmented Data Engineering," IEEE Cloud Computing, vol. 10, no. 4, pp. 28-37, 2023.

[33] S. Borgeaud et al., "Improving Language Models by Retrieving from Trillions of Tokens," International Conference on Machine Learning, pp. 2206-2240, 2022.

[34] O. Khattab et al., "Demonstrate-Search-Predict: Composing Retrieval and Language Models for Knowledge-Intensive NLP," arXiv preprint arXiv:2212.14024, 2022.

[35] M. Zaharia et al., "Accelerating the Machine Learning Lifecycle with MLflow," IEEE Data Engineering Bulletin, vol. 41, no. 4, pp. 39-45, 2018.

[36] S. Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models," International Conference on Learning Representations, 2023.

[37] T. Schick et al., "Toolformer: Language Models Can Teach Themselves to Use Tools," arXiv preprint arXiv:2302.04761, 2023.

[38] H. Chase, "LangChain: Building Applications with LLMs through Composability," available at https://github.com/hwchase17/langchain, 2023.

[39] M. Wu et al., "Safety Considerations for Autonomous AI Agents," AI Safety Conference Proceedings, pp. 112-127, 2023.

[40] C. Gulwani et al., "Program Synthesis," Foundations and Trends in Programming Languages, vol. 4, no. 1-2, pp. 1-119, 2017.

[41] D. Drain et al., "Generating Bug-Fixes Using Pretrained Transformers," Proceedings of the ACM/IEEE International Conference on Software Engineering, pp. 1548-1560, 2021.

[42] A. Svyatkovskiy et al., "IntelliCode Compose: Code Generation Using Transformer," Proceedings of the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1433-1443, 2020.

[43] Y. Zhou et al., "Large Language Models Are Human-Level Prompt Engineers," International Conference on Learning Representations, 2023.

[44] X. Wang et al., "Towards Practical Natural Language Interfaces to Databases," Communications of the ACM, vol. 65, no. 8, pp. 100-108, 2022.

[45] B. Pourreza and D. Rafiei, "DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction," arXiv preprint arXiv:2304.11015, 2023.

[46] P. Rajkumar et al., "Evaluating the Text-to-SQL Capabilities of Large Language Models," arXiv preprint arXiv:2204.00498, 2022.

[47] V. Yakovlev et al., "Security-Aware SQL Generation with Large Language Models," Database Security Workshop, pp. 78-92, 2023.

[48] N. Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning," arXiv preprint arXiv:2303.11366, 2023.

[49] K. Cobbe et al., "Training Verifiers to Solve Math Word Problems," arXiv preprint arXiv:2110.14168, 2021.

[50] T. Zhang et al., "Automated ETL Pipeline Generation Using Deep Learning," IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 6, pp. 5891-5904, 2023.

[51] D. Vassiliadis, "A Survey of Extract-Transform-Load Technology," International Journal of Data Warehousing and Mining, vol. 5, no. 3, pp. 1-27, 2009.

[52] M. Zaharia et al., "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing," NSDI, vol. 12, pp. 15-28, 2012.

[53] V. Hulsebos et al., "Sherlock: A Deep Learning Approach to Semantic Data Type Detection," Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1500-1508, 2019.

[54] Y. Suhara et al., "Annotating Columns with Pre-trained Language Models," Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1493-1503, 2022.

[55] K. Qian et al., "Are Deep Neural Networks the Best Choice for Modeling Source Code?" Proceedings of the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 763-773, 2020.

[56] T. Sonsteng et al., "Analytics Engineering with dbt: Best Practices and Patterns," dbt Labs Technical Report, 2022.

[57] C. Lemieux et al., "Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering," arXiv preprint arXiv:2401.08500, 2024.

[58] A. Wang et al., "Incorporating Documentation Knowledge for Code Generation," ACM Transactions on Software Engineering and Methodology, vol. 32, no. 4, pp. 1-28, 2023.

[59] B. Ray et al., "A Large-Scale Study of Programming Languages and Code Quality in GitHub," Communications of the ACM, vol. 60, no. 10, pp. 91-100, 2017.

[60] H. Pearce et al., "Can OpenAI Codex and Other Large Language Models Help Us Fix Security Bugs?" arXiv preprint arXiv:2112.02125, 2021.

[61] Z. Feng et al., "CodeBERT: A Pre-Trained Model for Programming and Natural Languages," Findings of EMNLP, pp. 1536-1547, 2020.

[62] S. Lu et al., "RLPROMPT: Optimizing Discrete Text Prompts with Reinforcement Learning," Proceedings of EMNLP, pp. 3369-3391, 2022.

[63] M. Tufano et al., "Towards Automating Code Review Activities," Proceedings of the IEEE/ACM International Conference on Software Engineering, pp. 163-174, 2021.

[64] Y. Liu et al., "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation," arXiv preprint arXiv:2305.01210, 2023.

[65] Z. Ji et al., "Survey of Hallucination in Natural Language Generation," ACM Computing Surveys, vol. 55, no. 12, pp. 1-38, 2023.

[66] S. Kim et al., "Automatic Generation of Performance Tests," IEEE Transactions on Software Engineering, vol. 47, no. 11, pp. 2428-2444, 2021.

[67] H. Pearce et al., "Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions," IEEE Symposium on Security and Privacy, pp. 754-768, 2022.

[68] A. Jiang et al., "Long Context Prompting for Claude 2.1," Anthropic Technical Report, 2023.

[69] R. Marcus et al., "Neo: A Learned Query Optimizer," Proceedings of the VLDB Endowment, vol. 12, no. 11, pp. 1705-1718, 2019.

[70] J. Hilprecht et al., "DeepDB: Learn from Data, not from Queries!" Proceedings of the VLDB Endowment, vol. 13, no. 7, pp. 992-1005, 2020.

[71] X. Wang et al., "Learning to Optimize SQL Queries," ACM SIGMOD Record, vol. 51, no. 2, pp. 6-13, 2022.

[72] K. Tzoumas et al., "Lightweight Asynchronous Snapshots for Distributed Dataflows," arXiv preprint arXiv:1506.08603, 2015.

[73] S. Chaudhuri, "An Overview of Query Optimization in Relational Systems," Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 34-43, 1998.

[74] A. Verma et al., "Large-scale Cluster Management at Google with Borg," Proceedings of the European Conference on Computer Systems, pp. 1-17, 2015.

[75] B. Hindman et al., "Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center," NSDI, vol. 11, pp. 22-22, 2011.

[76] V. K. Vavilapalli et al., "Apache Hadoop YARN: Yet Another Resource Negotiator," Proceedings of the Symposium on Cloud Computing, pp. 1-16, 2013.

[77] Netflix Technology Blog, "Auto Scaling Production Services on Titus," available at https://netflixtechblog.com, 2019.

[78] S. Schelter et al., "Automating Large-Scale Data Quality Verification," Proceedings of the VLDB Endowment, vol. 11, no. 12, pp. 1781-1794, 2018.

[79] T. Dasu and T. Johnson, "Exploratory Data Mining and Data Cleaning," Wiley-Interscience, 2003.

[80] Z. Abedjan et al., "Data Profiling," Synthesis Lectures on Data Management, vol. 10, no. 4, pp. 1-154, 2018.

[81] Great Expectations, "Great Expectations: Always Know What to Expect From Your Data," available at https://greatexpectations.io, 2023.

[82] T. Kraska et al., "SageDB: A Learned Database System," Conference on Innovative Data Systems Research, 2019.

[83] A. Pavlo et al., "Self-Driving Database Management Systems," Conference on Innovative Data Systems Research, 2017.

[84] S. Idreos et al., "Database Cracking," Conference on Innovative Data Systems Research, pp. 68-78, 2007.

[85] M. Interlandi et al., "Titian: Data Provenance Support in Spark," Proceedings of the VLDB Endowment, vol. 9, no. 3, pp. 216-227, 2015.

[86] P. Buneman et al., "Provenance in Databases," Foundations and Trends in Databases, vol. 1, no. 1, pp. 1-85, 2008.

[87] C. Le Goues et al., "The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs," IEEE Transactions on Software Engineering, vol. 41, no. 12, pp. 1236-1256, 2015.

[88] M. Monperrus, "Automatic Software Repair: A Bibliography," ACM Computing Surveys, vol. 51, no. 1, pp. 1-24, 2018.

[89] C. S. Xia et al., "Automated Program Repair in the Era of Large Pre-trained Language Models," Proceedings of the IEEE/ACM International Conference on Software Engineering, pp. 1036-1048, 2023.

[90] M. Harman and B. F. Jones, "Search-Based Software Engineering," Information and Software Technology, vol. 43, no. 14, pp. 833-839, 2001.

[91] D. Sculley et al., "Hidden Technical Debt in Machine Learning Systems," Advances in Neural Information Processing Systems, pp. 2503-2511, 2015.

[92] A. Madaan et al., "Self-Refine: Iterative Refinement with Self-Feedback," arXiv preprint arXiv:2303.17651, 2023.

[93] S. Maynez et al., "On Faithfulness and Factuality in Abstractive Summarization," Proceedings of ACL, pp. 1906-1919, 2020.

[94] J. Liu et al., "Is Your Code Generated By ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation," arXiv preprint arXiv:2305.01210, 2023.

[95] P. Godefroid et al., "Automating Software Testing Using Program Analysis," IEEE Software, vol. 25, no. 5, pp. 30-37, 2008.

[96] C. Parnin et al., "Automated Debugging: Are We There Yet?" Proceedings of the IEEE International Conference on Software Testing, Verification and Validation Workshops, pp. 1-4, 2011.

[97] W. McKeeman, "Differential Testing for Software," Digital Technical Journal, vol. 10, no. 1, pp. 100-107, 1998.

[98] H. Husain et al., "CodeSearchNet Challenge: Evaluating the State of Semantic Code Search," arXiv preprint arXiv:1909.09436, 2019.

[99] S. Noseworthy et al., "An Empirical Study of Software Vulnerabilities in Open Source Projects," IEEE Security and Privacy, vol. 18, no. 2, pp. 34-42, 2020.

[100] H. Pearce et al., "Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions," IEEE Symposium on Security and Privacy, pp. 754-768, 2022.

[101] S. Chen et al., "You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion," USENIX Security Symposium, pp. 1559-1575, 2021.

[102] M. Christakis and P. Müller, "An Experimental Evaluation of Deliberate Unsoundness in a Static Program Analyzer," Proceedings of the International Conference on Verification, Model Checking, and Abstract Interpretation, pp. 336-354, 2015.

[103] N. Ayewah et al., "Using Static Analysis to Find Bugs," IEEE Software, vol. 25, no. 5, pp. 22-29, 2008.

[104] T. Mytkowicz et al., "Producing Wrong Data Without Doing Anything Obviously Wrong!" ACM SIGPLAN Notices, vol. 44, no. 3, pp. 265-276, 2009.

[105] M. Zaharia et al., "Spark SQL: Relational Data Processing in Spark," Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1383-1394, 2015.

[106] A. Georges et al., "Statistically Rigorous Java Performance Evaluation," ACM SIGPLAN Notices, vol. 42, no. 10, pp. 57-76, 2007.

[107] G. Ren et al., "Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers," IEEE Micro, vol. 30, no. 4, pp. 65-79, 2010.

[108] W. Cunningham, "The WyCash Portfolio Management System," ACM SIGPLAN OOPS Messenger, vol. 4, no. 2, pp. 29-30, 1993.

[109] Z. Li et al., "Code Reviewing in the Trenches: Understanding Challenges and Best Practices," IEEE Software, vol. 35, no. 4, pp. 34-42, 2018.

[110] G. Sandoval et al., "Learning from Stack Overflow: How Software Developers Utilize Crowdsourced Knowledge in Practice," Proceedings of the IEEE/ACM International Conference on Software Engineering, pp. 1370-1380, 2014.

[111] S. McIntosh et al., "An Empirical Study of Build Maintenance Effort," Proceedings of the International Conference on Software Engineering, pp. 141-151, 2011.

[112] D. Yuan et al., "Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems," OSDI, pp. 249-265, 2014.

[113] P. Reynolds et al., "WAP5: Black-box Performance Debugging for Wide-area Systems," Proceedings of the International Conference on World Wide Web, pp. 347-356, 2006.

[114] M. Cinque et al., "Dependability Assessment of Distributed Control Systems: Concepts and Case Studies," International Journal of Critical Computer-Based Systems, vol. 4, no. 2, pp. 149-168, 2013.

[115] K. Ren et al., "Holistic Configuration Management at Facebook," Proceedings of the Symposium on Operating Systems Principles, pp. 328-343, 2015.

[116] E. Murphy-Hill et al., "How Do Software Developers Use GitHub Actions to Automate Their Workflows?" Proceedings of the IEEE/ACM International Conference on Mining Software Repositories, pp. 420-431, 2021.

[117] T. Fritz et al., "Degree-of-Knowledge: Modeling a Developer's Knowledge of Code," ACM Transactions on Software Engineering and Methodology, vol. 23, no. 2, pp. 1-42, 2014.

[118] J. Singer et al., "An Examination of Software Engineering Work Practices," Proceedings of the Conference on Computer Supported Cooperative Work, pp. 202-211, 1997.

[119] A. Opara-Martins et al., "Critical Analysis of Vendor Lock-in and Its Impact on Cloud Computing Migration: A Business Perspective," Journal of Cloud Computing, vol. 5, no. 1, pp. 1-18, 2016.

[120] L. M. Vaquero et al., "A Break in the Clouds: Towards a Cloud Definition," ACM SIGCOMM Computer Communication Review, vol. 39, no. 1, pp. 50-55, 2008.

[121] M. Armbrust et al., "A View of Cloud Computing," Communications of the ACM, vol. 53, no. 4, pp. 50-58, 2010.

[122] E. M. Bender et al., "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" Proceedings of the ACM Conference on Fairness, Accountability, and Transparency, pp. 610-623, 2021.

Keywords :

AI-Augmented Data Engineering, Large Language Models, Automated Pipeline Generation, Data Workflow Optimization, AI Governance, Mlops, Dataops.