ijact-book-coverT

Automated Document Pipelines: Deploying ML- Powered Workflows for End-to-End Archival and Retrieval

© 2025 by IJACT

Volume 3 Issue 2

Year of Publication : 2025

Author : Tejas Dhanorkar, Shemeer Sulaiman Kunju, Swaminathan Sethuraman

:10.56472/25838628/IJACT-V3I2P101

Citation :

Tejas Dhanorkar, Shemeer Sulaiman Kunju, Swaminathan Sethuraman, 2025. "Automated Document Pipelines: Deploying ML- Powered Workflows for End-to-End Archival and Retrieval" ESP International Journal of Advancements in Computational Technology (ESP-IJACT)  Volume 3, Issue 2: 1-9.

Abstract :

Modern enterprises struggle with inefficient manual document processing, leading to productivity losses and compliance risks. In this research , we introduce the creation of an archival and retrieval of the document to an end endpoint by means of a ML driven pipeline. Specifically, we are merging these two components (ingest, pre_join, class) and (metadata generate and search) from NLP and Computer Vision. Federated learning for privacy, vector embedding for retrieval, semantic retrieval, adaptive learning (domain adaption). With the policy of the security of our sources our immediate action was to store and adhere to the policy of the security of our sources of information: it can read many kind of sources other than from other automated emails, APIs, scanners etc. Certainly a lot of these gain a great deal in terms of processing speed and accuracy. When generative AI gets involved, like, for instance, generation AI summarization or real time stream process, the generative AI algorithms would have been included in future calculations. Also, this work will be very helpful for a business that wishes to gather a scalable and clever solution to an automation of document workflow.

References :

[1] A. Almeman, "The digital transformation in pharmacy: embracing online platforms and the cosmeceutical paradigm shift," J. Health Popul. Nutr., vol. 43, no. 1, p. 60, 2024.
[2] S. V. Mahadevkar et al., "Exploring AI-driven approaches for unstructured document analysis and future horizons," J. Big Data, vol. 11, no. 1, p. 92, 2024.
[3] S. B. Moore and S. L. Manring, "Strategy development in small and medium sized enterprises for sustainability and increased value creation," J. Cleaner Prod., vol. 17, no. 2, pp. 276–282, 2009.
[4] S. Jordan, S. S. Zabukovšek, and I. Š. Klančnik, "Document Management system–a way to digital transformation," Naše Gospod./Our Econ., vol. 68, no. 2, pp. 43–54, 2022.
[5] Business.com, "7 Statistics That Will Make You Rethink Your Document Management Strategy," 2023. [Online]. Available: https://www.business.com/articles/7-statistics-that-will-make-you- rethink-your-document-management-strategy
[6] Foxit Software, "10 Document Management Stats You Need to Know," 2023. [Online]. Available: https://www.foxit.com/blog/just- the-numbers-10-document-management-stats-you-need-to-kno
[7] PDF Reader Pro, "25 Document Management Statistics You Should Know," 2023. [Online]. Available: https://www.pdfreaderpro.com/blog/document-management-statistics
[8] Fortune Business Insights, "Document Management System Market Size, Share & COVID-19 Impact Analysis, By Component and Regional Forecast, 2025–2032," 2024. [Online]. Available: https://www.fortunebusinessinsights.com/document-management- system-market-106615
[9] D. Baviskar et al., "Efficient automated processing of the unstructured documents using artificial intelligence: A systematic literature review and future directions," IEEE Access, vol. 9, pp. 72894–72936, 2021.
[10] E. e Oliveira et al., "Unlabeled learning algorithms and operations: overview and future trends in defense sector," Artif. Intell. Rev., vol. 57, no. 3, p. 66, 2024.
[11] G. Chen, B. An, and S. Zeng, "A rule-based information extraction system for human-readable semi-structured scientific documents," in Proc. 4th Int. Conf. Comput. Sci. Netw. Technol. (ICCSNT), 2015, vol. 1.
[12] X. Chen, H. Xie, and X. Tao, "Vision, status, and research topics of Natural Language Processing," Nat. Lang. Process. J., vol. 1, p. 100001, 2022.
[13] W. Van Woensel and S. Motie, "NLP4PBM: a systematic review on process extraction using natural language processing with rule-based, machine and deep learning methods," Enterp. Inf. Syst., vol. 18, no. 11, p. 2417404, 2024.
[14] A. M. Aubaid, A. Mishra, and A. Mishra, "Machine learning and rule-based embedding techniques for classifying text documents," Int.
[15] J. Syst. Assur. Eng. Manag., 2024.J. Patel, "Bridging data silos using big data integration," Int. J. Database Manag. Syst., vol. 11, no. 3, pp. 1–6, 2019.
[16] M. A. Achachlouei et al., "Document Automation Architectures: Updated Survey in Light of Large Language Models," arXiv e-prints, arXiv:2308, 2023.
[17] G. Sundaram and D. Berleant, "Automating systematic literature reviews with natural language processing and text mining: A systematic literature review," in Int. Congr. Inf. Commun. Technol., Springer, Singapore, 2023.
[18] N. F. Ali et al., "Automated Literature Review Using NLP Techniques and LLM-Based Retrieval-Augmented Generation," arXiv e-prints, arXiv:2411, 2024.
[19] F. Saeed et al., "Employing Federated Learning for the Implication of Digital Twin," in Digital Twins for Wireless Networks: Overview, Architecture, and Challenges, Cham: Springer Nature Switzerland, 2024, pp. 93–122.
[20] V. Bellandi et al., "Streamlining Legal Document Management: A Knowledge-Driven Service Platform," SN Comput. Sci., vol. 6, no. 2,
[21] pp. 1–17, 2025.
[22] H. A. R. I. P. Mandava, "Streamlining enterprise resource planning through digital technologies," J. Adv. Eng. Technol., ResearchGate, 2024.
[23] U. Kampffmeyer, Ed., Conversion & Document Formats: Backfile Conversion and Format Issues for Information Stored in Digital Archives, vol. 2, PROJECT CONSULT GmbH, 2002.
[24] J. S. Chu, Automated pipelines for information extraction from semi- structured documents in structured format, Ph.D. dissertation, Massachusetts Institute of Technology, 2023.
[25] K. M. O. Nahar et al., "Recognition of Arabic air-written letters: machine learning, convolutional neural networks, and optical character recognition (OCR) techniques," Sensors, vol. 23, no. 23, p. 9475, 2023.
[26] Q. Zhang et al., "Document parsing unveiled: Techniques, challenges, and prospects for structured information extraction," arXiv preprint, arXiv:2410.21169, 2024.
[27] P.-Y. Hao, J.-H. Chiang, and Y.-K. Tu, "Hierarchically SVM classification based on support vector clustering method and its application to document categorization," Expert Syst. Appl., vol. 33, no. 3, pp. 627–635, 2007.
[28] S. R. Kundeti et al., "Clinical named entity recognition: Challenges and opportunities," in Proc. IEEE Int. Conf. Big Data, 2016.
[29] P. L. Bradshaw et al., "Archive storage system design for long-term storage of massive amounts of data," IBM J. Res. Dev., vol. 52, no. 4.5, pp. 379–388, 2008.

Keywords :

machine learner, information retrieved (NLP based), metadata indexing, Document automation, compliance, semantic search (etc).