IJCSIT

A Survey of Models for Grounded Vision-Language Learning with Multi-Modal Data

© 2025 by IJCSIT

Volume 1 Issue 3

Year of Publication : 2025

Author : Shabina Sayyad

: XXXXXXXX

Citation :

Shabina Sayyad, 2025. "A Survey of Models for Grounded Vision-Language Learning with Multi-Modal Data" International Journal of Computer Science & Information Technology  Volume 1, Issue 3: 12-20.

Abstract :

By leveraging big datasets and neural architectures, foundation models have taken AI to new heights, allowing robust generalization and in-context learning. Going even further multi-modal foundation models (MMFMs) also integrate text through images and all sensory information such as audio, video. This paper offers a novel, more complete understanding of the MMFM and its principles of design as well as training methodology and architectural advancement. Neural_MMoDs is based of several notable models that includes GPT-4, Gemini relational reasoning, Flamingo multimodal task completion and Kosmos planner-executor. This in turn demonstrates the efficacy of joint multimodal learning across different tasks (recursive reasoning, structured prediction), datasets (VQA [2], COCO- captions [26, 27]), and modalities.

References :

[1] Brown, T. et al. (2020). Language Models are Few-Shot Learners. NeurIPS.

[2] Radford, A. et al. (2021). Unicorn: Continual Learning with a Universal, Off-Policy Agent ICML.

[3] Alayrac, J.B. et al. (2022). Combining only a tiny local trunk region in each loop with few 1×M convolutions containing fewer age priors helps prevent over-fitting of domain specific regions, Despite this, these loops can be essentially viewed as a special memory module retrieving the global context for every frame. DeepMind.

[4] Zhai, X. et al. (2022). Scaling Vision Transformers. arXiv preprint arXiv:2106.04560.

[5] Yuan, H. et al. (2023). Kosmos-1: Multimodal Language Model. Microsoft Research.

[6] Chen, M. et al. (2023). Gemini: Google DeepMind’s Multimodal AI. DeepMind Technical Report.

[7] Li, X. et al. (2021). Lawyer upBefore Fuse*:NeurIPS- Vision and Language Representation Learning

[8] Tsimpoukelli, M. et al. (2021). StructFormer: Freezing the Structure of Language Models for Multimodal Few-Shot Learning [NeurIPS]

[9] OpenAI (2023). GPT-4 Technical Report. OpenAI.

[10] Bommasani, R. et al. (2021). Foundation Models: The Good, the Risks and What Lies Ahead Stanford CRFM Regulatory Report

[11] Devlin, J. et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding — NAACL

[12] Ramesh, A. et al. (2021). Zero-Shot Text-to-Image Generation. ICML.

[13] Jia, C. et al. (2021). Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision, ICML.

[14] Hendricks, L.A. et al. (2016). Deep Compositional Captioning. CVPR.

[15] Wang, A. et al. (2019). GLUE: A Multi-Task Benchmark for Natural Language Understanding ICLR 2019

[16] Li, J. et al. (2023). MURAL: Multimodal Representation Learning. arXiv:2302.00010.

[17] Wang, P. et al. (2023). VLMS ARE ZS-PLANNING arXiv:2306.14824.

[18] Jiao, X. et al. (2020). Evaluating the Limits of Transfer Learning with a Unified Text-to-Text Transformer EMNLP.

[19] Akbari, H. et al. (2021). Vision-and-Touch Transformers for Multimodal Self-Supervised Learning VATT NeurIPS

[20] Yu, J. et al. (2022). More On — Scalable On-Device Images Generation with Content-Rich Learned Priorsinfluential CVPR landmarks.

[21] Kim, J. et al. (2022). VQA-X: Explainable Visual Question Answering. ECCV.

[22] Huang, Y. et al. (2022). GIT: Generative Image-to-Text Transformer. ECCV.

[23] Li, X. et al. (2023). BLIP-2: Bootstrapping Language-Image Pre-training. arXiv.

[24] Zellers, R. et al. (2021). PIGLeT: Language-Grounded Image Generation. CVPR.

[25] Lin, T.-Y. et al. (2014). Microsoft COCO: Common Object in Context. ECCV.

[26] Antol, S. et al. (2015). VQA: Visual Question Answering. ICCV.

[27] Marino, K. et al. (2019). Written byJasonRamraj(2018) OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge CVPR

[28] Lu, J. et al. (2019). VilBERT: Pretraining Task-Agnostic V-L Representations. NeurIPS.

[29] Zhu, Y. et al. (2020). Model Type & Paper Title Action UnderstandingVideo Transformers — AbstractActFormer CVPR.

[30] Chen, Y. et al. (2021). Perceiver: with G eneral perception and iterative attention ICML.

[31] Jaegle, A. et al. (2021). Perceiver IO: A General Architecture for Structured Inputs and Outputs. ICML

[32] Dancette, C. et al. (2023). Real-time Multimodal Transformers除で、リアルタイム推論に向けて arXiv:2303.02550.

[33] Ahuja, K. et al. (2023). ASR-w-im: Multimodal asr and all im speech recognition. Splice models for interfacing speech and text with. ICASSP.

[34] Adiwardana, D. et al. (2020). Meena: A Conversational Agent. arXiv:2001.09977.

[35] Bubeck, S. et al. (2023). Testing AGI — First Light at the End of the Tunnel ( GPT-4 arXiv:2303.12712)

[36] Rozen, S. et al. (2023). Vision-Language Models for Autonomous Agents. arXiv:2306.00989.

[37] Dou, Z.Y. et al. (2022). Table of Image-Text Foundation Models ICL: Image-Centric Language ModelsCONTROL: Competence-adaptive Language ModelingCoCa: Contrastive Captioners CVPR.

[38] He, K. et al. (2016). ImageNet Classification with Deep Convolutional Neural NetworksDeep Residual Learning for Image Recognition [paper] CVPR.

[39] Simonyan, K., & Zisserman, A. (2015 ). Very Deep Convolutional Networks. ICLR.

[40] Dosovitskiy, A. et al. (2020). Review Paper (26): Transformers in Vision — Image is Worth 16×16 Words ICLR

[41] Xu, K. et al. (2015). Neural Image Caption Generation with Visual Attention — Show, Attend and Tell ICML.

[42] Vondrick, C. et al. (2016). Generating Videos with Scene Dynamics. NeurIPS.

[43] Ramesh, A. et al. (2022). Hierarchical Text-Conditional Image Generation. CVPR.

[44] Comp for the Rest of Us. Schick, T & Schütze, H (2021). EACL Cloze-style Few-Shot Text Classification Hacks

[45] Liu, Y. et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining (Trimmed for contextual understanding) arXiv:1907.11692.

[46] Thoppilan, R. et al. (2022). ArXiv:2201.08239Lambda(Language Models for Dialog Applications)

[47] Gafni, E. et al. (2022). Make-A-Video: Text-to-Video Generation. Meta AI.

[48] Sunkara, S. et al. (2023). AudioGPT: Learning to Generate Speech from the Text | arXiv.

[49] Hendrycks, D. et al. (2021). Measuring Massive Multitask Language Understanding. ICLR.

[50] Chiang, P.E. et al. (2023). Multimodal Language Model ArXiv Spinning Instructions

Keywords :

Pre-Train Model,Multimodal Learning,Model-Agnostic Tools,Vision-Language Applications,AI Applications Generative (Source).