
Contextual Object Detection with Multimodal Large Language Models Article: https://arxiv.org/pdf/2305.18273.pdf Introduction Object detection, a crucial aspect of computer vision, involves understanding the objects present in a scene, enabling various applications like robotics, autonomous driving, and AR/VR systems. Recently, Multi-modal Language Models (MLLMs) such as Flamingo, PaLM-E, and OpenAI's GPT-4 have demonstrated remarkable abilities in vision-language tasks like image captioning and question answering. These models enable interactive human-AI interactions, necessitating the modeling of contextual information and relationships among visual objects, human words, phrases, and dialogues. Therefore, there is a need to enhance MLLMs by enabling them to locate, identify, and associate visual objects with language inputs for effective human-AI interaction. Top of Form Concepts Multimodal Large Language Models (MLLMs) combine language comprehensio...