👁️ Om AI Lab Open Multimodal AGI Research Pioneering the next generation of multimodal AI models for Spatial Intelligence and Embodied AI. --- 🌌 About Us At Om AI Lab,…
👁️ Om AI Lab
Open Multimodal AGI Research


*Pioneering the next generation of multimodal AI models for Spatial Intelligence and Embodied AI.*
---
🌌 About Us
At Om AI Lab, we believe the future of AI extends far beyond pure text. We are dedicated to building the "brains" for next-generation systems by focusing on the intersection of Spatial Intelligence, Visual Reasoning, and Embodied Agents.
Our research spans across open-vocabulary perception, reinforced vision-language models, and real-time inference. We aim to bridge the critical gap between high-level logical reasoning and fine-grained visual action—building models that don't just "see" the world, but intuitively understand and interact with it.
---
🚀 Core Research Tracks
🧠 Reinforced & Advanced VLMs
*Models that think, reason, and understand the visual world at a granular level.*- 🌟 VLM-R1: Solving Visual Understanding with Reinforced VLMs. *(Highly active)*
- 🔍 VLM-FO1: Bridging the gap between high-level reasoning and fine-grained perception in Vision-Language Models.
- 🔎 ZoomEye: Enhancing Multimodal LLMs with human-like zooming capabilities through tree-based image exploration.
👁️ Real-Time Perception & Open-World Detection
*Foundational spatial understanding optimized for edge and on-premise speeds.*- ⚡ OmDet: Real-time, highly accurate, open-vocabulary end-to-end object detection.
- 📐 GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training.
- 🌍 ImageRAG: Enhancing ultrahigh-resolution remote sensing imagery analysis.
🤖 Multimodal Agents & Embodied AI
*Action-oriented intelligence for physical and virtual environments.*- 🛠️ OmAgent: A comprehensive framework to build multimodal language agents for fast prototyping and production.
- 🎯 OpenTrackVLA: Open and reproducible research for tracking Vision-Language-Action (VLA) models.
📊 Benchmarks & Evaluation
*Rigorous standards for the open-source multimodal community.*- 📏 OVDEval: A comprehensive evaluation benchmark for Open-Vocabulary Detection.
- 📝 VL-CheckList: Evaluating Vision & Language Pretraining Models with Objects, Attributes, and Relations.
Building the foundational brains for the physical world.
Join us in exploring the spatial frontier.
-
VLM-R1 ★ PINNED
Solve Visual Understanding with Reinforced VLMs
Python ★ 6.0k 3mo agoExplain → -
OmDet ★ PINNED
Real-time and accurate open-vocabulary end-to-end object detection
Python ★ 1.4k 3mo agoExplain → -
OmAgent ★ PINNED
[EMNLP-2024] Build multimodal language agents for fast prototype and production
Python ★ 2.7k 1y agoExplain → -
VLM-FO1 ★ PINNED
VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs
Python ★ 321 6d agoExplain → -
OmTrackVLA ★ PINNED
Open & Reproducible Research for Tracking VLAs
Python ★ 220 6d agoExplain → -
ZoomEye ★ PINNED
[EMNLP-2025 Oral] ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
Python ★ 91 7mo agoExplain → -
RS5M
RS5M: a large-scale vision language dataset for remote sensing [TGRS]
Python ★ 313 1y agoExplain → -
awesome-RSVLM
Collection of Remote Sensing Vision-Language Models
★ 143 2y agoExplain → -
VL-CheckList
Evaluating Vision & Language Pretraining Models with Objects, Attributes and Relations. [EMNLP 2022]
Python ★ 139 2mo agoExplain → -
GroundVLP
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection (AAAI 2024)
Jupyter Notebook ★ 74 2mo agoExplain → -
OVDEval
A Comprehensive Evaluation Benchmark for Open-Vocabulary Detection (AAAI 2024)
Python ★ 64 2mo agoExplain → -
OmModel
A collection of strong multimodal models for building multimodal AGI agents
★ 45 1y agoExplain → -
open-agent-leaderboard
Reproducible Language Agent Research
Python ★ 36 1y agoExplain → -
ImageRAG
Enhancing Ultrahigh Resolution Remote Sensing Imagery Analysis With ImageRAG [GRSM]
Jupyter Notebook ★ 32 1mo agoExplain → -
OmChat
A suite of multimodal language models that are powerful and efficient
Python ★ 19 1y agoExplain → -
Probing-VLM-VGM
Probing VLM vs VGM for spatial understanding.
Python ★ 11 23d agoExplain → -
vlm-r1seg
No description.
Python ★ 5 1y agoExplain → -
OmAgentDocs
No description.
HTML ★ 4 1y agoExplain → -
habitat-lab ⑂
A modular high-level library to train embodied AI agents across a variety of tasks, environments, and simulators.
★ 2 4y agoExplain → -
bottom-up-attention.pytorch ⑂
An PyTorch reimplementation of bottom-up-attention models
★ 1 5y agoExplain → -
VLM-R1.github.io
Blog Site for VLM-R1
HTML ★ 1 1y agoExplain → -
.github
No description.
★ 0 1mo agoExplain → -
om-ai-lab.github.io
Official website for the org
HTML ★ 0 10mo agoExplain →
No repos match these filters.