Om AI Lab ORG

Open Multimodal AGI Research

23 repos
810 followers
0 following

Python 72%
HTML 17%
Jupyter Notebook 11%

👁️ Om AI Lab Open Multimodal AGI Research Pioneering the next generation of multimodal AI models for Spatial Intelligence and Embodied AI. --- 🌌 About Us At Om AI Lab,…

👁️ Om AI Lab

Open Multimodal AGI Research

![Website](https://om-ai-lab.github.io)
![Hugging Face](https://huggingface.co/omlab)
![X (Twitter)](https://twitter.com/OmAI_lab)

*Pioneering the next generation of multimodal AI models for Spatial Intelligence and Embodied AI.*

---

🌌 About Us

At Om AI Lab, we believe the future of AI extends far beyond pure text. We are dedicated to building the "brains" for next-generation systems by focusing on the intersection of Spatial Intelligence, Visual Reasoning, and Embodied Agents.

Our research spans across open-vocabulary perception, reinforced vision-language models, and real-time inference. We aim to bridge the critical gap between high-level logical reasoning and fine-grained visual action—building models that don't just "see" the world, but intuitively understand and interact with it.

---

🚀 Core Research Tracks

🧠 Reinforced & Advanced VLMs

*Models that think, reason, and understand the visual world at a granular level.*

🌟 VLM-R1: Solving Visual Understanding with Reinforced VLMs. *(Highly active)*
🔍 VLM-FO1: Bridging the gap between high-level reasoning and fine-grained perception in Vision-Language Models.
🔎 ZoomEye: Enhancing Multimodal LLMs with human-like zooming capabilities through tree-based image exploration.

👁️ Real-Time Perception & Open-World Detection

*Foundational spatial understanding optimized for edge and on-premise speeds.*

⚡ OmDet: Real-time, highly accurate, open-vocabulary end-to-end object detection.
📐 GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training.
🌍 ImageRAG: Enhancing ultrahigh-resolution remote sensing imagery analysis.

🤖 Multimodal Agents & Embodied AI

*Action-oriented intelligence for physical and virtual environments.*

🛠️ OmAgent: A comprehensive framework to build multimodal language agents for fast prototyping and production.
🎯 OpenTrackVLA: Open and reproducible research for tracking Vision-Language-Action (VLA) models.

📊 Benchmarks & Evaluation

*Rigorous standards for the open-source multimodal community.*

📏 OVDEval: A comprehensive evaluation benchmark for Open-Vocabulary Detection.
📝 VL-CheckList: Evaluating Vision & Language Pretraining Models with Objects, Attributes, and Relations.

Building the foundational brains for the physical world.
Join us in exploring the spatial frontier.

All public repos (23)

Show forks Show archived