Members
-
InternVL ★ PINNED
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
Python ★ 10k 9mo agoExplain → -
InternVideo ★ PINNED
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Python ★ 2.3k 14d agoExplain → -
Ask-Anything ★ PINNED
[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.
Python ★ 3.3k 1y agoExplain → -
VideoMamba ★ PINNED
[ECCV2024] VideoMamba: State Space Model for Efficient Video Understanding
Python ★ 1.1k 1y agoExplain → -
OmniQuant ★ PINNED
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
Python ★ 899 7mo agoExplain → -
LLaMA-Adapter ★ PINNED
[ICLR 2024] Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters
Python ★ 5.9k 2y agoExplain → -
DragGAN
Unofficial Implementation of DragGAN - "Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold" (DragGAN 全功能实现,在线Demo,本地部署试用,代码、模型已全部开源,支持Windows, macOS, Linux)
Python ★ 5.0k 2y agoExplain → -
InternGPT
InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)
Python ★ 3.2k 1y agoExplain → -
InternImage
[CVPR 2023 Highlight] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Python ★ 2.8k 1y agoExplain → -
VisionLLM
VisionLLM Series
Python ★ 1.1k 1y agoExplain → -
SAM-Med2D
Official implementation of SAM-Med2D
Jupyter Notebook ★ 1.1k 2y agoExplain → -
ScaleCUA
[ICLR 2026 Oral] ScaleCUA is the open-sourced computer use agents that can operate on cross-platform environments (Windows, macOS, Ubuntu, Android).
Python ★ 1.1k 5mo agoExplain → -
VideoMAEv2
[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Python ★ 799 1y agoExplain → -
DCNv4
[CVPR 2024] Deformable Convolution v4
Python ★ 740 2y agoExplain → -
GITM
Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory
★ 640 3y agoExplain → -
Multi-Modality-Arena
Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!
Python ★ 566 2y agoExplain → -
Vision-RWKV
[ICLR 2025 Spotlight] Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
Python ★ 555 1y agoExplain → -
VideoChat-Flash
[ICLR2026] VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Python ★ 526 7mo agoExplain → -
all-seeing
[ICLR 2024 & ECCV 2024] The All-Seeing Projects: Towards Panoptic Visual Recognition&Understanding and General Relation Comprehension of the Open World"
Python ★ 509 1y agoExplain → -
OmniCorpus
[ICLR 2025 Spotlight] OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Python ★ 424 1y agoExplain → -
CaFo
[CVPR 2023] Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners
Python ★ 379 3y agoExplain → -
Instruct2Act
Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model
Python ★ 375 2y agoExplain → -
PonderV2
[T-PAMI 2025] PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm
Python ★ 374 8mo agoExplain → -
UniFormerV2
[ICCV2023] UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
Python ★ 350 2y agoExplain → -
unmasked_teacher
[ICCV2023 Oral] Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Python ★ 349 2y agoExplain → -
EfficientQAT
[ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
Python ★ 342 2mo agoExplain → -
LAMM
[NeurIPS 2023 Datasets and Benchmarks Track] LAMM: Multi-Modal Large Language Models and Applications as AI Agents
Python ★ 317 2y agoExplain → -
video-mamba-suite
The suite of modeling video with Mamba
Python ★ 295 2y agoExplain → -
InternVL-U
InternVL-U is a 4B-parameter unified multimodal model (UMM) that brings multimodal understanding, reasoning, image generation, image editing into a single framework.
Python ★ 291 3mo agoExplain → -
VideoChat-R1
[NIPS2025] VideoChat-R1 & R1.5: Enhancing Spatio-Temporal Perception and Reasoning via Reinforcement Fine-Tuning
Python ★ 267 8mo agoExplain → -
MM-Interleaved
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
Python ★ 253 2y agoExplain → -
HumanBench
This repo is official implementation of HumanBench (CVPR2023)
Python ★ 248 1y agoExplain → -
Diffree
Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model
Python ★ 239 1y agoExplain → -
ControlLLM
ControlLLM: Augment Language Models with Tools by Searching on Graphs
Python ★ 197 1y agoExplain → -
gv-benchmark
General Vision Benchmark, GV-B, a project from OpenGVLab
Python ★ 188 4y agoExplain → -
efficient-video-recognition
No description.
Python ★ 184 3y agoExplain → -
DriveMLM
No description.
★ 184 2y agoExplain → -
PhyGenBench
[ICML2025] The code and data of Paper: Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
Python ★ 161 1y agoExplain → -
UniHCP
Official PyTorch implementation of UniHCP
Python ★ 160 3y agoExplain → -
GUI-Odyssey
[ICCV 2025] GUIOdyssey is a comprehensive dataset for training and evaluating cross-app navigation agents. GUIOdyssey consists of 8,834 episodes from 6 mobile devices, spanning 6 types of cross-app tasks, 212 apps, and 1.4K app combos.
Python ★ 159 5mo agoExplain → -
Hulk
An official implementation of "Hulk: A Universal Knowledge Translator for Human-Centric Tasks"
Python ★ 146 1y agoExplain → -
EgoVideo
[CVPR 2024 Champions][ICLR 2025] Solutions for EgoVis Chanllenges in CVPR 2024
Jupyter Notebook ★ 136 1y agoExplain → -
ChartAst
[ACL 2024] ChartAssistant is a chart-based vision-language model for universal chart comprehension and reasoning.
Python ★ 135 1y agoExplain → -
MM-NIAH
[NeurIPS 2024] Needle In A Multimodal Haystack (MM-NIAH): A comprehensive benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.
Python ★ 126 1y agoExplain → -
ZeroGUI
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Python ★ 120 11mo agoExplain → -
MMT-Bench
[ICML 2024] | MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Python ★ 117 2mo agoExplain → -
PIIP
[NeurIPS 2024 Spotlight ⭐️ & TPAMI 2025] Parameter-Inverted Image Pyramid Networks (PIIP)
Python ★ 113 10mo agoExplain → -
Mono-InternVL
[CVPR 2025] Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Python ★ 109 11mo agoExplain → -
InternVL-MMDetSeg
Train InternViT-6B in MMSegmentation and MMDetection with DeepSpeed
Jupyter Notebook ★ 108 1y agoExplain → -
DiffRate
[ICCV 23]An approach to enhance the efficiency of Vision Transformer (ViT) by concurrently employing token pruning and token merging techniques, while incorporating a differentiable compression rate.
Jupyter Notebook ★ 103 2y agoExplain → -
SDLM
Sequential Diffusion Language Model (SDLM) enhances pre-trained autoregressive language models by adaptively determining generation length and maintaining KV-cache compatibility, achieving high efficiency and throughput.
Python ★ 98 6mo agoExplain → -
MMIU
[ICLR2025] MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Python ★ 98 1y agoExplain → -
NaViL
No description.
Python ★ 93 8mo agoExplain → -
vinci
Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model
Python ★ 92 7mo agoExplain → -
M3I-Pretraining
[CVPR 2023] implementation of Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information.
★ 91 3y agoExplain → -
VeBrain
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
★ 86 1y agoExplain → -
MUTR
「AAAI 2024」 Referred by Multi-Modality: A Unified Temporal Transformers for Video Object Segmentation
Python ★ 85 1y agoExplain → -
EgoExoLearn
[CVPR 2024] Data and benchmark code for the EgoExoLearn dataset
Python ★ 85 10mo agoExplain → -
Awesome-DragGAN
Awesome-DragGAN: A curated list of papers, tutorials, repositories related to DragGAN
★ 82 2y agoExplain → -
DDPS
Official Implementation of "Denoising Diffusion Semantic Segmentation with Mask Prior Modeling"
Python ★ 76 2y agoExplain → -
TimeSuite
[ICLR 2025] TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
Python ★ 74 1y agoExplain → -
LCL
[NeurIPS 2024] Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Python ★ 72 1y agoExplain → -
STM-Evaluation
No description.
Python ★ 70 16d agoExplain → -
GenExam
[ICML 2026] GenExam: A Multidisciplinary Text-to-Image Exam
Python ★ 68 29d agoExplain → -
Awesome-LLM4Tool
A curated list of the papers, repositories, tutorials, and anythings related to the large language models for tools
★ 68 2y agoExplain → -
TPO
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
Jupyter Notebook ★ 65 11mo agoExplain → -
LORIS
[ICML2023] Long-Term Rhythmic Video Soundtracker
Python ★ 63 11mo agoExplain → -
V2PE
[ICCV2025] V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
Python ★ 60 2mo agoExplain → -
PVC
[CVPR 2025] PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Python ★ 54 1y agoExplain → -
MetaCaptioner
No description.
Python ★ 51 4mo agoExplain → -
Vlaser
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
Python ★ 48 3mo agoExplain → -
Siamese-Image-Modeling
[CVPR 2023]Implementation of Siamese Image Modeling for Self-Supervised Vision Representation Learning
Python ★ 41 2y agoExplain → -
FluxViT
Make Your Training Flexible: Towards Deployment-Efficient Video Models
Python ★ 40 1y agoExplain → -
Docopilot
[CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding
Python ★ 37 11mo agoExplain → -
Multitask-Model-Selector
[NIPS2023]Implementation of Foundation Model is Efficient Multimodal Multitask Model Selector
Python ★ 37 2y agoExplain → -
De-focus-Attention-Networks
Learning 1D Causal Visual Representation with De-focus Attention Networks
Python ★ 35 2y agoExplain → -
EmbodiedGPT
No description.
★ 34 3y agoExplain → -
VRBench
[ICCV 2025] A Benchmark for Multi-Step Reasoning in Long Narrative Videos
Python ★ 28 21d agoExplain → -
DiffAgent
[CVPR 2024] DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model
★ 19 2y agoExplain → -
Official-ConvMAE-Det
No description.
Python ★ 18 3y agoExplain → -
LLMPrune-BESA
BESA is a differentiable weight pruning technique for large language models.
Python ★ 17 2y agoExplain → -
Future-L1
Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction
Python ★ 16 13d agoExplain → -
InternLMM
No description.
★ 16 3y agoExplain → -
.github
No description.
★ 15 1y agoExplain → -
opengvlab.github.io
No description.
★ 15 3y agoExplain → -
SID-VLN
Official implementation of: Learning Goal-Oriented Language-Guided Navigation with Self-Improving Demonstrations at Scale
Python ★ 14 6mo agoExplain → -
perception_test_iccv2023
Champion Solutions repository for Perception Test challenges in ICCV2023 workshop.
Python ★ 14 2y agoExplain → -
VKnowU
VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs
Python ★ 13 4mo agoExplain → -
MovieMind
No description.
★ 12 3y agoExplain → -
RIVER
[ICLR 2026] RIVER: A Real-Time Interaction Benchmark for Video LLMs
Python ★ 11 2mo agoExplain → -
ExpVid
No description.
★ 11 8mo agoExplain → -
VLMEvalKit_InternVL2_5 ⑂
Open-source evaluation toolkit of large vision-language models (LVLMs), support 160+ VLMs, 50+ benchmarks
Python ★ 2 1y agoExplain → -
OV-OAD ⑂
This repo takes the initial step towards leveraging text learning for online action detection without explicit human supervision.
★ 2 1y agoExplain → -
GenEditEvalKit ⑂
The first unified, efficient, and extensible evaluation toolkit for evaluating image generation and editing models across multiple benchmarks.
★ 0 3mo agoExplain →
No repos match these filters.