OpenGVLab ORG

General Vision Team of Shanghai AI Laboratory

94 repos
3.4k followers
0 following

Python 94%
Jupyter Notebook 6%

Members

shepnerd
orashi
czczup
JustinYuu
ZhenhangHuang
JerryFlymi
yinanhe
wzk1015
xh9998

All public repos (94)

Show forks Show archived Sort

InternVL ★ PINNED

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

Python ★ 10k 9mo ago
Explain →
InternVideo ★ PINNED

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding

Python ★ 2.3k 14d ago
Explain →
Ask-Anything ★ PINNED

[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.

Python ★ 3.3k 1y ago
Explain →
VideoMamba ★ PINNED

[ECCV2024] VideoMamba: State Space Model for Efficient Video Understanding

Python ★ 1.1k 1y ago
Explain →
OmniQuant ★ PINNED

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

Python ★ 899 7mo ago
Explain →
LLaMA-Adapter ★ PINNED

[ICLR 2024] Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters

Python ★ 5.9k 2y ago
Explain →
DragGAN

Unofficial Implementation of DragGAN - "Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold" （DragGAN 全功能实现，在线Demo，本地部署试用，代码、模型已全部开源，支持Windows, macOS, Linux）

Python ★ 5.0k 2y ago
Explain →
InternGPT

InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)

Python ★ 3.2k 1y ago
Explain →
InternImage

[CVPR 2023 Highlight] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

Python ★ 2.8k 1y ago
Explain →
VisionLLM

VisionLLM Series

Python ★ 1.1k 1y ago
Explain →
SAM-Med2D

Official implementation of SAM-Med2D

Jupyter Notebook ★ 1.1k 2y ago
Explain →
ScaleCUA

[ICLR 2026 Oral] ScaleCUA is the open-sourced computer use agents that can operate on cross-platform environments (Windows, macOS, Ubuntu, Android).

Python ★ 1.1k 5mo ago
Explain →
VideoMAEv2

[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Python ★ 799 1y ago
Explain →
DCNv4

[CVPR 2024] Deformable Convolution v4

Python ★ 740 2y ago
Explain →
GITM

Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

★ 640 3y ago
Explain →
Multi-Modality-Arena

Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!

Python ★ 566 2y ago
Explain →
Vision-RWKV

[ICLR 2025 Spotlight] Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

Python ★ 555 1y ago
Explain →
VideoChat-Flash

[ICLR2026] VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

Python ★ 526 7mo ago
Explain →
all-seeing

[ICLR 2024 & ECCV 2024] The All-Seeing Projects: Towards Panoptic Visual Recognition&Understanding and General Relation Comprehension of the Open World"

Python ★ 509 1y ago
Explain →
OmniCorpus

[ICLR 2025 Spotlight] OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Python ★ 424 1y ago
Explain →
CaFo

[CVPR 2023] Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners

Python ★ 379 3y ago
Explain →
Instruct2Act

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model

Python ★ 375 2y ago
Explain →
PonderV2

[T-PAMI 2025] PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm

Python ★ 374 8mo ago
Explain →
UniFormerV2

[ICCV2023] UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

Python ★ 350 2y ago
Explain →
unmasked_teacher

[ICCV2023 Oral] Unmasked Teacher: Towards Training-Efficient Video Foundation Models

Python ★ 349 2y ago
Explain →
EfficientQAT

[ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

Python ★ 342 2mo ago
Explain →
LAMM

[NeurIPS 2023 Datasets and Benchmarks Track] LAMM: Multi-Modal Large Language Models and Applications as AI Agents

Python ★ 317 2y ago
Explain →
video-mamba-suite

The suite of modeling video with Mamba

Python ★ 295 2y ago
Explain →
InternVL-U

InternVL-U is a 4B-parameter unified multimodal model (UMM) that brings multimodal understanding, reasoning, image generation, image editing into a single framework.

Python ★ 291 3mo ago
Explain →
VideoChat-R1

[NIPS2025] VideoChat-R1 & R1.5: Enhancing Spatio-Temporal Perception and Reasoning via Reinforcement Fine-Tuning

Python ★ 267 8mo ago
Explain →
MM-Interleaved

MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer

Python ★ 253 2y ago
Explain →
HumanBench

This repo is official implementation of HumanBench (CVPR2023)

Python ★ 248 1y ago
Explain →
Diffree

Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model

Python ★ 239 1y ago
Explain →
ControlLLM

ControlLLM: Augment Language Models with Tools by Searching on Graphs

Python ★ 197 1y ago
Explain →
gv-benchmark

General Vision Benchmark, GV-B, a project from OpenGVLab

Python ★ 188 4y ago
Explain →
efficient-video-recognition

No description.

Python ★ 184 3y ago
Explain →
DriveMLM

No description.

★ 184 2y ago
Explain →
PhyGenBench

[ICML2025] The code and data of Paper: Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Python ★ 161 1y ago
Explain →
UniHCP

Official PyTorch implementation of UniHCP

Python ★ 160 3y ago
Explain →
GUI-Odyssey

[ICCV 2025] GUIOdyssey is a comprehensive dataset for training and evaluating cross-app navigation agents. GUIOdyssey consists of 8,834 episodes from 6 mobile devices, spanning 6 types of cross-app tasks, 212 apps, and 1.4K app combos.

Python ★ 159 5mo ago
Explain →
Hulk

An official implementation of "Hulk: A Universal Knowledge Translator for Human-Centric Tasks"

Python ★ 146 1y ago
Explain →
EgoVideo

[CVPR 2024 Champions][ICLR 2025] Solutions for EgoVis Chanllenges in CVPR 2024

Jupyter Notebook ★ 136 1y ago
Explain →
ChartAst

[ACL 2024] ChartAssistant is a chart-based vision-language model for universal chart comprehension and reasoning.

Python ★ 135 1y ago
Explain →
MM-NIAH

[NeurIPS 2024] Needle In A Multimodal Haystack (MM-NIAH): A comprehensive benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.

Python ★ 126 1y ago
Explain →
ZeroGUI

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

Python ★ 120 11mo ago
Explain →
MMT-Bench

[ICML 2024] | MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Python ★ 117 2mo ago
Explain →
PIIP

[NeurIPS 2024 Spotlight ⭐️ & TPAMI 2025] Parameter-Inverted Image Pyramid Networks (PIIP)

Python ★ 113 10mo ago
Explain →
Mono-InternVL

[CVPR 2025] Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Python ★ 109 11mo ago
Explain →
InternVL-MMDetSeg

Train InternViT-6B in MMSegmentation and MMDetection with DeepSpeed

Jupyter Notebook ★ 108 1y ago
Explain →
DiffRate

[ICCV 23]An approach to enhance the efficiency of Vision Transformer (ViT) by concurrently employing token pruning and token merging techniques, while incorporating a differentiable compression rate.

Jupyter Notebook ★ 103 2y ago
Explain →
SDLM

Sequential Diffusion Language Model (SDLM) enhances pre-trained autoregressive language models by adaptively determining generation length and maintaining KV-cache compatibility, achieving high efficiency and throughput.

Python ★ 98 6mo ago
Explain →
MMIU

[ICLR2025] MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Python ★ 98 1y ago
Explain →
NaViL

No description.

Python ★ 93 8mo ago
Explain →
vinci

Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model

Python ★ 92 7mo ago
Explain →
M3I-Pretraining

[CVPR 2023] implementation of Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information.

★ 91 3y ago
Explain →
VeBrain

Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces

★ 86 1y ago
Explain →
MUTR

「AAAI 2024」 Referred by Multi-Modality: A Unified Temporal Transformers for Video Object Segmentation

Python ★ 85 1y ago
Explain →
EgoExoLearn

[CVPR 2024] Data and benchmark code for the EgoExoLearn dataset

Python ★ 85 10mo ago
Explain →
Awesome-DragGAN

Awesome-DragGAN: A curated list of papers, tutorials, repositories related to DragGAN

★ 82 2y ago
Explain →
DDPS

Official Implementation of "Denoising Diffusion Semantic Segmentation with Mask Prior Modeling"

Python ★ 76 2y ago
Explain →
TimeSuite

[ICLR 2025] TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

Python ★ 74 1y ago
Explain →
LCL

[NeurIPS 2024] Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

Python ★ 72 1y ago
Explain →
STM-Evaluation

No description.

Python ★ 70 16d ago
Explain →
GenExam

[ICML 2026] GenExam: A Multidisciplinary Text-to-Image Exam

Python ★ 68 29d ago
Explain →
Awesome-LLM4Tool

A curated list of the papers, repositories, tutorials, and anythings related to the large language models for tools

★ 68 2y ago
Explain →
TPO

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Jupyter Notebook ★ 65 11mo ago
Explain →
LORIS

[ICML2023] Long-Term Rhythmic Video Soundtracker

Python ★ 63 11mo ago
Explain →
V2PE

[ICCV2025] V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

Python ★ 60 2mo ago
Explain →
PVC

[CVPR 2025] PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

Python ★ 54 1y ago
Explain →
MetaCaptioner

No description.

Python ★ 51 4mo ago
Explain →
Vlaser

Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

Python ★ 48 3mo ago
Explain →
Siamese-Image-Modeling

[CVPR 2023]Implementation of Siamese Image Modeling for Self-Supervised Vision Representation Learning

Python ★ 41 2y ago
Explain →
FluxViT

Make Your Training Flexible: Towards Deployment-Efficient Video Models

Python ★ 40 1y ago
Explain →
Docopilot

[CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding

Python ★ 37 11mo ago
Explain →
Multitask-Model-Selector

[NIPS2023]Implementation of Foundation Model is Efficient Multimodal Multitask Model Selector

Python ★ 37 2y ago
Explain →
De-focus-Attention-Networks

Learning 1D Causal Visual Representation with De-focus Attention Networks

Python ★ 35 2y ago
Explain →
EmbodiedGPT

No description.

★ 34 3y ago
Explain →
VRBench

[ICCV 2025] A Benchmark for Multi-Step Reasoning in Long Narrative Videos

Python ★ 28 21d ago
Explain →
DiffAgent

[CVPR 2024] DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model

★ 19 2y ago
Explain →
Official-ConvMAE-Det

No description.

Python ★ 18 3y ago
Explain →
LLMPrune-BESA

BESA is a differentiable weight pruning technique for large language models.

Python ★ 17 2y ago
Explain →
Future-L1

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

Python ★ 16 13d ago
Explain →
InternLMM

No description.

★ 16 3y ago
Explain →
.github

No description.

★ 15 1y ago
Explain →
opengvlab.github.io

No description.

★ 15 3y ago
Explain →
SID-VLN

Official implementation of: Learning Goal-Oriented Language-Guided Navigation with Self-Improving Demonstrations at Scale

Python ★ 14 6mo ago
Explain →
perception_test_iccv2023

Champion Solutions repository for Perception Test challenges in ICCV2023 workshop.

Python ★ 14 2y ago
Explain →
VKnowU

VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs

Python ★ 13 4mo ago
Explain →
MovieMind

No description.

★ 12 3y ago
Explain →
RIVER

[ICLR 2026] RIVER: A Real-Time Interaction Benchmark for Video LLMs

Python ★ 11 2mo ago
Explain →
ExpVid

No description.

★ 11 8mo ago
Explain →
VLMEvalKit_InternVL2_5 ⑂

Open-source evaluation toolkit of large vision-language models (LVLMs), support 160+ VLMs, 50+ benchmarks

Python ★ 2 1y ago
Explain →
OV-OAD ⑂

This repo takes the initial step towards leveraging text learning for online action detection without explicit human supervision.

★ 2 1y ago
Explain →
GenEditEvalKit ⑂

The first unified, efficient, and extensible evaluation toolkit for evaluating image generation and editing models across multiple benchmarks.

★ 0 3mo ago
Explain →

No repos match these filters.

Made with gitmyhub, a BitVibe Labs product. · Explanations powered by explaingit.

GitHub is a trademark of GitHub, Inc. gitmyhub is independent and not affiliated with or endorsed by GitHub. Public data is shown via the GitHub API.