Members
-
opencompass ★ PINNED
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Python ★ 7.1k 1d agoExplain → -
VLMEvalKit ★ PINNED
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
Python ★ 4.2k 2d agoExplain → -
MMBench ★ PINNED
Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"
★ 305 1y agoExplain → -
CompassVerifier ★ PINNED
[EMNLP 2025] CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward
Jupyter Notebook ★ 69 10mo agoExplain → -
CompassJudger ★ PINNED
The All-in-one Judge Models introduced by Opencompass
★ 119 11mo agoExplain → -
MMBench-GUI ★ PINNED
Official repo of "MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents". It can be used to evaluate a GUI agent with a hierarchical manner across multiple platforms, including Windows, Linux, macOS, iOS, Android and Web.
Python ★ 113 9mo agoExplain → -
MixtralKit
A toolkit for inference and evaluation of 'mixtral-8x7b-32kseqlen' from Mistral AI
Python ★ 773 2y agoExplain → -
LawBench
Benchmarking Legal Knowledge of Large Language Models
Python ★ 436 2y agoExplain → -
T-Eval
[ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step
Python ★ 310 2y agoExplain → -
BotChat
Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.
Jupyter Notebook ★ 163 1y agoExplain → -
GTA
[NeurIPS 2024 D&B] GTA: A Benchmark for General Tool Agents & [arXiv 2026] GTA-2
Python ★ 146 2mo agoExplain → -
DevEval
A Comprehensive Benchmark for Software Development.
Python ★ 131 2y agoExplain → -
GAOKAO-Eval
No description.
Jupyter Notebook ★ 122 8mo agoExplain → -
MathBench
[ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
★ 115 1y agoExplain → -
OpenFinData
No description.
★ 92 2y agoExplain → -
ANAH
[ACL 2024] ANAH & [NeurIPS 2024] ANAH-v2 & [ICLR 2025] Mask-DPO
Python ★ 65 1y agoExplain → -
Ada-LEval
The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"
Python ★ 56 1y agoExplain → -
CriticEval
[NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs
Python ★ 49 1y agoExplain → -
GenEditEvalKit
The first unified, efficient, and extensible evaluation toolkit for evaluating image generation and editing models across multiple benchmarks.
Jupyter Notebook ★ 46 2mo agoExplain → -
GPassK
[ACL 2025] Are Your LLMs Capable of Stable Reasoning?
Python ★ 33 10mo agoExplain → -
ProSA
[EMNLP 2024 Findings] ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
Python ★ 29 1y agoExplain → -
code-evaluator
A multi-language code evaluation tool.
Python ★ 28 2y agoExplain → -
Creation-MMBench
Assessing Context-Aware Creative Intelligence in MLLMs
JavaScript ★ 23 11mo agoExplain → -
TextEdit
We provide TextEdit, a high-quality, multi-scenario text editing benchmark for generation models.
Python ★ 20 3mo agoExplain → -
CNFinBench
CNFinBench — the first comprehensive benchmark for high-stakes financial scenarios. It spans 29 subtasks grounded in authoritative financial corpora and real business contexts, reconstructing end-to-end agent execution chains from requirement parsing, path planning, tool invocation, to result verification.
Python ★ 16 18d agoExplain → -
CIBench
Official Repo of "CIBench: Evaluation of LLMs as Code Interpreter "
Python ★ 15 1y agoExplain → -
RePro
[ICLR 2026] Rectifying LLM Thought From Lens of Optimization
Python ★ 15 6mo agoExplain → -
SAGA
The code repository for the NeurIPS 2025 paper "Rethinking Verification for LLM Code Generation: From Generation to Testing."
★ 11 7mo agoExplain → -
CompassBench
Demo data of CompassBench
★ 10 1y agoExplain → -
InteractScience
No description.
JavaScript ★ 8 7mo agoExplain → -
RaML
[Preprint 2025] Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective
Jupyter Notebook ★ 8 1y agoExplain → -
ATLAS
ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning
★ 7 7mo agoExplain → -
CodeBench
No description.
★ 3 2y agoExplain → -
OASIS
No description.
Python ★ 3 7mo agoExplain → -
ReasonZoo
No description.
Python ★ 3 10mo agoExplain → -
human-eval ⑂
Code for the paper "Evaluating Large Language Models Trained on Code"
Python ★ 3 2y agoExplain → -
lagent-cibench
No description.
Python ★ 2 1y agoExplain → -
SWE-bench-server
No description.
Python ★ 1 2mo agoExplain → -
pytorch_sphinx_theme ⑂
Sphinx Theme for OpenCompass - Modified from PyTorch
CSS ★ 1 2y agoExplain → -
evalplus ⑂
EvalPlus for rigourous evaluation of LLM-synthesized code
Python ★ 1 2y agoExplain → -
PowerBench
No description.
★ 0 18d agoExplain → -
SearchAgentService
No description.
Python ★ 0 1mo agoExplain → -
Terminal-Bench-server
No description.
Shell ★ 0 2mo agoExplain → -
pinchbench_server
No description.
Python ★ 0 2mo agoExplain → -
MiroFlow ⑂
MiroMind Research Agent: Fully Open-Source Deep Research Agent with Reproducible State-of-the-Art Performance on FutureX, GAIA, HLE, BrowserComp and xBench.
Python ★ 0 5mo agoExplain → -
CognitiveKernel-Pro ⑂
Deep Research Agent CognitiveKernel-Pro from Tencent AI Lab. Paper: https://arxiv.org/pdf/2508.00414
Python ★ 0 8mo agoExplain → -
.github
No description.
★ 0 9mo agoExplain → -
oc_doc_website
No description.
★ 0 1y agoExplain → -
hinode ⑂
A clean documentation and blog theme for your Hugo site based on Bootstrap 5
★ 0 1y agoExplain → -
storage
No description.
★ 0 1y agoExplain →
No repos match these filters.