kvcache.ai ORG

@kvcache-ai ·madsys.cs.tsinghua.edu.cn

KVCache.AI is a joint research project between MADSys and top industry collaborators, focusing on efficient LLM serving.

15 repos
1.2k followers
0 following

C++ 25%
Python 25%
Go 25%
JavaScript 25%

Members

All public repos (15)

Show forks Show archived

Mooncake ★ PINNED

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

C++ ★ 5.6k 1h ago
Explain →
ktransformers ★ PINNED

A Flexible Framework for Experiencing Heterogeneous LLM Inference/Fine-tune Optimizations

Research toolkit for running very large AI language models on modest hardware by offloading expert layers to CPU, letting you run models like DeepSeek-V3 without a fleet of expensive high-end GPUs.

Python ★ 17k 2d ago
Explain →
TrEnv-X ★ PINNED

No description.

Go ★ 87 9mo ago
Explain →
vllm ⑂

A high-throughput and memory-efficient inference and serving engine for LLMs

Python ★ 15 4d ago
Explain →
kvcache-blog

No description.

JavaScript ★ 12 23h ago
Explain →
sglang ⑂

SGLang is a fast serving framework for large language models and vision language models.

Python ★ 11 4d ago
Explain →
custom_flashinfer ⑂

FlashInfer: Kernel Library for LLM Serving

Cuda ★ 7 11mo ago
Explain →
DeepEP_fault_tolerance ⑂

DeepEP: an efficient expert-parallel communication library that supports fault tolerance

Cuda ★ 3 5mo ago
Explain →
sglang_awq ⑂

SGLang is a fast serving framework for large language models and vision language models.

Python ★ 2 2mo ago
Explain →
accelerate ⑂

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

★ 1 1mo ago
Explain →
Model-Optimizer ⑂

A unified library of SOTA model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM, TensorRT, vLLM, etc. to optimize inference speed.

★ 0 3d ago
Explain →
evalscope ⑂

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

Python ★ 0 2mo ago
Explain →
transformers ⑂

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

★ 0 1mo ago
Explain →
gpustack ⑂

GPU cluster manager for optimized AI model deployment

★ 0 6mo ago
Explain →
sglang-npu ⑂

SGLang is a fast serving framework for large language models and vision language models.

★ 0 10mo ago
Explain →