-
LeetCUDA
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
Cuda ★ 11k 2h agoExplain → -
Awesome-LLM-Inference
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
Python ★ 5.3k 2mo agoExplain → -
lite.ai.toolkit
🛠A lite C++ AI toolkit: 100+ models with MNN, ORT and TRT, including Det, Seg, Stable-Diffusion, Face-Fusion, etc.🎉
C++ ★ 4.4k 3mo agoExplain → -
Awesome-DiT-Inference
📚A curated list of Awesome Diffusion Inference Papers with Codes: Sampling, Cache, Quantization, Parallelism, etc.🎉
Python ★ 569 7d agoExplain → -
lihang-notes
📚《统计学习方法-李航: 笔记》 200页PDF,公式细节讲解🎉
Shell ★ 498 11mo agoExplain → -
ffpa-attn
🤖FFPA: Extends FlashAttention-2 via Split-D for large headdims, 1.5x~3×↑🎉 vs SDPA, up to 430T🎉 on H200.
Python ★ 310 2d agoExplain → -
torchlm
💎An easy-to-use PyTorch library for face landmarks detection: training, evaluation, inference, and 100+ data augmentations.🎉
Python ★ 271 11mo agoExplain → -
HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
Cuda ★ 156 1y agoExplain → -
RVM-Inference
🔥Robust Video Matting C++ inference toolkit with ONNXRuntime、MNN、NCNN and TNN, via lite.ai.toolkit.
C++ ★ 142 1y agoExplain → -
yolov5face-toolkit
YOLO5Face 2021 with MNN/NCNN/TNN/ONNXRuntime
C++ ★ 61 3y agoExplain → -
nanodet-toolkit
NanoDet、NanoDet-Plus with ONNXRuntime/MNN/TNN/NCNN C++.
C++ ★ 30 4y agoExplain → -
flux-faster
A forked version of flux-fast that makes flux-fast even faster with cache-dit, 3.3x speedup on NVIDIA L20.
Python ★ 24 11mo agoExplain → -
scrfd-toolkit
Super fast accurate face detector ! SCRFD(CVPR 2021) with MNN/TNN/NCNN/ONNXRuntime C++.
C++ ★ 20 4y agoExplain → -
qwen-image-fast
⚡️Qwen-Image 4.8x🎉 speedup with Hybrid Acceleration for low VRAM GPUs
Python ★ 17 8mo agoExplain → -
fsanet-toolkit
FSANet: 1 Mb!! Head Pose Estimation with MNN、TNN and ONNXRuntime C++.
C++ ★ 17 4y agoExplain → -
longcat-video-fast
🔥LongCat-Video 1.7x🎉 speedup: cache acceleration and 4/8-bits weight only.
Python ★ 14 7mo agoExplain → -
netron-vscode-extension
☕️ A vscode extension for netron, support *.pdmodel, *.nb, *.onnx, *.pb, *.h5, *.tflite, *.pth, *.pt, *.mnn, *.param, etc.
TypeScript ★ 14 3y agoExplain → -
yolox-toolkit
YOLOX with NCNN/MNN/TNN/ONNXRuntime C++.
C++ ★ 13 4y agoExplain → -
yolop-toolkit
YOLOP with ONNXRuntime C++/MNN/TNN/NCNN
C++ ★ 9 4y agoExplain → -
mgmatting-toolkit
MGMatting with MNN/TNN/ONNXRuntime C++, GPU/CPU, support dynamic shape.
C++ ★ 8 4y agoExplain → -
SpargeAttn ⑂
SpargeAttention: A training-free sparse attention that can accelerate any model inference.
Cuda ★ 6 10mo agoExplain → -
flux ⑂
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
C++ ★ 5 10mo agoExplain → -
cache-dit ⑂
A PyTorch-native inference engine with cache, parallelism, quantization for Diffusion Transformers.
Python ★ 4 8d agoExplain → -
flux-fast ⑂
A forked version of flux-fast that makes flux-fast even faster with cache-dit.
Python ★ 4 5mo agoExplain → -
nunchaku ⑂
[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
Python ★ 3 2mo agoExplain → -
ssrnet-toolkit
SSRNet: 190 Kb!! Super fast Age Estimation with MNN/TNN/ONNXRuntime C++.
C++ ★ 3 4y agoExplain → -
cutlass ⑂
CUDA Templates and Python DSLs for High-Performance Linear Algebra
C++ ★ 2 5d agoExplain → -
quack ⑂
A Quirky Assortment of CuTe Kernels
Python ★ 2 5d agoExplain → -
svdquant-kernels ⑂
Cross-architecture CUDA kernels for SVDQuant (W4A4 with low-rank correction)
Python ★ 2 24d agoExplain → -
sglang ⑂
SGLang is a fast serving framework for large language models and vision language models.
Python ★ 1 2d agoExplain → -
diffusers ⑂
🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch and FLAX.
Python ★ 1 5d agoExplain → -
cllms-for-copilot ⑂
Pick Qwen, GLM, MiniMax, Xiaomi MiMo, Moonshot Kimi & Tencent Hunyuan models from the Copilot Chat model picker. Vision, thinking, BYOK.
★ 1 8d agoExplain → -
.github
No description.
★ 1 9d agoExplain → -
deepcompressor ⑂
Model Compression Toolbox for Large Language Models and Diffusion Models
★ 1 10mo agoExplain → -
TensorRT-LLM ⑂
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.
Python ★ 1 2mo agoExplain → -
ao ⑂
PyTorch native quantization and sparsity for training and inference
★ 1 3mo agoExplain → -
vllm-omni ⑂
A framework for efficient model inference with omni-modality models
Python ★ 1 3mo agoExplain → -
ComfyUI-CacheDiT ⑂
Cache-DiT Node for Comfyui
Python ★ 1 4mo agoExplain → -
DistVAE ⑂
A parallelism VAE avoids OOM for high resolution image generation
★ 1 10mo agoExplain → -
Qwen-Image ⑂
Qwen-Image is a powerful image generation foundation model capable of complex text rendering and precise image editing.
Python ★ 1 5mo agoExplain → -
Z-Image ⑂
No description.
Python ★ 1 5mo agoExplain → -
HunyuanImage-3.0 ⑂
HunyuanImage-3.0: A Powerful Native Multimodal Model for Image Generation
Python ★ 1 8mo agoExplain → -
HunyuanImage-2.1 ⑂
HunyuanImage-2.1: An Efficient Diffusion Model for High-Resolution (2K) Text-to-Image Generation
Python ★ 1 9mo agoExplain → -
draft-attention ⑂
Code for Draft Attention
★ 1 1y agoExplain → -
xlite-cli
The cli version of lite.ai.toolkit
C++ ★ 1 1y agoExplain → -
deepseek-v4-for-copilot ⑂
Pick DeepSeek V4 from the Copilot Chat model picker — and keep everything else Copilot already gives you.
TypeScript ★ 0 1d agoExplain → -
GCMP ⑂
通过集成国内主流原生大模型提供商,为开发者提供更加丰富、更适合本土需求的 AI 编程助手选择。 目前已内置支持 智谱AI、MiniMax、MoonshotAI、DeepSeek、阿里云百炼、快手万擎、火山方舟、腾讯云、Xiaomi MiMo 等原生大模型提供商。 此外,扩展插件已适配支持 OpenAI 与 Anthropic 的 API 接口兼容模型,支持自定义接入任何提供兼容接口的第三方云服务模型。
★ 0 5d agoExplain → -
flashinfer ⑂
FlashInfer: Kernel Library for LLM Serving
Python ★ 0 5d agoExplain → -
flash-attention ⑂
Fast and memory-efficient exact attention
Python ★ 0 5d agoExplain → -
vllm ⑂
A high-throughput and memory-efficient inference and serving engine for LLMs
★ 0 6d agoExplain → -
Wan2.1 ⑂
Wan: Open and Advanced Large-Scale Video Generative Models
Python ★ 0 8mo agoExplain → -
ptx-isa-markdown ⑂
PTX ISA 9.1 documentation converted to searchable markdown. Includes Claude Code skill for CUDA development.
★ 0 5mo agoExplain → -
Triton-distributed ⑂
Distributed Compiler based on Triton for Parallel Systems
★ 0 3mo agoExplain → -
cutile-learn ⑂
NVIDIA cuTile learn
★ 0 6mo agoExplain → -
SageAttention ⑂
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
Cuda ★ 0 5mo agoExplain → -
ComfyUI ⑂
The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
★ 0 7mo agoExplain → -
FlagGems ⑂
FlagGems is an operator library for large language models implemented in the Triton Language.
★ 0 11mo agoExplain → -
Kandinsky-5 ⑂
Kandinsky 5.0: A family of diffusion models for Video & Image generation
★ 0 8mo agoExplain → -
ImageReward ⑂
[NeurIPS 2023] ImageReward: Learning and Evaluating Human Preferences for Text-to-image Generation
Python ★ 0 7mo agoExplain → -
LongCat-Video ⑂
No description.
★ 0 7mo agoExplain → -
Wan2.2 ⑂
Wan: Open and Advanced Large-Scale Video Generative Models
Python ★ 0 8mo agoExplain → -
DiffSynth-Studio ⑂
Enjoy the magic of Diffusion models!
★ 0 8mo agoExplain → -
Phased-Consistency-Model ⑂
[NeurIPS 2024] Boosting the performance of consistency models with PCM!
★ 0 1y agoExplain → -
Qwen-Image-Lightning ⑂
Qwen-Image-Lightning: Speed up Qwen-Image model with distillation
★ 0 9mo agoExplain → -
pytorch ⑂
Tensors and Dynamic neural networks in Python with strong GPU acceleration
★ 0 10mo agoExplain → -
tutorial-template ⑂
Template for the Read the Docs tutorial
★ 0 1y agoExplain →
No repos match these filters.