Hikari_localLLM

@hikarioyama

8 repos
7 followers
1 following

Python 75%
C++ 25%

93 contributions in the last year

1-day current streak·10-day longest streak

Jun 2025

15161718192021222324252627282930

Jul 2025

12345678910111213141516171819202122232425262728293031

Aug 2025

12345678910111213141516171819202122232425262728293031

Sep 2025

123456789101112131415161718192021222324252627282930

Oct 2025

12345678910111213141516171819202122232425262728293031

Nov 2025

123456789101112131415161718192021222324252627282930

Dec 2025

12345678910111213141516171819202122232425262728293031

Jan 2026

12345678910111213141516171819202122232425262728293031

Feb 2026

12345678910111213141516171819202122232425262728

Mar 2026

12345678910111213141516171819202122232425262728293031

Apr 2026

123456789101112131415161718192021222324252627282930

May 2026

12345678910111213141516171819202122232425262728293031

Jun 2026

12345678910111213141516171819

All public repos (8)

Show forks Show archived

vllm-nvfp4-kv-sm120

NVFP4 KV cache for vLLM on SM120 (RTX PRO 6000) via FlashInfer FA2 explicit-SF-stride patch — ~1.5x fp8 pool at ~95-104% speed

A patch for vLLM that enables NVFP4 KV cache compression on Blackwell-architecture GPUs like the RTX PRO 6000, fitting roughly 78 percent more context into the same GPU memory.

Python ★ 15 14d ago
Explain →
swarm-agent

High-concurrency goal-driven swarm harness for Step-3.7-Flash: conversational front door, stigmergic board, AIMD admission, recall, skills, web UI.

Python ★ 7 8d ago
Explain →
glm52-speedup

GLM-5.2 UD-Q2 decode speedup: CPU∥GPU MoE expert-split (harness, specs, de-risk)

C++ ★ 2 2h ago
Explain →
nvfp4-vs-fp8-kv-cache-terminal-bench

FP8 vs NVFP4 (4-bit) KV cache on Terminal-Bench 2.0 — no measurable accuracy loss, 1.78x more KV capacity. Full results + verify.py.

Python ★ 2 11d ago
Explain →
sglang-nvfp4-kv-sm120

SGLang NVFP4 (fp4_e2m1) KV cache for Blackwell SM120 (RTX PRO 6000): FlashInfer FA2 kernel patches + native FP4 pool + hybrid-SWA wiring + per-layer global-scale auto-calibration. 1.778x KV capacity, ~4% decode cost. Validated end-to-end on Step-3.7-Flash 198B (cuda-graph, TP=2). Small models hit the 4-bit precision floor (use fp8 KV).

Python ★ 2 12d ago
Explain →
gemma4-repe

Gemma-4-31B representation engineering: dataset-free personality control + process-level sycophancy ablation

Python ★ 1 6d ago
Explain →
llama.cpp-glm52

llama.cpp fork: GLM-5.2 CPU∥GPU MoE expert-split decode speedup

C++ ★ 0 2h ago
Explain →
dsv4-180b-tb2-eval

DeepSeek-V4-Flash-180B (REAP K160) × Terminal-Bench 2.0 — reproducible evaluation deliverable

Python ★ 0 3d ago
Explain →