busybeaver

Python ★ 13 updated 28d ago

A 2MB sklearn classifier on consumer CPU beats Cohere's $10M+ 218B MoE on HumanEval

A Python tool that runs AI coding benchmarks locally on a consumer CPU using a 2.1GB quantized model, demonstrating that a small Qwen coding model can score 89% on HumanEval and beat larger commercial models on narrow coding tasks.

PythonQwen2.5setup: moderatecomplexity 2/5

busyBeaver is a Python benchmark runner that tests AI coding ability on your own computer using a small, locally downloaded model. The project's main claim is that a 2.1 GB model running on a consumer CPU scored 89% on HumanEval, a standard test where the model writes Python functions from descriptions, while Cohere's Command A+, a much larger commercial model, scored 75% on the same test.

The tool works by feeding each test problem to the model, generating up to three attempts at different temperature settings, extracting the code from the model's response, and running the benchmark's test suite against that code in an isolated subprocess with a 15-second timeout. Progress is saved after every problem, so if the run stops midway it can resume from where it left off. The tool supports three benchmarks: HumanEval (164 coding problems), MBPP (500 coding problems), and MMLU-Pro (general knowledge multiple-choice questions).

The model used in the published results is Qwen2.5-Coder-3B-Instruct, a 3 billion parameter model from Alibaba quantized to 4-bit precision to fit in 2.1 GB. It runs on CPU with no GPU required. The README notes the caveats: the small model uses three attempts while the commercial model's score likely reflects a single attempt, and on MMLU-Pro, which tests general world knowledge rather than coding, the small model scores 27% versus 68% for the larger one.

The broader point the project makes is that a small, code-specialized model can outperform a much larger general model on narrow coding tasks, and that evaluation design choices such as prompt framing, retry logic, and test sandboxing influence results significantly.

Licensed under MIT.

Where it fits

Benchmark how well a local AI model writes Python code without paying for a cloud API.
Reproduce the claim that a 2.1GB model can outperform large commercial models on HumanEval coding tasks.
Compare the effect of retry logic and temperature settings on benchmark scores.
Run coding benchmark evaluations on CPU-only machines with no GPU required.

Open on GitHub → Full breakdown on explaingit →