gpt-oss
gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI
Two open-weight AI models from OpenAI, a 120B and a 20B parameter model, downloadable and runnable on your own GPU without API access, licensed Apache 2.0.
gpt-oss is a pair of open-weight AI language models released by OpenAI: gpt-oss-120b (a large model with 117 billion total parameters but only 5.1 billion active at once) and gpt-oss-20b (a smaller, faster model with 21 billion parameters). "Open-weight" means the model weights — the learned numerical values that define how the model thinks — are publicly downloadable and can be run on your own hardware, unlike OpenAI's proprietary models which require API access.
Both models are Mixture-of-Experts (MoE) models, a design where only a fraction of the network activates for any given input. This makes the 120b model surprisingly efficient: despite its large size, it fits on a single NVIDIA H100 or AMD MI300X GPU (80GB of memory) because of MXFP4 quantization, a technique that compresses the model's numbers to use less memory. The 20b model runs within 16GB of memory, making it accessible on high-end consumer hardware.
The models support reasoning with configurable effort levels (low, medium, or high), full access to the model's internal chain-of-thought, function calling, web browsing, Python code execution, and structured outputs. They use a specific "Harmony" message format that must be applied correctly for the models to work.
You can run these models locally using Ollama (two commands to download and start), LM Studio, the Hugging Face Transformers library, or vLLM for production serving. The models are licensed under Apache 2.0, making them free to use commercially without copyleft restrictions. The repository also includes educational reference implementations in PyTorch, Triton, and Metal.
Where it fits
- Run a powerful 120B AI model locally on a single H100 or MI300X GPU for inference without OpenAI API costs.
- Build a production AI API endpoint using vLLM to serve gpt-oss-20b on a 16GB GPU server.
- Use function calling and structured outputs from a locally hosted model to build AI-powered tools without sending data externally.
- Study the PyTorch reference implementation to understand Mixture-of-Experts architecture and MXFP4 quantization.