Explore-Execute-Chain

Python ★ 36 updated 16d ago

A research project showing that splitting AI reasoning into a short planning phase and a longer execution phase makes test-time computation about 8 times cheaper, with training scripts, pretrained models, and benchmarks across math and medical domains.

PythonPyTorchHugging Facesetup: hardcomplexity 5/5

Explore-Execute Chain (E2C) is a research project about making AI language models reason more efficiently. The central idea is to split the reasoning process into two distinct phases within a single model: a short exploration phase where the model sketches a high-level plan and picks the best approach (around 1,000 tokens), followed by a longer execution phase where it carries out that plan step by step (around 10,000 tokens).

The benefit of this split is efficiency. When looking for the best answer by trying multiple possibilities at test time, the search only needs to cover the short exploration phase rather than full reasoning chains, which the authors report makes test-time compute about 8 times cheaper. When adapting the model to a new subject area (such as medical question answering), only the exploration segments need fine-tuning, using roughly 3.5 percent of the tokens that a standard full fine-tuning approach would require.

The repository includes pretrained model checkpoints and training datasets hosted on Hugging Face, as well as scripts for inference, training, and evaluation. Running inference requires a machine with at least 16 GB of GPU memory. Training from scratch requires significantly more: the supervised fine-tuning step is designed for 4 GPUs with 40 GB each, and the reinforcement learning step for 8. An interactive demo lets you test the model on eight built-in problems spanning math, medical, and code domains, or supply your own.

Training follows three steps: supervised fine-tuning on exploration-execution pairs, reinforcement learning using a two-stage GRPO process, and an optional lightweight adaptation step for new domains. Evaluation scripts cover 16 benchmarks across math and medical reasoning. The paper reports that the approach matches or beats standard methods on math competition problems while using around 7 times fewer tokens at test time.

This is a research codebase tied to a specific paper and pretrained models. It is aimed at machine learning practitioners who want to reproduce the paper results or experiment with the E2C training approach on their own data and compute.

Where it fits

Run inference with pretrained E2C checkpoints to reproduce paper results on math and medical reasoning benchmarks.
Fine-tune a language model on a new domain using only the short exploration segments, cutting token cost by roughly 97% versus standard fine-tuning.
Evaluate how E2C-style reasoning compares to standard chain-of-thought on custom benchmark tasks using the included evaluation scripts.

Open on GitHub → Full breakdown on explaingit →