grepseek
Codes and data for paper: GrepSeek: Training Search Agents for Direct Corpus Interaction
A research project that fine-tunes a 9-billion-parameter AI model to answer factual questions by writing grep commands on a 14 GB plain-text Wikipedia corpus, beating vector search systems with far less setup.
GrepSeek is a research project that trains a compact AI model to answer factual questions by running shell search commands directly on a raw text corpus. Instead of building a vector database or a pre-computed search index, the model learns to write grep and ripgrep commands against a 14 GB Wikipedia corpus stored as plain text. The approach is called Direct Corpus Interaction. The project comes with code, training scripts, a trained model, and a dataset, all published alongside an academic paper.
The model is a 9 billion parameter language model from the Qwen3.5 family, fine-tuned in two stages. The first stage uses a dataset of 10,000 example search trajectories generated by a teacher model, teaching the agent how to break down a question into a sequence of shell commands. The second stage uses reinforcement learning (a method called GRPO) where the model is rewarded for finding answers that match the correct text. The combined approach outperforms dense retrieval systems that require large vector indices across a benchmark of seven question-answering datasets, achieving the best average score.
One practical advantage is cost and simplicity. Setting up a dense vector index for the same Wikipedia corpus requires 70 GB of RAM or many hours of GPU processing. GrepSeek needs only the raw text and about 14 GB of RAM, with roughly one minute of setup. The repository also includes a sharded parallel search engine that runs corpus searches up to 7.6 times faster than plain grep while producing byte-identical results.
The codebase is split into folders for data generation, supervised fine-tuning, reinforcement learning training, and inference. A Jupyter notebook lets anyone try the released model on Google Colab without writing training code. The project is licensed under Apache 2.0.
Where it fits
- Reproduce the GrepSeek paper results on seven question-answering benchmarks without building a vector index.
- Fine-tune a language model to search raw text corpora using shell commands instead of dense retrieval.
- Run the parallel sharded search engine for up to 7.6x faster Wikipedia corpus search with byte-identical results.