gitmyhub

grepseek

Python ★ 44 updated 21d ago

Codes and data for paper: GrepSeek: Training Search Agents for Direct Corpus Interaction

A research project that fine-tunes a 9-billion-parameter AI model to answer factual questions by writing grep commands on a 14 GB plain-text Wikipedia corpus, beating vector search systems with far less setup.

PythonQwen3.5ripgrepGRPOJupyterApachesetup: hardcomplexity 4/5

GrepSeek is a research project that trains a compact AI model to answer factual questions by running shell search commands directly on a raw text corpus. Instead of building a vector database or a pre-computed search index, the model learns to write grep and ripgrep commands against a 14 GB Wikipedia corpus stored as plain text. The approach is called Direct Corpus Interaction. The project comes with code, training scripts, a trained model, and a dataset, all published alongside an academic paper.

The model is a 9 billion parameter language model from the Qwen3.5 family, fine-tuned in two stages. The first stage uses a dataset of 10,000 example search trajectories generated by a teacher model, teaching the agent how to break down a question into a sequence of shell commands. The second stage uses reinforcement learning (a method called GRPO) where the model is rewarded for finding answers that match the correct text. The combined approach outperforms dense retrieval systems that require large vector indices across a benchmark of seven question-answering datasets, achieving the best average score.

One practical advantage is cost and simplicity. Setting up a dense vector index for the same Wikipedia corpus requires 70 GB of RAM or many hours of GPU processing. GrepSeek needs only the raw text and about 14 GB of RAM, with roughly one minute of setup. The repository also includes a sharded parallel search engine that runs corpus searches up to 7.6 times faster than plain grep while producing byte-identical results.

The codebase is split into folders for data generation, supervised fine-tuning, reinforcement learning training, and inference. A Jupyter notebook lets anyone try the released model on Google Colab without writing training code. The project is licensed under Apache 2.0.

Where it fits