easy-dataset

JavaScript ★ 14k updated 1mo ago

A powerful tool for creating datasets for LLM fine-tuning 、RAG and Eval

Easy Dataset is a desktop and web app that automatically converts your documents into structured question-and-answer training data for fine-tuning or evaluating AI language models.

JavaScriptNode.jsDockersetup: easycomplexity 2/5

Easy Dataset is a desktop and web application for turning documents into training data for AI language models. If you want to teach an AI model something specific, such as the contents of a product manual, a legal guide, or a technical knowledge base, you need a collection of question-and-answer pairs drawn from that material. Easy Dataset automates that process.

You start by uploading documents in formats like PDF, Word, Markdown, or plain text. The app splits the content into segments and then uses an AI model of your choice to generate questions and answers from each segment. The result is a structured dataset you can use to fine-tune an AI model or to power a retrieval-augmented generation setup, which is a technique for letting an AI pull from a custom knowledge base when answering questions.

Beyond basic question-and-answer pairs, the tool can generate multi-turn conversation data, image-based question pairs, and evaluation datasets for testing how well a model performs. The evaluation side includes multiple-choice and open-ended questions, an automated judge that scores model answers, and a side-by-side blind comparison mode where you can pit two models against each other without knowing in advance which is which.

The app connects to a wide range of AI provider APIs as long as they follow the standard OpenAI request format. This covers services like OpenAI, Ollama for running models locally, and various others. Once your dataset is ready, you can export it in several common formats used in AI training pipelines and upload directly to the Hugging Face model repository platform.

Desktop installers are available for Windows, macOS, and Linux. You can also run it locally via Node.js or Docker. The interface supports Chinese, English, Turkish, and Portuguese. The project is open source under the AGPL-3.0 license.

Where it fits

Convert a product manual or knowledge base PDF into question-answer pairs for fine-tuning an AI model.
Generate multi-turn conversation training data from existing documentation to customize a language model.
Run automated model evaluation with the built-in judge to score and compare two models side by side.
Export a finished dataset directly to Hugging Face in standard AI training formats.

Open on GitHub → Full breakdown on explaingit →