gitmyhub

nextcloud-qdrant-pipeline

Python ★ 17 updated 15d ago

Automated RAG pipeline: Nextcloud inbox → PDF chunking → bge-m3 embedding → Qdrant vector store

An automated pipeline that watches a Nextcloud folder for PDFs, splits and embeds each document using an AI model, stores results in Qdrant, and lets you search your files by meaning rather than exact keywords.

PythonNextcloudQdrantbge-m3WebDAVTelegramsetup: hardcomplexity 4/5

This project creates an automated pipeline for making your PDF documents searchable by meaning rather than by exact keywords. You drop a PDF into a specific folder in your Nextcloud storage (Nextcloud is self-hosted cloud storage, similar to Dropbox but run on your own server), and the pipeline picks it up, processes it, and stores it in a way that lets you later ask questions in plain language and get back the relevant passages.

The processing happens in three steps. First, the PDF is split into smaller text chunks, with the system trying to detect section headers and preserve the document structure rather than cutting blindly by character count. Second, each chunk is converted into a set of numbers that represent its meaning, using an AI embedding model called bge-m3. Third, those numerical representations are stored in Qdrant, a database built specifically for storing and searching this kind of data. Once the pipeline has run, you can query the collection with a question and get back the passages whose meaning is closest to what you asked.

The pipeline also handles updates intelligently. Each PDF gets a fingerprint (a SHA256 hash) when it is first ingested. If you drop the same file again, the system detects the match and skips it. If you drop an updated version of the same file with a different fingerprint, the old entries are deleted and the new version is ingested. Successfully processed files are moved to a processed folder; failed ones go to a separate failed folder. Telegram notifications report the outcome of each file.

The whole system runs as a background service that checks your Nextcloud inbox folder every 30 seconds. It requires Nextcloud with WebDAV enabled, an Infinity server running the bge-m3 embedding model, and a Qdrant instance. All three can be self-hosted. The project is written in Python 3.11 or newer and is licensed under MIT.

Where it fits