framedex

Python ★ 326 updated 2d ago

Framedex — a queryable knowledge base for your video archive

CLI tool delivered as a Claude Code skill that indexes a video archive into searchable markdown sidecar files with transcripts, GPS, faces, and scene descriptions.

PythonWhisperXffmpeginsightfaceClaude CodeAnthropic SDKsetup: hardcomplexity 4/5

Framedex is a command-line tool for taking a messy archive of video clips spread across several external drives and turning it into something searchable. For each clip, it writes a small text file in markdown next to the original video. That sidecar file contains everything the tool was able to learn about the clip, including duration and resolution from the file itself, GPS coordinates, the place name those coordinates correspond to, a transcript with speaker labels, an English translation when the speech is in another language, detected faces, and a written description of the scene with a keep, review, or cull rating. The original videos are never changed.

The tool is delivered as a Claude Code skill that installs a vidx command. After cloning the repo into the skills folder, running setup.py installs the Python dependencies and downloads the Whisper speech-recognition models plus face-detection models. You also need a Hugging Face token and to accept the terms on two pyannote model pages so that speaker diarization can run.

The per-clip pipeline is a chain of well-known tools. ffprobe reads file metadata, exiftool reads GPS data, Nominatim turns the coordinates into a place name with polite rate limiting, ffmpeg extracts five evenly spaced JPEG frames and a mono 16-kHz WAV file, WhisperX runs transcription with word-level alignment and speaker labels, and insightface detects faces and computes 512-dimensional embeddings. Finally a vision model produces a structured scene description and a keep, review, or cull rating, and the sidecar file is written.

Vision work can run in three modes. The cli backend uses a Claude Max subscription through the claude -p command, which has no marginal cost. The api backend uses the Anthropic SDK with an API key, which is the fastest option for huge archives. The local backend talks to LM Studio or any OpenAI-compatible local server so that nothing leaves the machine.

The tool is built to be resumable: a sidecar that already exists means the clip is skipped on the next run. Useful flags include --dry-run, --max-files to test on a small subset, --force to re-index, --no-diarize to skip speaker labels, --no-faces to skip face detection, and --max-duration to cap clip length. A .video-context.md file at the root of a scan target gives the vision model a hint about the project and feeds proper nouns to Whisper for better transcription.

Where it fits

Index a multi-drive archive of raw video clips so each one is searchable by transcript, place, and scene description.
Triage hours of unsorted footage with automatic keep, review, or cull ratings before editing.
Add speaker-labeled transcripts and translations to family videos in mixed languages.
Detect and group faces across a large clip library without uploading anything to the cloud.

Open on GitHub → Full breakdown on explaingit →