ClinSeekAgent
A UC Santa Cruz research system that tests whether AI models make better clinical decisions when they actively search patient records, medical images, and external knowledge versus reading a curated summary.
ClinSeekAgent is a research system from UC Santa Cruz that tests whether AI models can reason about clinical cases better when they actively search for evidence on their own, compared to being handed pre-selected information. The core question it addresses: does giving an AI model access to raw patient records, medical image tools, and external knowledge sources help it make better clinical decisions than reading a curated summary?
The system works by placing a host AI model inside an agent loop. The paper evaluates models including Claude, Qwen, and others. The agent has access to three types of tools: one for querying patient-level electronic health record tables, one for searching external medical knowledge via a browser, and one for analyzing chest X-ray images. The model can call these tools in sequence to gather evidence before producing an answer, much like a clinician consulting different sources before making a diagnosis.
Results from the paper show that active evidence-seeking often outperforms the curated baseline. On text-only EHR tasks, most evaluated models improve when given raw access instead of pre-selected snippets. The gap is larger for multimodal tasks involving imaging: one model gained over 34 percentage points on a specific reasoning category when allowed to actively query images rather than receiving them pre-processed.
The repository also includes a recipe for training a smaller student model on the trajectories generated by the larger agent system. The resulting model (ClinSeek-35B-A3B) reaches performance close to its teacher on an external benchmark while being significantly smaller than the largest closed-source models it was compared against.
The codebase is split into four separate roles (agent driver, EHR server, image server, training) each with its own dependencies, because the image and training components require GPU hardware and specific library versions. Patient data is not included in the repository; it must be obtained separately from credentialed sources such as the MIMIC dataset.
Where it fits
- Evaluate whether an AI model performs better when it actively queries raw patient records versus reading a pre-selected summary for a clinical question.
- Analyze chest X-ray images as part of an AI-driven clinical reasoning pipeline that calls imaging tools on demand.
- Train a smaller student model on clinical reasoning trajectories generated by the larger agent system.
- Benchmark clinical AI models across text-only EHR tasks and multimodal medical imaging tasks.