ipex-llm

Python ★ 8.8k updated 4mo ago ▣ archived

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, DeepSeek, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, DeepSpeed, Axolotl, etc.

Archived Intel library for running and fine-tuning large AI language models locally on Intel GPUs and NPUs, now officially unsupported with known security issues, for historical reference only.

PythonPyTorchONNXOllamaLangChainvLLMsetup: hardcomplexity 4/5

Important note before anything else: this project has been officially archived by Intel. Intel states it will no longer provide maintenance, bug fixes, new releases, or accept patches, and has identified the project as having known security issues. Anyone considering it for active use should treat it as unsupported.

While it was active, IPEX-LLM was a library that made it faster to run and fine-tune large AI language models on Intel hardware, specifically Intel graphics chips (including the Arc, Flex, and Max discrete GPU lines), Intel's integrated graphics, and the neural processing unit (NPU) found in newer Intel Core Ultra processors. The goal was to let people run capable AI models locally on consumer and workstation Intel hardware rather than relying on cloud services.

The library supported over 70 models, including well-known open-source families such as Llama, Mistral, DeepSeek, Qwen, and others. It also offered ways to compress models to smaller sizes (using techniques like 4-bit and 8-bit quantization) so they fit within the limited memory of consumer graphics cards. It was designed to plug into popular existing AI tools like Ollama, llama.cpp, HuggingFace's model library, LangChain, and vLLM, so developers could swap in Intel GPU acceleration without rewriting their code.

One notable feature was the ability to run very large models, such as DeepSeek's 671-billion-parameter models, across one or two Intel Arc graphics cards by splitting the workload, which would otherwise require expensive enterprise hardware.

Because the project is archived and carries known security issues, the appropriate use is historical reference or research only, not production deployment. The full README is longer than what was shown.

Where it fits

Research how Intel GPU acceleration was applied to running open-source LLMs like Llama and DeepSeek locally.
Study 4-bit and 8-bit quantization techniques for fitting large language models into consumer GPU memory.

Open on GitHub → Full breakdown on explaingit →