gitmyhub

Qwen

Python ★ 21k updated 3mo ago

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.

Alibaba's open-source family of large language models (1.8B, 72B parameters) trained on 3 trillion multilingual tokens, with chat versions for conversation, coding, math, and tool use.

PythonPyTorchvLLMFastChatLoRAsetup: moderatecomplexity 4/5

Qwen is the original open-source large language model series from Alibaba Cloud (Tongyi Qianwen in Chinese). The repository hosts the first-generation Qwen models and their chat-tuned counterparts. The README opens with an important note that Qwen2 is now available in a separate repository (QwenLM/Qwen2), and that this repo is no longer actively maintained because the codebase has diverged. So the project is mainly a reference point for the original Qwen 1 generation rather than something you would start with today for production work.

The series comes in four sizes: Qwen-1.8B, Qwen-7B, Qwen-14B, and Qwen-72B. Each size is released as a base language model (the raw pretrained version) and as a chat model (Qwen-Chat), which has been aligned with human preferences through supervised fine-tuning and RLHF. The chat models can hold conversations, write and summarize text, extract information, translate, write code, solve math problems, call tools, act as agents, and even act as a code interpreter. Each chat model is also released in Int4 and Int8 quantized versions, which use less GPU memory at the cost of some precision. Downloads are hosted on Hugging Face and ModelScope.

The base models were pretrained on up to 3 trillion tokens of multilingual data, with a particular focus on Chinese and English alongside many other languages and domains. The README reports the release dates, max context length (8K for Qwen-14B, 32K for the others), pretraining token counts, minimum GPU memory required for Q-LoRA finetuning (from about 5.8GB for the 1.8B model up to 61.4GB for the 72B), and minimum GPU memory for generating 2048 tokens with the Int4 quantized version (from about 2.9GB up to 48.9GB). All four sizes support tool usage.

The repository documents how to get started with inference, how to use the quantized models including GPTQ and KV-cache quantization, performance statistics, finetuning tutorials (full-parameter, LoRA, and Q-LoRA), deployment instructions using vLLM and FastChat, how to build a WebUI or CLI demo, how to call the DashScope API service, how to build an OpenAI-style API in front of your local model, how to use Qwen for tool use and agents, long-context evaluation, FAQ, and the license. A technical report describing the series is published at arxiv.org/abs/2309.16609.

Where it fits