使用 Unsloth 微调 Gemma 4

Unsloth 是一个开源库，可让 LLM 微调速度提升 2 倍，同时节省 60% 的内存。它通过自定义 CUDA 内核和优化的训练循环实现这一点——与标准训练相比零精度损失。

Gemma 4 在 Unsloth 中得到完全支持，包括所有四个变体（E2B、E4B、26B MoE、31B）。本指南涵盖安装、数据集准备、训练配置和导出微调模型。

为什么用 Unsloth 微调？

2 倍速训练

自定义 Triton 内核优化了注意力、MLP 和嵌入层。标准方法需要 10 小时的微调，Unsloth 只需约 5 小时。

节省 60% 内存

智能梯度检查点和内存管理让你可以在更小的 GPU 上微调更大的模型。E4B 模型可以在单张 RTX 3090 上微调。

零精度损失

Unsloth 的优化与标准训练在数学上等价。用更少的计算获得相同的模型质量——没有近似或妥协。

便捷导出

将微调模型导出为 GGUF（用于 Ollama/llama.cpp）、SafeTensors（用于 vLLM），或直接推送到 Hugging Face——一条命令搞定。

安装

使用 pip 安装 Unsloth。需要 Python 3.10+ 和 PyTorch 2.0+：

pip install unsloth

快速上手：微调 E4B

使用 LoRA 在自有数据集上微调 Gemma 4 E4B 的最小示例：

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="google/gemma-4-e4b-it",
    max_seq_length=4096,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model, r=16, lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

# Train with your dataset
from trl import SFTTrainer
trainer = SFTTrainer(
    model=model, tokenizer=tokenizer,
    train_dataset=dataset,
    max_seq_length=4096,
)
trainer.train()

准备数据集

Unsloth 支持多种数据集格式来微调 Gemma 4：

unslothPage.datasets.formats.0.title

包含 user/assistant 轮次的对话。最适合聊天机器人和助手微调。

unslothPage.datasets.formats.1.title

用于继续预训练或领域适配的原始文本。

unslothPage.datasets.formats.2.title

选择/拒绝对，用于基于偏好的训练。

微调硬件要求

unslothPage.hardware.desc

unslothPage.hardware.headers.model	unslothPage.hardware.headers.gpu	unslothPage.hardware.headers.time
E2B LoRA	RTX 3060 (12 GB)	~15 min / 1K steps
E4B LoRA	RTX 4060 Ti (16 GB)	~25 min / 1K steps
E4B QLoRA	RTX 3060 (12 GB)	~30 min / 1K steps
27B MoE LoRA	RTX 4090 (24 GB)	~60 min / 1K steps
27B MoE QLoRA	RTX 4070 Ti (16 GB)	~90 min / 1K steps

导出模型

微调完成后，导出为你需要的格式：

# Save to GGUF for Ollama
model.save_pretrained_gguf("gemma4-custom", tokenizer, quantization_method="q4_k_m")

# Save to SafeTensors for vLLM
model.save_pretrained_merged("gemma4-custom-merged", tokenizer)

# Push to Hugging Face
model.push_to_hub_merged("your-username/gemma4-custom", tokenizer)

Unsloth + Gemma 4 常见问题

Unsloth 是什么？

Unsloth 是一个开源微调库，通过自定义 CUDA 内核让 LLM 训练速度提升 2 倍、内存节省 60%。支持 Gemma 4、Llama、Mistral 等主流模型家族。

消费级 GPU 能微调 Gemma 4 E4B 吗？

可以。使用 Unsloth 的 QLoRA 4-bit，可以在 RTX 4060 (8GB) 上微调 E4B。LoRA 需要 RTX 3090 (24GB)。更大的模型需要专业 GPU (A100/H100) 或云实例。

LoRA 和 QLoRA 有什么区别？

LoRA（低秩适配）在模型中添加小型可训练矩阵，同时冻结基础权重。QLoRA 额外将基础模型量化为 4-bit，大幅减少内存。两者产出的模型质量相近。

微调需要多少数据？

领域适配通常 1K-10K 高质量样本就够了。指令微调需要 5K-50K 对话对。质量比数量更重要——1K 优质样本胜过 10 万条噪声数据。

能把 LoRA 权重合并到基础模型吗？

可以。Unsloth 支持将 LoRA 权重合并到基础模型中进行部署，无需适配器开销。以 GGUF 或 SafeTensors 格式导出为单一合并模型。

Unsloth 支持 MoE 模型吗？

支持。Unsloth 支持微调 Gemma 4 26B A4B MoE 模型。由于 MoE 架构，LoRA 通常应用于共享层和专家路由，比相同活跃参数量的稠密模型需要更多显存。

unslothPage.faq.items.6.q

unslothPage.faq.items.6.a

unslothPage.faq.items.7.q

unslothPage.faq.items.7.a

unslothPage.faq.items.8.q

unslothPage.faq.items.8.a

unslothPage.faq.items.9.q

unslothPage.faq.items.9.a

开始微调 Gemma 4

安装 Unsloth，准备数据集，几小时内创建你的定制 Gemma 4 模型。

下载基础模型选择模型变体硬件要求