⚡

Advanced

Fine-Tune a Model on Your Own Data

Take a base model and train it on your writing style, your docs, your domain. LoRA fine-tuning runs on consumer hardware — no cloud GPUs required.

⏱ ~45 minutes 💻 Mac (MLX) / Linux+GPU (Unsloth) 💾 16GB+ RAM

What You'll Need

Apple Silicon Mac with 16GB+ RAM (for MLX path) or Linux/Windows with an NVIDIA GPU, 8GB+ VRAM (for Unsloth path)
Python 3.10+
Ollama installed (to use your finished model — see our Ollama guide)
Training data — your writing, conversations, documentation, or domain knowledge

💡 Two paths in this guide: MLX for Apple Silicon Macs (uses unified memory, no GPU needed) and Unsloth for NVIDIA GPUs (faster, supports larger models). Both produce a model you can export to Ollama. Pick the one that matches your hardware.

1 Prepare Your Training Data

Fine-tuning needs structured examples in JSONL format — one JSON object per line. The format depends on what you're training the model to do.

Chat format (teach it to respond like you):

{"messages": [{"role": "user", "content": "How should I structure this project?"}, {"role": "assistant", "content": "Start with the data layer. Get your schema right first, then build the API on top. Don't touch the frontend until the API works."}]}
{"messages": [{"role": "user", "content": "What do you think about microservices?"}, {"role": "assistant", "content": "Overkill for most projects. Start with a monolith. Split it when you actually have scaling problems, not before."}]}
        

Instruction format (teach it a task):

{"messages": [{"role": "system", "content": "You are a technical writer."}, {"role": "user", "content": "Explain DNS in one paragraph."}, {"role": "assistant", "content": "DNS translates domain names to IP addresses..."}]}
        

How much data do you need?

Goal	Examples Needed	Effect
Style/tone adaptation	50–200	Model picks up your voice and phrasing
Domain knowledge	200–1,000	Learns your field's terminology and patterns
Task specialization	500–5,000	Becomes reliable at a specific workflow
Deep expertise	5,000+	Approaches expert-level in a narrow domain

Quick script to convert your chat exports to JSONL:

import json

# Example: turn a list of Q&A pairs into training data
pairs = [
    ("What's LoRA?", "Low-Rank Adaptation — a way to fine-tune a model by training a small adapter instead of all the weights. Uses 10-100x less memory than full fine-tuning."),
    ("When should I fine-tune vs RAG?", "Fine-tune when you want to change how the model talks or thinks. RAG when you want to give it access to specific documents. They're complementary."),
]

with open("train.jsonl", "w") as f:
    for q, a in pairs:
        json.dump({"messages": [
            {"role": "user", "content": q},
            {"role": "assistant", "content": a}
        ]}, f)
        f.write("\n")

print(f"Wrote {len(pairs)} examples to train.jsonl")
        

⚠️ Quality over quantity. 100 excellent examples beat 10,000 sloppy ones. Every training example teaches the model a pattern — bad examples teach bad patterns. Review your data. Remove duplicates, fix typos, and cut anything you wouldn't want the model to repeat.

2 Choose a Base Model

You're not training from scratch — you're adapting an existing model. Pick one that fits your hardware and use case.

Base Model	Params	RAM Needed	Good For
Llama 3.2 1B	1B	~4GB	Quick experiments, edge devices
Llama 3.2 3B	3B	~8GB	Good balance for personal assistants
Llama 3.1 8B	8B	~16GB	Best quality for consumer hardware
Mistral 7B v0.3	7B	~16GB	Strong reasoning, fast inference
Gemma 2 9B	9B	~20GB	Excellent instruction following
Qwen 2.5 7B	7B	~16GB	Strong coding and multilingual

💡 Start with 3B. It trains fast, needs less RAM, and you'll see results in minutes. Once your data and workflow are dialled in, scale up to 7B/8B for the final model.

3 Fine-Tune with MLX (Mac)

MLX is Apple's machine learning framework. It runs fine-tuning directly on Apple Silicon using unified memory — no NVIDIA GPU needed.

# Install MLX and the LM tools
pip install mlx-lm
        

Split your data into training and validation sets:

# Create a data directory
mkdir -p data

# Use ~90% for training, ~10% for validation
# If you have 100 examples:
head -90 train.jsonl > data/train.jsonl
tail -10 train.jsonl > data/valid.jsonl
        

Run the fine-tune:

mlx_lm.lora \
  --model mlx-community/Llama-3.2-3B-Instruct-4bit \
  --data ./data \
  --train \
  --iters 600 \
  --batch-size 4 \
  --lora-layers 16 \
  --learning-rate 1e-5
        

This takes 10–30 minutes on an M1/M2/M3 Mac depending on dataset size. You'll see training loss decrease over iterations:

# Expected output:
Iter 100: train loss 1.842, val loss 1.901
Iter 200: train loss 1.234, val loss 1.456
Iter 300: train loss 0.891, val loss 1.123
Iter 400: train loss 0.654, val loss 0.987
# Loss going down = model is learning your data
        

Test your fine-tuned model before exporting:

mlx_lm.generate \
  --model mlx-community/Llama-3.2-3B-Instruct-4bit \
  --adapter-path adapters \
  --prompt "How should I structure a new project?"
        

💡 Key parameters:
• --iters — more iterations = more learning, but risk overfitting. Start with 200–600.
• --lora-layers — how many layers to adapt. 8–16 is the sweet spot.
• --learning-rate — how fast it learns. Too high = unstable, too low = slow. 1e-5 is safe.
• --batch-size — higher = faster training but more memory. 4 is safe for 16GB.

4 Fine-Tune with Unsloth (NVIDIA GPU)

If you have an NVIDIA GPU, Unsloth is the fastest option — 2x faster than standard training with 60% less memory.

# Install Unsloth
pip install unsloth
        

Create a training script:

from unsloth import FastLanguageModel

# Load the base model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,              # LoRA rank — 8-32, higher = more capacity
    lora_alpha=16,     # scaling factor
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
)

# Load your dataset
from datasets import load_dataset
dataset = load_dataset("json", data_files="train.jsonl", split="train")

# Train
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        num_train_epochs=3,
        learning_rate=2e-5,
        output_dir="outputs",
        logging_steps=10,
    ),
)

trainer.train()
model.save_pretrained("my-finetuned-model")
print("Done! Model saved to my-finetuned-model/")
        

GPU	VRAM	Max Model (QLoRA)	Training Speed
RTX 3060	12GB	7B–8B	~20 min for 500 examples
RTX 4070	12GB	7B–8B	~12 min for 500 examples
RTX 3090 / 4090	24GB	13B–14B	~8 min for 500 examples
Apple M2 Pro 16GB	shared	3B–7B (MLX)	~25 min for 500 examples
Apple M3 Max 36GB	shared	8B–13B (MLX)	~15 min for 500 examples

5 Export & Use in Ollama

Convert your fine-tuned model to GGUF format so Ollama can run it.

From MLX:

# Fuse the LoRA adapters into the base model
mlx_lm.fuse \
  --model mlx-community/Llama-3.2-3B-Instruct-4bit \
  --adapter-path adapters \
  --save-path fused-model \
  --de-quantize

# Convert to GGUF (needs llama.cpp)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
pip install -r requirements.txt
python convert_hf_to_gguf.py ../fused-model --outfile my-model.gguf --outtype q4_K_M
        

From Unsloth:

# Unsloth can export to GGUF directly
model.save_pretrained_gguf(
    "my-model",
    tokenizer,
    quantization_method="q4_k_m"  # good balance of size vs quality
)
# Output: my-model/my-model-Q4_K_M.gguf
        

Create an Ollama Modelfile and import:

# Create a Modelfile
cat > Modelfile <<'EOF'
FROM ./my-model.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
SYSTEM "You are a helpful assistant trained on custom data."
EOF

# Import into Ollama
ollama create my-custom-model -f Modelfile

# Test it!
ollama run my-custom-model "How should I structure a new project?"
        

💡 Quantization options:
• q4_K_M — best balance of quality and size (recommended)
• q5_K_M — slightly better quality, ~25% larger
• q8_0 — near-original quality, 2x the size of q4
• f16 — full precision, largest file, best quality

⚠️ Watch for overfitting. If your model starts repeating training examples verbatim instead of generalizing, you've overtrained. Signs: val loss goes up while train loss keeps dropping. Fix: fewer iterations, more diverse data, or lower learning rate.

✅ What You've Set Up

Training data prepared in JSONL chat format
LoRA fine-tuning via MLX (Mac) or Unsloth (NVIDIA GPU)
A custom model that speaks in your voice and knows your domain
GGUF export and import into Ollama for daily use
Understanding of key parameters: iterations, rank, learning rate

Next Steps

Iterate on your data — fine-tuning is a loop. Train, test, find gaps, add examples, retrain. Each round gets better.
Store training data on your NAS — keep JSONL files centralized so you can train from any machine. See the NAS guide.
Build an eval pipeline — write test prompts and expected answers, score your model automatically. This tells you if a new training run is actually better.
Merge multiple LoRA adapters — train separate adapters for different skills (coding, writing, domain knowledge) and merge them into one model.
Share your model — push GGUF files to Hugging Face or serve them from your AI server for your whole network.

💡 This is the AI OS workflow. AI OS collects your conversations, preferences, and patterns over time. That data becomes training material. Fine-tune a model on it, and you get an AI that actually knows you — not a generic assistant, but your assistant. See the Ollama guide to get started.

What You'll Need

1 Prepare Your Training Data

2 Choose a Base Model

3 Fine-Tune with MLX (Mac)

4 Fine-Tune with Unsloth (NVIDIA GPU)

5 Export & Use in Ollama

✅ What You've Set Up

Next Steps

📚 Learning Links

Videos

Official Docs

Community