← Back to all guides
Advanced

Fine-Tune a Model on Your Own Data

Take a base model and train it on your writing style, your docs, your domain. LoRA fine-tuning runs on consumer hardware — no cloud GPUs required.

⏱ ~45 minutes 💻 Mac (MLX) / Linux+GPU (Unsloth) 💾 16GB+ RAM

What You'll Need

  • Apple Silicon Mac with 16GB+ RAM (for MLX path) or Linux/Windows with an NVIDIA GPU, 8GB+ VRAM (for Unsloth path)
  • Python 3.10+
  • Ollama installed (to use your finished model — see our Ollama guide)
  • Training data — your writing, conversations, documentation, or domain knowledge
💡 Two paths in this guide: MLX for Apple Silicon Macs (uses unified memory, no GPU needed) and Unsloth for NVIDIA GPUs (faster, supports larger models). Both produce a model you can export to Ollama. Pick the one that matches your hardware.

1 Prepare Your Training Data

Fine-tuning needs structured examples in JSONL format — one JSON object per line. The format depends on what you're training the model to do.

Chat format (teach it to respond like you):

{"messages": [{"role": "user", "content": "How should I structure this project?"}, {"role": "assistant", "content": "Start with the data layer. Get your schema right first, then build the API on top. Don't touch the frontend until the API works."}]} {"messages": [{"role": "user", "content": "What do you think about microservices?"}, {"role": "assistant", "content": "Overkill for most projects. Start with a monolith. Split it when you actually have scaling problems, not before."}]}

Instruction format (teach it a task):

{"messages": [{"role": "system", "content": "You are a technical writer."}, {"role": "user", "content": "Explain DNS in one paragraph."}, {"role": "assistant", "content": "DNS translates domain names to IP addresses..."}]}

How much data do you need?

Goal Examples Needed Effect
Style/tone adaptation 50–200 Model picks up your voice and phrasing
Domain knowledge 200–1,000 Learns your field's terminology and patterns
Task specialization 500–5,000 Becomes reliable at a specific workflow
Deep expertise 5,000+ Approaches expert-level in a narrow domain

Quick script to convert your chat exports to JSONL:

import json # Example: turn a list of Q&A pairs into training data pairs = [ ("What's LoRA?", "Low-Rank Adaptation — a way to fine-tune a model by training a small adapter instead of all the weights. Uses 10-100x less memory than full fine-tuning."), ("When should I fine-tune vs RAG?", "Fine-tune when you want to change how the model talks or thinks. RAG when you want to give it access to specific documents. They're complementary."), ] with open("train.jsonl", "w") as f: for q, a in pairs: json.dump({"messages": [ {"role": "user", "content": q}, {"role": "assistant", "content": a} ]}, f) f.write("\n") print(f"Wrote {len(pairs)} examples to train.jsonl")
⚠️ Quality over quantity. 100 excellent examples beat 10,000 sloppy ones. Every training example teaches the model a pattern — bad examples teach bad patterns. Review your data. Remove duplicates, fix typos, and cut anything you wouldn't want the model to repeat.

2 Choose a Base Model

You're not training from scratch — you're adapting an existing model. Pick one that fits your hardware and use case.

Base Model Params RAM Needed Good For
Llama 3.2 1B 1B ~4GB Quick experiments, edge devices
Llama 3.2 3B 3B ~8GB Good balance for personal assistants
Llama 3.1 8B 8B ~16GB Best quality for consumer hardware
Mistral 7B v0.3 7B ~16GB Strong reasoning, fast inference
Gemma 2 9B 9B ~20GB Excellent instruction following
Qwen 2.5 7B 7B ~16GB Strong coding and multilingual
💡 Start with 3B. It trains fast, needs less RAM, and you'll see results in minutes. Once your data and workflow are dialled in, scale up to 7B/8B for the final model.

3 Fine-Tune with MLX (Mac)

MLX is Apple's machine learning framework. It runs fine-tuning directly on Apple Silicon using unified memory — no NVIDIA GPU needed.

# Install MLX and the LM tools pip install mlx-lm

Split your data into training and validation sets:

# Create a data directory mkdir -p data # Use ~90% for training, ~10% for validation # If you have 100 examples: head -90 train.jsonl > data/train.jsonl tail -10 train.jsonl > data/valid.jsonl

Run the fine-tune:

mlx_lm.lora \ --model mlx-community/Llama-3.2-3B-Instruct-4bit \ --data ./data \ --train \ --iters 600 \ --batch-size 4 \ --lora-layers 16 \ --learning-rate 1e-5

This takes 10–30 minutes on an M1/M2/M3 Mac depending on dataset size. You'll see training loss decrease over iterations:

# Expected output: Iter 100: train loss 1.842, val loss 1.901 Iter 200: train loss 1.234, val loss 1.456 Iter 300: train loss 0.891, val loss 1.123 Iter 400: train loss 0.654, val loss 0.987 # Loss going down = model is learning your data

Test your fine-tuned model before exporting:

mlx_lm.generate \ --model mlx-community/Llama-3.2-3B-Instruct-4bit \ --adapter-path adapters \ --prompt "How should I structure a new project?"
💡 Key parameters:
--iters — more iterations = more learning, but risk overfitting. Start with 200–600.
--lora-layers — how many layers to adapt. 8–16 is the sweet spot.
--learning-rate — how fast it learns. Too high = unstable, too low = slow. 1e-5 is safe.
--batch-size — higher = faster training but more memory. 4 is safe for 16GB.

4 Fine-Tune with Unsloth (NVIDIA GPU)

If you have an NVIDIA GPU, Unsloth is the fastest option — 2x faster than standard training with 60% less memory.

# Install Unsloth pip install unsloth

Create a training script:

from unsloth import FastLanguageModel # Load the base model with 4-bit quantization model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Llama-3.2-3B-Instruct", max_seq_length=2048, load_in_4bit=True, ) # Add LoRA adapters model = FastLanguageModel.get_peft_model( model, r=16, # LoRA rank — 8-32, higher = more capacity lora_alpha=16, # scaling factor lora_dropout=0.05, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], ) # Load your dataset from datasets import load_dataset dataset = load_dataset("json", data_files="train.jsonl", split="train") # Train from trl import SFTTrainer from transformers import TrainingArguments trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, args=TrainingArguments( per_device_train_batch_size=2, num_train_epochs=3, learning_rate=2e-5, output_dir="outputs", logging_steps=10, ), ) trainer.train() model.save_pretrained("my-finetuned-model") print("Done! Model saved to my-finetuned-model/")
GPU VRAM Max Model (QLoRA) Training Speed
RTX 3060 12GB 7B–8B ~20 min for 500 examples
RTX 4070 12GB 7B–8B ~12 min for 500 examples
RTX 3090 / 4090 24GB 13B–14B ~8 min for 500 examples
Apple M2 Pro 16GB shared 3B–7B (MLX) ~25 min for 500 examples
Apple M3 Max 36GB shared 8B–13B (MLX) ~15 min for 500 examples

5 Export & Use in Ollama

Convert your fine-tuned model to GGUF format so Ollama can run it.

From MLX:

# Fuse the LoRA adapters into the base model mlx_lm.fuse \ --model mlx-community/Llama-3.2-3B-Instruct-4bit \ --adapter-path adapters \ --save-path fused-model \ --de-quantize # Convert to GGUF (needs llama.cpp) git clone https://github.com/ggerganov/llama.cpp cd llama.cpp pip install -r requirements.txt python convert_hf_to_gguf.py ../fused-model --outfile my-model.gguf --outtype q4_K_M

From Unsloth:

# Unsloth can export to GGUF directly model.save_pretrained_gguf( "my-model", tokenizer, quantization_method="q4_k_m" # good balance of size vs quality ) # Output: my-model/my-model-Q4_K_M.gguf

Create an Ollama Modelfile and import:

# Create a Modelfile cat > Modelfile <<'EOF' FROM ./my-model.gguf PARAMETER temperature 0.7 PARAMETER top_p 0.9 SYSTEM "You are a helpful assistant trained on custom data." EOF # Import into Ollama ollama create my-custom-model -f Modelfile # Test it! ollama run my-custom-model "How should I structure a new project?"
💡 Quantization options:
q4_K_M — best balance of quality and size (recommended)
q5_K_M — slightly better quality, ~25% larger
q8_0 — near-original quality, 2x the size of q4
f16 — full precision, largest file, best quality
⚠️ Watch for overfitting. If your model starts repeating training examples verbatim instead of generalizing, you've overtrained. Signs: val loss goes up while train loss keeps dropping. Fix: fewer iterations, more diverse data, or lower learning rate.

✅ What You've Set Up

  • Training data prepared in JSONL chat format
  • LoRA fine-tuning via MLX (Mac) or Unsloth (NVIDIA GPU)
  • A custom model that speaks in your voice and knows your domain
  • GGUF export and import into Ollama for daily use
  • Understanding of key parameters: iterations, rank, learning rate

Next Steps

  • Iterate on your data — fine-tuning is a loop. Train, test, find gaps, add examples, retrain. Each round gets better.
  • Store training data on your NAS — keep JSONL files centralized so you can train from any machine. See the NAS guide.
  • Build an eval pipeline — write test prompts and expected answers, score your model automatically. This tells you if a new training run is actually better.
  • Merge multiple LoRA adapters — train separate adapters for different skills (coding, writing, domain knowledge) and merge them into one model.
  • Share your model — push GGUF files to Hugging Face or serve them from your AI server for your whole network.
💡 This is the AI OS workflow. AI OS collects your conversations, preferences, and patterns over time. That data becomes training material. Fine-tune a model on it, and you get an AI that actually knows you — not a generic assistant, but your assistant. See the Ollama guide to get started.

📚 Learning Links

Videos

Official Docs

Community