Take a base model and train it on your writing style, your docs, your domain. LoRA fine-tuning runs on consumer hardware — no cloud GPUs required.
⏱ ~45 minutes💻 Mac (MLX) / Linux+GPU (Unsloth)💾 16GB+ RAM
What You'll Need
Apple Silicon Mac with 16GB+ RAM (for MLX path) orLinux/Windows with an NVIDIA GPU, 8GB+ VRAM (for Unsloth path)
Python 3.10+
Ollama installed (to use your finished model — see our Ollama guide)
Training data — your writing, conversations, documentation, or domain knowledge
💡 Two paths in this guide:MLX for Apple Silicon Macs (uses unified memory, no GPU needed) and Unsloth for NVIDIA GPUs (faster, supports larger models). Both produce a model you can export to Ollama. Pick the one that matches your hardware.
1 Prepare Your Training Data
Fine-tuning needs structured examples in JSONL format — one JSON object per line. The format depends on what you're training the model to do.
Chat format (teach it to respond like you):
{"messages": [{"role": "user", "content": "How should I structure this project?"}, {"role": "assistant", "content": "Start with the data layer. Get your schema right first, then build the API on top. Don't touch the frontend until the API works."}]}{"messages": [{"role": "user", "content": "What do you think about microservices?"}, {"role": "assistant", "content": "Overkill for most projects. Start with a monolith. Split it when you actually have scaling problems, not before."}]}
Instruction format (teach it a task):
{"messages": [{"role": "system", "content": "You are a technical writer."}, {"role": "user", "content": "Explain DNS in one paragraph."}, {"role": "assistant", "content": "DNS translates domain names to IP addresses..."}]}
How much data do you need?
Goal
Examples Needed
Effect
Style/tone adaptation
50–200
Model picks up your voice and phrasing
Domain knowledge
200–1,000
Learns your field's terminology and patterns
Task specialization
500–5,000
Becomes reliable at a specific workflow
Deep expertise
5,000+
Approaches expert-level in a narrow domain
Quick script to convert your chat exports to JSONL:
import json# Example: turn a list of Q&A pairs into training datapairs = [ ("What's LoRA?", "Low-Rank Adaptation — a way to fine-tune a model by training a small adapter instead of all the weights. Uses 10-100x less memory than full fine-tuning."), ("When should I fine-tune vs RAG?", "Fine-tune when you want to change how the model talks or thinks. RAG when you want to give it access to specific documents. They're complementary."),]with open("train.jsonl", "w") as f: for q, a in pairs: json.dump({"messages": [ {"role": "user", "content": q}, {"role": "assistant", "content": a} ]}, f) f.write("\n")print(f"Wrote {len(pairs)} examples to train.jsonl")
⚠️ Quality over quantity. 100 excellent examples beat 10,000 sloppy ones. Every training example teaches the model a pattern — bad examples teach bad patterns. Review your data. Remove duplicates, fix typos, and cut anything you wouldn't want the model to repeat.
2 Choose a Base Model
You're not training from scratch — you're adapting an existing model. Pick one that fits your hardware and use case.
Base Model
Params
RAM Needed
Good For
Llama 3.2 1B
1B
~4GB
Quick experiments, edge devices
Llama 3.2 3B
3B
~8GB
Good balance for personal assistants
Llama 3.1 8B
8B
~16GB
Best quality for consumer hardware
Mistral 7B v0.3
7B
~16GB
Strong reasoning, fast inference
Gemma 2 9B
9B
~20GB
Excellent instruction following
Qwen 2.5 7B
7B
~16GB
Strong coding and multilingual
💡 Start with 3B. It trains fast, needs less RAM, and you'll see results in minutes. Once your data and workflow are dialled in, scale up to 7B/8B for the final model.
3 Fine-Tune with MLX (Mac)
MLX is Apple's machine learning framework. It runs fine-tuning directly on Apple Silicon using unified memory — no NVIDIA GPU needed.
# Install MLX and the LM toolspip install mlx-lm
Split your data into training and validation sets:
# Create a data directorymkdir -p data# Use ~90% for training, ~10% for validation# If you have 100 examples:head -90 train.jsonl > data/train.jsonltail -10 train.jsonl > data/valid.jsonl
This takes 10–30 minutes on an M1/M2/M3 Mac depending on dataset size. You'll see training loss decrease over iterations:
# Expected output:Iter 100: train loss 1.842, val loss 1.901Iter 200: train loss 1.234, val loss 1.456Iter 300: train loss 0.891, val loss 1.123Iter 400: train loss 0.654, val loss 0.987# Loss going down = model is learning your data
Test your fine-tuned model before exporting:
mlx_lm.generate \ --model mlx-community/Llama-3.2-3B-Instruct-4bit \ --adapter-path adapters \ --prompt "How should I structure a new project?"
💡 Key parameters: • --iters — more iterations = more learning, but risk overfitting. Start with 200–600.
• --lora-layers — how many layers to adapt. 8–16 is the sweet spot.
• --learning-rate — how fast it learns. Too high = unstable, too low = slow. 1e-5 is safe.
• --batch-size — higher = faster training but more memory. 4 is safe for 16GB.
4 Fine-Tune with Unsloth (NVIDIA GPU)
If you have an NVIDIA GPU, Unsloth is the fastest option — 2x faster than standard training with 60% less memory.
# Install Unslothpip install unsloth
Create a training script:
from unsloth import FastLanguageModel# Load the base model with 4-bit quantizationmodel, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Llama-3.2-3B-Instruct", max_seq_length=2048, load_in_4bit=True,)# Add LoRA adaptersmodel = FastLanguageModel.get_peft_model( model, r=16, # LoRA rank — 8-32, higher = more capacity lora_alpha=16, # scaling factor lora_dropout=0.05, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],)# Load your datasetfrom datasets import load_datasetdataset = load_dataset("json", data_files="train.jsonl", split="train")# Trainfrom trl import SFTTrainerfrom transformers import TrainingArgumentstrainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, args=TrainingArguments( per_device_train_batch_size=2, num_train_epochs=3, learning_rate=2e-5, output_dir="outputs", logging_steps=10, ),)trainer.train()model.save_pretrained("my-finetuned-model")print("Done! Model saved to my-finetuned-model/")
GPU
VRAM
Max Model (QLoRA)
Training Speed
RTX 3060
12GB
7B–8B
~20 min for 500 examples
RTX 4070
12GB
7B–8B
~12 min for 500 examples
RTX 3090 / 4090
24GB
13B–14B
~8 min for 500 examples
Apple M2 Pro 16GB
shared
3B–7B (MLX)
~25 min for 500 examples
Apple M3 Max 36GB
shared
8B–13B (MLX)
~15 min for 500 examples
5 Export & Use in Ollama
Convert your fine-tuned model to GGUF format so Ollama can run it.
From MLX:
# Fuse the LoRA adapters into the base modelmlx_lm.fuse \ --model mlx-community/Llama-3.2-3B-Instruct-4bit \ --adapter-path adapters \ --save-path fused-model \ --de-quantize# Convert to GGUF (needs llama.cpp)git clone https://github.com/ggerganov/llama.cppcd llama.cpppip install -r requirements.txtpython convert_hf_to_gguf.py ../fused-model --outfile my-model.gguf --outtype q4_K_M
From Unsloth:
# Unsloth can export to GGUF directlymodel.save_pretrained_gguf( "my-model", tokenizer, quantization_method="q4_k_m" # good balance of size vs quality)# Output: my-model/my-model-Q4_K_M.gguf
Create an Ollama Modelfile and import:
# Create a Modelfilecat > Modelfile <<'EOF'FROM ./my-model.ggufPARAMETER temperature 0.7PARAMETER top_p 0.9SYSTEM "You are a helpful assistant trained on custom data."EOF# Import into Ollamaollama create my-custom-model -f Modelfile# Test it!ollama run my-custom-model "How should I structure a new project?"
💡 Quantization options: • q4_K_M — best balance of quality and size (recommended)
• q5_K_M — slightly better quality, ~25% larger
• q8_0 — near-original quality, 2x the size of q4
• f16 — full precision, largest file, best quality
⚠️ Watch for overfitting. If your model starts repeating training examples verbatim instead of generalizing, you've overtrained. Signs: val loss goes up while train loss keeps dropping. Fix: fewer iterations, more diverse data, or lower learning rate.
✅ What You've Set Up
Training data prepared in JSONL chat format
LoRA fine-tuning via MLX (Mac) or Unsloth (NVIDIA GPU)
A custom model that speaks in your voice and knows your domain
GGUF export and import into Ollama for daily use
Understanding of key parameters: iterations, rank, learning rate
Next Steps
Iterate on your data — fine-tuning is a loop. Train, test, find gaps, add examples, retrain. Each round gets better.
Store training data on your NAS — keep JSONL files centralized so you can train from any machine. See the NAS guide.
Build an eval pipeline — write test prompts and expected answers, score your model automatically. This tells you if a new training run is actually better.
Merge multiple LoRA adapters — train separate adapters for different skills (coding, writing, domain knowledge) and merge them into one model.
Share your model — push GGUF files to Hugging Face or serve them from your AI server for your whole network.
💡 This is the AI OS workflow. AI OS collects your conversations, preferences, and patterns over time. That data becomes training material. Fine-tune a model on it, and you get an AI that actually knows you — not a generic assistant, but your assistant. See the Ollama guide to get started.