Inside the Vision Training Pipeline: How we build and fine‑tune MTG card vision models

2025-10-11 | 6 min read

TL;DR

We convert local card images into concise vision descriptions and embeddings (via a local Ollama server), turn those descriptions into a structured fine‑tuning dataset (JSONL), then fine‑tune a Gemma 3 model using LoRA adapters. The process supports incremental training: the embedding builders are parallelized, training uses checkpoints, and adapters are exported for serving with Ollama or GGUF.

Goals and high-level contract

Input: a set of local card images in public/cards/ plus the canonical card metadata from content/cards/*.
Output: a LoRA adapter (saved under mtg-card-model-*), fine‑tuning JSONL in training-data/vision-finetuning.jsonl, and an embeddings JSON at content/cards/card-vision-embeddings.json.
Success criteria: the model reliably identifies card names from images and the adapter can be iteratively improved by re-running short training cycles using different samples.

Pipeline overview (steps)

Build vision embeddings from card images (several runner scripts available):
- scripts/runBuildCardVisionEmbeddings.cjs — sequential builder
- scripts/runBuildCardVisionEmbeddingsParallel.cjs — parallel worker-thread builder (preferred for speed)
- scripts/runBuildCardVisionEmbeddingsSimple.cjs — lightweight variant for quick tests
These scripts write content/cards/card-vision-embeddings.json (cacheable) containing a cards array of objects with cardId, name, description, embedding, and print metadata (set, rarity, releasedAt).
Create random training subsets (optional):
- scripts/createRandomTrainingSet.ps1 [N] picks N random cards from the vision embeddings and writes a temporary embeddings file content/cards/card-vision-embeddings-random-$N.json.
- It then runs npm run cards:prepare-finetuning which executes scripts/runConvertToFineTuning.cjs to build the actual fine‑tuning JSONL.
Convert embeddings -> fine‑tuning dataset:
- scripts/runConvertToFineTuning.cjs reads the embeddings JSON, finds the corresponding local image file in public/cards/ (expects {cardId}-normal.jpg|jpeg|png|webp), and emits
  - training-data/vision-finetuning.jsonl — the Unsloth/Gemma message format JSONL with messages containing an image entry (no <image> token in prompt) and the assistant completion (card name + metadata + description),
  - training-data/classification.csv and training-data/embedding-triplets.json for other training tasks and diagnostics,
  - training-data/Modelfile.template — an Ollama Modelfile hint for serving the merged model.
Fine‑tune the base vision model with LoRA adapters:
- train_mtg_vision.py (the trainer) loads the base model (unsloth/gemma-3-4b-it) and either:
  - Attaches LoRA adapters (if starting fresh), or
  - Loads the model with existing adapters from the adapter output directory if found.
- The script converts the JSONL messages to a Hugging Face Dataset, builds an SFTTrainer configured for vision data, and runs trainer.train().
- The trainer periodically saves Hugging Face-style checkpoints (e.g., checkpoint-50) and finally saves the adapters in OUTPUT_DIR for later merging/export.
Export/serve:
- scripts/exportToOllama.py / scripts/testMergedModel.py let you merge the adapter and test the merged model; scripts/serveFinetunedModel.py can serve the model locally.

Key implementation details

Vision embedding builder (parallel)

buildCardVisionEmbeddingsParallel.ts launches worker threads and does the following per card:
- Find a local image file for the card (supporting .jpg, .jpeg, .png, .webp).
- Read the image as base64 and POST it to a local Ollama vision model (/api/generate) with a tightly-scoped prompt asking for a concise visual description (frame color, main subject, background, art style). The worker enforces simple sanitization and appends printing metadata (set, rarity, release date) to the description.
- POST that description to the Ollama embeddings endpoint (/api/embeddings) to receive an embedding vector.
- Each worker returns a result object which the main process accumulates into a JSON cache (content/cards/card-vision-embeddings.json) and saves incrementally while workers run.
The parallel builder is robust: it skips cards already cached, saves progress incrementally (so large runs survive interruptions), and reports time per card and a rough speedup by worker count.

Converting embeddings into a fine‑tuning JSONL

runConvertToFineTuning.cjs iterates the vision embeddings cards[], finds the local image file, constructs a user message containing an explicit { type: 'image', image: '<path>' } entry followed by a short prompt ("What Magic card is this? Identify the card name and describe its artwork."), and an assistant message where completion contains a normalized description: This is <card name>. <Set info>. Rarity: <Rarity>. <Description>.
The conversion script performs validation (ensures prompts do NOT contain <image> tokens because the message structure already supplies the image), writes training-data/vision-finetuning.jsonl (Unsloth message format), classification.csv, embedding-triplets.json, and a Modelfile.template for easy Ollama serving.

Fine‑tuning script details (LoRA, checkpoints, resume)

train_mtg_vision.py uses Unsloth's FastVisionModel, TRL's SFTTrainer, and an Unsloth-specific data collator (UnslothVisionDataCollator).
Behavior overview:
- If an adapter already exists in the OUTPUT_DIR (detected via adapter_model.safetensors), the script loads the model from that directory (so adapter weights are reused).
- If no adapter exists, it loads the base model (unsloth/gemma-3-4b-it) and configures LoRA adapters using FastVisionModel.get_peft_model(...) with tuned hyperparameters (r=16, alpha=16, rslora, dropout, etc.).
- The script transforms the JSONL into a Hugging Face Dataset, configures SFTTrainer with parameters such as per-device batch size, gradient accumulation, learning rate, warmup, save steps (50), save total limit, and runs the training loop.
Important addition: the trainer code now detects Hugging Face-style checkpoints (checkpoint-*) inside the adapter output directory and will resume from the most recent checkpoint via trainer.train(resume_from_checkpoint=...) if found. This preserves optimizer/Adam moments, scheduler state, and trainer step counters across separate Python processes — essential for safe incremental training across multiple runs.

How to use the overnight/random-cycle script (what you already had)

Your PowerShell snippet does the following each cycle:

Calls . emplates unCreateRandomTrainingSet.ps1 100 (the script in scripts/createRandomTrainingSet.ps1) to produce a random sample and convert it into training-data/vision-finetuning.jsonl.
Archives a copy named training-data/vision-finetuning-cycle-XX.jsonl for traceability.
Calls python train_mtg_vision.py which runs a fresh Python process.

With the updated trainer resume logic, repeated cycles will:

Load the adapter weights and any available checkpoints, then resume training so optimizer/scheduler state is preserved.
Use the newly generated vision-finetuning.jsonl (the canonical path is used by the trainer), so each cycle can show the adapter a different random sample.

This is an effective way to bootstrap a LoRA adapter from many small random samples while preserving training continuity via checkpoints.

Best practices and tips

Checkpoint frequency and size: save_steps=50 is a reasonable default for fast iteration, but adjust according to batch size and how long you want to be able to resume. Keep save_total_limit tuned to avoid disk explosion.
Resume policy: always prefer to resume from the latest checkpoint rather than re-loading the adapter-only weights. Resume preserves optimizer state and learning-rate schedules.
Data hygiene: ensure image files exist in public/cards/ (the conversion script will skip cards with missing images). The builder logs missing images so you can backfill with npm run cards:download-images.
Test locally first: run a tiny batch (cards:vision-50) and a short trainer run to verify the loop.
Deterministic seeds: the trainer sets seed 3407 and the LoRA initialization uses random_state=3407 — this helps reproducibility across runs.

Verification & debugging checklist

After a run, check training-data/ for vision-finetuning.jsonl and training-data/Modelfile.template.
During training, watch logs/ for trainer logs and confirm that checkpoint directories checkpoint-* are produced.
On restart, confirm the trainer prints Resuming from checkpoint: <path> in stdout.
Use scripts/testMergedModel.py or scripts/testFinetunedModel.py to validate inference quality on held-out images.
If the Ollama vision endpoint fails during embedding builds, verify OLLAMA_HOST and OLLAMA_VISION_MODEL environment variables and that your local Ollama server is running and the required models are pulled.

Exporting and serving

After you are happy with the adapter, merge it for serving via scripts/exportToOllama.py (or scripts/mergeLoRA.py if you prefer different workflows) and test with scripts/testMergedModel.py.
For Ollama, the Modelfile.template in training-data/ is a small convenience to serve the merged model with a Modelfile that points to the adapter.

Next steps and improvements

Add an argument to train_mtg_vision.py so the trainer can accept arbitrary --data paths. That makes it trivial to point the trainer at training-data/vision-finetuning-cycle-XX.jsonl instead of the canonical file.
Add a short harness that runs N cycles automatically and pushes intermediate adapters into a controlled directory, with a summary artifact listing validation images, per-cycle losses, and sample inference outputs to measure progress across cycles.
Consider filtering augmentations or balancing by rarity/set to avoid overfitting on common art styles.

Closing

This pipeline gives you a practical, incremental way to build a Gemma-based MTG card vision model using local images, Ollama for vision + embeddings, and LoRA adapters for cheap, iterative fine‑tuning. With checkpoint-aware resumes, the overnight random-sample workflow becomes a valid strategy to slowly improve adapter quality while preserving optimizer and scheduler state — the key change we added to make cycles truly cumulative.

If you'd like, I can:

Make train_mtg_vision.py accept a --data arg and update your PowerShell loop to pass the per-cycle file.
Add a small scripts/verify-resume.ps1 helper that runs a short cycle, stops, and restarts to assert resume behavior automatically.

Which would you prefer me to implement next?