Inside the Vision Training Pipeline: How we build and fine‑tune MTG card vision models

2025-10-11 | 6 min read

TL;DR

We convert local card images into concise vision descriptions and embeddings (via a local Ollama server), turn those descriptions into a structured fine‑tuning dataset (JSONL), then fine‑tune a Gemma 3 model using LoRA adapters. The process supports incremental training: the embedding builders are parallelized, training uses checkpoints, and adapters are exported for serving with Ollama or GGUF.

Goals and high-level contract

  • Input: a set of local card images in public/cards/ plus the canonical card metadata from content/cards/*.
  • Output: a LoRA adapter (saved under mtg-card-model-*), fine‑tuning JSONL in training-data/vision-finetuning.jsonl, and an embeddings JSON at content/cards/card-vision-embeddings.json.
  • Success criteria: the model reliably identifies card names from images and the adapter can be iteratively improved by re-running short training cycles using different samples.

Pipeline overview (steps)

  1. Build vision embeddings from card images (several runner scripts available):

    • scripts/runBuildCardVisionEmbeddings.cjs — sequential builder
    • scripts/runBuildCardVisionEmbeddingsParallel.cjs — parallel worker-thread builder (preferred for speed)
    • scripts/runBuildCardVisionEmbeddingsSimple.cjs — lightweight variant for quick tests

    These scripts write content/cards/card-vision-embeddings.json (cacheable) containing a cards array of objects with cardId, name, description, embedding, and print metadata (set, rarity, releasedAt).

  2. Create random training subsets (optional):

    • scripts/createRandomTrainingSet.ps1 [N] picks N random cards from the vision embeddings and writes a temporary embeddings file content/cards/card-vision-embeddings-random-$N.json.
    • It then runs npm run cards:prepare-finetuning which executes scripts/runConvertToFineTuning.cjs to build the actual fine‑tuning JSONL.
  3. Convert embeddings -> fine‑tuning dataset:

    • scripts/runConvertToFineTuning.cjs reads the embeddings JSON, finds the corresponding local image file in public/cards/ (expects {cardId}-normal.jpg|jpeg|png|webp), and emits
      • training-data/vision-finetuning.jsonl — the Unsloth/Gemma message format JSONL with messages containing an image entry (no <image> token in prompt) and the assistant completion (card name + metadata + description),
      • training-data/classification.csv and training-data/embedding-triplets.json for other training tasks and diagnostics,
      • training-data/Modelfile.template — an Ollama Modelfile hint for serving the merged model.
  4. Fine‑tune the base vision model with LoRA adapters:

    • train_mtg_vision.py (the trainer) loads the base model (unsloth/gemma-3-4b-it) and either:
      • Attaches LoRA adapters (if starting fresh), or
      • Loads the model with existing adapters from the adapter output directory if found.
    • The script converts the JSONL messages to a Hugging Face Dataset, builds an SFTTrainer configured for vision data, and runs trainer.train().
    • The trainer periodically saves Hugging Face-style checkpoints (e.g., checkpoint-50) and finally saves the adapters in OUTPUT_DIR for later merging/export.
  5. Export/serve:

    • scripts/exportToOllama.py / scripts/testMergedModel.py let you merge the adapter and test the merged model; scripts/serveFinetunedModel.py can serve the model locally.

Key implementation details

Vision embedding builder (parallel)

  • buildCardVisionEmbeddingsParallel.ts launches worker threads and does the following per card:
    • Find a local image file for the card (supporting .jpg, .jpeg, .png, .webp).
    • Read the image as base64 and POST it to a local Ollama vision model (/api/generate) with a tightly-scoped prompt asking for a concise visual description (frame color, main subject, background, art style). The worker enforces simple sanitization and appends printing metadata (set, rarity, release date) to the description.
    • POST that description to the Ollama embeddings endpoint (/api/embeddings) to receive an embedding vector.
    • Each worker returns a result object which the main process accumulates into a JSON cache (content/cards/card-vision-embeddings.json) and saves incrementally while workers run.
  • The parallel builder is robust: it skips cards already cached, saves progress incrementally (so large runs survive interruptions), and reports time per card and a rough speedup by worker count.

Converting embeddings into a fine‑tuning JSONL

  • runConvertToFineTuning.cjs iterates the vision embeddings cards[], finds the local image file, constructs a user message containing an explicit { type: 'image', image: '<path>' } entry followed by a short prompt ("What Magic card is this? Identify the card name and describe its artwork."), and an assistant message where completion contains a normalized description: This is <card name>. <Set info>. Rarity: <Rarity>. <Description>.
  • The conversion script performs validation (ensures prompts do NOT contain <image> tokens because the message structure already supplies the image), writes training-data/vision-finetuning.jsonl (Unsloth message format), classification.csv, embedding-triplets.json, and a Modelfile.template for easy Ollama serving.

Fine‑tuning script details (LoRA, checkpoints, resume)

  • train_mtg_vision.py uses Unsloth's FastVisionModel, TRL's SFTTrainer, and an Unsloth-specific data collator (UnslothVisionDataCollator).

  • Behavior overview:

    • If an adapter already exists in the OUTPUT_DIR (detected via adapter_model.safetensors), the script loads the model from that directory (so adapter weights are reused).
    • If no adapter exists, it loads the base model (unsloth/gemma-3-4b-it) and configures LoRA adapters using FastVisionModel.get_peft_model(...) with tuned hyperparameters (r=16, alpha=16, rslora, dropout, etc.).
    • The script transforms the JSONL into a Hugging Face Dataset, configures SFTTrainer with parameters such as per-device batch size, gradient accumulation, learning rate, warmup, save steps (50), save total limit, and runs the training loop.
  • Important addition: the trainer code now detects Hugging Face-style checkpoints (checkpoint-*) inside the adapter output directory and will resume from the most recent checkpoint via trainer.train(resume_from_checkpoint=...) if found. This preserves optimizer/Adam moments, scheduler state, and trainer step counters across separate Python processes — essential for safe incremental training across multiple runs.

How to use the overnight/random-cycle script (what you already had)

Your PowerShell snippet does the following each cycle:

  • Calls . emplates unCreateRandomTrainingSet.ps1 100 (the script in scripts/createRandomTrainingSet.ps1) to produce a random sample and convert it into training-data/vision-finetuning.jsonl.
  • Archives a copy named training-data/vision-finetuning-cycle-XX.jsonl for traceability.
  • Calls python train_mtg_vision.py which runs a fresh Python process.

With the updated trainer resume logic, repeated cycles will:

  • Load the adapter weights and any available checkpoints, then resume training so optimizer/scheduler state is preserved.
  • Use the newly generated vision-finetuning.jsonl (the canonical path is used by the trainer), so each cycle can show the adapter a different random sample.

This is an effective way to bootstrap a LoRA adapter from many small random samples while preserving training continuity via checkpoints.

Best practices and tips

  • Checkpoint frequency and size: save_steps=50 is a reasonable default for fast iteration, but adjust according to batch size and how long you want to be able to resume. Keep save_total_limit tuned to avoid disk explosion.
  • Resume policy: always prefer to resume from the latest checkpoint rather than re-loading the adapter-only weights. Resume preserves optimizer state and learning-rate schedules.
  • Data hygiene: ensure image files exist in public/cards/ (the conversion script will skip cards with missing images). The builder logs missing images so you can backfill with npm run cards:download-images.
  • Test locally first: run a tiny batch (cards:vision-50) and a short trainer run to verify the loop.
  • Deterministic seeds: the trainer sets seed 3407 and the LoRA initialization uses random_state=3407 — this helps reproducibility across runs.

Verification & debugging checklist

  • After a run, check training-data/ for vision-finetuning.jsonl and training-data/Modelfile.template.
  • During training, watch logs/ for trainer logs and confirm that checkpoint directories checkpoint-* are produced.
  • On restart, confirm the trainer prints Resuming from checkpoint: <path> in stdout.
  • Use scripts/testMergedModel.py or scripts/testFinetunedModel.py to validate inference quality on held-out images.
  • If the Ollama vision endpoint fails during embedding builds, verify OLLAMA_HOST and OLLAMA_VISION_MODEL environment variables and that your local Ollama server is running and the required models are pulled.

Exporting and serving

  • After you are happy with the adapter, merge it for serving via scripts/exportToOllama.py (or scripts/mergeLoRA.py if you prefer different workflows) and test with scripts/testMergedModel.py.
  • For Ollama, the Modelfile.template in training-data/ is a small convenience to serve the merged model with a Modelfile that points to the adapter.

Next steps and improvements

  • Add an argument to train_mtg_vision.py so the trainer can accept arbitrary --data paths. That makes it trivial to point the trainer at training-data/vision-finetuning-cycle-XX.jsonl instead of the canonical file.
  • Add a short harness that runs N cycles automatically and pushes intermediate adapters into a controlled directory, with a summary artifact listing validation images, per-cycle losses, and sample inference outputs to measure progress across cycles.
  • Consider filtering augmentations or balancing by rarity/set to avoid overfitting on common art styles.

Closing

This pipeline gives you a practical, incremental way to build a Gemma-based MTG card vision model using local images, Ollama for vision + embeddings, and LoRA adapters for cheap, iterative fine‑tuning. With checkpoint-aware resumes, the overnight random-sample workflow becomes a valid strategy to slowly improve adapter quality while preserving optimizer and scheduler state — the key change we added to make cycles truly cumulative.

If you'd like, I can:

  • Make train_mtg_vision.py accept a --data arg and update your PowerShell loop to pass the per-cycle file.
  • Add a small scripts/verify-resume.ps1 helper that runs a short cycle, stops, and restarts to assert resume behavior automatically.

Which would you prefer me to implement next?

Related Posts