Inside the Vision Training Pipeline: How we build and fine‑tune MTG card vision models
2025-10-11 | 6 min read
TL;DR
We convert local card images into concise vision descriptions and embeddings (via a local Ollama server), turn those descriptions into a structured fine‑tuning dataset (JSONL), then fine‑tune a Gemma 3 model using LoRA adapters. The process supports incremental training: the embedding builders are parallelized, training uses checkpoints, and adapters are exported for serving with Ollama or GGUF.
Goals and high-level contract
- Input: a set of local card images in
public/cards/plus the canonical card metadata fromcontent/cards/*. - Output: a LoRA adapter (saved under
mtg-card-model-*), fine‑tuning JSONL intraining-data/vision-finetuning.jsonl, and an embeddings JSON atcontent/cards/card-vision-embeddings.json. - Success criteria: the model reliably identifies card names from images and the adapter can be iteratively improved by re-running short training cycles using different samples.
Pipeline overview (steps)
-
Build vision embeddings from card images (several runner scripts available):
scripts/runBuildCardVisionEmbeddings.cjs— sequential builderscripts/runBuildCardVisionEmbeddingsParallel.cjs— parallel worker-thread builder (preferred for speed)scripts/runBuildCardVisionEmbeddingsSimple.cjs— lightweight variant for quick tests
These scripts write
content/cards/card-vision-embeddings.json(cacheable) containing acardsarray of objects withcardId,name,description,embedding, and print metadata (set, rarity, releasedAt). -
Create random training subsets (optional):
scripts/createRandomTrainingSet.ps1 [N]picks N random cards from the vision embeddings and writes a temporary embeddings filecontent/cards/card-vision-embeddings-random-$N.json.- It then runs
npm run cards:prepare-finetuningwhich executesscripts/runConvertToFineTuning.cjsto build the actual fine‑tuning JSONL.
-
Convert embeddings -> fine‑tuning dataset:
scripts/runConvertToFineTuning.cjsreads the embeddings JSON, finds the corresponding local image file inpublic/cards/(expects{cardId}-normal.jpg|jpeg|png|webp), and emitstraining-data/vision-finetuning.jsonl— the Unsloth/Gemma message format JSONL with messages containing an image entry (no<image>token in prompt) and the assistant completion (card name + metadata + description),training-data/classification.csvandtraining-data/embedding-triplets.jsonfor other training tasks and diagnostics,training-data/Modelfile.template— an Ollama Modelfile hint for serving the merged model.
-
Fine‑tune the base vision model with LoRA adapters:
train_mtg_vision.py(the trainer) loads the base model (unsloth/gemma-3-4b-it) and either:- Attaches LoRA adapters (if starting fresh), or
- Loads the model with existing adapters from the adapter output directory if found.
- The script converts the JSONL messages to a Hugging Face Dataset, builds an SFTTrainer configured for vision data, and runs
trainer.train(). - The trainer periodically saves Hugging Face-style checkpoints (e.g.,
checkpoint-50) and finally saves the adapters inOUTPUT_DIRfor later merging/export.
-
Export/serve:
scripts/exportToOllama.py/scripts/testMergedModel.pylet you merge the adapter and test the merged model;scripts/serveFinetunedModel.pycan serve the model locally.
Key implementation details
Vision embedding builder (parallel)
buildCardVisionEmbeddingsParallel.tslaunches worker threads and does the following per card:- Find a local image file for the card (supporting
.jpg,.jpeg,.png,.webp). - Read the image as base64 and POST it to a local Ollama vision model (
/api/generate) with a tightly-scoped prompt asking for a concise visual description (frame color, main subject, background, art style). The worker enforces simple sanitization and appends printing metadata (set, rarity, release date) to the description. - POST that description to the Ollama embeddings endpoint (
/api/embeddings) to receive an embedding vector. - Each worker returns a result object which the main process accumulates into a JSON cache (
content/cards/card-vision-embeddings.json) and saves incrementally while workers run.
- Find a local image file for the card (supporting
- The parallel builder is robust: it skips cards already cached, saves progress incrementally (so large runs survive interruptions), and reports time per card and a rough speedup by worker count.
Converting embeddings into a fine‑tuning JSONL
runConvertToFineTuning.cjsiterates the vision embeddingscards[], finds the local image file, constructs a user message containing an explicit{ type: 'image', image: '<path>' }entry followed by a shortprompt("What Magic card is this? Identify the card name and describe its artwork."), and an assistant message wherecompletioncontains a normalized description:This is <card name>. <Set info>. Rarity: <Rarity>. <Description>.- The conversion script performs validation (ensures prompts do NOT contain
<image>tokens because the message structure already supplies the image), writestraining-data/vision-finetuning.jsonl(Unsloth message format),classification.csv,embedding-triplets.json, and aModelfile.templatefor easy Ollama serving.
Fine‑tuning script details (LoRA, checkpoints, resume)
-
train_mtg_vision.pyuses Unsloth's FastVisionModel, TRL's SFTTrainer, and an Unsloth-specific data collator (UnslothVisionDataCollator). -
Behavior overview:
- If an adapter already exists in the
OUTPUT_DIR(detected viaadapter_model.safetensors), the script loads the model from that directory (so adapter weights are reused). - If no adapter exists, it loads the base model (
unsloth/gemma-3-4b-it) and configures LoRA adapters usingFastVisionModel.get_peft_model(...)with tuned hyperparameters (r=16, alpha=16, rslora, dropout, etc.). - The script transforms the JSONL into a Hugging Face Dataset, configures SFTTrainer with parameters such as per-device batch size, gradient accumulation, learning rate, warmup, save steps (50), save total limit, and runs the training loop.
- If an adapter already exists in the
-
Important addition: the trainer code now detects Hugging Face-style checkpoints (
checkpoint-*) inside the adapter output directory and will resume from the most recent checkpoint viatrainer.train(resume_from_checkpoint=...)if found. This preserves optimizer/Adam moments, scheduler state, and trainer step counters across separate Python processes — essential for safe incremental training across multiple runs.
How to use the overnight/random-cycle script (what you already had)
Your PowerShell snippet does the following each cycle:
- Calls
. emplates unCreateRandomTrainingSet.ps1 100(the script inscripts/createRandomTrainingSet.ps1) to produce a random sample and convert it intotraining-data/vision-finetuning.jsonl. - Archives a copy named
training-data/vision-finetuning-cycle-XX.jsonlfor traceability. - Calls
python train_mtg_vision.pywhich runs a fresh Python process.
With the updated trainer resume logic, repeated cycles will:
- Load the adapter weights and any available checkpoints, then resume training so optimizer/scheduler state is preserved.
- Use the newly generated
vision-finetuning.jsonl(the canonical path is used by the trainer), so each cycle can show the adapter a different random sample.
This is an effective way to bootstrap a LoRA adapter from many small random samples while preserving training continuity via checkpoints.
Best practices and tips
- Checkpoint frequency and size:
save_steps=50is a reasonable default for fast iteration, but adjust according to batch size and how long you want to be able to resume. Keepsave_total_limittuned to avoid disk explosion. - Resume policy: always prefer to resume from the latest checkpoint rather than re-loading the adapter-only weights. Resume preserves optimizer state and learning-rate schedules.
- Data hygiene: ensure image files exist in
public/cards/(the conversion script will skip cards with missing images). The builder logs missing images so you can backfill withnpm run cards:download-images. - Test locally first: run a tiny batch (
cards:vision-50) and a short trainer run to verify the loop. - Deterministic seeds: the trainer sets seed 3407 and the LoRA initialization uses
random_state=3407— this helps reproducibility across runs.
Verification & debugging checklist
- After a run, check
training-data/forvision-finetuning.jsonlandtraining-data/Modelfile.template. - During training, watch
logs/for trainer logs and confirm that checkpoint directoriescheckpoint-*are produced. - On restart, confirm the trainer prints
Resuming from checkpoint: <path>in stdout. - Use
scripts/testMergedModel.pyorscripts/testFinetunedModel.pyto validate inference quality on held-out images. - If the Ollama vision endpoint fails during embedding builds, verify
OLLAMA_HOSTandOLLAMA_VISION_MODELenvironment variables and that your local Ollama server is running and the required models are pulled.
Exporting and serving
- After you are happy with the adapter, merge it for serving via
scripts/exportToOllama.py(orscripts/mergeLoRA.pyif you prefer different workflows) and test withscripts/testMergedModel.py. - For Ollama, the
Modelfile.templateintraining-data/is a small convenience to serve the merged model with aModelfilethat points to the adapter.
Next steps and improvements
- Add an argument to
train_mtg_vision.pyso the trainer can accept arbitrary--datapaths. That makes it trivial to point the trainer attraining-data/vision-finetuning-cycle-XX.jsonlinstead of the canonical file. - Add a short harness that runs N cycles automatically and pushes intermediate adapters into a controlled directory, with a summary artifact listing validation images, per-cycle losses, and sample inference outputs to measure progress across cycles.
- Consider filtering augmentations or balancing by rarity/set to avoid overfitting on common art styles.
Closing
This pipeline gives you a practical, incremental way to build a Gemma-based MTG card vision model using local images, Ollama for vision + embeddings, and LoRA adapters for cheap, iterative fine‑tuning. With checkpoint-aware resumes, the overnight random-sample workflow becomes a valid strategy to slowly improve adapter quality while preserving optimizer and scheduler state — the key change we added to make cycles truly cumulative.
If you'd like, I can:
- Make
train_mtg_vision.pyaccept a--dataarg and update your PowerShell loop to pass the per-cycle file. - Add a small
scripts/verify-resume.ps1helper that runs a short cycle, stops, and restarts to assert resume behavior automatically.
Which would you prefer me to implement next?