CLIP — Composite Models¶
CLIP (openai/clip-vit-base-patch32) is a dual-encoder vision-language model: one tower encodes images, the other encodes text, and both project into a shared embedding space. winml-cli treats it as a composite model — a model that is split into multiple ONNX sub-models that run together at inference time. For CLIP, the two sub-models are:
| Sub-model | Role | Input shape | Output (projected) |
|---|---|---|---|
image-encoder |
Encodes images into embeddings | pixel_values [1, 3, 224, 224] |
image_embeds [1, 512] |
text-encoder |
Encodes text labels into embeddings | input_ids [1, 77] |
text_embeds [1, 512] |
Zero-shot classification is achieved by embedding the image and the candidate text labels, then ranking the labels by the cosine similarity between their embeddings. Splitting the towers into two ONNX graphs lets each encoder have fully static shapes (required for efficient NPU compilation) and lets you build, cache, and benchmark them independently.
Prerequisites¶
- winml-cli installed and
winmlon your PATH. - A network connection to download CLIP weights from HuggingFace on first run.
Overall workflow¶
The composite model architecture for CLIP:
graph LR
A[winml config] -->|"(clip, zero-shot-image-classification)"| B[Composite Registry]
B --> C[image-encoder config]
B --> D[text-encoder config]
C --> E[winml build → image-encoder.onnx]
D --> F[winml build → text-encoder.onnx]
E --> G[WinMLAutoModel]
F --> G
G -->|logits_per_image| H[Classification scores]
Step 1: Generate build configs¶
Because (clip, zero-shot-image-classification) is registered as a composite model, this command produces two config files — one per sub-model:
clip_image-encoder.json— export config usingimage-feature-extractiontaskclip_text-encoder.json— export config usingfeature-extractiontask
Each config includes CLIP-specific optimizations (GELU fusion, LayerNorm fusion, MatMul+Add fusion, and clamp constant values).
Step 2: Build each sub-model¶
Build both sub-models individually using their config files:
# Build the image encoder
winml build -c clip_image-encoder.json -m openai/clip-vit-base-patch32 -o output/image-encoder
# Build the text encoder
winml build -c clip_text-encoder.json -m openai/clip-vit-base-patch32 -o output/text-encoder
Each winml build runs the full pipeline: export → optimize → quantize → compile. The output directories contain the final ONNX files ready for inference.
To target a specific execution provider (e.g., QNN for NPU):
winml build -c clip_image-encoder.json -m openai/clip-vit-base-patch32 -o output/image-encoder --ep qnn
winml build -c clip_text-encoder.json -m openai/clip-vit-base-patch32 -o output/text-encoder --ep qnn
Step 3: Benchmark each sub-model¶
This lets you identify whether the image or text encoder is the bottleneck on your target hardware.
Step 4: Run inference (Python API)¶
There are two ways to get a ready-to-run model. Both return the same WinMLModelForZeroShotImageClassification — a single object that orchestrates the two encoders and combines their projected embeddings into similarity scores — so the inference code afterward is identical.
Option 1 — Load the ONNX files built in Step 2 (skips re-export/optimization). Pass a dict mapping each component name to its built model.onnx, plus the HF config so the composite registry can resolve (clip, zero-shot-image-classification):
from transformers import AutoConfig
from winml.modelkit.models import WinMLAutoModel
model = WinMLAutoModel.from_onnx(
{
"image-encoder": "output/image-encoder/model.onnx",
"text-encoder": "output/text-encoder/model.onnx",
},
task="zero-shot-image-classification",
hf_config=AutoConfig.from_pretrained("openai/clip-vit-base-patch32"),
skip_build=True,
)
Option 2 — Build both encoders from the HuggingFace model in one call. WinMLAutoModel.from_pretrained detects the composite task and runs the full pipeline for each sub-model:
from winml.modelkit.models import WinMLAutoModel
model = WinMLAutoModel.from_pretrained(
"openai/clip-vit-base-patch32",
task="zero-shot-image-classification",
)
Either way, run inference the same way — prepare an image plus candidate labels with the HF processor, then call the model:
from PIL import Image
from transformers import CLIPProcessor
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image = Image.open("cat.jpg")
labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
# Run both encoders and combine into per-label similarity scores
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=-1)
for label, p in zip(labels, probs[0].tolist()):
print(f"{label}: {p:.4f}")
The text encoder's fixed sequence length (77) is handled for you — the processor's tokens are padded or truncated to match the ONNX graph before each run.
Customizing shape config per sub-model¶
Each encoder takes its own shape_config, passed through sub_model_kwargs. The image encoder accepts vision keys (height, width); the text encoder accepts text keys (sequence_length):
model = WinMLAutoModel.from_pretrained(
"openai/clip-vit-base-patch32",
task="zero-shot-image-classification",
sub_model_kwargs={
"image-encoder": {"shape_config": {"height": 224, "width": 224}},
"text-encoder": {"shape_config": {"sequence_length": 77}},
},
)
Other composite models¶
The same composite model pattern is used for:
- SigLIP (
google/siglip-base-patch16-224) — dual-encoder zero-shot image classification; shares the same composite wrapper as CLIP - T5 (
google-t5/t5-small) — encoder + decoder for translation/summarization - BART (
facebook/bart-large-cnn) — encoder + decoder for summarization and table-question-answering (TAPEX) - Marian (
Helsinki-NLP/opus-mt-en-de) — encoder + decoder for translation - Qwen3 (
Qwen/Qwen3-0.6B) — prefill + generation decoders for text generation - BLIP (
Salesforce/blip-image-captioning-base) — vision encoder + text decoder for image-to-text captioning - Vision-encoder-decoder (
microsoft/trocr-base-handwritten) — vision encoder + text decoder for image-to-text (TrOCR, Donut)
See also¶
- BERT — Config + Build + Perf — single-model workflow
- Hugging Face Model to NPU — step-by-step pipeline
- Supported Models — full list of validated architectures
- Config and build — concept overview