Skip to main content

Model Inferencing on Disconnected Windows Environments

Status

  • Draft
  • Proposed
  • Accepted
  • Deprecated

Context

Small Language Model inferencing on disconnected environments presents unique challenges, especially on Windows-based systems where:

  • No internet connectivity is available (fully offline deployment).
  • Local inference must be optimized for low latency.
  • Hardware resources (CPU/GPU) need to be efficiently utilized.
  • Model execution must be Windows-compatible (no reliance on Linux-based tools).

Given these constraints, we evaluated multiple inference engines to determine the best solution for running SLM inference in an on-prem, disconnected Windows environment.

Decision

We selected ONNX Runtime GenAI as the best inference engine for running SLM inference on Windows-based, offline environments due to:

Decision Drivers

  • Ability to run RAG without internet access.
  • Fast inference speed ( < 5s total for responses).
  • Efficient GPU utilization on Windows.
  • Ease of integration with on-prem infrastructure.

Considered Options

Option 1: ONNX Runtime GenAI ✅ (Selected)

Pros:

  • Best performance (higher token throughput, lower latency).
  • Windows-native support without extra dependencies.
  • Works with ONNX-optimized models like Phi-3 Mini.
  • Provides graph optimizations.
  • Fully self-contained for offline execution.

Cons:

  • Requires manual model conversion using onnxruntime-genai.builder if there is no Onnx version published.

Option 2: LlamaCPP

Pros:

  • Good performance on CPU-only environments.
  • Fully self-contained for offline execution.

Cons:

  • GPU support is limited on Windows.
  • Requires custom Windows builds, adding complexity.

Option 3: Hugging Face Optimum

Pros:

  • Supports various ONNX models.
  • Provides graph optimizations.
  • Fully self-contained for offline execution.

Cons:

  • Does not optimize model inferencing like usage of KV-cache, input/output binding, pre/post processing/
  • Lower token throughput than ONNX Runtime GenAI.

Option 4: NVIDIA Triton Inference Server

Pros:

  • Multi-GPU scalability and efficient batching.

Cons:

  • Not feasible for offline Windows environments (requires Docker).
  • Higher resource overhead compared to ONNX Runtime GenAI.

Note: Nvidia Triton option was not added to benchmark due to restriction of Docker usage on disconnected environment. For further consideration, it needs to be tested.

Consequences

By selecting ONNX Runtime GenAI, we achieve:

  • Disconnected RAG capabilities on Windows environments.
  • Optimized inference speeds, keeping response times low.
  • Better resource utilization in on-prem settings.

Additional Considerations for Disconnected Environments

Logging & Telemetry

For disconnected environments, the chosen solution was leveraging OpenTelemetry to export file logs. After these logs are produced, using log clients to periodically import exported logs to Azure Log Monitor, Grafana, and similar applications for visualization.

OpenTelemetry on Disconnected Environment

Future Considerations

  • Further optimizing ONNX models for better efficiency.
  • Exploring additional fine-tuning techniques for model inferencing.
  • Monitoring new developments in ONNX Runtime GenAI.

Benchmark Results

For the benchmark, it was used Microsoft's Phi-3 Mini ONNX model (Hugging Face link) for testing, deployed on an Azure ND96amsr_A100_v4 instance (1x A100 80GB GPU). In addition, we used fixed generation length 256 tokens, with 5 warmup run and 10 repetition.

We tested following combinations;

  • Onnxruntime-genai Windows vs Linux
  • Onnxruntime-genai vs Huggingface Optimum

Average Token Generation Throughput (tokens per second, tps)

Batch SizePrompt LengthOnnxruntime-genaiLlamaCPPHF OptimumSpeed Up ORT/LlamaCPP RatioSpeed Up ORT/Optimum Ratio
116137.59109.47108.3451.261.27
164136.82110.26107.1351.241.28
1256134.45109.42105.7551.231.36
11024127.34105.60102.1141.211.50
12048122.62102.0099.3451.201.59

Token Generation Throughput

Wall-clock Latency Optimum vs ONNX Runtime GenAI

Prompt LengthONNX Runtime GenAI (s)Optimum (s)Llamacpp (s)
162.4913.5262.51
642.5023.5452.52
2562.5713.7722.71
10242.7904.0493.12
20483.0734.2793.46

Wall-Clock Comparison

Wall-Clock Throughput with Onnxruntime-genai Windows vs Linux (tps)

Prompt LengthWindowsLinux
16142.18129.52
64166.20154.18
256259.48235.68
1024585.68545.91
2048932.03892.12

Wall-Clock Throughput

Wall-Clock Latency with Onnxruntime-genai Windows vs Linux (s)

Prompt LengthWindowsLinux
161.9132.1
641.9252.08
2561.9732.17
10242.1852.34
20482.4722.58

Wall-Clock Latency

References

AI and automation capabilities described in this scenario should be implemented following responsible AI principles, including fairness, reliability, safety, privacy, inclusiveness, transparency, and accountability. Organizations should ensure appropriate governance, monitoring, and human oversight are in place for all AI-powered solutions.