Blog Post

Linux and Open Source Blog
8 MIN READ

Dissecting LLM Container Cold-Start: Where the Time Actually Goes

robcronin's avatar
robcronin
Icon for Microsoft rankMicrosoft
Apr 15, 2026

TL;DR: Your inference engine determines what to optimize. Dropping gzip from an LLM container image gives a 1.35x cold-start speedup for llama.cpp (pull-bound) but only 1.02x for vLLM (startup-bound) – same GPU, same image, same registry. Profile your engine’s startup before investing in pull-pipeline optimizations.

Cold-start latency determines whether GPU clusters can scale to zero, how fast they can autoscale, and whether bursty or low-QPS workloads are economically viable. Most optimization effort targets the container pull path – faster registries, lazy-pull snapshotters, different compression formats. But “cold-start” is actually a composite of pull, runtime startup, and model initialization, and the dominant phase varies dramatically by inference engine. An optimization that cuts time-to-first-token for one engine can be irrelevant for another, even on identical infrastructure.

What we measured

We decomposed cold-start for two architecturally different engines – vLLM (Python/CUDA, heavy JIT compilation) and llama.cpp (C++, minimal runtime) – running Llama 3.1 8B on A100 GPUs. Every run starts from a completely clean slate: containerd stopped, all state wiped, kernel page caches dropped. No warm starts, no pre-pulling, no caching.

We break TTFT into three phases: pull (download + decompression + snapshot creation), startup (container start → server ready), and first inference (first API response, including model weight loading for engines that defer it). We tested across three snapshotters (overlayfs, EROFS, Nydus) with gzip and uncompressed images, pulling from same-region Azure Container Registry.

Setup

All experiments ran on an NVIDIA A100 80GB (Azure NC24ads_A100_v4), pulling from same-region Azure Container Registry. Images were built with AIKit, which produces ModelPack-compliant OCI artifacts with uncompressed model weight layers, Cosign signatures, SBOMs, and provenance attestations. These are supply chain properties you lose when model weights live on a shared drive.

vLLM: startup dominates, pull barely matters

vLLM loads model weights, runs torch.compile, captures CUDA graphs for multiple batch shapes, allocates KV cache, and warms up, all before serving the first request. This takes ~176 seconds regardless of how fast the image arrived.

The breakdown makes the bottleneck obvious: the green bar (startup) is nearly constant across all four variants, swamping any pull-time differences.

 

 

Figure 1: vLLM cold-start breakdown. Startup (green, ~176s) dominates regardless of snapshotter.

Method

Pull

Startup

1st Inference

TTFT

overlayfs (gzip)

140.8s ±5.5

176.0s ±3.2

0.16s

317.2s ±2.2

overlayfs (uncomp.)

129.9s ±3.3

180.8s ±12.2

0.16s

310.9s ±8.9

EROFS (gzip)

158.9s ±8.8

175.3s ±0.8

0.16s

334.4s ±8.7

EROFS (uncomp.)

166.3s ±21.1

177.3s ±12.8

0.16s

343.8s ±8.2

Llama 3.1 8B Q4_K_M, ~14 GB image, n=2–3 per variant. ± = sample standard deviation. Three of twelve runs hit intermittent NVIDIA container runtime crashes (exit code 120, unrelated to snapshotters) and were excluded. We excluded Nydus because FUSE-streaming the 14 GB Python/CUDA stack caused startup to exceed 900s. Steady-state inference: ~0.134s across all snapshotters.

44% pull, 56% startup. Dropping gzip saves 11 seconds on a 317-second cold start (1.02x). If your engine is vLLM, optimizing the pull pipeline is the wrong lever.

llama.cpp: pull dominates, compression is the bottleneck

llama.cpp has the opposite profile. Its C++ runtime starts in 2–5 seconds, so the pull becomes the majority of cold-start. This is where filesystem and compression choices actually matter.

Here the picture flips. Pull (blue) is the widest bar, and the gzip-to-uncompressed difference is visible at a glance:

 

 

Figure 2: llama.cpp cold-start breakdown. Pull time (blue) dominates for gzip variants.

Method

Pull

Startup

1st Inference

TTFT

overlayfs (gzip)

88.3s ±0.2

5.3s ±0.5

45.1s ±1.4

138.8s ±0.8

overlayfs (uncomp.)

56.3s ±3.1

2.0s ±0.0

44.2s ±0.1

102.4s ±3.1

EROFS (gzip)

92.0s ±2.3

6.1s ±0.5

44.0s ±0.2

142.3s ±1.9

EROFS (uncomp.)

58.8s ±0.6

2.0s ±0.0

44.0s ±0.1

104.8s ±0.5

Llama 3.1 8B Q4_K_M, ~8 GB image, n=3 per variant, 12/12 runs succeeded. First inference includes model weight loading into GPU VRAM (~43s) plus token generation (~1.5s). Steady-state inference: ~1.5s across all snapshotters.

64% pull, 4% startup, 33% model loading. Dropping gzip saves 32 seconds (1.35x) with zero infrastructure changes.

Engine comparison

Placed side by side, the two engines tell opposite stories about the same infrastructure:

 

 

Figure 3: Where cold-start time goes. vLLM is compute-bound; llama.cpp is pull-bound.

 

vLLM

llama.cpp

Time saved by dropping gzip

11s (3% of TTFT)

32s (23% of TTFT)

Startup time

176–181s

2–5s

Speedup from dropping gzip

1.02x

1.35x

Same optimization, completely different impact. Before investing in pull optimization (compression changes, lazy-pull infrastructure, registry tuning), profile your engine’s startup. If startup dominates, the pull isn’t where the time goes.

Why gzip hurts: model weights are incompressible

The AIKit image is 8.7 GB uncompressed, 6.6 GB with gzip (a modest 0.76x ratio). But this ratio hides what’s really happening:

Layer type

Size

% of image

Gzip ratio

Model weights (GGUF)

4.9 GB

56%

~1.00x (quantized binary, no redundancy)

CUDA + system layers

~3.8 GB

44%

~0.46x (compresses well)

The GGUF file is already quantized to 4-bit precision. Gzip reads every byte, burns CPU, and produces output the same size as the input. You’re paying full decompression cost on 56% of the image for zero size reduction.

Bottom line: gzip is doing real work on less than half your image and producing zero savings on the rest. Dropping it costs nothing and removes a bottleneck from every cold start.

The Nydus prefetch finding

If decompression is the bottleneck, what about skipping the full pull entirely?

Nydus lazy-pull takes a fundamentally different approach: it fetches only manifest metadata during “pull” (~0.7s), then streams model data on-demand via FUSE as the container reads it. Nydus TTFT isn’t directly comparable to the full-pull methods above because the download cost shifts from the pull column to the inference column.

With prefetch enabled, Nydus achieved 77.8s TTFT for llama.cpp vs 139.1s for overlayfs gzip. The critical detail is the prefetch_all flag:

 

 

Figure 4: Nydus prefetch ON vs OFF. One config flag, 2.87x difference. Overlayfs gzip shown as baseline.

Configuration

1st Inference

TTFT

Nydus, prefetch ON

72.4s ±0.6

77.8s ±0.5

Nydus, prefetch OFF

218.6s ±2.9

223.4s ±2.9

overlayfs gzip (baseline)

44.0s ±0.4

139.1s ±1.9

n=3 per config, 9/9 runs succeeded. Data: 03-prefetch-config-20260401-030725.csv

One flag in nydusd-config.json, 2.87x difference. Without prefetch, every model weight page fault fires an individual HTTP range request to the registry. With prefetch_all=true, Nydus streams the full blob in the background while the container starts, so chunks arrive ahead of the GPU’s read pattern.

Even with prefetch, Nydus first inference is ~28s slower than overlayfs (72s vs 44s) due to FUSE kernel-user roundtrips during model mmap. Nydus wins on total TTFT because it eliminates the blocking pull, but this overhead means its advantage shrinks on faster networks.

Bottom line: Nydus lazy-pull can halve cold-start for pull-bound engines, but only if prefetch is on. Treat prefetch_all=true as a hard requirement, not a tuning knob.

How to apply these findings

Pick your optimization by engine type

The right optimization depends on where your engine spends its cold-start time. This table summarizes the tradeoffs:

Engine type

Dominant phase

Speedup from dropping gzip

Nydus viable?

Best optimization

What NOT to optimize

vLLM / TensorRT-LLM

Startup (56%)

1.02x — negligible

No — FUSE + Python/CUDA stack exceeded 900s in our tests

Cache torch.compile artifacts and CUDA graphs

Pull pipeline (it’s <44% of TTFT and already fast enough)

llama.cpp / ONNX Runtime

Pull (64%)

1.35x — 32s saved

Yes, with prefetch_all=true (77.8s TTFT vs 139s gzip baseline)

Drop gzip on weight layers; consider lazy-pull on slow links

Startup (already 2–5s; no room to improve)

Large dense models (70B+)

Pull (projected)

>1.35x — scales with image size

Yes, strongest case for lazy-pull

Uncompressed or zstd; Nydus prefetch on bandwidth-constrained links

Recommendations

  1. Profile your engine’s startup before touching the pull pipeline. If CUDA compilation dominates (vLLM, TensorRT-LLM), no amount of pull optimization will help. Cache torch.compile artifacts and CUDA graphs instead — production clusters that do this reduce vLLM restarts to ~45–60s.
  2. Drop gzip on model weight layers. For pull-bound engines (llama.cpp, ONNX Runtime), this is the single highest-ROI change: build with --output=type=image,compression=uncompressed, or use AIKit, which defaults to uncompressed weight layers. Quantized model weights (GGUF, safetensors) are already dense binary — gzip burns CPU for a ~1.00x compression ratio on 56% of the image.
  3. If using Nydus, set prefetch_all=true. Without it, every weight page fault triggers an individual HTTP range request and cold-start is 2.87x slower. This is a single flag in nydusd-config.json.
  4. Package models as signed OCI artifacts, not volume mounts. Three CNCF projects implement this pipeline end-to-end: ModelPack defines the OCI artifact spec (model metadata, architecture, quantization format). AIKit builds ModelPack-compliant images with Cosign signatures, SBOMs, and provenance attestations — supply chain guarantees you lose when weights live on a shared drive. KAITO handles the Kubernetes deployment: GPU node provisioning, inference engine setup, and API exposure. Together they cover packaging → build → deploy, and they produce the exact image layout these benchmarks measured.

Why this matters: the cost of cold-start

On an A100 node (~$3–4/hr on major clouds), a 5-minute vLLM cold start burns ~$0.30 in idle GPU time per pod. That sounds small until you multiply it: a cluster that scales 50 pods to zero overnight and restarts them each morning wastes ~$15/day — over $5,000/year — on GPUs sitting idle during pull and CUDA compilation. More critically, cold-start latency determines whether scale-to-zero is feasible at all. If cold-start exceeds your SLO (say, 30s for an interactive app), you’re forced to keep warm replicas running 24/7, which can 2–3x your GPU spend. Cutting llama.cpp cold-start from 139s to 103s by dropping gzip doesn’t just save 36 seconds — it moves the needle on whether autoscaling is viable for your workload.

What this doesn’t cover

  • zstd compression: decompresses 5–10x faster than gzip; containerd supports it natively. The most obvious gap in this analysis.
  • Pre-pulling and caching: production clusters pre-pull images and cache CUDA graphs, reducing vLLM restarts to ~45–60s. We measure the cold case: scale-from-zero events and first-time deployments.
  • Volume-mounted weights: skips the pull entirely, but loses supply chain properties (signing, scanning, provenance).
  • Larger models (70B+): pull would dominate more, increasing the gzip penalty.
  • Sample size: n=3 per AIKit variant, n=2–3 per vLLM variant. The gzip finding for llama.cpp is statistically significant (Welch’s t-test, p=0.0014, Cohen’s d=16.3; verification script). Other comparisons are directional.

Reproduce it

Scripts and raw data: erofs-repro-repo. Data for this post: 02-aikit-five-way-20260401-004716.csv and 01-vllm-four-way-20260331-113848.csv. Full analysis: technical report.

Updated Apr 14, 2026
Version 1.0
No CommentsBe the first to comment