Distributed LLM Inference Pipelines

Introduction

Model registries like HuggingFace give you the weights, the tokenizer, and the config. What they don't give you is a production deployment. You still need to host and serve the model in your own cloud infrastructure — and that's where things get complicated.

Deploying an LLM isn't like deploying a web service. These models consume entire GPUs worth of memory, inference latency is measured in seconds, and an idle GPU burns $2–4 per hour whether it's serving traffic or not. Getting from “I downloaded a model” to “it's running reliably in production” requires more than a Dockerfile.

This post describes the architecture of a distributed LLM inference pipeline — a system designed to automate the full journey from a model in a registry to a scalable, optimized, production API running in your cloud. The design is built on Kubernetes and open-source cloud-native tooling, and it's portable across any major cloud provider or on-premise cluster.

The core insight is this: LLM inference is not a deployment problem, it's a pipeline problem. You don't just deploy a model — you acquire it, optimize it (in multiple ways), evaluate the trade-offs, pick the best variant, and then deploy it with intelligent autoscaling. Each of those steps has different compute needs, different failure modes, and different scaling characteristics.

The Problem: Why LLM Serving is Hard

To appreciate why a distributed pipeline is necessary, consider how LLM inference differs from traditional ML serving:

Model Size

A typical classification model might be a few megabytes. A small LLM starts at 500 MB. Production LLMs range from 7 billion parameters (14 GB in FP16) to 70 billion parameters (140 GB) and beyond. You can't just load these into memory and go — they often need to be split across multiple GPUs, compressed through quantization, or both.

The Quality-Performance Trade-off

With traditional ML, you pick the most accurate model and ship it. With LLMs, there's a multi-dimensional trade-off:

Quality — Does the model produce coherent, accurate responses?
Latency — How long does the user wait for the first token?
Throughput — How many requests can a single GPU handle per second?
Memory — How much expensive GPU VRAM does the model consume?

A full-precision model gives you the best quality but the worst latency and cost. A heavily quantized model is fast and cheap but might produce lower-quality outputs. The right answer depends on your use case, and figuring it out requires evaluating multiple variants side by side.

GPU Economics

GPUs are the most expensive resource in most cloud bills. An idle A100 GPU costs $2–4 per hour. If your service handles variable traffic — quiet at night, busy during business hours — you need autoscaling that understands LLM workload patterns, not just CPU utilization.

Cold Start Problem

Loading a multi-gigabyte model into GPU memory takes 60–120 seconds. Traditional autoscaling (spin up a new pod when load increases) doesn't work well when the new pod takes two minutes to become ready. The architecture needs to account for warm pools, pre-loaded models, and intelligent scaling thresholds.

Architecture Overview

The system is organized as a four-stage pipeline where each stage is an independent, containerized step:

Acquire — Download the pre-trained model from a model registry
Optimize — Create multiple quantized variants (in parallel)
Select — Evaluate variants and pick the best one
Serve — Deploy the winner as an autoscaled inference API

A workflow orchestrator manages the pipeline as a Directed Acyclic Graph (DAG), ensuring stages run in the correct order, parallelizing where possible, and retrying on failure. All stages communicate through a shared persistent filesystem — no message queues, no RPCs, no complexity beyond “write files to a known path.”

This design treats the problem the same way CI/CD treats software deployment: as a series of automated, reproducible steps that turn an input (a model name) into an output (a production API).

Architecture

Model Registry (HuggingFace)

↓

Workflow Orchestration Layer

1. Acquire Model

Download weights, tokenizer, config

↓

2. Optimize (parallel)

FP16 · AWQ · GPTQ — three variants simultaneously

↓

3. Select Best Model

Multi-metric scoring: quality + latency + throughput + memory

↓

Model Serving Layer

4. Deploy & Serve

Autoscaled 1–N replicas · OpenAI-compatible REST API

↓

Shared Storage Layer

Persistent Volume: base/ → optimized/ → production/

↓

Clients (any OpenAI-compatible SDK)

The architecture has three horizontal layers:

Orchestration Layer — Manages the pipeline stages, handles sequencing, parallelism, retries, and caching
Serving Layer — Runs the inference engine with autoscaling and load balancing
Storage Layer — A shared filesystem that acts as the communication backbone between all stages

The Four-Stage Pipeline

Model Acquisition

The pipeline begins by pulling a pre-trained model from a model registry (typically HuggingFace Hub). This downloads everything the model needs to run: the neural network weights, the tokenizer that converts text to tokens, and configuration files that describe the model's architecture.

Why this is its own stage: Model downloads can take minutes to hours depending on model size and network speed. By isolating this as a separate stage with built-in caching, the pipeline avoids re-downloading the same model on every run. If you're iterating on optimization strategies, the base model is already there from the first run.

Caching strategy: The orchestrator memoizes this step based on the model name and version. If the same model was downloaded in the last 24 hours, this step completes instantly by reusing the cached artifacts.

Model Optimization

This is where the pipeline diverges from traditional ML workflows. Instead of a single optimization pass, the system creates three different quantized variants of the base model — each representing a different point on the quality-performance spectrum:

Variant	What it Does	Trade-off
FP16	Converts model weights from 32-bit to 16-bit floating point	Minimal quality loss, ~50% size reduction, moderate speedup
AWQ	Activation-Aware Weight Quantization to 4-bit precision	Small quality loss, ~75% size reduction, 2–3x faster inference
GPTQ	Post-Training Quantization to 4-bit precision	Small quality loss, ~75% size reduction, 1.5–2x faster inference

The key design choice here is parallelism. All three optimization jobs are independent — they each read from the same base model and write to different output paths. The orchestrator runs them simultaneously, which means this stage takes as long as the slowest optimization, not the sum of all three. For a pipeline that might take 20+ minutes per optimization, this saves 40+ minutes of wall-clock time.

Why three variants instead of just picking one? Different quantization techniques have different strengths. AWQ tends to be faster but GPTQ sometimes preserves quality better on certain model families. FP16 is the safest bet but uses more memory. Rather than guessing, the pipeline tries all three and lets the data decide.

Model Selection

This stage evaluates all three optimized variants and automatically selects the best one for production. This is one of the most important architectural innovations — it replaces human guesswork with data-driven selection.

Multi-dimensional scoring: Unlike traditional ML where you'd pick the model with the highest accuracy, LLM inference requires balancing four competing concerns:

Scoring Weights

Quality

50%

Latency

20%

Throughput

20%

GPU Memory

10%

Weights are configurable — a chatbot prioritizes latency, a batch system prioritizes throughput, a medical application prioritizes quality.

How it works:

Each variant is benchmarked on all four metrics
Scores are normalized to a 0–1 scale (so latency in milliseconds and quality as a ratio are comparable)
A weighted average produces a single combined score
The variant with the highest combined score is promoted to the production path
A metrics report is saved alongside the model, documenting exactly why this variant was chosen

Typical outcome: AWQ tends to win in general-purpose scenarios because it delivers the best balance — quality stays above 0.80 while latency drops to ~45ms and memory usage drops to ~4 GB.

Inference Deployment

The final stage takes the winning model and deploys it as a production inference service. This isn't just “start a container” — it creates a fully managed service with:

Multiple replicas for high availability and throughput
Autoscaling that adds or removes replicas based on demand
Health monitoring that automatically restarts unhealthy instances
An OpenAI-compatible API that any existing OpenAI client can use without modification

The last point is worth emphasizing: the deployed service speaks the exact same API format as the OpenAI API. Any application, SDK, or tool built for OpenAI's API works out of the box — just point it at a different URL. This makes adoption frictionless and avoids vendor lock-in.

Key Architectural Components

Workflow Orchestration

The pipeline needs a conductor — something that knows which stages to run, in what order, which can run in parallel, and what to do when something fails. This architecture uses a Kubernetes-native workflow engine.

What it provides:

Step sequencing — Stage 1 must complete before Stage 2 begins; all Stage 2 jobs must complete before Stage 3 starts
Parallelism — Stage 2's three optimization jobs run simultaneously
Memoization — Expensive steps (like model downloads) are cached, so re-running the pipeline doesn't repeat completed work
Retry policies — If a step fails due to a transient issue (network timeout, temporary resource exhaustion), it's automatically retried
Visibility — A dashboard shows pipeline progress, logs, and history

Think of it as CI/CD for models. Just as a CI pipeline compiles, tests, and deploys code, this pipeline acquires, optimizes, selects, and deploys models.

The Inference Engine

At the heart of the serving layer is a high-performance LLM inference engine. This is not a generic model server — it's purpose-built for the unique characteristics of LLM inference:

PagedAttention: The biggest memory bottleneck in LLM inference is the key-value (KV) cache — a data structure that grows with conversation length and must be kept in GPU memory. Traditional approaches pre-allocate a fixed block of memory for each request, wasting up to 90% on unused space. PagedAttention manages this memory like an operating system manages virtual memory: in small pages that are allocated on demand and reclaimed when freed. This alone can increase throughput by 2–4x.

Continuous Batching: Traditional model servers collect a batch of requests, process them together, and return results. If the batch size is 8 but only 3 requests have arrived, the server waits. Continuous batching removes this wait — as soon as one request in a batch finishes, a new request takes its slot. The GPU is never idle waiting for a batch to fill.

Native Quantization Support: The engine natively understands all three quantization formats (FP16, AWQ, GPTQ), loading and executing them at their optimal precision without wrapper code.

Model Serving Layer

On top of the inference engine sits a model serving layer that handles the operational concerns of running a model in production:

Declarative deployment — You describe what you want (model path, resources, replicas) and the platform makes it happen
Canary rollouts — When deploying a new model version, you can route a percentage of traffic to the new version while the old version still handles the majority, then gradually shift traffic once confidence builds
Traffic management — Load balancing, request routing, and connection draining during deployments
Health management — Continuous health checks that detect and replace failed instances

This layer is what turns a “container running an inference engine” into a “production service.”

Autoscaling

LLM autoscaling is different from web application autoscaling. CPU utilization — the standard autoscaling metric — is a poor signal for LLM workloads because a GPU can be saturated while the CPU sits idle.

This architecture scales on request concurrency: the number of requests being actively processed by each replica. When concurrency exceeds a threshold (e.g., 10 concurrent requests per replica), new replicas are added. When it drops, replicas are removed.

Scaling Behavior

Low traffic — a single replica handles everything

↓

Growing traffic — new replicas spin up as concurrency exceeds the target

↓

Peak traffic — a hard cap prevents runaway scaling and cost surprises

↓

Declining traffic — replicas scale back down after a cooldown period

The cold start problem (60–120 seconds to load a model) is mitigated by keeping at least one replica always warm and by using shared storage so new replicas don't need to download the model — just load it from the already-mounted filesystem.

Shared Storage

The simplest and most elegant part of the architecture is how stages communicate: a shared persistent filesystem.

There are no message queues, no object store APIs, no artifact transfer steps. Stage 1 writes the base model to a known path. Stage 2 reads from that path and writes optimized variants to another known path. Stage 3 reads from there and writes the winner to the production path. Stage 4 mounts the production path and serves it.

Why this works well:

Simplicity — Every developer understands files and directories
Debuggability — You can inspect any intermediate artifact by reading the filesystem
No serialization overhead — Model files are already in the format the inference engine expects
Atomicity — File copy operations are straightforward and reliable

When to move beyond it: For very large models (100+ GB), multi-cluster deployments, or environments where persistent volumes are expensive, object storage (S3, GCS) with a local cache layer becomes the better choice. But for most use cases, a shared filesystem is the right level of simplicity.

Design Decisions and Trade-offs

Why a Pipeline Instead of a Monolith?

A monolithic approach (one script that downloads, optimizes, selects, and serves) seems simpler, but it creates problems at scale:

Concern	Monolith	Pipeline
Failure recovery	Restart everything from scratch	Retry only the failed step
Iteration speed	Full re-run to change optimization	Only re-run from the changed step
Resource efficiency	One pod needs max resources for all stages	Each pod requests only what it needs
Parallelism	Serial execution	Optimization runs in parallel
Observability	One log stream for everything	Per-stage logs, metrics, and status

Why Three Optimization Variants?

Running three quantization strategies and comparing them might seem excessive. Why not just use AWQ, which usually wins?

The answer is that “usually” isn't “always.” The best quantization technique depends on the specific model architecture, the inference hardware, and the use case priorities. A pipeline that automatically evaluates all options and picks the winner adapts to these variables without human intervention.

Why OpenAI API Compatibility?

The deployed inference service could expose a custom API. Instead, it implements the OpenAI chat completions format. This was a deliberate choice:

Ecosystem leverage — Thousands of tools, libraries, and applications already support the OpenAI API format
Migration simplicity — Teams can switch from OpenAI's hosted service to self-hosted inference by changing a single URL
No lock-in — If the team later wants to switch inference engines or move back to a hosted service, no client code changes

Shared Filesystem vs. Object Storage vs. Message Queue

Three common patterns for inter-stage communication:

Pattern	Pros	Cons	Best For
Shared filesystem	Simple, fast, debuggable	Single-cluster only, capacity limits	Most deployments
Object storage (S3/GCS)	Unlimited scale, multi-cluster, durable	Latency, IAM complexity, transfer costs	Large models, multi-region
Message queue	Event-driven, loose coupling	Overkill for large binary artifacts	Metadata, not model weights

This architecture chose the shared filesystem because it offers the best simplicity-to-capability ratio for single-cluster deployments.

How Data Flows Through the System

The pipeline transforms a model through four distinct states, each stored at a known path:

Data Flow

Model Registry

A model identifier, e.g. “facebook/opt-125m”

↓

base/

The original model exactly as published — full precision weights, tokenizer, config

↓

optimized/

Three variants in parallel: FP16 (~50% size) · AWQ (~25% size) · GPTQ (~25% size)

↓ evaluate all three

production/

The winning variant, selected by multi-metric scoring + evaluation report

↓ served via

Live API

Autoscaled inference endpoint, OpenAI-compatible format

Inter-stage contracts are implicit, defined by file paths:

Stage 1 promises to write a valid model to base/
Stage 2 promises to write optimized variants to optimized/{fp16,awq,gptq}/
Stage 3 promises to write the best model to production/ and metrics to production_metrics.json
Stage 4 expects a servable model at production/

This file-based contract is simple but effective. If any stage produces malformed output, the next stage fails fast with a clear error rather than silently propagating bad data.

Scaling Strategy

The system scales at two different levels, each addressing a different bottleneck:

Pipeline-Level Scaling

During the optimization stage, the pipeline scales horizontally across compute nodes. Each quantization job runs as an independent container on whatever node has available resources. On a cluster with multiple GPU nodes, all three optimizations can run on different GPUs simultaneously.

This is “batch scaling” — it handles variable workloads during the pipeline execution phase, then releases resources when the pipeline completes.

Serving-Level Scaling

Once deployed, the inference service scales based on incoming request traffic:

Traffic → Replicas → Cost

Night (low traffic) → 1 replica → $ baseline

↓

Morning ramp-up → 1–2 replicas → $$

↓

Business hours (peak) → 2–3 replicas → $$$

↓

Evening wind-down → 2–1 replicas → $$

↓

Night again → 1 replica → $

The scaling signal is request concurrency (how many requests are being actively processed), not CPU or memory utilization. This is critical because LLM inference is GPU-bound — the CPU might show 10% utilization while the GPU is completely saturated. Scaling on CPU would miss the bottleneck entirely.

Handling Cold Starts

The Achilles' heel of LLM autoscaling is the cold start. Loading a 7B model into GPU memory takes 60–120 seconds. Two strategies mitigate this:

Always-warm minimum — At least one replica is always running, even during low traffic. This guarantees instant response for the first request of a busy period.
Shared model storage — New replicas mount the same persistent volume as existing replicas. The model files are already on disk — the replica just needs to load them into GPU memory, which is faster than downloading from a remote registry.

Multi-Cloud Portability

Because the entire architecture runs on Kubernetes, it's portable across any environment that supports a conformant Kubernetes cluster:

What Stays the Same Everywhere

The pipeline logic (all four stages)
The optimization and selection algorithms
The inference engine configuration
The autoscaling policies
The API format and client compatibility

What Changes Per Environment

Setting	AWS	GCP	Azure	On-Premise
Kubernetes	EKS	GKE	AKS	kubeadm / k3s
Storage	EBS (gp3)	Persistent Disk	Azure Disk	Local / NFS
GPU	p3 / g5	n1 + T4/A100	NC-series	Direct NVIDIA
Ingress	ALB	GCE Ingress	Azure LB	MetalLB / NodePort

The configuration delta between clouds is typically 5–10 lines of configuration. The application code, pipeline logic, and infrastructure definitions remain identical.

This portability isn't theoretical — it's a direct consequence of using Kubernetes-native abstractions (PersistentVolumeClaim, Deployment, Service) rather than cloud-specific services (SageMaker, Vertex AI, Azure ML). The trade-off is that you manage the infrastructure yourself, but you own it completely and can move between clouds without re-architecting.

What It Takes to Go to Production

What the Architecture Already Handles

Automated pipeline — One command goes from model name to production API
Error recovery — Failed steps are retried automatically; successful steps are cached
Health monitoring — Inference pods report their health; unhealthy pods are replaced
Resource governance — CPU and memory boundaries prevent runaway resource consumption
Autoscaling — Traffic-based scaling prevents over-provisioning and under-provisioning
Access control — Role-based permissions limit what each component can do

What You Should Add

Observability is the most important gap. You need to see what's happening inside the system in real time:

Request latency percentiles (p50, p95, p99)
GPU utilization and memory usage
Request queue depth
Token generation rate
Error rates and types

Authentication is essential before exposing the API beyond your internal network. Options range from simple API key validation to a full OAuth2 proxy.

TLS/HTTPS should be added at the ingress layer. All client traffic should be encrypted in transit.

Model versioning beyond the filesystem becomes important as you manage multiple models and need to roll back to previous versions. A model registry (MLflow, Weights & Biases, or even a structured object storage bucket) adds this capability.

Cost controls matter because GPU compute is expensive. Spot/preemptible instances can reduce costs by 60–90% for workloads that can tolerate interruption. Scheduled scaling (scale down overnight if your users are in one timezone) adds further savings.

CI/CD integration turns the pipeline from “something you run manually” to “something that runs automatically when a new model is pushed to your registry.” GitOps tools can watch a Git repository for configuration changes and automatically re-run the pipeline.

Conclusion

Distributed LLM inference is a pipeline problem, not a deployment problem. The architecture described in this post treats it as such — breaking the journey from model registry to production API into four clear, automated stages:

Acquire the model from a registry
Optimize it with multiple strategies in parallel
Select the best variant using data, not guesswork
Serve it with intelligent autoscaling and a standard API

The key architectural principles are:

Separation of concerns — Each stage has one job and does it well
Parallelism where possible — Independent work runs simultaneously
Data-driven selection — Multi-metric scoring replaces intuition
Simplicity in communication — A shared filesystem beats complex middleware
Portability by default — Kubernetes-native design runs anywhere

The result is a system that takes a model name as input and produces a production-grade, autoscaled, OpenAI-compatible inference API as output — repeatably, reliably, and across any cloud.

Whether you're deploying a small model for internal tooling or a large model for customer-facing applications, the pipeline pattern scales with you. The stages stay the same; only the model name, resource allocation, and scaling thresholds change.