Why this distinction matters more than GPU specs

When I started building Ozmarx, I noticed something: almost every GPU pricing question people asked was framed purely around hourly rates. "What's the cheapest H100?" "Is Lambda Labs cheaper than AWS?" These are reasonable questions, but they're missing a variable that changes the entire math: what are you doing with the GPU?

Training a model and serving a model are fundamentally different economic problems. Training is like running a construction project — you have a defined scope, you run it to completion, and then it's done. Inference is like running a building — you operate it continuously, often at variable load, and cost is a function of how efficiently you serve each request. The right GPU, the right provider, and even the right pricing structure (spot vs on-demand vs reserved) are different for each.

The common mistake: Teams that just finished training on H100 clusters default to using the same infrastructure for inference. Often this is the wrong call on both cost and latency grounds. Inference has different optimization targets, and the most expensive GPU is not always the right answer.

Training economics: cost per run is the number that matters

For training, the relevant unit of cost is the total dollar spend to complete a training run — not the hourly rate. This is a subtle but important reframe. If GPU A costs $2.49/hr and GPU B costs $1.89/hr, but GPU A completes the training run 2.5× faster, GPU A may actually be cheaper per run despite costing 32% more per hour.

This is exactly the H100 vs A100 dynamic for transformer training at scale. I worked through this in detail in the H100 vs A100 article, but the short version: for models above ~7B parameters on multi-GPU setups, the H100's speed advantage typically makes it cheaper per training run than the A100, despite the higher hourly rate.

Training also lends itself to spot instances, which can dramatically reduce total cost. Most training jobs can checkpoint and restart — meaning if your instance gets preempted, you lose at most a few hours of work, not the entire run. Spot pricing on H100s is running 50–60% below on-demand right now, which translates to real savings on any serious training job.

What training buyers should optimize for

Inference economics: cost per token is the number that matters

Inference flips the equation. You're not completing a finite run — you're serving ongoing requests, often with latency requirements. The relevant metrics shift to: throughput (tokens per second per GPU), latency (time to first token, tokens per second during generation), and cost per 1,000 tokens or per million tokens generated.

Because inference is continuous, the economics look more like running a data center than completing a project. You're optimizing GPU utilization across variable load — peak hours, off-peak hours, traffic spikes. A training cluster can sit at 90% GPU utilization for the entire run. An inference cluster might average 30–60% utilization depending on traffic patterns, and the cost of idle GPU time is real.

This is why inference economics often favor a different approach than training: reserved capacity at lower cost, plus autoscaling on top to handle peaks. And why the cheapest per-GPU option isn't always the right inference answer — if a more expensive provider with better infrastructure (lower latency networking, SLA guarantees) allows you to serve the same traffic with fewer GPUs, the effective cost per token can be lower despite higher sticker price.

What inference buyers should optimize for

The GPU choice changes too

The H100 vs A100 decision looks different for inference than for training.

For training, the H100's compute advantage (3.2× theoretical FP16 FLOPS) is the dominant factor — transformer training is largely compute-bound, and the H100 wins decisively on large models. But inference is different. Inference is often memory-bandwidth-bound: you're loading model weights repeatedly for each request, and the bottleneck is how fast you can move weights from GPU memory to compute, not the raw compute throughput itself.

On memory bandwidth, the H100 SXM5 leads the A100 SXM4 by 68% (3.35 TB/s vs 2.0 TB/s). This is meaningful for inference, but it's a smaller advantage than the compute gap. For many inference workloads — especially smaller models, or quantized models where effective memory requirements are reduced — the A100's lower hourly rate makes it competitive or preferable to the H100 on a cost-per-token basis.

Factor Training Inference
Primary bottleneck Compute (FLOPS) Memory bandwidth + latency
H100 advantage Very high (3.2× compute) Moderate (1.68× memory BW)
Cost unit that matters Total cost per run Cost per 1M tokens
Spot instances useful? Yes — checkpointing handles preemption No — interruptions break serving
Reserved pricing value High if workload is predictable High for base load; autoscale for peaks
SLA requirements Usually not critical Critical for production serving
Best provider options Lambda Labs, Vast.ai, RunPod CoreWeave, GCP, Azure

A real cost comparison: serving a 70B model

To make this concrete, here's how I'd think about the infrastructure cost to serve a 70B parameter model at moderate scale (roughly 50 requests per second, average output length 200 tokens).

A 70B model in BF16 requires ~140GB of GPU memory just for weights — that's 2 full H100 SXM5 (80GB) GPUs with nothing left for KV cache. In practice you need at least 4 GPUs for comfortable inference headroom. With quantization (INT8), you can fit in 2× H100, but lose some output quality.

At 50 req/s with 200 output tokens, you're generating ~10,000 tokens per second. A single H100 in production inference for a 70B model delivers roughly 2,000–3,000 tokens/second depending on batch size and quantization. That means you need at minimum 4–5 H100s to serve this traffic without queuing.

Provider H100 count Monthly cost Notes
Lambda Labs (on-demand) 4× H100 $7,171 $2.49/hr × 4 × 720hr. No SLA.
Lambda Labs (1yr reserved) 4× H100 $5,731 $1.99/hr × 4 × 720hr. No SLA.
CoreWeave (on-demand) 4× H100 $8,611 $2.99/hr × 4 × 720hr. Formal SLA, InfiniBand.
GCP (a3-highgpu) 4× H100 $8,640 $3.00/hr × 4 × 720hr. Full ecosystem.
AWS (p5.48xlarge, per GPU) 4× H100 $11,232 $3.90/hr × 4 × 720hr. Enterprise SLA, full ecosystem.

Monthly costs assume continuous utilization (720 hrs). Actual utilization may be lower for variable traffic. Last verified April 2026.

The utilization question matters here: If your inference cluster runs at 50% average utilization (realistic for variable traffic), you're paying for idle GPU time. On Lambda on-demand, that's $3,586/month in idle cost. This is where a reserved base + autoscaling overage on-demand becomes the right architecture — commit to the base you'll always use, and pay on-demand only for traffic spikes.

Quantization: the inference cost lever most teams underuse

One thing I didn't fully appreciate until I dug into inference economics: quantization is often the highest-leverage cost reduction available for inference workloads — higher than provider selection.

A 70B model in BF16 (full precision) requires ~140GB of GPU memory. The same model in INT8 requires ~70GB — fitting on 1 H100 instead of 2. INT4 quantization gets it to ~35GB. The quality tradeoff depends on the quantization method and the model, but for many use cases (especially retrieval-augmented generation, summarization, classification), INT8 quantized models are indistinguishable from BF16 for end users while cutting GPU requirements in half.

Halving your GPU requirements halves your inference compute cost. That's a 50% cost reduction before you even look at provider selection. Tools like bitsandbytes, GPTQ, and AWQ make this relatively accessible now. If you're running inference on large models and haven't explored quantization, that's the first place I'd look.

The provider question for inference

For training, my default recommendation is Lambda Labs — the price advantage is significant and the SLA limitation rarely matters. For inference, the calculus shifts toward CoreWeave for most production use cases.

The reasons: CoreWeave offers formal SLAs, which you need for customer-facing serving. Its Kubernetes-native infrastructure is built for the kind of autoscaling and service management patterns inference deployments require. And while it's 20% more expensive per GPU than Lambda, the operational overhead savings for a production deployment often justify the cost difference.

If you're at early stages — serving internal users, building a prototype, or running low-traffic inference — Lambda on-demand is perfectly sufficient and meaningfully cheaper. If you're serving external customers at scale with latency and availability requirements, CoreWeave is where I'd start the conversation.

Hyperscalers (GCP, AWS) are worth considering if you're already deeply integrated into their ecosystems, have compliance requirements that mandate cloud-provider certifications, or need managed inference services (GCP's Vertex AI, AWS SageMaker) rather than raw GPU access. For raw GPU inference without those constraints, you're paying a significant premium for infrastructure you probably don't need.

Quick decision framework

Training workloads → Optimize for cost per run. Use H100 for models above ~7B params. Use spot instances if you can checkpoint. Lambda Labs or Vast.ai for most teams; CoreWeave for very large distributed runs requiring InfiniBand.
Inference workloads → Optimize for cost per token and latency SLA. Explore quantization first — it often beats provider optimization on cost. Use reserved pricing for base load. For production serving with SLA requirements, CoreWeave or GCP. For internal or low-traffic inference, Lambda is fine.
Mixed workloads → Consider separating your training and inference infrastructure. Running inference on your training cluster is often wasteful — training clusters are optimized for throughput, not latency. The overhead of switching between use cases eats into both.

Use the tools to model your specific workload

The numbers in this piece are illustrative — your actual cost depends on model size, traffic volume, utilization patterns, and what quality tradeoffs you're willing to make on quantization. The TCO Calculator lets you input your specific parameters and get a real dollar comparison across providers. For provider selection, the GPU Finder factors in workload type when making recommendations. Use them before committing to infrastructure.