H100 vs A100 for Training: What the Price Gap Actually Means

The spec comparison first

Before running the economics, it's worth being precise about what the H100 actually adds — because the raw specs are often cited without context, and context is what makes the decision. Two numbers matter most.

Spec	H100 SXM5 80GB	A100 SXM4 80GB	Delta
FP16 / BF16 performance	989 TFLOPS	312 TFLOPS	+3.2×
FP8 performance	1,979 TFLOPS	N/A	H100 only
Memory bandwidth	3.35 TB/s	2.0 TB/s	+68%
HBM capacity	80 GB	80 GB	Equal
NVLink bandwidth	900 GB/s	600 GB/s	+50%
Transformer Engine	Yes (FP8 mixed)	No	H100 only

What the table tells you: the H100 is dramatically better for compute-bound workloads — which transformer training at scale is. But it has identical memory capacity to the A100. If your bottleneck is VRAM — fitting the model — upgrading to H100 gets you nothing on that dimension. If your bottleneck is compute throughput — getting through forward and backward passes — the H100 is significantly faster.

For most transformer training workloads at scale, the bottleneck is compute. That's where the H100 advantage concentrates, and that's why the hourly rate comparison alone is misleading.

Real-world speedup: what "2–3×" actually means

The theoretical FP16 speedup is 3.2×, but you won't see that in practice. Memory transfers, communication overhead in multi-GPU settings, and I/O bottlenecks all reduce effective utilization. When I looked at benchmarks across model sizes, here's what the actual speedup looks like:

Large models (70B+): 2.5–3× real-world speedup. These are compute-bound; the H100 advantage is nearly fully realized.
Medium models (7B–13B): 1.8–2.5× speedup. Closer to the theoretical maximum as model sizes increase.
Small models (under 3B): 1.2–1.8× speedup. At this scale, data loading and I/O start to become bottlenecks, diluting the GPU advantage.
Inference (any size): Speedup varies more widely, 1.5–2.5×, but memory bandwidth (which the H100 leads by 68%) matters more than raw FLOPS.

The cost math: three training scenarios

This is the part I actually wanted to figure out when I built Ozmarx — because "H100 is faster" is obvious, but "H100 is cheaper per training run" is the more interesting claim. I modeled three common scenarios using April 2026 on-demand pricing (Lambda Labs H100 at $2.49/hr, A100 at $1.89/hr), assuming 8 GPUs per setup.

Scenario 1: Llama-3 7B fine-tune (small workload)

GPU	Price/GPU/hr	Speedup	Est. hours	Total cost
A100 80GB	$1.89	Baseline	40 hrs	$605
H100 SXM5	$2.49	~1.7×	24 hrs	$478

At this scale, H100 actually wins on total cost ($478 vs $605) despite the higher hourly rate — because the run is short enough that speed dominates the math. Savings: 21%.

Scenario 2: Llama-3 70B full pre-training run

GPU	Price/GPU/hr	Speedup	Est. hours	Total cost (8 GPUs)
A100 80GB	$1.89	Baseline	2,400 hrs	$36,288
H100 SXM5	$2.49	~2.5×	960 hrs	$19,123

This is the clearest H100 case. At 70B scale, the 2.5× speedup so completely overwhelms the 32% hourly premium that the H100 ends up 47% cheaper per training run. This is the scenario I keep coming back to when people ask whether the H100 upgrade is worth it — at serious scale, the question almost answers itself.

Scenario 3: Stable Diffusion fine-tune (image model, small)

GPU	Price/GPU/hr	Speedup	Est. hours	Total cost
A100 80GB	$1.89	Baseline	12 hrs	$181
H100 SXM5	$2.49	~1.4×	8.5 hrs	$169

For image model fine-tunes, H100 saves only $12 — essentially a wash. The speedup advantage gets diluted by I/O overhead at small model sizes. For a team doing many iterations of this type of run, I'd actually weigh A100 availability and simplicity higher than the marginal cost difference, which is noise at this scale.

The rule of thumb: Transformer models above ~7B parameters on multi-GPU setups — H100 is almost certainly cheaper per training run despite the higher hourly rate. Below that threshold, it depends on your iteration frequency and how much the engineer's time factors into your cost calculation. Clock time has a price too, and it's often higher than the compute bill.

The VRAM argument: when it doesn't matter which is faster

Both the H100 SXM5 and A100 SXM4 come in 80GB variants. If your model fits in 80GB per GPU (including optimizer states and activations in mixed precision), VRAM is not a differentiating factor. If it doesn't, you're sharding either way — and then the H100's NVLink bandwidth advantage (+50%) starts to matter for how efficiently that sharding performs.

One thing worth knowing: the H100 PCIe variant also comes in 80GB, but with slower interconnect. The SXM5 is the one with full NVLink 4.0, and it's what most neoclouds are offering for training workloads. Make sure you're comparing the right version when you look at pricing.

When to choose A100 instead

The A100 isn't just the slower, cheaper option — there are scenarios where it's genuinely the right call:

Inference workloads where the compute bottleneck is less extreme and the 20–30% lower hourly rate compounds over long deployment periods — you're paying by the hour indefinitely, not by the run
Small model iteration where you're running many short experiments and care more about minimizing per-run cost than wall-clock time
Budget-constrained teams where the lower hourly rate enables more total compute hours within a fixed monthly budget — sometimes more experiments at slower speed beats fewer experiments at maximum speed
Existing infrastructure — if you're already on A100 clusters with tooling, containers, and team familiarity baked in, the switching cost is real and worth calculating before you move

Decision framework

✓ Choose H100 for transformer models above ~3B params on multi-GPU setups, or any workload where wall-clock time has a cost (engineer hours, experiment velocity). Despite the higher hourly rate, it will almost certainly be cheaper per training run.

→ Consider A100 for inference, small model experiments where iteration speed matters more than throughput, or if your monthly budget constrains how many hours you can run rather than how fast each run completes.

→ Run your own numbers using the TCO Calculator. Enter your model size, GPU count, and estimated run time — the math is simple, but the inputs have to match your actual workload to mean anything.

Current pricing reference (April 2026)

Provider	GPU	On-Demand	Reserved (1yr)
Lambda Labs	H100 SXM5 80GB	$2.49/hr	$1.99/hr
Lambda Labs	A100 SXM4 80GB	$1.89/hr	~$1.50/hr
CoreWeave	H100 SXM5 80GB	$2.99/hr	Negotiated
AWS (p5)	H100 SXM5 80GB	$3.90/hr	$2.21/hr (3yr)
GCP (a3-highgpu)	H100 SXM5 80GB	$3.00/hr	~$2.10/hr (1yr)

Per-GPU rates. Last verified April 2026. See full comparison table →