The spec comparison first

Before running the economics, it's worth being precise about what the H100 actually adds — because the raw specs are often cited without context, and context is what makes the decision. Two numbers matter most.

Spec H100 SXM5 80GB A100 SXM4 80GB Delta
FP16 / BF16 performance 989 TFLOPS 312 TFLOPS +3.2×
FP8 performance 1,979 TFLOPS N/A H100 only
Memory bandwidth 3.35 TB/s 2.0 TB/s +68%
HBM capacity 80 GB 80 GB Equal
NVLink bandwidth 900 GB/s 600 GB/s +50%
Transformer Engine Yes (FP8 mixed) No H100 only

What the table tells you: the H100 is dramatically better for compute-bound workloads — which transformer training at scale is. But it has identical memory capacity to the A100. If your bottleneck is VRAM — fitting the model — upgrading to H100 gets you nothing on that dimension. If your bottleneck is compute throughput — getting through forward and backward passes — the H100 is significantly faster.

For most transformer training workloads at scale, the bottleneck is compute. That's where the H100 advantage concentrates, and that's why the hourly rate comparison alone is misleading.

Real-world speedup: what "2–3×" actually means

The theoretical FP16 speedup is 3.2×, but you won't see that in practice. Memory transfers, communication overhead in multi-GPU settings, and I/O bottlenecks all reduce effective utilization. When I looked at benchmarks across model sizes, here's what the actual speedup looks like:

The cost math: three training scenarios

This is the part I actually wanted to figure out when I built Ozmarx — because "H100 is faster" is obvious, but "H100 is cheaper per training run" is the more interesting claim. I modeled three common scenarios using April 2026 on-demand pricing (Lambda Labs H100 at $2.49/hr, A100 at $1.89/hr), assuming 8 GPUs per setup.

Scenario 1: Llama-3 7B fine-tune (small workload)

GPUPrice/GPU/hrSpeedupEst. hoursTotal cost
A100 80GB $1.89 Baseline 40 hrs $605
H100 SXM5 $2.49 ~1.7× 24 hrs $478

At this scale, H100 actually wins on total cost ($478 vs $605) despite the higher hourly rate — because the run is short enough that speed dominates the math. Savings: 21%.

Scenario 2: Llama-3 70B full pre-training run

GPUPrice/GPU/hrSpeedupEst. hoursTotal cost (8 GPUs)
A100 80GB $1.89 Baseline 2,400 hrs $36,288
H100 SXM5 $2.49 ~2.5× 960 hrs $19,123

This is the clearest H100 case. At 70B scale, the 2.5× speedup so completely overwhelms the 32% hourly premium that the H100 ends up 47% cheaper per training run. This is the scenario I keep coming back to when people ask whether the H100 upgrade is worth it — at serious scale, the question almost answers itself.

Scenario 3: Stable Diffusion fine-tune (image model, small)

GPUPrice/GPU/hrSpeedupEst. hoursTotal cost
A100 80GB $1.89 Baseline 12 hrs $181
H100 SXM5 $2.49 ~1.4× 8.5 hrs $169

For image model fine-tunes, H100 saves only $12 — essentially a wash. The speedup advantage gets diluted by I/O overhead at small model sizes. For a team doing many iterations of this type of run, I'd actually weigh A100 availability and simplicity higher than the marginal cost difference, which is noise at this scale.

The rule of thumb: Transformer models above ~7B parameters on multi-GPU setups — H100 is almost certainly cheaper per training run despite the higher hourly rate. Below that threshold, it depends on your iteration frequency and how much the engineer's time factors into your cost calculation. Clock time has a price too, and it's often higher than the compute bill.

The VRAM argument: when it doesn't matter which is faster

Both the H100 SXM5 and A100 SXM4 come in 80GB variants. If your model fits in 80GB per GPU (including optimizer states and activations in mixed precision), VRAM is not a differentiating factor. If it doesn't, you're sharding either way — and then the H100's NVLink bandwidth advantage (+50%) starts to matter for how efficiently that sharding performs.

One thing worth knowing: the H100 PCIe variant also comes in 80GB, but with slower interconnect. The SXM5 is the one with full NVLink 4.0, and it's what most neoclouds are offering for training workloads. Make sure you're comparing the right version when you look at pricing.

When to choose A100 instead

The A100 isn't just the slower, cheaper option — there are scenarios where it's genuinely the right call:

Decision framework

Choose H100 for transformer models above ~3B params on multi-GPU setups, or any workload where wall-clock time has a cost (engineer hours, experiment velocity). Despite the higher hourly rate, it will almost certainly be cheaper per training run.
Consider A100 for inference, small model experiments where iteration speed matters more than throughput, or if your monthly budget constrains how many hours you can run rather than how fast each run completes.
Run your own numbers using the TCO Calculator. Enter your model size, GPU count, and estimated run time — the math is simple, but the inputs have to match your actual workload to mean anything.

Current pricing reference (April 2026)

ProviderGPUOn-DemandReserved (1yr)
Lambda LabsH100 SXM5 80GB$2.49/hr$1.99/hr
Lambda LabsA100 SXM4 80GB$1.89/hr~$1.50/hr
CoreWeaveH100 SXM5 80GB$2.99/hrNegotiated
AWS (p5)H100 SXM5 80GB$3.90/hr$2.21/hr (3yr)
GCP (a3-highgpu)H100 SXM5 80GB$3.00/hr~$2.10/hr (1yr)

Per-GPU rates. Last verified April 2026. See full comparison table →