The spec comparison first
Before running the economics, it's worth being precise about what the H100 actually adds — because the raw specs are often cited without context, and context is what makes the decision. Two numbers matter most.
| Spec | H100 SXM5 80GB | A100 SXM4 80GB | Delta |
|---|---|---|---|
| FP16 / BF16 performance | 989 TFLOPS | 312 TFLOPS | +3.2× |
| FP8 performance | 1,979 TFLOPS | N/A | H100 only |
| Memory bandwidth | 3.35 TB/s | 2.0 TB/s | +68% |
| HBM capacity | 80 GB | 80 GB | Equal |
| NVLink bandwidth | 900 GB/s | 600 GB/s | +50% |
| Transformer Engine | Yes (FP8 mixed) | No | H100 only |
What the table tells you: the H100 is dramatically better for compute-bound workloads — which transformer training at scale is. But it has identical memory capacity to the A100. If your bottleneck is VRAM — fitting the model — upgrading to H100 gets you nothing on that dimension. If your bottleneck is compute throughput — getting through forward and backward passes — the H100 is significantly faster.
For most transformer training workloads at scale, the bottleneck is compute. That's where the H100 advantage concentrates, and that's why the hourly rate comparison alone is misleading.
Real-world speedup: what "2–3×" actually means
The theoretical FP16 speedup is 3.2×, but you won't see that in practice. Memory transfers, communication overhead in multi-GPU settings, and I/O bottlenecks all reduce effective utilization. When I looked at benchmarks across model sizes, here's what the actual speedup looks like:
- Large models (70B+): 2.5–3× real-world speedup. These are compute-bound; the H100 advantage is nearly fully realized.
- Medium models (7B–13B): 1.8–2.5× speedup. Closer to the theoretical maximum as model sizes increase.
- Small models (under 3B): 1.2–1.8× speedup. At this scale, data loading and I/O start to become bottlenecks, diluting the GPU advantage.
- Inference (any size): Speedup varies more widely, 1.5–2.5×, but memory bandwidth (which the H100 leads by 68%) matters more than raw FLOPS.
The cost math: three training scenarios
This is the part I actually wanted to figure out when I built Ozmarx — because "H100 is faster" is obvious, but "H100 is cheaper per training run" is the more interesting claim. I modeled three common scenarios using April 2026 on-demand pricing (Lambda Labs H100 at $2.49/hr, A100 at $1.89/hr), assuming 8 GPUs per setup.
Scenario 1: Llama-3 7B fine-tune (small workload)
| GPU | Price/GPU/hr | Speedup | Est. hours | Total cost |
|---|---|---|---|---|
| A100 80GB | $1.89 | Baseline | 40 hrs | $605 |
| H100 SXM5 | $2.49 | ~1.7× | 24 hrs | $478 |
At this scale, H100 actually wins on total cost ($478 vs $605) despite the higher hourly rate — because the run is short enough that speed dominates the math. Savings: 21%.
Scenario 2: Llama-3 70B full pre-training run
| GPU | Price/GPU/hr | Speedup | Est. hours | Total cost (8 GPUs) |
|---|---|---|---|---|
| A100 80GB | $1.89 | Baseline | 2,400 hrs | $36,288 |
| H100 SXM5 | $2.49 | ~2.5× | 960 hrs | $19,123 |
This is the clearest H100 case. At 70B scale, the 2.5× speedup so completely overwhelms the 32% hourly premium that the H100 ends up 47% cheaper per training run. This is the scenario I keep coming back to when people ask whether the H100 upgrade is worth it — at serious scale, the question almost answers itself.
Scenario 3: Stable Diffusion fine-tune (image model, small)
| GPU | Price/GPU/hr | Speedup | Est. hours | Total cost |
|---|---|---|---|---|
| A100 80GB | $1.89 | Baseline | 12 hrs | $181 |
| H100 SXM5 | $2.49 | ~1.4× | 8.5 hrs | $169 |
For image model fine-tunes, H100 saves only $12 — essentially a wash. The speedup advantage gets diluted by I/O overhead at small model sizes. For a team doing many iterations of this type of run, I'd actually weigh A100 availability and simplicity higher than the marginal cost difference, which is noise at this scale.
The rule of thumb: Transformer models above ~7B parameters on multi-GPU setups — H100 is almost certainly cheaper per training run despite the higher hourly rate. Below that threshold, it depends on your iteration frequency and how much the engineer's time factors into your cost calculation. Clock time has a price too, and it's often higher than the compute bill.
The VRAM argument: when it doesn't matter which is faster
Both the H100 SXM5 and A100 SXM4 come in 80GB variants. If your model fits in 80GB per GPU (including optimizer states and activations in mixed precision), VRAM is not a differentiating factor. If it doesn't, you're sharding either way — and then the H100's NVLink bandwidth advantage (+50%) starts to matter for how efficiently that sharding performs.
One thing worth knowing: the H100 PCIe variant also comes in 80GB, but with slower interconnect. The SXM5 is the one with full NVLink 4.0, and it's what most neoclouds are offering for training workloads. Make sure you're comparing the right version when you look at pricing.
When to choose A100 instead
The A100 isn't just the slower, cheaper option — there are scenarios where it's genuinely the right call:
- Inference workloads where the compute bottleneck is less extreme and the 20–30% lower hourly rate compounds over long deployment periods — you're paying by the hour indefinitely, not by the run
- Small model iteration where you're running many short experiments and care more about minimizing per-run cost than wall-clock time
- Budget-constrained teams where the lower hourly rate enables more total compute hours within a fixed monthly budget — sometimes more experiments at slower speed beats fewer experiments at maximum speed
- Existing infrastructure — if you're already on A100 clusters with tooling, containers, and team familiarity baked in, the switching cost is real and worth calculating before you move
Decision framework
Current pricing reference (April 2026)
| Provider | GPU | On-Demand | Reserved (1yr) |
|---|---|---|---|
| Lambda Labs | H100 SXM5 80GB | $2.49/hr | $1.99/hr |
| Lambda Labs | A100 SXM4 80GB | $1.89/hr | ~$1.50/hr |
| CoreWeave | H100 SXM5 80GB | $2.99/hr | Negotiated |
| AWS (p5) | H100 SXM5 80GB | $3.90/hr | $2.21/hr (3yr) |
| GCP (a3-highgpu) | H100 SXM5 80GB | $3.00/hr | ~$2.10/hr (1yr) |
Per-GPU rates. Last verified April 2026. See full comparison table →