Essay

Your RAFT-Stereo Gives Different Numbers on H200 vs RTX 4090. It's Not the GPU.

A negative-result detective story: I observed a 33% accuracy gap between H200 and RTX 4090 with TF32 enabled, blamed Hopper's TF32 spec, and was wrong. Publishing the falsified hypothesis.

── A Detective Story Where I Eliminated Every Hypothesis I Had

TL;DR: I observed a 33% accuracy gap between H200 and RTX 4090 on stereo matching with TF32 enabled, and initially blamed Hopper’s TF32 spec. After ~30 experiments, that explanation is wrong — modern and legacy PyTorch / cuDNN stacks both produce bit-identical TF32 numerics across Ada / Hopper / Blackwell. I then attempted to connect the observation to a statistical-noise mechanism from my prior science fair project, but a separate prior project of mine had already empirically shown that mechanism doesn’t apply to RAFT-Stereo’s architecture. I’m publishing this because the negative results are valuable: I have a confidently-falsified original explanation, and I do not have a confident replacement. The practical advice (disable TF32 in final evaluation) remains correct.


1. The story begins: a 51% EPE difference

PIDS (Physics-Informed Deep Stereo) is a system for detecting transparent obstacles using polarized light, with RAFT-Stereo as its backbone, evaluated on 857 transparent-object test images.

Initial observation (2026-01-30, Exp #42):

GPUGlass EPE Median (TF32 ON)(TF32 OFF)Δ
RTX 4090 (Ada)5.264 px5.251 px0.25%
H200 (Hopper)3.504 px5.250 px33%
RTX 5090 (Blackwell)3.498 px5.250 px33%

FP32 (TF32 OFF) is fully consistent across architectures (5.250 ± 0.001), but TF32 ON drops Hopper / Blackwell numbers by 33% — while on Ada the TF32 toggle barely moves the needle.

First instinct: “Hopper / Blackwell’s TF32 spec must be different from Ada’s.”

The paper went out with this explanation, and we forced TF32 off in all evaluation code:

torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.allow_tf32 = False

But is this explanation correct? I spent a week and ~$3 of cloud GPU time running ~30 experiments to verify. It isn’t. And the replacement story I was tempted to tell — also wrong.


2. Hypothesis 1: Hopper TF32 spec differs → falsified

I started with NVIDIA’s four architecture whitepapers (Ampere / Ada / Hopper / Blackwell). Hopper’s 4th-generation Tensor Core does mention “30% lower operand delivery power” — implying an internal datapath redesign. The Ampere whitepaper explicitly says TF32 produces “standard IEEE FP32 output”, but Hopper / Blackwell drop that wording.

In theory this omission could indicate Hopper / Blackwell uses a different accumulator strategy. But experiments don’t bear that out.

Pure cuBLAS matmul scan on RTX 4090 (Ada) and RTX 5090 (Blackwell):

size       4090 rel_L2     5090 rel_L2     Δ
256        2.960e-04       2.960e-04       0%
512        2.935e-04       2.927e-04       0%
1024       2.941e-04       2.938e-04       0%
2048       2.936e-04       2.937e-04       0%
4096       2.940e-04       2.938e-04       0%
8192       2.948e-04       2.944e-04       0%

Identical. The cuDNN conv2d scan is also identical. torch.einsum gives the same result as the equivalent torch.bmm. Modern PyTorch 2.11 + CUDA 12.8 + cuDNN 9 produces consistent TF32 numerics across Ada and Blackwell at the kernel level.

3. Hypothesis 2: cuDNN version changed → falsified

PIDS’s original observation was 2026-01-30, possibly using an older stack. I installed PyTorch 2.2.0 + CUDA 12.1 + cuDNN 8.9 (close to PIDS’s original environment) on the 4090 and re-ran the cuDNN conv2d scan:

HxW      cuDNN 9 (4090)   cuDNN 8 (4090)
32×32    1.465e-03        1.470e-03
64×64    1.403e-03        1.404e-03
128×128  1.502e-03        1.497e-03
256×256  1.492e-03        1.492e-03

Then I rented an H100 NVL and ran the same cuDNN 8 stack:

HxW      4090 cuDNN 8     H100 cuDNN 8
32×32    1.470e-03        1.453e-03
256×256  1.492e-03        1.473e-03

Hopper and Ada are also fully consistent in the cuDNN 8 era. This one cell ($1 of H100 spot for 5 minutes) directly falsifies the hypothesis that PIDS’s 33% comes from a stack-level numerical inconsistency.

At this point I knew: the problem is not in the hardware. Not in the driver. Not in the cuDNN version.


4. Hypothesis 3: trained model + random input → falsified

I downloaded PIDS’s checkpoints_baseline_exp22/baseline_best.pth from HuggingFace, fed it a fixed-seed random stereo pair, ran 32 GRU iterations, and compared TF32 ON/OFF final disparity:

median_rel = 0.010 %    ← PIDS observed 33%
rel_L2 = 7.27e-04
abs_max = 3.30 px       ← some pixels do differ by 3 px, but very sparse

0.010% — three orders of magnitude smaller than 33%.

5. Hypothesis 4: real images → falsified

I downloaded Middlebury 2014 Piano, Playtable, and Pipes scenes (containing glass cups, plastic pipes) and ran the same PIDS checkpoint:

SceneGlass-region ΔSelectivity (glass/bg)
Piano0.061%4.30×
Playtable0.078%0.90×
Pipes0.013%5.10×
Piano (with official RAFT-SceneFlow weights instead)0.009%5.46×

The maximum is 0.078% — still ~400× short of 33%.

One observation that did hold up across every test: the selectivity ratio (glass-like / background) consistently sits in the 4–6× range. I’ll come back to this in Section 7 because it’s the only positive signal I have, and it’s also the place where I almost convinced myself of a wrong explanation.

6. Hypothesis 5: iid Gaussian noise injection → falsified

I designed a noise-injection experiment: add Gaussian noise inside a glass-mask region with variance 4×–16× the background, and see if it triggers a PIDS-scale delta:

bg_σ   glass_σ   R_input    glass Δ     selectivity
1.0    2.0       4×         0.0078%     0.89×
3.0    6.0       4×         0.0103%     1.67×
5.0    10.0      4×         0.0009%     0.09×
10.0   20.0      4×         0.0034%     0.83×
20.0   40.0      4×         0.0062%     1.46×
5.0    15.0      9×         0.0051%     1.14×
5.0    20.0      16×        0.0044%     0.68×
10.0   30.0      9×         0.0037%     2.90×

iid Gaussian noise has no effect — and notably removes the natural ~5× selectivity I see on clean inputs. This is consistent with conv encoders smoothing out high-frequency iid noise, but it’s also a clue that whatever causes the 5× selectivity isn’t a noise-magnitude phenomenon.


7. The hypothesis I was about to publish, and why it’s wrong

At this point I had a tempting story.

Earlier this year I’d done a science fair project on Monte Carlo rendering, which independently quantified that Mitsuba renders glass regions with about 4× higher MC noise variance than non-glass (R_image = 4.18 ± 0.51, later refined to ~5.4× with more scenes). The same project showed this variance is amplified to about 8× at the matching cost level (R_cost = 7.85). And the selectivity ratio I kept seeing across all my TF32 experiments — 4–6×, identical scenes or not — looked like a numerical match to that R_image.

So the natural story was: PIDS was trained on Mitsuba data, the trained model learned to navigate the noisy cost landscape in glass regions, and TF32 mantissa truncation tips the GRU into different basins exactly where the cost surface is roughest.

But that story is inconsistent with another project of mine.

In February 2026 I ran a project called StaMask specifically to test whether RAFT-Stereo uses the statistical mask as a learned signal. The flattening experiment — equalizing R_image to ~1.0 in the input — should have increased Glass MAE if the model relied on the mask as a shortcut. The actual result, across four RAFT-Stereo checkpoints (sceneflow, middlebury, eth3d, realtime), was a +6.9% average increase in Glass MAE from flattening — but in the opposite direction from the shortcut hypothesis (MC noise was actually helping matching slightly, or the experiment was confounded; either way the predicted shortcut effect was absent).

The architectural reason was clear:

# RAFT-Stereo's feature network uses BatchNorm
Conv2d → BatchNorm2d → ReLU → Conv2d → BatchNorm2d → ...

# And the correlation block does L1 normalization before correlation
fmap1 = fmap1 / (fmap1.abs().sum(dim=1, keepdim=True) + 1e-8)
fmap2 = fmap2 / (fmap2.abs().sum(dim=1, keepdim=True) + 1e-8)
corr = torch.matmul(fmap1, fmap2)

BatchNorm normalizes per-channel variance to 1. L1-normalized correlation discards absolute intensity. Together, these wash out the variance ratio before it reaches the matching layer. RAFT-Stereo isn’t choosing not to use the mask — its architecture eliminates the signal.

So the “statistical mask drives TF32 sensitivity” story I was about to write is incompatible with my own prior empirical and architectural work. The numerical match between my 4–6× selectivity and the 4.18× R_image is plausibly coincidental — both numbers describe “glass regions are harder”, but through different mechanisms.

What likely actually generates the 4–6× selectivity I keep seeing is something more ordinary: low-texture / low-confidence regions have flatter cost surfaces with multiple shallow minima nearby; tiny numerical perturbations flip the argmin more often than in confidently-matched regions. This is a classical stereo property, not specific to synthetic data.


8. So what is causing the 33%?

I don’t know.

My experiments solidly rule out:

  • A spec-level Hopper/Blackwell vs Ada TF32 difference (kernel scans identical)
  • A cuDNN 8 → 9 regression on Ada or Hopper (matched cells identical)
  • Trained-checkpoint-only sensitivity (random input gives 0.01%)
  • Real-image-only sensitivity (Middlebury caps at 0.08%)
  • Bulk-noise-magnitude effects (iid injection inert)
  • “Model learns Mitsuba’s statistical mask and that drives TF32 chaos” (StaMask flattening + BN/L1 norm analysis rule this out architecturally and empirically)

I cannot rule out (and don’t currently have a way to test):

  • A specific PyTorch / cuBLAS / cuDNN version active when the original observation was made, that has since been silently fixed for Hopper — the 4090 vs H100 cells I ran were on pytorch/pytorch:2.2.0-cuda12.1-cudnn8 but PIDS was originally run with whatever the cloud pod’s default image provided at the time, and I cannot now reconstruct that image
  • An interaction between specific images in the deleted PIDS test set and the model that triggers the divergence — I have no way to retrieve the dataset
  • Some configuration of the original eval (batch size, image preprocessing, iter count) different from my reproduction attempts

The honest answer is: the original 33% observation is real, but the mechanism I confidently described in the paper (“Hopper TF32 is different”) is wrong, and I don’t have a confident replacement.


9. What this does and doesn’t change

What stays true:

  • The empirical 33% delta on the original PIDS evaluation
  • The practical action: disabling TF32 in evaluation eliminated the cross-platform inconsistency. That’s an actionable workaround regardless of mechanism.
  • The recommendation in the PIDS paper to set allow_tf32 = False in eval is still defensible — it gives reproducible FP32 numbers across platforms.

What I’d change:

  • The paper attributes the 33% delta to “Hopper/Blackwell’s TF32 having a different accumulator strategy than Ada”. Based on this investigation, that attribution is unsupported by my own kernel-level scans. The mechanism should have been described as “cause currently unknown; TF32 disabled as precaution”.
  • If I were re-writing the paper, I would put the architectural Hopper speculation in an appendix as a prior hypothesis, not as the main explanation.

10. Implications and open questions

I want to be careful here: this is one researcher’s investigation, with the original test data deleted, against a backdrop where my prior project (StaMask) explicitly disproved the most natural alternative explanation. None of what follows is a community recommendation; it’s a list of questions that someone better-resourced could pick up.

  • Reproducing the original observation: anyone with PIDS-style Mitsuba rendered transparent test data and access to both Ada and Hopper / Blackwell hardware could check whether the 33% delta is a real, persistent cross-architecture phenomenon today, or a transient stack-version artifact that has been fixed.
  • TF32 effect on iterative refinement in general: the universal 4–6× selectivity in low-texture regions across all my synthetic and real-image tests is a real (if small in absolute terms) phenomenon, consistent with low-confidence matching being more perturbation-sensitive. Whether this becomes a 33%-scale effect under specific data distributions is an open question.
  • For practitioners: until the mechanism is understood, the conservative position is to disable TF32 in final evaluation of any iterative-refinement stereo / flow / NeRF-style model. The cost is a few percent throughput; the upside is FP32 numerics that match across hardware. This advice is worth following independently of any of the mechanism stories above.

11. Why I’m publishing this

I sat on this for a few days deciding whether to write it up. My first draft of this post had a confident concluding mechanism — “synthetic training data has a statistical mask, the trained model learns it, TF32 perturbs the chaotic GRU dynamics”. It was a clean story. It was also wrong; my own prior StaMask project had already empirically refuted exactly that story for RAFT-Stereo, and I had to delete the section.

I’m publishing the negative-result version because:

  1. The original PIDS paper attributes the 33% delta to a specific architectural mechanism that this investigation does not support. That should be on the record.
  2. The temptation to overclaim — to wrap an unexplained empirical observation in a plausible-sounding mechanism — is real. I felt it here. Writing the false hypothesis down explicitly (Section 7) is the only honest way to mark it as ruled out.
  3. The practical takeaway (disable TF32 in eval) is independent of mechanism and remains valid.

If you reproduce or refute any of this, I’d be glad to hear about it.


Appendix A: complete experiment list

The investigation involved ~30 experiments, ~$3 of cloud GPU compute (RTX 4090 + H100 NVL):

ExperimentPurposeConclusion
cuBLAS / einsum / cuDNN pure kernel scanCross-arch TF32 numerics on 4090 vs 5090 vs H100Identical
PT 2.2 + cuDNN 8 vs PT 2.11 + cuDNN 9Cross-stack version differencesIdentical
Trained checkpoint + random inputPIDS baseline + fixed-seed random tensors0.010% delta
Synthetic + structured glass maskThree ambiguity patterns0.017% delta, ~5× selectivity
Middlebury real scenesThree scenes × two checkpoints0.061% delta, ~5× selectivity
iid Gaussian noise injection scan8 (bg σ, glass multiplier) combinationsNo effect

Appendix B: references

  1. PIDS Development Log, Exp #42 — original 33% observation
  2. StaMask (my prior project, Feb 2026) — empirical and architectural demonstration that RAFT-Stereo’s BatchNorm + L1-normalized correlation wash out variance differences before they reach matching, and that flattening the statistical mask in input does not hurt RAFT-Stereo in the way the shortcut hypothesis predicts
  3. Statistical mask in Monte Carlo rendering (my prior science fair project) — measured R_image = 4.18× and R_cost = 7.85× between glass and non-glass regions in Mitsuba renders. Notes: this paper does not claim the mask drives any specific downstream model behavior; that downstream link is what StaMask later tested and found absent for RAFT-Stereo
  4. NVIDIA Ampere / Ada / Hopper / Blackwell architecture whitepapers
  5. Lipson, Teed, Deng. RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching. 3DV 2021

← All posts