1. Design Goals
When the polarization signal participates in stereo matching, both the “manner” and the “space” in which it intervenes directly affect stability:
- If polarization only intervenes in the spatial domain ([H,W]) via attention or a multiplicative gate, it cannot distinguish “which disparity” matters — the entire disparity dimension is scaled uniformly.
- A multiplicative gate scales the correlation values up or zeroes them out globally, breaking the inductive bias that RAFT has built around the correlation volume.
The design goal of this architecture is to let polarization intervene without breaking RAFT’s inductive bias: polarization does not interfere via a multiplicative gate, but injects a bias into the correlation volume as an additive residual.
pol_corr → PolCorrResidual → Δcorr
corr_enhanced = corr + α × Δcorr (Additive Bias in Disparity Space)
Since the correlation volume itself lives in disparity space, this residual is an “additive bias in disparity space”, and the UpdateBlock remains the original RAFT, so its inductive bias is preserved.
2. Architecture
Data Flow
- The left and right images pass through
FeatureEncoderto producefmap1/fmap2, which form the standardCorrBlock. - The left and right images are each
Downsampled to 1/4and fed intoPolCorrBlockto compute the polarization difference volume. - The output of
PolCorrBlockis fed intoPolCorrResidualto produce the correlation residualΔcorr. corr_enhanced = corr + α × Δcorradditively injects the residual into the correlation volume.- The enhanced
corr_enhancedis fed into the original RAFT UpdateBlock (without any modification).
3. Components and Modules
3.1 PolCorrResidual
class PolCorrResidual(nn.Module):
def __init__(self, pol_dim, corr_dim, hidden_dim=64, init_scale=0.1):
self.net = nn.Sequential(
nn.Conv2d(pol_dim, hidden_dim, 3, padding=1),
nn.ReLU(),
nn.Conv2d(hidden_dim, hidden_dim, 3, padding=1),
nn.ReLU(),
nn.Conv2d(hidden_dim, corr_dim, 1), # project to corr dimension
)
self.scale = nn.Parameter(torch.tensor(init_scale))
# Initialize the last layer to 0 → initial Δcorr ≈ 0
nn.init.zeros_(self.net[-1].weight)
nn.init.zeros_(self.net[-1].bias)
def forward(self, pol_corr):
return self.scale * self.net(pol_corr)
Design points:
net: three convolutional layers (3×3 → 3×3 → 1×1); the final 1×1 convolution projects features tocorr_dim(aligned with the correlation volume channels).scale: a learnable scalar parameter, initialized asinit_scale=0.1.- The last layer’s weights and bias are initialized to 0: at the start of training
Δcorr ≈ 0, i.e.corr_enhanced ≈ corr, so the initial behavior is identical to plain RAFT-Stereo without pol. As training proceeds, the residual is gradually learned, providing a stable training starting point. forwardreturnsscale * net(pol_corr), i.e. the scaled residualΔcorr.
3.2 Additive Bias in Disparity Space
The key idea of this architecture is the combination of “additive bias” and “disparity space”:
- Additive:
corr + α × Δcorr, with no multiplicative interference. Addition does not scale the original correlation values up or zero them out — it only shifts them, thereby preserving RAFT’s existing understanding of correlation. - Disparity space: the residual acts directly on the correlation volume, and each channel/index of the correlation volume corresponds to a disparity candidate. Therefore the pol correction naturally carries disparity semantics, rather than being purely spatial.
4. Tensor Dimensions
| Tensor | Shape / Parameter | Description |
|---|---|---|
PolCorrResidual input pol_corr | (B, pol_dim, H, W) | From PolCorrBlock |
Intermediate layer in net | hidden_dim=64 | Two 3×3 convolutions |
Δcorr output | (B, corr_dim, H, W) | Projected to corr dimension |
scale | scalar | Learnable parameter, init 0.1 |
corr_enhanced | (B, corr_dim, H, W) | corr + α × Δcorr |
5. Hyperparameters
| Hyperparameter | Value | Description |
|---|---|---|
pol_levels | 4 | Number of pyramid levels in the polarization volume |
pol_radius | 4 | Lookup radius of the polarization volume |
iters | 24 | Number of GRU iterations |
hidden_dim | 64 | Number of channels in the intermediate layer of PolCorrResidual |
init_scale | 0.1 | Initial value of the learnable scale parameter |
6. Design Decisions and Rationale
| Decision | Rationale |
|---|---|
| Use an additive residual instead of a multiplicative gate | Addition does not break RAFT’s inductive bias, multiplication does |
| Apply the residual to the correlation volume | The correlation volume is disparity space; the correction carries disparity semantics |
Initialize the last layer of PolCorrResidual to 0 | Δcorr ≈ 0 at the start of training, behavior close to plain RAFT-Stereo, learning starts from a stable point |
scale set as a learnable parameter (init 0.1) | Lets the model decide the overall strength of the pol residual |
| UpdateBlock fully reuses the original RAFT | Downstream remains unchanged, maximally preserving pretrained capability |
| Pol images downsampled to 1/4 | Aligned with the correlation volume resolution |
7. Highlights
- Polarization is injected into the correlation volume as an additive residual, only shifting and never globally scaling correlation values, fully preserving RAFT’s existing understanding of correlation.
- The residual acts directly on the correlation volume rather than the spatial domain, so the polarization correction naturally carries disparity semantics and can distinguish “which disparity”.
- The last layer of
PolCorrResidualis initialized to 0, soΔcorr ≈ 0at the start of training and the model learns the polarization residual from a stable starting point equivalent to plain RAFT-Stereo. - A single learnable scalar
scalecontrols the overall strength of the polarization residual, letting the model decide how much to trust pol. - UpdateBlock fully reuses the original RAFT with no modifications, maximally preserving pretrained capability.