Pol-Conditioned Correlation Residual Architecture

1. Design Goals

When the polarization signal participates in stereo matching, both the “manner” and the “space” in which it intervenes directly affect stability:

If polarization only intervenes in the spatial domain ([H,W]) via attention or a multiplicative gate, it cannot distinguish “which disparity” matters — the entire disparity dimension is scaled uniformly.
A multiplicative gate scales the correlation values up or zeroes them out globally, breaking the inductive bias that RAFT has built around the correlation volume.

The design goal of this architecture is to let polarization intervene without breaking RAFT’s inductive bias: polarization does not interfere via a multiplicative gate, but injects a bias into the correlation volume as an additive residual.

pol_corr → PolCorrResidual → Δcorr
corr_enhanced = corr + α × Δcorr  (Additive Bias in Disparity Space)

Since the correlation volume itself lives in disparity space, this residual is an “additive bias in disparity space”, and the UpdateBlock remains the original RAFT, so its inductive bias is preserved.

2. Architecture

Overall architecture of Pol-Conditioned Corr Residual

Data Flow

The left and right images pass through FeatureEncoder to produce fmap1 / fmap2, which form the standard CorrBlock.
The left and right images are each Downsampled to 1/4 and fed into PolCorrBlock to compute the polarization difference volume.
The output of PolCorrBlock is fed into PolCorrResidual to produce the correlation residual Δcorr.
corr_enhanced = corr + α × Δcorr additively injects the residual into the correlation volume.
The enhanced corr_enhanced is fed into the original RAFT UpdateBlock (without any modification).

3. Components and Modules

3.1 PolCorrResidual

class PolCorrResidual(nn.Module):
    def __init__(self, pol_dim, corr_dim, hidden_dim=64, init_scale=0.1):
        self.net = nn.Sequential(
            nn.Conv2d(pol_dim, hidden_dim, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(hidden_dim, hidden_dim, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(hidden_dim, corr_dim, 1),  # project to corr dimension
        )
        self.scale = nn.Parameter(torch.tensor(init_scale))
        # Initialize the last layer to 0 → initial Δcorr ≈ 0
        nn.init.zeros_(self.net[-1].weight)
        nn.init.zeros_(self.net[-1].bias)

    def forward(self, pol_corr):
        return self.scale * self.net(pol_corr)

Design points:

net: three convolutional layers (3×3 → 3×3 → 1×1); the final 1×1 convolution projects features to corr_dim (aligned with the correlation volume channels).
scale: a learnable scalar parameter, initialized as init_scale=0.1.
The last layer’s weights and bias are initialized to 0: at the start of training Δcorr ≈ 0, i.e. corr_enhanced ≈ corr, so the initial behavior is identical to plain RAFT-Stereo without pol. As training proceeds, the residual is gradually learned, providing a stable training starting point.
forward returns scale * net(pol_corr), i.e. the scaled residual Δcorr.

3.2 Additive Bias in Disparity Space

The key idea of this architecture is the combination of “additive bias” and “disparity space”:

Additive: corr + α × Δcorr, with no multiplicative interference. Addition does not scale the original correlation values up or zero them out — it only shifts them, thereby preserving RAFT’s existing understanding of correlation.
Disparity space: the residual acts directly on the correlation volume, and each channel/index of the correlation volume corresponds to a disparity candidate. Therefore the pol correction naturally carries disparity semantics, rather than being purely spatial.

4. Tensor Dimensions

Tensor	Shape / Parameter	Description
`PolCorrResidual` input `pol_corr`	`(B, pol_dim, H, W)`	From PolCorrBlock
Intermediate layer in `net`	`hidden_dim=64`	Two 3×3 convolutions
`Δcorr` output	`(B, corr_dim, H, W)`	Projected to corr dimension
`scale`	scalar	Learnable parameter, init 0.1
`corr_enhanced`	`(B, corr_dim, H, W)`	`corr + α × Δcorr`

5. Hyperparameters

Hyperparameter	Value	Description
`pol_levels`	4	Number of pyramid levels in the polarization volume
`pol_radius`	4	Lookup radius of the polarization volume
`iters`	24	Number of GRU iterations
`hidden_dim`	64	Number of channels in the intermediate layer of `PolCorrResidual`
`init_scale`	0.1	Initial value of the learnable `scale` parameter

6. Design Decisions and Rationale

Decision	Rationale
Use an additive residual instead of a multiplicative gate	Addition does not break RAFT’s inductive bias, multiplication does
Apply the residual to the correlation volume	The correlation volume is disparity space; the correction carries disparity semantics
Initialize the last layer of `PolCorrResidual` to 0	`Δcorr ≈ 0` at the start of training, behavior close to plain RAFT-Stereo, learning starts from a stable point
`scale` set as a learnable parameter (init 0.1)	Lets the model decide the overall strength of the pol residual
UpdateBlock fully reuses the original RAFT	Downstream remains unchanged, maximally preserving pretrained capability
Pol images downsampled to 1/4	Aligned with the correlation volume resolution

7. Highlights

Polarization is injected into the correlation volume as an additive residual, only shifting and never globally scaling correlation values, fully preserving RAFT’s existing understanding of correlation.
The residual acts directly on the correlation volume rather than the spatial domain, so the polarization correction naturally carries disparity semantics and can distinguish “which disparity”.
The last layer of PolCorrResidual is initialized to 0, so Δcorr ≈ 0 at the start of training and the model learns the polarization residual from a stable starting point equivalent to plain RAFT-Stereo.
A single learnable scalar scale controls the overall strength of the polarization residual, letting the model decide how much to trust pol.
UpdateBlock fully reuses the original RAFT with no modifications, maximally preserving pretrained capability.