Side Information Injection Architecture — Po-Ting Lin (林柏廷)

1. Design Goals

This architecture introduces polarization information on top of the standard RAFT-Stereo stereo matching framework while pursuing the minimum possible architectural change.

Polarization information has a specific role:

The polarization signal (e.g. |I∥ - I⊥|) is separable between glass and non-glass — it is context-level discriminative information.
The polarization signal has no disparity-discriminative peak and is unsuitable for placing inside the cost volume to participate in disparity search.

The design goal of this architecture: treat polarization as context-level side information and inject it into the network with minimum changes. Three principles are adopted:

The RGB stream is left completely untouched (fnet, cnet, corr_block remain as is).
Polarization features are physics-based (no learning) — no learnable parameters are introduced.
Polarization features are not compressed by the CNet; they are injected directly into the GRU’s MotionEncoder.

Why not inject into the CNet

After multiple layers of conv compression, the hidden-layer output of the CNet is biased toward low-dimensional information, and polarization details would be diluted. Injecting directly at every GRU iteration instead yields three benefits:

Polarization information is injected at full resolution (H/8) at every iteration.
It bypasses the CNet compression and preserves spatial structure.
The GRU can directly see the spatial structure of polarization at every step.

2. Architecture

Overall architecture of Side Information Injection

The simplest architecture: on top of the original RAFT-Stereo, only a 12ch physics-based polarization feature is injected as “side information” into the input of the GRU’s MotionEncoder; the RGB stream is completely unchanged.

Pol Features (12 channels, physics-based, no learning):

|I∥ - I⊥| (3ch): polarization intensity — where the glass is.
I∥ / (I∥ + I⊥ + ε) (3ch): polarization ratio — related to Fresnel angle.
Sobel_x(|I∥ - I⊥|) (3ch): polarization edge x — glass boundary.
Sobel_y(|I∥ - I⊥|) (3ch): polarization edge y — glass boundary.

3. Components and Modules

3.1 PolarizationFeatures (12ch, physics-based, no learning)

A polarization feature extraction module that requires no learning. Sobel kernels are registered as buffers and do not participate in training.

class PolarizationFeatures(nn.Module):
    """Physics-based polarization features; no learning required"""
    def __init__(self, output_scale=8):
        super().__init__()
        # Sobel kernels (registered as buffer, not trained)
        self.register_buffer('sobel_x', ...)
        self.register_buffer('sobel_y', ...)

    def forward(self, left, right):
        # 1. |I∥ - I⊥|
        pol_diff = torch.abs(left - right)

        # 2. I∥ / (I∥ + I⊥ + ε)
        pol_ratio = left / (left + right + 1e-6)

        # 3. Sobel edges
        pol_edge_x = sobel_x(pol_diff)
        pol_edge_y = sobel_y(pol_diff)

        # Concat and downsample to H/8
        pol_features = torch.cat([pol_diff, pol_ratio, pol_edge_x, pol_edge_y], dim=1)
        return F.avg_pool2d(pol_features, kernel_size=8)

Output: 4 groups of 3ch each, totaling 12ch, downsampled to H/8 via avg_pool2d(kernel_size=8).

3.2 MotionEncoderV6

Extends the MotionEncoder of RAFT-Stereo with an additional polarization branch. The correlation and disparity branches are identical to the baseline; only the polarization branch is new.

class MotionEncoderV6(nn.Module):
    """Extended MotionEncoder that accepts pol_features"""
    def __init__(self, corr_dim=36, disp_dim=1, pol_dim=12):
        super().__init__()
        # Correlation branch (same as baseline)
        self.convc1 = nn.Conv2d(corr_dim, 64, 1)
        self.convc2 = nn.Conv2d(64, 64, 3, padding=1)

        # Disparity branch (same as baseline)
        self.convd1 = nn.Conv2d(disp_dim, 128, 7, padding=3)
        self.convd2 = nn.Conv2d(128, 64, 3, padding=1)

        # Polarization branch (NEW - this is the only new part)
        self.convp1 = nn.Conv2d(pol_dim, 32, 3, padding=1)
        self.convp2 = nn.Conv2d(32, 32, 3, padding=1)

        # Fusion: 64 + 64 + 32 = 160 -> 126
        self.conv = nn.Conv2d(160, 126, 3, padding=1)

    def forward(self, corr, disp, pol_features):
        cor = F.relu(self.convc2(F.relu(self.convc1(corr))))
        dis = F.relu(self.convd2(F.relu(self.convd1(disp))))
        pol = F.relu(self.convp2(F.relu(self.convp1(pol_features))))
        out = F.relu(self.conv(torch.cat([cor, dis, pol], dim=1)))
        return torch.cat([out, disp], dim=1)  # 127

The outputs of the three branches (cor 64ch + dis 64ch + pol 32ch = 160ch) are concatenated and projected by the fusion conv to 126ch; finally, disp (1ch) is concatenated to give 127ch.

4. Tensor Dimensions

Item	Shape	Description
pol_diff	(B, 3, H, W)	`
pol_ratio	(B, 3, H, W)	`I∥ / (I∥ + I⊥ + ε)`
pol_edge_x	(B, 3, H, W)	`Sobel_x(pol_diff)`
pol_edge_y	(B, 3, H, W)	`Sobel_y(pol_diff)`
pol_features (after concat)	(B, 12, H, W) → (B, 12, H/8, W/8)	avg_pool2d kernel=8 downsampling
Correlation branch output (cor)	(B, 64, H/8, W/8)	convc1/convc2
Disparity branch output (dis)	(B, 64, H/8, W/8)	convd1/convd2
Polarization branch output (pol)	(B, 32, H/8, W/8)	convp1/convp2
Fusion conv input	(B, 160, H/8, W/8)	concat(cor, dis, pol)
Fusion conv output (out)	(B, 126, H/8, W/8)	conv 160→126
MotionEncoderV6 final output	(B, 127, H/8, W/8)	concat(out, disp)

5. Parameter Count

Component	Parameters
Baseline (RAFT-Stereo)	~5.3M
Added (pol branch)	~50K
Added overhead	~1%

6. Hyperparameters

Hyperparameter	Value	Description
pol_dim	12	Number of polarization feature channels
output_scale	8	Downsampling factor for polarization features (avg_pool2d kernel size)
corr_dim	36	Number of RGB correlation channels
ε	1e-6	Denominator stabilizer for pol_ratio

7. Design Decisions and Rationale

7.1 Physics-based, no learning

All polarization features are computed by deterministic physics formulas (pol_diff, pol_ratio, Sobel) — no learnable parameters are introduced. Physics-based features combined with very few new parameters make hypothesis verification easier than with a complex learnable branch.

7.2 Injection point: MotionEncoder input, bypassing CNet

The injection point matters. Bypassing the CNet and injecting directly at each GRU iteration preserves the spatial structure of the polarization signal. The GRU can directly see polarization spatial details at every step.

7.3 Do not pollute the cost volume

The polarization signal has no disparity-discriminative peak. Polarization is context-level information rather than cost-level information, so it is not placed in the cost volume — only used as side information.

7.4 Very few new parameters

Only the MotionEncoder input layer is extended (a new polarization branch), about 50K parameters, roughly 1% overhead relative to the baseline. The RGB stream is completely unchanged, preserving the full capability of the original RAFT-Stereo.

8. Highlights

Zero-learning polarization features: all polarization features are computed by deterministic physics formulas (intensity difference, polarization ratio, Sobel edges) — no learnable parameters — which makes hypothesis verification more direct.
Side information rather than cost volume: polarization is explicitly positioned as context-level information; the cost volume is not polluted, avoiding interference of correlation matching by a signal without a disparity peak.
Direct injection bypassing CNet: polarization features are not compressed by the CNet and are injected at full resolution into the MotionEncoder at every GRU iteration, fully preserving the spatial structure of polarization.
Tiny ~1% overhead: only the MotionEncoder input layer is extended with a new polarization branch (~50K parameters); the RGB stream is completely unchanged, preserving all baseline capabilities.