1. Design Goals
This architecture introduces polarization information on top of the standard RAFT-Stereo stereo matching framework while pursuing the minimum possible architectural change.
Polarization information has a specific role:
- The polarization signal (e.g.
|I∥ - I⊥|) is separable between glass and non-glass — it is context-level discriminative information. - The polarization signal has no disparity-discriminative peak and is unsuitable for placing inside the cost volume to participate in disparity search.
The design goal of this architecture: treat polarization as context-level side information and inject it into the network with minimum changes. Three principles are adopted:
- The RGB stream is left completely untouched (fnet, cnet, corr_block remain as is).
- Polarization features are physics-based (no learning) — no learnable parameters are introduced.
- Polarization features are not compressed by the CNet; they are injected directly into the GRU’s MotionEncoder.
Why not inject into the CNet
After multiple layers of conv compression, the hidden-layer output of the CNet is biased toward low-dimensional information, and polarization details would be diluted. Injecting directly at every GRU iteration instead yields three benefits:
- Polarization information is injected at full resolution (H/8) at every iteration.
- It bypasses the CNet compression and preserves spatial structure.
- The GRU can directly see the spatial structure of polarization at every step.
2. Architecture
The simplest architecture: on top of the original RAFT-Stereo, only a 12ch physics-based polarization feature is injected as “side information” into the input of the GRU’s MotionEncoder; the RGB stream is completely unchanged.
Pol Features (12 channels, physics-based, no learning):
|I∥ - I⊥|(3ch): polarization intensity — where the glass is.I∥ / (I∥ + I⊥ + ε)(3ch): polarization ratio — related to Fresnel angle.Sobel_x(|I∥ - I⊥|)(3ch): polarization edge x — glass boundary.Sobel_y(|I∥ - I⊥|)(3ch): polarization edge y — glass boundary.
3. Components and Modules
3.1 PolarizationFeatures (12ch, physics-based, no learning)
A polarization feature extraction module that requires no learning. Sobel kernels are registered as buffers and do not participate in training.
class PolarizationFeatures(nn.Module):
"""Physics-based polarization features; no learning required"""
def __init__(self, output_scale=8):
super().__init__()
# Sobel kernels (registered as buffer, not trained)
self.register_buffer('sobel_x', ...)
self.register_buffer('sobel_y', ...)
def forward(self, left, right):
# 1. |I∥ - I⊥|
pol_diff = torch.abs(left - right)
# 2. I∥ / (I∥ + I⊥ + ε)
pol_ratio = left / (left + right + 1e-6)
# 3. Sobel edges
pol_edge_x = sobel_x(pol_diff)
pol_edge_y = sobel_y(pol_diff)
# Concat and downsample to H/8
pol_features = torch.cat([pol_diff, pol_ratio, pol_edge_x, pol_edge_y], dim=1)
return F.avg_pool2d(pol_features, kernel_size=8)
Output: 4 groups of 3ch each, totaling 12ch, downsampled to H/8 via avg_pool2d(kernel_size=8).
3.2 MotionEncoderV6
Extends the MotionEncoder of RAFT-Stereo with an additional polarization branch. The correlation and disparity branches are identical to the baseline; only the polarization branch is new.
class MotionEncoderV6(nn.Module):
"""Extended MotionEncoder that accepts pol_features"""
def __init__(self, corr_dim=36, disp_dim=1, pol_dim=12):
super().__init__()
# Correlation branch (same as baseline)
self.convc1 = nn.Conv2d(corr_dim, 64, 1)
self.convc2 = nn.Conv2d(64, 64, 3, padding=1)
# Disparity branch (same as baseline)
self.convd1 = nn.Conv2d(disp_dim, 128, 7, padding=3)
self.convd2 = nn.Conv2d(128, 64, 3, padding=1)
# Polarization branch (NEW - this is the only new part)
self.convp1 = nn.Conv2d(pol_dim, 32, 3, padding=1)
self.convp2 = nn.Conv2d(32, 32, 3, padding=1)
# Fusion: 64 + 64 + 32 = 160 -> 126
self.conv = nn.Conv2d(160, 126, 3, padding=1)
def forward(self, corr, disp, pol_features):
cor = F.relu(self.convc2(F.relu(self.convc1(corr))))
dis = F.relu(self.convd2(F.relu(self.convd1(disp))))
pol = F.relu(self.convp2(F.relu(self.convp1(pol_features))))
out = F.relu(self.conv(torch.cat([cor, dis, pol], dim=1)))
return torch.cat([out, disp], dim=1) # 127
The outputs of the three branches (cor 64ch + dis 64ch + pol 32ch = 160ch) are concatenated and projected by the fusion conv to 126ch; finally, disp (1ch) is concatenated to give 127ch.
4. Tensor Dimensions
| Item | Shape | Description |
|---|---|---|
| pol_diff | (B, 3, H, W) | ` |
| pol_ratio | (B, 3, H, W) | I∥ / (I∥ + I⊥ + ε) |
| pol_edge_x | (B, 3, H, W) | Sobel_x(pol_diff) |
| pol_edge_y | (B, 3, H, W) | Sobel_y(pol_diff) |
| pol_features (after concat) | (B, 12, H, W) → (B, 12, H/8, W/8) | avg_pool2d kernel=8 downsampling |
| Correlation branch output (cor) | (B, 64, H/8, W/8) | convc1/convc2 |
| Disparity branch output (dis) | (B, 64, H/8, W/8) | convd1/convd2 |
| Polarization branch output (pol) | (B, 32, H/8, W/8) | convp1/convp2 |
| Fusion conv input | (B, 160, H/8, W/8) | concat(cor, dis, pol) |
| Fusion conv output (out) | (B, 126, H/8, W/8) | conv 160→126 |
| MotionEncoderV6 final output | (B, 127, H/8, W/8) | concat(out, disp) |
5. Parameter Count
| Component | Parameters |
|---|---|
| Baseline (RAFT-Stereo) | ~5.3M |
| Added (pol branch) | ~50K |
| Added overhead | ~1% |
6. Hyperparameters
| Hyperparameter | Value | Description |
|---|---|---|
| pol_dim | 12 | Number of polarization feature channels |
| output_scale | 8 | Downsampling factor for polarization features (avg_pool2d kernel size) |
| corr_dim | 36 | Number of RGB correlation channels |
| ε | 1e-6 | Denominator stabilizer for pol_ratio |
7. Design Decisions and Rationale
7.1 Physics-based, no learning
All polarization features are computed by deterministic physics formulas (pol_diff, pol_ratio, Sobel) — no learnable parameters are introduced. Physics-based features combined with very few new parameters make hypothesis verification easier than with a complex learnable branch.
7.2 Injection point: MotionEncoder input, bypassing CNet
The injection point matters. Bypassing the CNet and injecting directly at each GRU iteration preserves the spatial structure of the polarization signal. The GRU can directly see polarization spatial details at every step.
7.3 Do not pollute the cost volume
The polarization signal has no disparity-discriminative peak. Polarization is context-level information rather than cost-level information, so it is not placed in the cost volume — only used as side information.
7.4 Very few new parameters
Only the MotionEncoder input layer is extended (a new polarization branch), about 50K parameters, roughly 1% overhead relative to the baseline. The RGB stream is completely unchanged, preserving the full capability of the original RAFT-Stereo.
8. Highlights
- Zero-learning polarization features: all polarization features are computed by deterministic physics formulas (intensity difference, polarization ratio, Sobel edges) — no learnable parameters — which makes hypothesis verification more direct.
- Side information rather than cost volume: polarization is explicitly positioned as context-level information; the cost volume is not polluted, avoiding interference of correlation matching by a signal without a disparity peak.
- Direct injection bypassing CNet: polarization features are not compressed by the CNet and are injected at full resolution into the MotionEncoder at every GRU iteration, fully preserving the spatial structure of polarization.
- Tiny ~1% overhead: only the MotionEncoder input layer is extended with a new polarization branch (~50K parameters); the RGB stream is completely unchanged, preserving all baseline capabilities.