1. Design Goals
The problem this architecture solves: the polarization image pair I∥ / I⊥ carries a usable brightness-difference signal in glass regions, but the value of the polarization signal lies in the macroscopic brightness-magnitude difference between the two, and the first BatchNorm layer of the stereo-matching backbone fnet normalizes the input distribution and washes this macroscopic difference away. Feeding the polarization pair directly into fnet erases the macroscopic magnitude difference at the very first layer.
An explicit polarization encoder is therefore needed:
- The polarization branch must bypass the BN of
fnetand operate from the raw|I∥ - I⊥|polarization difference. - Polarization features and stereo features are computed separately and then fused, rather than mixed inside a single encoder.
This is the core idea of the “Dual-Stream” architecture: one RGB / stereo stream, one polarization stream. The stereo stream handles ordinary feature matching, the polarization stream preserves and extracts the polarization signal in glass regions, and the two merge at the fusion stage.
2. Architecture
2.1 Overall Architecture
2.2 PolarizationEncoder Internals
3. Components and Modules
3.1 PolarizationEncoder
The polarization encoder converts the polarization difference pol_diff into a polarization feature map. Internal flow:
pol_diff → Soft Threshold → Stem Conv → ResidualBlock → SpatialAttention → pol_features
Component responsibilities:
- Soft Threshold: a soft, sigmoid-shaped selection over polarization intensity (see §3.2).
- Stem Conv: an initial convolution that projects the soft-thresholded signal into feature space.
- ResidualBlock2D: a residual connection that improves gradient flow.
- SpatialAttention: spatial attention that learns to focus on glass regions.
- Output dimension: 64.
Important design principle: the Pol stream must not use any form of Normalization (BatchNorm / InstanceNorm / GroupNorm). All forms of Normalization standardize the input distribution and wash out the macroscopic brightness difference. The value of the polarization signal is precisely this “magnitude difference.” The RGB stream can use BN (it relies on micro-structure), but the Pol stream absolutely cannot.
3.2 Soft Threshold
A soft threshold over polarization intensity P, with weight formula:
w = sigmoid(kappa * (P - tau))
tau: threshold (pol_threshold); polarization intensity below this value is suppressed.kappa: sharpness, controlling the steepness of the sigmoid transition.- The soft threshold (as opposed to a hard threshold) preserves differentiability, allowing the network to adjust sensitivity to polarization intensity during training.
3.3 SpatialAttention
Learns a spatial weight map that “focuses on glass regions,” so polarization features concentrate where polarization differences are meaningful.
3.4 ResidualBlock2D
A 2D residual block; its skip connection improves gradient flow, allowing a deeper polarization encoder to train stably.
3.5 Disparity Alignment (warp_with_disparity)
Problem
The raw pol_diff = |left(x,y) - right(x,y)| compares the left and right images at the same pixel coordinates, but the two cameras have a 65 mm baseline with disparity range 55–94 px, so this actually compares different 3D points. This contaminates pol_diff with disparity errors instead of being a pure polarization difference.
Solution
Use warp_with_disparity() to warp the right image into the left view before differencing:
def warp_with_disparity(img, disparity):
"""Warp the right image into the left view using disparity."""
# right(x - disparity, y) corresponds to left(x, y)
# PolarizationEncoder.compute_pol_diff()
if disparity is not None:
right_aligned = warp_with_disparity(right, disparity)
else:
right_aligned = right # Fall back to the raw form without disparity
pol_diff = |left - right_aligned| # After alignment, a pure polarization difference
Behavior:
- With GT disparity: use the GT disparity for alignment to compute a pure polarization difference.
- Without GT disparity: fall back to the raw, unaligned form.
3.6 Fusion
Fusion uses a simple concat + conv:
fused = torch.cat([stereo_feat, pol_feat], dim=1) # (B, 256, H, W)
fused = self.fusion_conv(fused) # (B, 128, H, W)
The stereo and polarization features are concatenated along the channel dimension and a convolution fuses them back to the original channel count.
3.7 Polarization-aware Loss
The polarization-aware loss function:
loss = Sum gamma^(N-i) * [glass_mask * glass_weight + (1-glass_mask)]
* [pol_weight * pol_diff + (1-pol_diff)]
* |D_pred - D_gt|
gamma^(N-i): RAFT-style iterative weighting; later GRU iterations carry higher weight.[glass_mask * glass_weight + (1-glass_mask)]: glass regions are weighted byglass_weight; background weight is 1.[pol_weight * pol_diff + (1-pol_diff)]: regions with large polarization differences receive extra weight (pol_weight=2.0).|D_pred - D_gt|: L1 error between predicted and GT disparity.- Rationale: large polarization difference = glass region → matching should be more accurate.
4. Data Flow
- The left and right images pass through the shared-weight
FeatureEncoderto producefmap1andfmap2. - Compute the polarization difference: with GT disparity,
right_aligned = warp_with_disparity(right, GT_disparity), otherwiseright_aligned = right; thenpol_diff = |left - right_aligned|. pol_diffpasses through thePolarizationEncoder(Soft Threshold → Stem Conv → ResidualBlock → SpatialAttention) to produce polarization features (64-dim).Fusionconcatenates the stereo and polarization features and fuses them with a convolution.- The fused features feed into the correlation pyramid + GRU for iterative refinement → disparity.
5. Tensor Dimensions
| Tensor | Dimensions | Description |
|---|---|---|
stereo_feat | (B, 128, H, W) | Stereo-stream features |
pol_feat / pol_features | (B, 64, H, W) | Polarization-stream features (pol_dim, output dim 64) |
torch.cat([stereo_feat, pol_feat]) | (B, 256, H, W) | After concatenation |
fusion_conv output | (B, 128, H, W) | Fused back to 128 channels |
Note:
pol_dimcan be 64 or 128; this document uses 64-dim by default, with the hyperparameter table listed separately.
6. Hyperparameters
6.1 PolarizationEncoder Hyperparameters
| Parameter | Value | Description |
|---|---|---|
pol_dim | 64 | Polarization feature dimension (more capacity for complex features) |
pol_threshold | 0.05 | Soft Threshold tau (sensitive to weak polarization) |
glass_weight | 5.0 | Loss weight on glass regions |
pol_lr_mult | 5.0 | Learning-rate multiplier for polarization layers (avoids new-layer instability) |
pol_weight | 2.0 | Polarization weight in Polarization-aware Loss |
6.2 Training Hyperparameters
| Parameter | Value | Description |
|---|---|---|
dual_stream | enabled | Enable the Dual-Stream architecture |
pretrained | raftstereo-sceneflow.pth | SceneFlow pre-trained weights |
pol_dim | 64 / 128 | Polarization feature dimension |
pol_threshold | 0.05 | Soft Threshold tau |
pol_sharpness | 20.0 | Soft Threshold sharpness kappa |
pol_lr_mult | 5.0 | Learning-rate multiplier for polarization layers |
pol_weight | 2.0 | Polarization weight in Polarization-aware Loss |
glass_weight | 5.0 | Loss weight on glass regions |
strict_glass_weight | 0.5 | Extra weight on the strict glass core (intersection mask of left/right views) |
batch_size | 8 | Training batch size |
num_steps | 60000 | Training steps |
lr | 0.0003 | Learning rate |
iters | 24 | GRU iterations |
7. Design Decisions and Rationale
7.1 Why a “dual-stream” rather than a single encoder
Isolating polarization processing into its own stream lets it bypass the BN of the RGB encoder entirely and operate from raw |I∥ - I⊥|, preserving the macroscopic magnitude difference. If polarization features were mixed into a single encoder with stereo features, the macroscopic polarization difference would be normalized away at the first BN layer.
7.2 Why all Normalization is forbidden in the Pol stream
BatchNorm / InstanceNorm / GroupNorm all standardize the input distribution and wash out the macroscopic brightness difference. The value of the polarization signal is precisely this “magnitude difference.” The RGB stream depends on micro-structure and can use BN; the Pol stream absolutely cannot.
7.3 Why Soft Threshold rather than a hard threshold
A hard threshold is non-differentiable. Soft Threshold uses the form sigmoid(kappa(P-tau)) to preserve differentiability, allowing the network to adjust sensitivity to polarization intensity and the transition sharpness during training.
7.4 Why disparity alignment (warp) is needed
The 65 mm baseline means the same pixel coordinates in the left and right views correspond to different 3D points. Without alignment, pol_diff is contaminated with disparity errors. Warping with GT disparity makes pol_diff a “pure polarization difference.”
7.5 Why SpatialAttention
The polarization signal is meaningful only in glass regions. SpatialAttention lets the encoder learn to focus on glass regions and suppress background noise.
7.6 Why lower the pol_lr_mult
The polarization branch consists of newly added, randomly initialized layers. Too high a learning-rate multiplier causes the new layers to oscillate violently on top of a pre-trained backbone. Setting the multiplier to 5.0 avoids new-layer instability.
8. Highlights
- Two-stream separation: stereo and polarization streams are computed independently; the polarization stream completely bypasses the stereo backbone’s BatchNorm, ensuring the macroscopic magnitude difference of the polarization signal is not normalized away.
- A polarization branch free of all Normalization: the Pol stream uses no BatchNorm / InstanceNorm / GroupNorm, operating from raw
|I∥ - I⊥|and preserving the magnitude difference of the polarization signal. - Differentiable soft-threshold selection:
sigmoid(kappa(P-tau))replaces a hard threshold, letting the network adjust sensitivity to polarization intensity and transition sharpness during training. - Disparity-aligned pure polarization difference:
warp_with_disparityaligns the right image into the left view before differencing, removing geometric error from the 65 mm baseline and yielding a pure polarization difference. - Polarization-aware loss: dual weighting by
glass_maskandpol_diff, combined with RAFT-style iterative weighting, guides the model to match more accurately in glass and high-polarization-difference regions. - Spatial attention focusing on glass: SpatialAttention learns a spatial weight map so polarization features concentrate where the polarization signal is genuinely meaningful—in glass regions.