Dual-Stream Polarization Stereo Matching Architecture

1. Design Goals

The problem this architecture solves: the polarization image pair I∥ / I⊥ carries a usable brightness-difference signal in glass regions, but the value of the polarization signal lies in the macroscopic brightness-magnitude difference between the two, and the first BatchNorm layer of the stereo-matching backbone fnet normalizes the input distribution and washes this macroscopic difference away. Feeding the polarization pair directly into fnet erases the macroscopic magnitude difference at the very first layer.

An explicit polarization encoder is therefore needed:

The polarization branch must bypass the BN of fnet and operate from the raw |I∥ - I⊥| polarization difference.
Polarization features and stereo features are computed separately and then fused, rather than mixed inside a single encoder.

This is the core idea of the “Dual-Stream” architecture: one RGB / stereo stream, one polarization stream. The stereo stream handles ordinary feature matching, the polarization stream preserves and extracts the polarization signal in glass regions, and the two merge at the fusion stage.

2. Architecture

2.1 Overall Architecture

Dual-Stream overall architecture

2.2 PolarizationEncoder Internals

PolarizationEncoder internal structure

3. Components and Modules

3.1 PolarizationEncoder

The polarization encoder converts the polarization difference pol_diff into a polarization feature map. Internal flow:

pol_diff → Soft Threshold → Stem Conv → ResidualBlock → SpatialAttention → pol_features

Component responsibilities:

Soft Threshold: a soft, sigmoid-shaped selection over polarization intensity (see §3.2).
Stem Conv: an initial convolution that projects the soft-thresholded signal into feature space.
ResidualBlock2D: a residual connection that improves gradient flow.
SpatialAttention: spatial attention that learns to focus on glass regions.
Output dimension: 64.

Important design principle: the Pol stream must not use any form of Normalization (BatchNorm / InstanceNorm / GroupNorm). All forms of Normalization standardize the input distribution and wash out the macroscopic brightness difference. The value of the polarization signal is precisely this “magnitude difference.” The RGB stream can use BN (it relies on micro-structure), but the Pol stream absolutely cannot.

3.2 Soft Threshold

A soft threshold over polarization intensity P, with weight formula:

w = sigmoid(kappa * (P - tau))

tau: threshold (pol_threshold); polarization intensity below this value is suppressed.
kappa: sharpness, controlling the steepness of the sigmoid transition.
The soft threshold (as opposed to a hard threshold) preserves differentiability, allowing the network to adjust sensitivity to polarization intensity during training.

3.3 SpatialAttention

Learns a spatial weight map that “focuses on glass regions,” so polarization features concentrate where polarization differences are meaningful.

3.4 ResidualBlock2D

A 2D residual block; its skip connection improves gradient flow, allowing a deeper polarization encoder to train stably.

3.5 Disparity Alignment (warp_with_disparity)

Problem

The raw pol_diff = |left(x,y) - right(x,y)| compares the left and right images at the same pixel coordinates, but the two cameras have a 65 mm baseline with disparity range 55–94 px, so this actually compares different 3D points. This contaminates pol_diff with disparity errors instead of being a pure polarization difference.

Solution

Use warp_with_disparity() to warp the right image into the left view before differencing:

def warp_with_disparity(img, disparity):
    """Warp the right image into the left view using disparity."""
    # right(x - disparity, y) corresponds to left(x, y)

# PolarizationEncoder.compute_pol_diff()
if disparity is not None:
    right_aligned = warp_with_disparity(right, disparity)
else:
    right_aligned = right  # Fall back to the raw form without disparity

pol_diff = |left - right_aligned|  # After alignment, a pure polarization difference

Behavior:

With GT disparity: use the GT disparity for alignment to compute a pure polarization difference.
Without GT disparity: fall back to the raw, unaligned form.

3.6 Fusion

Fusion uses a simple concat + conv:

fused = torch.cat([stereo_feat, pol_feat], dim=1)  # (B, 256, H, W)
fused = self.fusion_conv(fused)                     # (B, 128, H, W)

The stereo and polarization features are concatenated along the channel dimension and a convolution fuses them back to the original channel count.

3.7 Polarization-aware Loss

The polarization-aware loss function:

loss = Sum gamma^(N-i) * [glass_mask * glass_weight + (1-glass_mask)]
                       * [pol_weight * pol_diff + (1-pol_diff)]
                       * |D_pred - D_gt|

gamma^(N-i): RAFT-style iterative weighting; later GRU iterations carry higher weight.
[glass_mask * glass_weight + (1-glass_mask)]: glass regions are weighted by glass_weight; background weight is 1.
[pol_weight * pol_diff + (1-pol_diff)]: regions with large polarization differences receive extra weight (pol_weight=2.0).
|D_pred - D_gt|: L1 error between predicted and GT disparity.
Rationale: large polarization difference = glass region → matching should be more accurate.

4. Data Flow

The left and right images pass through the shared-weight FeatureEncoder to produce fmap1 and fmap2.
Compute the polarization difference: with GT disparity, right_aligned = warp_with_disparity(right, GT_disparity), otherwise right_aligned = right; then pol_diff = |left - right_aligned|.
pol_diff passes through the PolarizationEncoder (Soft Threshold → Stem Conv → ResidualBlock → SpatialAttention) to produce polarization features (64-dim).
Fusion concatenates the stereo and polarization features and fuses them with a convolution.
The fused features feed into the correlation pyramid + GRU for iterative refinement → disparity.

5. Tensor Dimensions

Tensor	Dimensions	Description
`stereo_feat`	(B, 128, H, W)	Stereo-stream features
`pol_feat` / `pol_features`	(B, 64, H, W)	Polarization-stream features (`pol_dim`, output dim 64)
`torch.cat([stereo_feat, pol_feat])`	(B, 256, H, W)	After concatenation
`fusion_conv` output	(B, 128, H, W)	Fused back to 128 channels

Note: pol_dim can be 64 or 128; this document uses 64-dim by default, with the hyperparameter table listed separately.

6. Hyperparameters

6.1 PolarizationEncoder Hyperparameters

Parameter	Value	Description
`pol_dim`	64	Polarization feature dimension (more capacity for complex features)
`pol_threshold`	0.05	Soft Threshold `tau` (sensitive to weak polarization)
`glass_weight`	5.0	Loss weight on glass regions
`pol_lr_mult`	5.0	Learning-rate multiplier for polarization layers (avoids new-layer instability)
`pol_weight`	2.0	Polarization weight in Polarization-aware Loss

6.2 Training Hyperparameters

Parameter	Value	Description
`dual_stream`	enabled	Enable the Dual-Stream architecture
`pretrained`	`raftstereo-sceneflow.pth`	SceneFlow pre-trained weights
`pol_dim`	64 / 128	Polarization feature dimension
`pol_threshold`	0.05	Soft Threshold `tau`
`pol_sharpness`	20.0	Soft Threshold sharpness `kappa`
`pol_lr_mult`	5.0	Learning-rate multiplier for polarization layers
`pol_weight`	2.0	Polarization weight in Polarization-aware Loss
`glass_weight`	5.0	Loss weight on glass regions
`strict_glass_weight`	0.5	Extra weight on the strict glass core (intersection mask of left/right views)
`batch_size`	8	Training batch size
`num_steps`	60000	Training steps
`lr`	0.0003	Learning rate
`iters`	24	GRU iterations

7. Design Decisions and Rationale

7.1 Why a “dual-stream” rather than a single encoder

Isolating polarization processing into its own stream lets it bypass the BN of the RGB encoder entirely and operate from raw |I∥ - I⊥|, preserving the macroscopic magnitude difference. If polarization features were mixed into a single encoder with stereo features, the macroscopic polarization difference would be normalized away at the first BN layer.

7.2 Why all Normalization is forbidden in the Pol stream

BatchNorm / InstanceNorm / GroupNorm all standardize the input distribution and wash out the macroscopic brightness difference. The value of the polarization signal is precisely this “magnitude difference.” The RGB stream depends on micro-structure and can use BN; the Pol stream absolutely cannot.

7.3 Why Soft Threshold rather than a hard threshold

A hard threshold is non-differentiable. Soft Threshold uses the form sigmoid(kappa(P-tau)) to preserve differentiability, allowing the network to adjust sensitivity to polarization intensity and the transition sharpness during training.

7.4 Why disparity alignment (warp) is needed

The 65 mm baseline means the same pixel coordinates in the left and right views correspond to different 3D points. Without alignment, pol_diff is contaminated with disparity errors. Warping with GT disparity makes pol_diff a “pure polarization difference.”

7.5 Why SpatialAttention

The polarization signal is meaningful only in glass regions. SpatialAttention lets the encoder learn to focus on glass regions and suppress background noise.

7.6 Why lower the pol_lr_mult

The polarization branch consists of newly added, randomly initialized layers. Too high a learning-rate multiplier causes the new layers to oscillate violently on top of a pre-trained backbone. Setting the multiplier to 5.0 avoids new-layer instability.

8. Highlights

Two-stream separation: stereo and polarization streams are computed independently; the polarization stream completely bypasses the stereo backbone’s BatchNorm, ensuring the macroscopic magnitude difference of the polarization signal is not normalized away.
A polarization branch free of all Normalization: the Pol stream uses no BatchNorm / InstanceNorm / GroupNorm, operating from raw |I∥ - I⊥| and preserving the magnitude difference of the polarization signal.
Differentiable soft-threshold selection: sigmoid(kappa(P-tau)) replaces a hard threshold, letting the network adjust sensitivity to polarization intensity and transition sharpness during training.
Disparity-aligned pure polarization difference: warp_with_disparity aligns the right image into the left view before differencing, removing geometric error from the 65 mm baseline and yielding a pure polarization difference.
Polarization-aware loss: dual weighting by glass_mask and pol_diff, combined with RAFT-style iterative weighting, guides the model to match more accurately in glass and high-polarization-difference regions.
Spatial attention focusing on glass: SpatialAttention learns a spatial weight map so polarization features concentrate where the polarization signal is genuinely meaningful—in glass regions.