True Dual-Stream (Heavy) Implementation Architecture

1. Design Goals

True Dual-Stream (Heavy) is the full code-realization of the True Dual-Stream dual-stream architecture. It keeps the complete dual-stream design: the full PolVolumeEncoder (3D CNN), PolFNet, and PolCNet, making it the highest-capacity version of the dual-stream series.

This architecture targets the matching difficulty over glass regions in active polarization stereo systems: photometric inconsistency on glass causes the Cost Volume to produce garbage signals. The architecture runs a fully trained polarization path (Pol stream) in parallel with a frozen RGB path, and fuses the two Cost Volumes via per-pixel α weighting.

During implementation, two design issues that destroy the polarization signal were discovered; this document records their fixes as well:

Fix 1: Normalization in the Pol stream standardizes away the magnitude differences in the polarization signal.
Fix 2: Using AvgPool in the PolVolumeEncoder to compress the disparity dimension washes out the polarization peak.

2. Module Structure

True Dual-Stream (Heavy) full architecture

The core model is composed of the following classes and functions:

class PolCostVolume        # Step 1: polarization Cost Volume (B,192,H,W)
class PolVolumeEncoder     # Step 2: 3D CNN compression -> (B,8,H/4,W/4)
class PolFNet              # Step 3b: 11ch -> 32ch (keeps H/4)
class PolCNet              # Step 3b: 40ch -> context(64) + hidden(128) + α(1)
class RGBFNet              # Step 3a: frozen, 256ch
class RGBCNet              # Step 3a: frozen, 64+128
class UpdateBlockHeavy     # Step 5: hidden=256, context=128
class TrueDualStreamStereo # Main model
class TrueDualStreamLoss   # Loss function
def load_raft_stereo_weights()  # Checkpoint loading

3. Key Implementation Details

3.1 PolFNet Dimension Handling

PolFNet must keep its features at H/4 resolution throughout. The input is concatenated directly at H/4 and all layers use stride=1.

def forward(self, pol_feature, left):
    B, _, H4, W4 = pol_feature.shape
    left_down = F.interpolate(left, size=(H4, W4), ...)  # H/4
    x = torch.cat([pol_feature, left_down], dim=1)       # (B, 11, H/4, W/4)
    # All layers use stride=1 to keep H/4
    ...
    return x  # (B, 32, H/4, W/4)

3.2 α Fusion Mechanism

# Pol CNet outputs α (per-pixel)
context_pol, hidden_pol, alpha = self.pol_cnet(pol_feature, fmap_pol_left)
# alpha: (B, 1, H/4, W/4), passed through sigmoid, range [0, 1]

# Cost Volume fusion
cost_fused = alpha * cost_rgb + (1 - alpha) * cost_pol
# α ≈ 1: trust RGB (non-glass)
# α ≈ 0: trust Pol (glass)

3.3 RGB Stream Freezing

def _freeze_rgb_stream(self):
    for param in model.rgb_fnet.parameters():
        param.requires_grad = False
    for param in model.rgb_cnet.parameters():
        param.requires_grad = False
    model.rgb_fnet.eval()
    model.rgb_cnet.eval()

4. Tensor Dimensions

Module	Input	Output
PolCostVolume	(B,3,H,W) × 2	(B,192,H,W)
PolVolumeEncoder	(B,192,H,W)	(B,8,H/4,W/4)
Pol FNet	(B,11,H/4,W/4)	(B,32,H/4,W/4)
Pol CNet	(B,40,H/4,W/4)	64 + 128 + 1
RGB FNet	(B,3,H,W)	(B,256,H/4,W/4)
RGB CNet	(B,3,H,W)	64 + 128
cost_fused	-	(B,36,H/4,W/4)
context_fused	-	(B,128,H/4,W/4)
hidden_fused	-	(B,256,H/4,W/4)

5. Diagnostic Metric Design

Four core monitoring metrics derived directly from the design intent.

5.1 α Divergence (Most Important)

Metric	Computation	Expected Value	Meaning
`alpha_glass`	Mean α over glass regions	→ 0	Glass should trust Pol
`alpha_non_glass`	Mean α over non-glass regions	→ 1	Non-glass should trust RGB
`alpha_divergence`	`	α_non_glass - α_glass	`

If α_divergence approaches 0, the model has not learned to distinguish glass from non-glass.

5.2 Region-Wise EPE (Safety Gate)

Metric	Expected Behavior	Alarm Condition
`glass_epe`	Steadily decreasing	Rising = Pol stream is not learning well
`non_glass_epe`	Stable	Rising = α is not approaching 1 on non-glass

Key: Non-glass EPE must not be worse than the original RAFT, because the RGB path is frozen.

5.3 Relative Convergence

relative_convergence = final_delta / first_delta

Value	Interpretation
< 0.1	Healthy, GRU converging normally
0.1 ~ 0.3	Normal
> 0.5	Possible problem, GRU not converged

5.4 Validation Output Example (Metric Field Format)

[Val @ Step N]
  EPE: ...
  Glass EPE: ...
  Non-glass EPE: ... (safety gate)
  D1: ...
  --- Diagnostics ---
  α glass: ... (target: →0)
  α non-glass: ... (target: →1)
  α divergence: ... (target: →1)
  Relative convergence: ... (target: <0.1)

6. Fix 1: BatchNorm Removal — Polarization Signal Washed Out by Normalization

6.1 Nature of the Problem

If the Pol stream uses BatchNorm, the magnitude differences in the polarization signal are washed out.

Input:  I∥ = 0.8,  I⊥ = 0.2  (4x difference!)
           ↓
       Conv Layer
           ↓
       BatchNorm  <- normalizes mean/variance
           ↓
Output: feature_∥ ≈ feature_⊥  (difference washed out)

BatchNorm computes the per-channel mean/variance within a batch and normalizes the output to mean=0, variance=1; the macroscopic brightness difference is eliminated by the very first BN.

The polarization signal computed by PolCostVolume is exactly |I∥ - I⊥| (large |0.8 - 0.2| = 0.6 on glass, tiny |0.5 - 0.48| = 0.02 on non-glass); BN pulls both to the same distribution, so the magnitude difference disappears and α cannot distinguish them.

6.2 Normalization in the Pol Stream (Before Fix)

Module	Normalization	Problem
PolVolumeEncoder	`BatchNorm3d` ×3	Washes out the magnitude differences in the pol volume
PolFNet	`InstanceNorm2d`	Also normalizes
PolCNet	`BatchNorm2d` ×5	α’s input is already normalized

6.3 Fix: Remove All Normalization in the Pol Stream

# PolVolumeEncoder - remove BatchNorm3d
self.conv3d = nn.Sequential(
    nn.Conv3d(1, 8, kernel_size=(7, 3, 3), stride=(4, 2, 2), padding=(3, 1, 1)),
    # nn.BatchNorm3d(8),  # removed
    nn.ReLU(inplace=True),
    ...
)

# PolFNet - switch to norm_fn='none'
def __init__(self, input_dim=11, output_dim=32, norm_fn='none'):  # change default
    ...
    self.norm1 = nn.Identity()  # no normalization

# PolCNet - remove BatchNorm2d
self.encoder = nn.Sequential(
    nn.Conv2d(input_dim, 64, kernel_size=3, padding=1),
    # nn.BatchNorm2d(64),  # removed
    nn.ReLU(inplace=True),
    ResidualBlock(64, 64, 'none'),  # switch to 'none'
    ...
)

Changes made:

PolVolumeEncoder: remove 3 BatchNorm3d
PolFNet: change norm_fn default to 'none'
PolCNet: remove BatchNorm2d, ResidualBlock switched to 'none'
ResidualBlock.downsample: support norm_fn='none'

6.4 Design Rule

The Pol stream must not use any form of normalization: BatchNorm/InstanceNorm/GroupNorm all normalize the input distribution and wash out macroscopic brightness differences. The value of the polarization signal lies exactly in this “magnitude difference”. The RGB stream can use BN (because it relies on micro-structure); the Pol stream absolutely cannot.

7. Fix 2: AvgPool → MaxPool

7.1 Problem

PolVolumeEncoder uses AdaptiveAvgPool3d to compress the disparity dimension. The polarization signal forms a “peak” at the correct disparity; averaging along the disparity axis dilutes this peak together with the large flat non-peak region, diluting the signal.

7.2 Fix

Change the disparity-dimension compression from AvgPool to MaxPool, so that the “peak at the correct disparity” is preserved instead of averaged away. MaxPool takes the maximum along the disparity axis, which corresponds exactly to the physical meaning of “glass produces a peak at the correct d” in the PolCostVolume design.

8. Highlights

Complete three-module polarization path: keeps the full polarization-learning pipeline of PolVolumeEncoder (3D CNN) + PolFNet + PolCNet, the highest-capacity version of the dual-stream series.
Pol stream fully de-normalized: explicitly defines that the value of the polarization signal lies in “magnitude difference”; the Pol stream uses no BatchNorm/InstanceNorm/GroupNorm to avoid normalization washing out the |I∥ - I⊥| difference.
AvgPool → MaxPool preserves peaks: disparity-dimension compression switches to MaxPool, matching the physical meaning of “glass has a peak at the correct d” and avoiding peak dilution by averaging.
Four diagnostic metrics as gatekeepers: α divergence, region-wise EPE (safety gate), and relative convergence are derived directly from the design intent, enabling real-time checks on whether the model has learned to distinguish glass, whether the GRU converges normally, and whether non-glass regions degrade.