1. Design Goals
True Dual-Stream (Heavy) is the full code-realization of the True Dual-Stream dual-stream architecture. It keeps the complete dual-stream design: the full PolVolumeEncoder (3D CNN), PolFNet, and PolCNet, making it the highest-capacity version of the dual-stream series.
This architecture targets the matching difficulty over glass regions in active polarization stereo systems: photometric inconsistency on glass causes the Cost Volume to produce garbage signals. The architecture runs a fully trained polarization path (Pol stream) in parallel with a frozen RGB path, and fuses the two Cost Volumes via per-pixel α weighting.
During implementation, two design issues that destroy the polarization signal were discovered; this document records their fixes as well:
- Fix 1: Normalization in the Pol stream standardizes away the magnitude differences in the polarization signal.
- Fix 2: Using AvgPool in the PolVolumeEncoder to compress the disparity dimension washes out the polarization peak.
2. Module Structure
The core model is composed of the following classes and functions:
class PolCostVolume # Step 1: polarization Cost Volume (B,192,H,W)
class PolVolumeEncoder # Step 2: 3D CNN compression -> (B,8,H/4,W/4)
class PolFNet # Step 3b: 11ch -> 32ch (keeps H/4)
class PolCNet # Step 3b: 40ch -> context(64) + hidden(128) + α(1)
class RGBFNet # Step 3a: frozen, 256ch
class RGBCNet # Step 3a: frozen, 64+128
class UpdateBlockHeavy # Step 5: hidden=256, context=128
class TrueDualStreamStereo # Main model
class TrueDualStreamLoss # Loss function
def load_raft_stereo_weights() # Checkpoint loading
3. Key Implementation Details
3.1 PolFNet Dimension Handling
PolFNet must keep its features at H/4 resolution throughout. The input is concatenated directly at H/4 and all layers use stride=1.
def forward(self, pol_feature, left):
B, _, H4, W4 = pol_feature.shape
left_down = F.interpolate(left, size=(H4, W4), ...) # H/4
x = torch.cat([pol_feature, left_down], dim=1) # (B, 11, H/4, W/4)
# All layers use stride=1 to keep H/4
...
return x # (B, 32, H/4, W/4)
3.2 α Fusion Mechanism
# Pol CNet outputs α (per-pixel)
context_pol, hidden_pol, alpha = self.pol_cnet(pol_feature, fmap_pol_left)
# alpha: (B, 1, H/4, W/4), passed through sigmoid, range [0, 1]
# Cost Volume fusion
cost_fused = alpha * cost_rgb + (1 - alpha) * cost_pol
# α ≈ 1: trust RGB (non-glass)
# α ≈ 0: trust Pol (glass)
3.3 RGB Stream Freezing
def _freeze_rgb_stream(self):
for param in model.rgb_fnet.parameters():
param.requires_grad = False
for param in model.rgb_cnet.parameters():
param.requires_grad = False
model.rgb_fnet.eval()
model.rgb_cnet.eval()
4. Tensor Dimensions
| Module | Input | Output |
|---|---|---|
| PolCostVolume | (B,3,H,W) × 2 | (B,192,H,W) |
| PolVolumeEncoder | (B,192,H,W) | (B,8,H/4,W/4) |
| Pol FNet | (B,11,H/4,W/4) | (B,32,H/4,W/4) |
| Pol CNet | (B,40,H/4,W/4) | 64 + 128 + 1 |
| RGB FNet | (B,3,H,W) | (B,256,H/4,W/4) |
| RGB CNet | (B,3,H,W) | 64 + 128 |
| cost_fused | - | (B,36,H/4,W/4) |
| context_fused | - | (B,128,H/4,W/4) |
| hidden_fused | - | (B,256,H/4,W/4) |
5. Diagnostic Metric Design
Four core monitoring metrics derived directly from the design intent.
5.1 α Divergence (Most Important)
| Metric | Computation | Expected Value | Meaning |
|---|---|---|---|
alpha_glass | Mean α over glass regions | → 0 | Glass should trust Pol |
alpha_non_glass | Mean α over non-glass regions | → 1 | Non-glass should trust RGB |
alpha_divergence | ` | α_non_glass - α_glass | ` |
If α_divergence approaches 0, the model has not learned to distinguish glass from non-glass.
5.2 Region-Wise EPE (Safety Gate)
| Metric | Expected Behavior | Alarm Condition |
|---|---|---|
glass_epe | Steadily decreasing | Rising = Pol stream is not learning well |
non_glass_epe | Stable | Rising = α is not approaching 1 on non-glass |
Key: Non-glass EPE must not be worse than the original RAFT, because the RGB path is frozen.
5.3 Relative Convergence
relative_convergence = final_delta / first_delta
| Value | Interpretation |
|---|---|
| < 0.1 | Healthy, GRU converging normally |
| 0.1 ~ 0.3 | Normal |
| > 0.5 | Possible problem, GRU not converged |
5.4 Validation Output Example (Metric Field Format)
[Val @ Step N]
EPE: ...
Glass EPE: ...
Non-glass EPE: ... (safety gate)
D1: ...
--- Diagnostics ---
α glass: ... (target: →0)
α non-glass: ... (target: →1)
α divergence: ... (target: →1)
Relative convergence: ... (target: <0.1)
6. Fix 1: BatchNorm Removal — Polarization Signal Washed Out by Normalization
6.1 Nature of the Problem
If the Pol stream uses BatchNorm, the magnitude differences in the polarization signal are washed out.
Input: I∥ = 0.8, I⊥ = 0.2 (4x difference!)
↓
Conv Layer
↓
BatchNorm <- normalizes mean/variance
↓
Output: feature_∥ ≈ feature_⊥ (difference washed out)
BatchNorm computes the per-channel mean/variance within a batch and normalizes the output to mean=0, variance=1; the macroscopic brightness difference is eliminated by the very first BN.
The polarization signal computed by PolCostVolume is exactly |I∥ - I⊥| (large |0.8 - 0.2| = 0.6 on glass, tiny |0.5 - 0.48| = 0.02 on non-glass); BN pulls both to the same distribution, so the magnitude difference disappears and α cannot distinguish them.
6.2 Normalization in the Pol Stream (Before Fix)
| Module | Normalization | Problem |
|---|---|---|
| PolVolumeEncoder | BatchNorm3d ×3 | Washes out the magnitude differences in the pol volume |
| PolFNet | InstanceNorm2d | Also normalizes |
| PolCNet | BatchNorm2d ×5 | α’s input is already normalized |
6.3 Fix: Remove All Normalization in the Pol Stream
# PolVolumeEncoder - remove BatchNorm3d
self.conv3d = nn.Sequential(
nn.Conv3d(1, 8, kernel_size=(7, 3, 3), stride=(4, 2, 2), padding=(3, 1, 1)),
# nn.BatchNorm3d(8), # removed
nn.ReLU(inplace=True),
...
)
# PolFNet - switch to norm_fn='none'
def __init__(self, input_dim=11, output_dim=32, norm_fn='none'): # change default
...
self.norm1 = nn.Identity() # no normalization
# PolCNet - remove BatchNorm2d
self.encoder = nn.Sequential(
nn.Conv2d(input_dim, 64, kernel_size=3, padding=1),
# nn.BatchNorm2d(64), # removed
nn.ReLU(inplace=True),
ResidualBlock(64, 64, 'none'), # switch to 'none'
...
)
Changes made:
PolVolumeEncoder: remove 3BatchNorm3dPolFNet: changenorm_fndefault to'none'PolCNet: removeBatchNorm2d, ResidualBlock switched to'none'ResidualBlock.downsample: supportnorm_fn='none'
6.4 Design Rule
The Pol stream must not use any form of normalization: BatchNorm/InstanceNorm/GroupNorm all normalize the input distribution and wash out macroscopic brightness differences. The value of the polarization signal lies exactly in this “magnitude difference”. The RGB stream can use BN (because it relies on micro-structure); the Pol stream absolutely cannot.
7. Fix 2: AvgPool → MaxPool
7.1 Problem
PolVolumeEncoder uses AdaptiveAvgPool3d to compress the disparity dimension. The polarization signal forms a “peak” at the correct disparity; averaging along the disparity axis dilutes this peak together with the large flat non-peak region, diluting the signal.
7.2 Fix
Change the disparity-dimension compression from AvgPool to MaxPool, so that the “peak at the correct disparity” is preserved instead of averaged away. MaxPool takes the maximum along the disparity axis, which corresponds exactly to the physical meaning of “glass produces a peak at the correct d” in the PolCostVolume design.
8. Highlights
- Complete three-module polarization path: keeps the full polarization-learning pipeline of PolVolumeEncoder (3D CNN) + PolFNet + PolCNet, the highest-capacity version of the dual-stream series.
- Pol stream fully de-normalized: explicitly defines that the value of the polarization signal lies in “magnitude difference”; the Pol stream uses no BatchNorm/InstanceNorm/GroupNorm to avoid normalization washing out the
|I∥ - I⊥|difference. - AvgPool → MaxPool preserves peaks: disparity-dimension compression switches to MaxPool, matching the physical meaning of “glass has a peak at the correct d” and avoiding peak dilution by averaging.
- Four diagnostic metrics as gatekeepers: α divergence, region-wise EPE (safety gate), and relative convergence are derived directly from the design intent, enabling real-time checks on whether the model has learned to distinguish glass, whether the GRU converges normally, and whether non-glass regions degrade.