True Dual-Stream (Mid) Architecture — Po-Ting Lin (林柏廷)

1. Design Goals

This architecture targets the matching difficulty over glass regions in active polarization stereo systems and adopts a “medium-capacity polarization path (Pol stream)” design. Its core idea:

Process the polarization signal with the same pyramid + lookup strategy as RGB
  - Keep the full 36ch polarization signal
  - No encoder/pooling
  - At every GRU iteration the model sees the complete profile at the current
    disparity

The design avoids two extremes: compressing the polarization signal with a 3D CNN + pooling washes out the peak at the correct disparity, while using only two statistics (max/var) is too sparse. The middle ground is to not compress and not pool — keep the full 36-channel polarization cost profile so that the α predictor sees the complete information at every GRU iteration.

2. Architecture

2.1 Initial Version

True Dual-Stream (Mid) initial architecture

Key designs:

α predicted at every iteration: from the 36-channel cost_pol at the current disparity.
Sees the full profile: not compressed statistics.
No encoder needed: grayscale images go directly into correlation.

2.2 Revised Version (CorrBlockNoNorm + Gaussian blur)

True Dual-Stream (Mid) revised architecture

The revised version addresses two problems of the initial version: (1) the L2 normalization in the original CorrBlock kills the polarization intensity difference, so CorrBlockNoNorm is used instead; (2) once normalization is removed, high-frequency spurious signals appear in non-glass regions, so a Gaussian blur is added before correlation to suppress them.

3. Components and Modules

3.1 AlphaFromCost

Predicts per-pixel α from the 36-channel cost_pol at the current disparity of every GRU iteration.

class AlphaFromCost(nn.Module):
    """Predict per-pixel α from cost_pol (36ch)"""

    def __init__(self, corr_dim=36, hidden_dim=32):
        self.net = nn.Sequential(
            nn.Conv2d(corr_dim, hidden_dim, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(hidden_dim, hidden_dim, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(hidden_dim, 1, 1),
            nn.Sigmoid(),
        )

3.2 AlphaFromCost — Large Kernel Improved Version

To break the “BN dilemma” (BN ON washes out the polarization signal, BN OFF lets MC noise through), large kernels are used: large kernels naturally smooth high-frequency noise while preserving region-level polarization patterns, removing the need for BN.

class AlphaFromCost(nn.Module):
    def __init__(self, corr_dim=36, hidden_dim=32):
        self.net = nn.Sequential(
            # 7x7 kernel: region-level judgment
            nn.Conv2d(corr_dim, hidden_dim, kernel_size=7, padding=3),
            nn.ReLU(),
            # 5x5 kernel
            nn.Conv2d(hidden_dim, hidden_dim, kernel_size=5, padding=2),
            nn.ReLU(),
            # 1x1 output
            nn.Conv2d(hidden_dim, 1, kernel_size=1),
            nn.Sigmoid(),
        )

Design rationale:

MC noise averages out within the 7×7 receptive field.
The region-level structure of the polarization signal is preserved.
No BN needed (avoids washing out the magnitude difference).

3.3 CorrBlockNoNorm

The original CorrBlock applies L2 normalization before computing correlation. L2 normalization eliminates the “intensity difference” (after normalization, I∥ and I⊥ become almost identical), and the essence of the polarization signal is exactly this intensity difference. Hence a CorrBlock without normalization is designed for the Pol stream.

Original CorrBlock:

class CorrBlock:
    def _build_corr(self, fmap1, fmap2):
        # L2 Normalization
        fmap1 = fmap1 / (torch.norm(fmap1, dim=1, keepdim=True) + 1e-6)
        fmap2 = fmap2 / (torch.norm(fmap2, dim=1, keepdim=True) + 1e-6)

        corr = torch.einsum('bchw,bchx->bhwx', fmap1, fmap2)  # dot product
        return corr

CorrBlockNoNorm:

class CorrBlockNoNorm:
    """No L2 normalization; preserves polarization intensity difference"""

    def _build_corr(self, fmap1, fmap2):
        # No normalization! compute dot product directly
        corr = torch.einsum('bchw,bchx->bhwx', fmap1, fmap2)
        return corr

3.4 Gaussian Blur (Low-pass Filter)

Once normalization is removed, non-glass regions develop high-frequency spurious signals (MC noise + texture noise). Adding a Gaussian blur before correlation suppresses the high-frequency noise while preserving the low-frequency polarization structure.

def _get_pol_features(self, left, right):
    # Downsample
    left_feat = F.interpolate(left, size=(h, w), ...)
    right_feat = F.interpolate(right, size=(h, w), ...)

    # Low-pass filter (Gaussian blur)
    left_feat = gaussian_blur(left_feat, kernel_size=5)
    right_feat = gaussian_blur(right_feat, kernel_size=5)

    return left_feat, right_feat

Why blur before correlation: what needs to be eliminated is the high-frequency noise in the input images; smoothing first and then computing correlation is cleaner than computing correlation first and smoothing afterwards.

4. Tensor Dimensions

Item	Dimensions	Description
Pol features (initial)	(B, 1, H/4, W/4)	Grayscale downsampling
Pol features (revised)	(B, 3, H/4, W/4)	RGB downsampling + Gaussian blur
cost_pol	(B, 36, H/4, W/4)	CorrBlockNoNorm pyramid + lookup
RGB features	(B, 256, H/4, W/4)	fnet encoder output
cost_rgb	(B, 36, H/4, W/4)	CorrBlock (with L2 norm) pyramid + lookup
AlphaFromCost input	(B, 36, H/4, W/4)	cost_pol
α	(B, 1, H/4, W/4)	Per-pixel fusion weight
cost_fused	(B, 36, H/4, W/4)	α-weighted fusion

5. Hyperparameters

Parameter	Default	Description
corr_dim	36	Number of cost_pol channels (AlphaFromCost input)
hidden_dim	32	AlphaFromCost hidden channel count
Gaussian blur kernel	5	Low-pass filter size used before correlation
Large Kernel sizes	7×7 / 5×5	Kernel sizes of the improved AlphaFromCost
Pol parameter count	~5K	Total parameters of AlphaFromCost

6. Design Decisions and Rationale

6.1 Same Pyramid + Lookup as RGB

Pyramid + lookup preserves the full 36-channel polarization signal so that every GRU iteration sees the complete cost profile at the current disparity. The α predictor needs to see the complete disparity profile to correctly distinguish glass from non-glass.

6.2 L2 Normalization Removed

L2 normalization is fine for learned features (256ch), but for raw pixels (3ch) it kills the polarization signal (eliminates intensity differences). Correlation measures similarity rather than difference; polarization makes I∥ ≠ I⊥ on glass, so removing the normalization leaves at least some variation to work with.

6.3 Gaussian Blur Separates Signal from Noise

Polarization is region-level structure; noise is pixel-level randomness. Gaussian blur smooths high-frequency noise while preserving low-frequency structure.

6.4 Large Kernels Break the BN Dilemma

BN is a double-edged sword:

BN State	Polarization Signal	MC Noise
ON	Washed out	Suppressed
OFF	Preserved	Let through

Characteristic differences between MC noise and the polarization signal:

Characteristic	MC Noise	Polarization Signal
Spatial frequency	High frequency (pixel-level random)	Low frequency (region-level structure)
Spatial correlation	None (i.i.d.)	Strong (continuous over glass regions)
Physical source	Monte Carlo sampling error	Polarized specular reflection

Large kernels distinguish signal from noise via spatial structure rather than via statistical normalization, so they preserve the polarization signal while suppressing MC noise without needing BN.

7. Highlights

Full 36-channel polarization profile: processes the polarization signal with the same pyramid + lookup strategy as RGB; no compression, no pooling — every GRU iteration sees the complete cost profile at the current disparity.
Iteration-level α prediction: AlphaFromCost predicts per-pixel α at each GRU iteration from the 36-channel cost_pol at the current disparity, rather than relying on one-shot static statistics.
CorrBlockNoNorm preserves polarization intensity: a custom CorrBlock without L2 normalization for the polarization path, avoiding the elimination of the “intensity difference” on which the polarization signal depends.
Gaussian blur separates signal from noise: a low-pass filter before correlation suppresses pixel-level high-frequency noise while preserving the region-level low-frequency polarization structure.
Large kernels replace BN: 7×7 / 5×5 large kernels distinguish signal from noise via “spatial structure”, resolving the dilemma of “BN ON washes out the signal, BN OFF lets the noise through” without any normalization.