Cost Concatenation Architecture — Po-Ting Lin (林柏廷)

1. Design Goals

This architecture addresses the stereo matching problem when “both an RGB and a polarization information stream are available”.

An intuitive fusion approach is α-weighted fusion: cost_fused = α × cost_rgb + (1-α) × cost_pol, where an α predictor decides how much RGB and how much polarization should be used at each location. However, this design implicitly carries a strong assumption: the α predictor must learn to distinguish glass from non-glass regions in order to bias toward polarization on glass and toward RGB on the background. This imposes two coupled learning objectives on the network simultaneously:

Cost volume differences are insufficient to support explicit classification: when the cost profiles of cost_pol between glass and non-glass regions are not distinctive enough, the α predictor lacks a discriminative signal and tends to collapse to the mean (equal use of both streams).
Interference from samples without polarization signal: if the data contains some samples lacking a polarization signal, the optimal α policy degenerates into “always 0.5” to avoid making mistakes on those samples.
Task coupling: α fusion forces the network to do two things at once — predict disparity, and implicitly predict “where the glass is”. These two tasks may interfere with each other.

The design goal of this architecture: do not require the network to explicitly learn “where the glass is”. Instead, feed both streams’ cost information to the network and let the GRU learn how to use them, collapsing the learning objective back to a single disparity prediction.

2. Architecture

Overall architecture of Cost Concatenation

Core idea: the 36ch cost volumes from the two streams are directly concatenated into 72ch and handed to the GRU’s UpdateBlock. No α, no weighted fusion.

3. Components and Modules

3.1 RGB Stream

RGB → fnet encoder → CorrBlock (with L2 norm) → cost_rgb (36ch). For learned features (256ch), L2 norm is a suitable normalization.

3.2 Pol Stream

RGB → downsample → Gaussian blur → CorrBlockNoNorm → cost_pol (36ch).

The Pol stream adopts two key design choices:

CorrBlockNoNorm: no L2 norm, preserving polarization intensity differences. L2 norm would erase intensity information, while the polarization signal is precisely carried by intensity differences.
Gaussian blur: suppresses high-frequency noise while preserving low-frequency polarization structure.

3.3 Cost Concatenation

cost_combined = concat(cost_rgb, cost_pol)   # 36 + 36 = 72ch

No weighting, no selection — both streams’ costs are passed in full to the GRU.

3.4 UpdateBlock

The cost is 72ch, so the input dimension of the motion encoder expands accordingly. The remaining GRU and flow_head structures retain their standard design.

class UpdateBlockV5(nn.Module):
    def __init__(self, hidden_dim=128, context_dim=64, corr_dim=72):
        # Motion encoder: 72ch cost + 1ch disp = 73ch
        self.encoder = nn.Sequential(
            nn.Conv2d(corr_dim + 1, 128, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 64, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
        )
        # ... GRU + flow_head unchanged

The motion encoder input is cost_combined (72ch) + disp (1ch) = 73ch.

4. Tensor Dimensions

Item	Shape	Description
cost_rgb	(B, 36, H/8, W/8)	RGB CorrBlock (L2 norm) pyramid + lookup
cost_pol	(B, 36, H/8, W/8)	Pol CorrBlockNoNorm + blur
cost_combined	(B, 72, H/8, W/8)	concat(cost_rgb, cost_pol)
UpdateBlock motion encoder input	(B, 73, H/8, W/8)	cost_combined(72) + disp(1)
context	(B, 64, H/8, W/8)	From RGB CNet
hidden	(B, 128, H/8, W/8)	From RGB CNet

5. Hyperparameters

Hyperparameter	Value	Description
hidden_dim	128	GRU hidden state dimension
context_dim	64	Context feature dimension
corr_dim	72	Cost volume channels after concatenation
Correlation pyramid levels	4	Number of CorrBlock pyramid levels

6. Design Decisions and Rationale

6.1 Drop α, use concatenation instead

α fusion requires the network to learn glass segmentation in order to weight correctly. When the cost volume differences are insufficient to support this distinction, α cannot learn an effective separation. Concatenation does not require the network to perform explicit segmentation; both kinds of information are handed to the network, and the GRU makes its own trade-offs during the update process.

6.2 A simplified learning objective

The network only needs to learn disparity — no glass segmentation and no α supervision signal are required. A single objective lowers optimization difficulty and reduces inter-task interference.

6.3 More flexible fusion

The GRU can dynamically decide how to use the two streams at each iteration based on its current state, rather than being constrained by the fixed weighted form of α. The fusion behavior is data-driven, not constrained by a hand-crafted prior form.

6.4 Differentiated processing for the Pol stream

CorrBlockNoNorm and Gaussian blur are tailored to the characteristics of the polarization signal: the former preserves polarization information carried by intensity differences, while the latter filters out high-frequency noise and preserves low-frequency polarization structure. This forms a clear separation from the L2 norm processing of the RGB stream.

7. Highlights

Concatenation instead of weighted fusion: no α predictor is introduced; the decision of “how to fuse the two streams” is fully delegated to the GRU, avoiding the hard-to-learn intermediate task of explicit glass segmentation.
A single learning objective: the network only needs to predict disparity. The α supervision and segmentation sub-task are removed, reducing task coupling and optimization difficulty.
Differentiated cost processing for RGB and Pol streams: the RGB stream uses L2 norm (suitable for learned features), while the Pol stream uses CorrBlockNoNorm + Gaussian blur (preserving polarization intensity differences and filtering high-frequency noise); each stream is tuned to the characteristics of its signal.
Per-iteration dynamic fusion: the 72ch cost is fully visible at every GRU step, so the fusion strategy adapts throughout the refinement process rather than being fixed once and for all.