Dual Volume Architecture — Po-Ting Lin (林柏廷)

1. Design Goals

Core Problem

RAFT-Stereo assumes photometric consistency: corresponding pixels in the left and right images have similar values. In an active polarization stereo system, however, this assumption does not hold over glass regions. The left camera (I∥) captures strong specular reflections while the right camera (I⊥) suppresses them, so on glass regions I∥(left) >> I⊥(right).

The resulting problems:

The Cost Volume produces garbage signals over glass regions.
A correct match yields a high cost, and the system misinterprets it as a “matching error”.
When the GRU iterates on garbage signals, the result degrades with every iteration.

Design Goal: Dual Volume Complementarity

This architecture introduces a second volume (the Pol Volume) that is complementary to the Cost Volume. Their reliability over glass vs. non-glass regions is exactly opposite:

Volume	Non-glass region	Glass region
Cost Volume	Strong, reliable signal	Garbage, unreliable
Pol Volume	≈ 0, no signal	Strong signal (I∥ >> I⊥)

The two volumes are defined as:

Cost Volume: corr[x,d] = dot(fmap_left[x], fmap_right[x-d])
Pol Volume:  pol[x,d]  = left[x] - right[x-d]

Neither volume depends on the ground truth, so both are available at inference time.

2. Architecture (with Data Flow)

Dual Volume architecture data flow

Four Phases of the Data Flow

Phase	Name	Input	Output
Phase 1	Feature Extraction	left, right	fmap_left, fmap_right
Phase 2	Build volumes (GT-free)	fmap_left/right, left, right	Cost Volume, Pol Volume
Phase 3	Context	left, pol_input	context = concat(rgb_ctx, pol_ctx), hidden
Phase 4	GRU iteration	Cost/Pol Volume, context, hidden	delta_disp (iteratively updates disp)

3. Components and Modules

3.1 fnet (Feature Network)

fmap_left, fmap_right = fnet(left), fnet(right).

3.2 Cost Volume

corr[x,d] = dot(fmap_left[x], fmap_right[x-d]), a dot-product correlation.

3.3 Pol Volume

pol[x,d] = left[x] - right[x-d], the raw polarization difference.
GT-free, available at inference time.

3.4 rgb_cnet (RGB Context Encoder)

rgb_ctx, hidden = rgb_cnet(left).
Produces both the context and the GRU initial hidden state.

3.5 pol_cnet (Pol Context Encoder)

Input: pol_input, 2 channels.
Output: pol_ctx.
Inputs differ by training stage:
- Stage 1 (Pretrain): pol_input = GT_mask.
- Stage 2 (Finetune): pol_input = pol_stats(pol_vol).

3.6 Context Fusion

context = concat(rgb_ctx, pol_ctx), simple concatenation.

3.7 GRU Update Unit

Each iteration:

corr_feat = lookup(Cost_Volume, disp).
pol_feat = lookup(Pol_Volume, disp).
motion = encoder(concat(corr_feat, pol_feat, disp)) — concat-based fusion.
hidden = gru(hidden, concat(motion, context)).
delta_disp = disp_head(hidden).

4. Two-Stage Training Strategy

Stage 1 (Pretrain): Cheating with the GT Mask

pol_input = [GT_mask, zeros] (2 channels, second channel filled with zeros).
Goal: let the Pol CNet learn what a “correct glass context” looks like.

Stage 2 (Finetune): Pol Volume Statistics

pol_input = [pol_max, pol_var] (2 channels).
pol_max: maximum of the Pol Volume along the disparity dimension.
pol_var: variance of the Pol Volume along the disparity dimension.

Transition Strategy

Lower the Pol CNet learning rate during Stage 2 (×0.1).
Or use a warmup blending period.

5. Tensor Dimensions

Tensor	Dimensions / Setting	Description
left / right	(B, 3, H, W)	Input polarization image pair
fmap_left / fmap_right	(B, C, ·, ·)	Output feature maps of fnet
Cost Volume `corr[x,d]`	Indexed by disparity	dot product
Pol Volume `pol[x,d]`	Indexed by disparity	`left[x] - right[x-d]`
pol_input	(B, 2, ·, ·)	2 channels (Pretrain: [GT_mask, zeros]; Finetune: [pol_max, pol_var])
rgb_ctx / pol_ctx	(B, ·, ·, ·)	The two context branches
context	(B, ·, ·, ·)	`concat(rgb_ctx, pol_ctx)`
hidden	(B, ·, ·, ·)	GRU initial hidden state (from rgb_cnet)
corr_feat / pol_feat	(B, ·, ·, ·)	Lookup results of the two volumes
disp / delta_disp	(B, 1, ·, ·)	Disparity and its iterative residual

6. Design Decisions and Rationale

Decision	Choice	Rationale
Cost + Pol fusion	Concat	Simple and stable; let the encoder learn the weighting
Pol CNet input channels	2 channels	Zero-padded in Pretrain, max+var in Finetune
Pol Volume statistics	max + var	max = peak polarization difference, var = signal stability
Transition strategy	Lower LR	Avoid overwriting what was learned in Stage 1

Design Principles

No GT at inference: training may cheat, but inference must stand alone.
No extra inputs: only left (I∥) and right (I⊥) are used.
No degradation on non-glass regions: the new design must only add value.
The Pol Volume is the savior: it is the only reliable signal over glass.

Monitoring

Non-glass degradation monitoring:

metrics = {
    'epe_total': ...,
    'epe_glass': ...,
    'epe_non_glass': ...,  # key metric to monitor
}

if epe_non_glass > baseline * 1.05:
    print("WARNING: Non-glass degradation!")

7. Polarization Injection Points

Injection Point	Phase	Form	Description
Pol Volume → motion encoder	Phase 4 (GRU iteration)	`motion = encoder(concat(corr_feat, pol_feat, disp))`	Concat-fused with the Cost Volume every iteration
pol_input → pol_cnet → context	Phase 3 (Context)	`context = concat(rgb_ctx, pol_ctx)`	Polarization enters the context via the Pol CNet

Polarization has two entry points: (1) the Pol Volume goes directly into the GRU loop to complement the Cost Volume; (2) pol_input enters the context via the Pol CNet. Both are GT-free (after Stage 2).

8. Highlights

Complementary Dual Volume design: the Pol Volume fills the blind spot where the Cost Volume fails on glass; their reliability is mutually exclusive, covering both glass and non-glass regions.
Fully self-sufficient at inference: both volumes are GT-free, so training can cheat while inference still runs independently from only left (I∥) and right (I⊥) images.
Minimalist concat fusion: no complex cross-attention or gated fusion — the two volumes are simply concatenated and the motion encoder learns the trade-off, keeping the design stable and debug-friendly.
Two-stage training bridges oracle and real signals: Stage 1 pretrains the Pol CNet with the GT mask to establish the “correct glass context” representation; Stage 2 switches to Pol Volume statistics with a lowered learning rate for a smooth transition.
Built-in non-glass degradation monitoring: region-wise EPE acts as a gate, ensuring the new polarization path only adds value and never harms matching performance on existing non-glass regions.