1. Design Goals
Core Problem
RAFT-Stereo assumes photometric consistency: corresponding pixels in the left and right images have similar values. In an active polarization stereo system, however, this assumption does not hold over glass regions. The left camera (I∥) captures strong specular reflections while the right camera (I⊥) suppresses them, so on glass regions I∥(left) >> I⊥(right).
The resulting problems:
- The Cost Volume produces garbage signals over glass regions.
- A correct match yields a high cost, and the system misinterprets it as a “matching error”.
- When the GRU iterates on garbage signals, the result degrades with every iteration.
Design Goal: Dual Volume Complementarity
This architecture introduces a second volume (the Pol Volume) that is complementary to the Cost Volume. Their reliability over glass vs. non-glass regions is exactly opposite:
| Volume | Non-glass region | Glass region |
|---|---|---|
| Cost Volume | Strong, reliable signal | Garbage, unreliable |
| Pol Volume | ≈ 0, no signal | Strong signal (I∥ >> I⊥) |
The two volumes are defined as:
Cost Volume: corr[x,d] = dot(fmap_left[x], fmap_right[x-d])
Pol Volume: pol[x,d] = left[x] - right[x-d]
Neither volume depends on the ground truth, so both are available at inference time.
2. Architecture (with Data Flow)
Four Phases of the Data Flow
| Phase | Name | Input | Output |
|---|---|---|---|
| Phase 1 | Feature Extraction | left, right | fmap_left, fmap_right |
| Phase 2 | Build volumes (GT-free) | fmap_left/right, left, right | Cost Volume, Pol Volume |
| Phase 3 | Context | left, pol_input | context = concat(rgb_ctx, pol_ctx), hidden |
| Phase 4 | GRU iteration | Cost/Pol Volume, context, hidden | delta_disp (iteratively updates disp) |
3. Components and Modules
3.1 fnet (Feature Network)
fmap_left, fmap_right = fnet(left), fnet(right).
3.2 Cost Volume
corr[x,d] = dot(fmap_left[x], fmap_right[x-d]), a dot-product correlation.
3.3 Pol Volume
pol[x,d] = left[x] - right[x-d], the raw polarization difference.- GT-free, available at inference time.
3.4 rgb_cnet (RGB Context Encoder)
rgb_ctx, hidden = rgb_cnet(left).- Produces both the context and the GRU initial hidden state.
3.5 pol_cnet (Pol Context Encoder)
- Input:
pol_input, 2 channels. - Output:
pol_ctx. - Inputs differ by training stage:
- Stage 1 (Pretrain):
pol_input = GT_mask. - Stage 2 (Finetune):
pol_input = pol_stats(pol_vol).
- Stage 1 (Pretrain):
3.6 Context Fusion
context = concat(rgb_ctx, pol_ctx), simple concatenation.
3.7 GRU Update Unit
Each iteration:
corr_feat = lookup(Cost_Volume, disp).pol_feat = lookup(Pol_Volume, disp).motion = encoder(concat(corr_feat, pol_feat, disp))— concat-based fusion.hidden = gru(hidden, concat(motion, context)).delta_disp = disp_head(hidden).
4. Two-Stage Training Strategy
Stage 1 (Pretrain): Cheating with the GT Mask
pol_input = [GT_mask, zeros](2 channels, second channel filled with zeros).- Goal: let the Pol CNet learn what a “correct glass context” looks like.
Stage 2 (Finetune): Pol Volume Statistics
pol_input = [pol_max, pol_var](2 channels).pol_max: maximum of the Pol Volume along the disparity dimension.pol_var: variance of the Pol Volume along the disparity dimension.
Transition Strategy
- Lower the Pol CNet learning rate during Stage 2 (×0.1).
- Or use a warmup blending period.
5. Tensor Dimensions
| Tensor | Dimensions / Setting | Description |
|---|---|---|
| left / right | (B, 3, H, W) | Input polarization image pair |
| fmap_left / fmap_right | (B, C, ·, ·) | Output feature maps of fnet |
Cost Volume corr[x,d] | Indexed by disparity | dot product |
Pol Volume pol[x,d] | Indexed by disparity | left[x] - right[x-d] |
| pol_input | (B, 2, ·, ·) | 2 channels (Pretrain: [GT_mask, zeros]; Finetune: [pol_max, pol_var]) |
| rgb_ctx / pol_ctx | (B, ·, ·, ·) | The two context branches |
| context | (B, ·, ·, ·) | concat(rgb_ctx, pol_ctx) |
| hidden | (B, ·, ·, ·) | GRU initial hidden state (from rgb_cnet) |
| corr_feat / pol_feat | (B, ·, ·, ·) | Lookup results of the two volumes |
| disp / delta_disp | (B, 1, ·, ·) | Disparity and its iterative residual |
6. Design Decisions and Rationale
| Decision | Choice | Rationale |
|---|---|---|
| Cost + Pol fusion | Concat | Simple and stable; let the encoder learn the weighting |
| Pol CNet input channels | 2 channels | Zero-padded in Pretrain, max+var in Finetune |
| Pol Volume statistics | max + var | max = peak polarization difference, var = signal stability |
| Transition strategy | Lower LR | Avoid overwriting what was learned in Stage 1 |
Design Principles
- No GT at inference: training may cheat, but inference must stand alone.
- No extra inputs: only left (I∥) and right (I⊥) are used.
- No degradation on non-glass regions: the new design must only add value.
- The Pol Volume is the savior: it is the only reliable signal over glass.
Monitoring
Non-glass degradation monitoring:
metrics = {
'epe_total': ...,
'epe_glass': ...,
'epe_non_glass': ..., # key metric to monitor
}
if epe_non_glass > baseline * 1.05:
print("WARNING: Non-glass degradation!")
7. Polarization Injection Points
| Injection Point | Phase | Form | Description |
|---|---|---|---|
| Pol Volume → motion encoder | Phase 4 (GRU iteration) | motion = encoder(concat(corr_feat, pol_feat, disp)) | Concat-fused with the Cost Volume every iteration |
| pol_input → pol_cnet → context | Phase 3 (Context) | context = concat(rgb_ctx, pol_ctx) | Polarization enters the context via the Pol CNet |
Polarization has two entry points: (1) the Pol Volume goes directly into the GRU loop to complement the Cost Volume; (2) pol_input enters the context via the Pol CNet. Both are GT-free (after Stage 2).
8. Highlights
- Complementary Dual Volume design: the Pol Volume fills the blind spot where the Cost Volume fails on glass; their reliability is mutually exclusive, covering both glass and non-glass regions.
- Fully self-sufficient at inference: both volumes are GT-free, so training can cheat while inference still runs independently from only left (I∥) and right (I⊥) images.
- Minimalist concat fusion: no complex cross-attention or gated fusion — the two volumes are simply concatenated and the motion encoder learns the trade-off, keeping the design stable and debug-friendly.
- Two-stage training bridges oracle and real signals: Stage 1 pretrains the Pol CNet with the GT mask to establish the “correct glass context” representation; Stage 2 switches to Pol Volume statistics with a lowered learning rate for a smooth transition.
- Built-in non-glass degradation monitoring: region-wise EPE acts as a gate, ensuring the new polarization path only adds value and never harms matching performance on existing non-glass regions.