1. Design Goals
Background
The glass textures produced by synthetic rendering are often too rich, allowing baseline RAFT-Stereo to perform stereo matching in glass regions without polarization. Such synthetic data is therefore not a fair arena for comparing polarization vs. non-polarization.
Core problem
How can we exploit the physically correct polarization signal in synthetic data without baking its structural errors (texture / correlation) into the model?
Solution idea
- The polarization contrast (I∥ vs I⊥) in synthetic data is physically correct — Fresnel equations are not affected by rendering artifacts.
- But the textural structure of synthetic data is misleading for stereo matching — the texture of real glass is ≈ 0.
- Therefore: learn only the polarization semantics from synthetic data (where the glass is), not the stereo matching (how to match glass).
Entry point: Context Encoder (cnet)
- cnet is the only module in RAFT-Stereo that looks only at the left image — it produces the context + hidden state for the GRU.
- cnet learns “where to pay attention” (attention guidance), not “how to match” (correlation / GRU).
- Training glass segmentation on cnet does not touch the stereo matching pipeline — it is fully isolated.
Core idea: on top of an RGB-only Context Encoder pretraining, add a polarization-contrast side branch and pretrain cnet into a glass-aware module, with no contact at all with disparity loss.
2. Architecture (with Data Flow)
Data flow overview
| Stage | Operation | Output dimensions |
|---|---|---|
| RGB main stem | conv1(3→64,7,s2) + BN + ReLU | feat_rgb (B,64,H/2) |
| Pol side-branch stem | pol_conv(1→64,7,s2) + BN + ReLU | feat_pol (B,64,H/2) |
| Gated fusion | cat → gate_conv(128→64,1×1) → sigmoid, then feat_rgb + gate × feat_pol | feat (B,64,H/2) |
| Backbone | layer1 (64→64) → layer2 (64→96, s2) → layer3 (96→128) | (B,128,H/4) |
| Dual-head output | head_union / head_strict | Independent segmentation predictions |
3. Components and Modules
3.1 RGB main stem (conv1)
conv1: 3→64, 7×7, stride 2.- BN(64) + ReLU.
- Stays 3-channel input so that SceneFlow pretrained weights load directly.
3.2 Pol Side-Branch (pol_conv)
pol_conv: 1→64, 7×7, stride 2.- BN(64) + ReLU (i.e.
norm_pol). - Input is the single-channel
pol_contrast. - Designed as a side branch; does not modify the original backbone structure.
3.3 Gated Fusion
- Concatenate
feat_rgbandfeat_polinto (B,128,H/2). gate_conv: 128→64, 1×1 conv, followed by sigmoid, producinggate (B,64,H/2).- Fusion formula:
feat = feat_rgb + gate × feat_pol. - The learned sigmoid gate lets the model decide when to inject the pol signal.
3.4 Backbone
- layer1: 64→64.
- layer2: 64→96, stride 2.
- layer3: 96→128.
- Output (B,128,H/4).
3.5 Dual heads: head_union / head_strict
head_union: supervised against GTmask(union version).head_strict: supervised against GTmask_strict(strict version).- Both head losses are BCE + Dice and do not touch the disparity loss at all.
4. Pol Contrast Computation
Synthetic-data stage: warp right (I⊥) to the left viewpoint using GT disparity, then compute:
pol_contrast = |I∥_gray - warp(I⊥_gray, disp_GT)| / (I∥_gray + warp(I⊥_gray, disp_GT) + ε)
Real-world stage: switch to a two-pass design (Pass 1 coarse disparity → warp → pol contrast → Pass 2 refinement).
Notes:
padding_mode='zeros'— for out-of-bound warp, warped_right = 0 → pol_contrast ≈ 1.0 (looks like high polarization contrast). This is acceptable in synthetic pretraining (GT glass masks do not cover large occlusion regions).- No clamping — preserves physical intuition; BN layers handle extreme values.
5. Tensor Dimensions
| Tensor | Dimensions | Description |
|---|---|---|
| x_rgb | (B,3,H,W) | RGB input image |
| pol_contrast | (B,1,H,W) | Polarization contrast (single channel) |
| feat_rgb | (B,64,H/2,W/2) | conv1 output |
| feat_pol | (B,64,H/2,W/2) | pol_conv output |
| cat(feat_rgb, feat_pol) | (B,128,H/2,W/2) | After concatenation |
| gate | (B,64,H/2,W/2) | gate_conv + sigmoid output |
| feat (after fusion) | (B,64,H/2,W/2) | feat_rgb + gate × feat_pol |
| layer3 output | (B,128,H/4,W/4) | Backbone tail features |
6. Design Decisions and Rationale
| Decision | Choice | Rationale |
|---|---|---|
| conv1 channel count | Keep 3ch | SceneFlow pretrained weights load directly |
| Pol input method | Side branch (1ch→64ch) | Does not modify the original backbone structure |
| Fusion mechanism | Gated fusion (learned sigmoid) | Lets the model learn when to inject the pol signal |
| Supervision signal | BCE + Dice on glass mask | Does not touch disparity loss — fully isolated |
| Pol contrast computation | GT disparity warp + normalized difference | Removes disparity noise; pure polarization signal |
Core principle: polarization is used only to train cnet’s glass-awareness (segmentation); the entire stereo matching pipeline (correlation / GRU / fnet) is left untouched, so the structural texture errors in synthetic data will not be baked into the model.
7. Polarization Injection Points
| Injection point | Location | Form | Description |
|---|---|---|---|
| pol_contrast → pol_conv | Side branch next to the cnet stem | Independent conv stem (1→64) | Does not modify the conv1 of the RGB main branch |
| feat_pol → gated fusion | Between the RGB stem and the backbone | feat = feat_rgb + gate × feat_pol, where gate is a learned sigmoid | The model learns the injection strength on its own |
Polarization only enters the Context Encoder (cnet) and is used solely for segmentation supervision; it does not enter fnet, does not enter correlation, and does not touch disparity loss.
8. Added Parameter Count
| Component | Parameters |
|---|---|
| pol_conv (1→64, 7×7) | 3,200 |
| norm_pol (BN 64) | 128 |
| gate_conv (128→64, 1×1) | 8,256 |
| Total added | ~11,584 (<2% of backbone) |
9. Highlights
- Decoupling semantics from structure: only the physically correct polarization semantics of “where is the glass” is learned from synthetic data; the structural cue of “how to match glass” is deliberately not learned, avoiding baking rendering artifacts into the model.
- Fully isolated stereo pipeline: polarization enters only cnet and is supervised with BCE+Dice glass segmentation; correlation / GRU / fnet and the disparity loss are not touched at all.
- Non-destructive side branch: the RGB main branch stays 3-channel so SceneFlow pretrained weights load directly; polarization joins via an independent 1→64 conv stem without modifying the original backbone structure.
- Learned gated fusion:
feat_rgb + gate × feat_polwith a learned sigmoid gate lets the model decide when to inject the polarization signal. - Extremely lightweight: about 11.6K added parameters, less than 2% of the backbone — virtually no extra cost.