1. Design Goals
If a polarization stereo matching system adopts a Dual-Stream design, it must use GT disparity to align the left/right images during training before extracting polarization information. This creates the “Oracle-Real Gap” problem:
- During training, GT disparity is used for alignment (Oracle); at inference there is no GT, and only the model’s own estimated disparity is available.
- The two are inconsistent → the polarization signal learned during training degrades at inference.
- Switching to the model’s estimated disparity for alignment runs into the other side of the Oracle-Real Gap problem.
The design goal of this architecture is to eliminate the Oracle-Real Gap: use the model’s own Pass 1 result for alignment so that training and inference are fully consistent.
Core idea: Pass 1 performs geometry search and produces disp₁; after alignment, polarization information is extracted; Pass 2 performs pol-aware refinement.
2. Architecture (with Data Flow)
Three Stages of the Data Flow
| Stage | Task | Iterations | Output | Loss |
|---|---|---|---|---|
| Pass 1 (geometry search) | Stereo matching / geometry search | N = 12 | disp₁ | L₁ = 0.3 × L(disp₁, GT) |
| Between (pol extraction) | warp + normalized contrast + encoding | — | pol_feat | — |
| Pass 2 (pol-aware refinement) | Local refinement | M = 4~6 | disp₂ | L₂ = 1.0 × L(disp₂, GT) |
3. Components and Modules
3.1 fnet (shared Feature Encoder)
- Runs once for each of left and right, producing
fmap1andfmap2. - Computed only once, shared between Pass 1 and Pass 2.
- fnet is not modified.
3.2 cnet (Context Encoder)
- Pass 1:
cnet(left) → context₁, hidden₁, the original RGB context. - Pass 2: the bases produced by
cnet(left)andcnet_h(left), with pol_feat additively injected.
3.3 CorrBlock (Correlation Pyramid)
corr_pyramid = CorrBlock(fmap1, fmap2), computed only once.- Used by Pass 1: vanilla corr + V2-C pol_corr.
- Used by Pass 2: reuses the same
corr_pyramid, but only vanilla corr, without pol_corr.
3.4 V2-C pol_corr (polarization guidance inside Pass 1)
- Injects polarization into the correlation space of Pass 1 via gradient gating + scheduled residual.
- Exists only in Pass 1; not retained in Pass 2.
3.5 PolEncoder
- Input:
pol_diff(normalized contrast). - Output:
pol_featwith shape(B, Cp, H, W),Cp = 16~32. - Structure: 2–3 layers of 3×3 conv, no downsampling.
- The task only requires spatial smoothing + channel projection (a tiny network); it does not need to learn normalization — the input is already a clean degree-of-polarization signal. Essentially the output is a glass indicator map.
3.6 Wc / Wh (projection layers)
Wc: projectspol_feat(Cp channels) to the context dimension (128), additively injecting intocontext₂.Wh: projectspol_feat(Cp channels) to the hidden dimension (128), additively injecting intohidden₂.- Both are 1×1 conv.
3.7 GRU₁ / GRU₂
- Same structure, independent weights.
- GRU₁: update unit for Pass 1’s geometry search.
- GRU₂: update unit for Pass 2’s pol-aware refinement.
4. Tensor Dimensions
| Tensor | Dimensions | Description |
|---|---|---|
| left / right | (B, 3, H, W) | Input polarization image pair |
| fmap1 / fmap2 | (B, C, H/?, W/?) | fnet output feature maps (shared) |
| context₁ / hidden₁ | (B, 128, ·, ·) | Pass 1 original context / hidden |
| disp₁ | (B, 1, H, W) | Pass 1 disparity output |
| right_warped | (B, 3, H, W) | Right warped with disp₁.detach() |
| pol_diff | (B, 3, H, W) | Normalized contrast (left − right_warped)/(left + right_warped + ε) |
| pol_feat | (B, Cp, H, W) | PolEncoder output, Cp = 16~32 |
| Wc(pol_feat) | (B, 128, ·, ·) | Projected to context dimension |
| Wh(pol_feat) | (B, 128, ·, ·) | Projected to hidden dimension |
| context₂ / hidden₂ | (B, 128, ·, ·) | Pass 2 context / hidden after pol injection |
| disp₂ | (B, 1, H, W) | Pass 2 disparity output |
5. Hyperparameters
| Hyperparameter | Value | Description |
|---|---|---|
| Pass 1 iterations N | 12 | Number of geometry-search GRU iterations |
| Pass 2 iterations M | 4~6 | Number of refinement GRU iterations, roughly 1/3~1/2 of N |
| L₁ weight | 0.3 | Pass 1 loss weight |
| L₂ weight | 1.0 | Pass 2 loss weight |
| PolEncoder channels Cp | 16~32 | Number of polarization feature channels |
6. Design Decisions and Rationale (8 items)
1. pol_diff = normalized contrast + PolEncoder
Uses (I∥ − I⊥) / (I∥ + I⊥ + ε) instead of raw difference. Physically close to DoLP and scale-invariant (unaffected by exposure / white balance). High on glass, low on diffuse surfaces, and random where disp₁ is wrong. PolEncoder only needs to do spatial smoothing + channel projection (a tiny network); it does not need to learn normalization — the input is already a clean degree-of-polarization signal. Essentially a glass indicator map.
2. Inject into both context and hidden (additively)
hidden is the GRU’s initial memory. If only context is modified, the GRU update rule changes, but the “state it starts from” is still a pure RGB prior → pol can see but cannot change anything. Nopol-safe: pol_diff=0 → PolEncoder(0)≈0 → hidden₂ ≈ base_hidden.
3. Pass 2 starts from disp₁.detach() (not zero)
All of Pass 2’s assumptions rest on “the left and right are aligned to the same physical point”. Starting from zero = re-doing stereo matching → the pol context is forced to explain geometry → waste. Pass 2’s task is refinement, not re-search.
4. GRU₁ / GRU₂ have independent weights (same structure, different weights)
Pass 1 learns photometric consistency (geometry search) and Pass 2 learns material cues (local refinement). Sharing forces Pass 2 to use the search strategy, flattening the influence of polarization.
5. Both disp₁ and disp₂ are supervised (L₁=0.3, L₂=1.0)
disp₁.detach() cuts the gradient → Pass 1 cannot receive feedback from Pass 2 → it needs its own loss. Otherwise Pass 1 = blind guess → warp quality is unstable → pol_diff becomes garbage.
6. Pass 2 does not retain V2-C pol_corr (vanilla corr)
The disparity of Pass 2 is already near disp₁ and no longer performs a wide-range search. pol_corr is built for search; since Pass 2 no longer searches, it merely re-encodes information and adds noise.
7. Pass 2 iterations M = 4~6 (< Pass 1’s N=12)
The refinement pass does not need many iterations: (1) risk of over-correction (pol is a local cue; too many iterations over-correct); (2) the model starts to “trust pol more than geometry”; (3) errors are already in the low-frequency band, so late iterations only jitter. Rule of thumb: refinement ≈ 1/3~1/2 of search.
8. Training strategy: end-to-end + soft-frozen Pass 2 in the early stage
No staged training (two stages) is used, nor is everything fully released. The pol injection coefficient for L₂ ramps from 0 to 1.0, letting Pass 1 stabilize geometry search first. This can be paired with a slight LR reduction on Pass 2 (not hard suppression). Directly freezing Pass 2 parameters is not recommended.
7. Polarization Injection Points
Polarization injection in the Two-Pass architecture occurs at two places:
| Injection point | Location | Form | Description |
|---|---|---|---|
| Pass 1 — pol_corr | correlation space | gradient gating + scheduled residual | Polarization guidance inside Pass 1, assisting geometry search |
| Pass 2 — context₂ | Context Encoder output | additive: cnet(left) + Wc(pol_feat) | Affects the GRU update rule |
| Pass 2 — hidden₂ | GRU initial hidden | additive: cnet_h(left) + Wh(pol_feat) | Affects the GRU’s initial memory state |
Relation to design principles
This architecture follows three polarization-injection principles:
- Polarization does not enter fnet — satisfied (fnet untouched).
- Polarization does not only do spatial gating — satisfied (context injects into the GRU loop, and the GRU operates in disparity-aware space).
- Polarization only acts in disparity-aware space — satisfied (context is injected into the GRU every iteration, where the GRU is coupled with correlation lookup).
8. New Parameters and Computational Cost
| Module | Parameters | Description |
|---|---|---|
| PolEncoder | ~1K–5K | 3→Cp (16~32), 2–3 layers of 3×3 conv, no downsampling |
| Wc | ~2K–4K | Cp→128, 1×1 conv |
| Wh | ~2K–4K | Cp→128, 1×1 conv |
| GRU₂ | ~same as GRU₁ | Same structure, independent weights |
Computational cost: ~1.5× baseline (Pass 2 only runs 4–6 iterations, not 2.2×).
9. Highlights
- Self-alignment eliminates the Oracle-Real Gap: the model uses its own Pass 1 disp₁ to align the left/right images, so training and inference rely on alignments from the same source, fully removing the bias caused by relying on GT disparity.
- Division of labor between geometry and material: Pass 1 focuses on geometry search and Pass 2 focuses on polarization material cues. The independent weights of GRU₁ / GRU₂ let each task be optimized separately.
- Scale-invariant polarization signal: normalized contrast is physically close to DoLP and is unaffected by exposure or white balance, so PolEncoder can be extremely lightweight.
- Refine rather than re-search: Pass 2 starts from disp₁, reuses corr_pyramid, and runs only 1/3~1/2 the iterations of Pass 1. This avoids over-correction; the total computational cost is only ~1.5× baseline.
- Dual injection into context and hidden: polarization simultaneously affects the GRU’s update rule and its initial memory state, so polarization can be “both seen and acted upon”, while gracefully degrading to a safe baseline under nopol input.