Two-Pass Polarization Stereo Matching Architecture

1. Design Goals

If a polarization stereo matching system adopts a Dual-Stream design, it must use GT disparity to align the left/right images during training before extracting polarization information. This creates the “Oracle-Real Gap” problem:

During training, GT disparity is used for alignment (Oracle); at inference there is no GT, and only the model’s own estimated disparity is available.
The two are inconsistent → the polarization signal learned during training degrades at inference.
Switching to the model’s estimated disparity for alignment runs into the other side of the Oracle-Real Gap problem.

The design goal of this architecture is to eliminate the Oracle-Real Gap: use the model’s own Pass 1 result for alignment so that training and inference are fully consistent.

Core idea: Pass 1 performs geometry search and produces disp₁; after alignment, polarization information is extracted; Pass 2 performs pol-aware refinement.

2. Architecture (with Data Flow)

Two-Pass polarization stereo matching data flow

Three Stages of the Data Flow

Stage	Task	Iterations	Output	Loss
Pass 1 (geometry search)	Stereo matching / geometry search	N = 12	disp₁	L₁ = 0.3 × L(disp₁, GT)
Between (pol extraction)	warp + normalized contrast + encoding	—	pol_feat	—
Pass 2 (pol-aware refinement)	Local refinement	M = 4~6	disp₂	L₂ = 1.0 × L(disp₂, GT)

3. Components and Modules

3.1 fnet (shared Feature Encoder)

Runs once for each of left and right, producing fmap1 and fmap2.
Computed only once, shared between Pass 1 and Pass 2.
fnet is not modified.

3.2 cnet (Context Encoder)

Pass 1: cnet(left) → context₁, hidden₁, the original RGB context.
Pass 2: the bases produced by cnet(left) and cnet_h(left), with pol_feat additively injected.

3.3 CorrBlock (Correlation Pyramid)

corr_pyramid = CorrBlock(fmap1, fmap2), computed only once.
Used by Pass 1: vanilla corr + V2-C pol_corr.
Used by Pass 2: reuses the same corr_pyramid, but only vanilla corr, without pol_corr.

3.4 V2-C pol_corr (polarization guidance inside Pass 1)

Injects polarization into the correlation space of Pass 1 via gradient gating + scheduled residual.
Exists only in Pass 1; not retained in Pass 2.

3.5 PolEncoder

Input: pol_diff (normalized contrast).
Output: pol_feat with shape (B, Cp, H, W), Cp = 16~32.
Structure: 2–3 layers of 3×3 conv, no downsampling.
The task only requires spatial smoothing + channel projection (a tiny network); it does not need to learn normalization — the input is already a clean degree-of-polarization signal. Essentially the output is a glass indicator map.

3.6 Wc / Wh (projection layers)

Wc: projects pol_feat (Cp channels) to the context dimension (128), additively injecting into context₂.
Wh: projects pol_feat (Cp channels) to the hidden dimension (128), additively injecting into hidden₂.
Both are 1×1 conv.

3.7 GRU₁ / GRU₂

Same structure, independent weights.
GRU₁: update unit for Pass 1’s geometry search.
GRU₂: update unit for Pass 2’s pol-aware refinement.

4. Tensor Dimensions

Tensor	Dimensions	Description
left / right	(B, 3, H, W)	Input polarization image pair
fmap1 / fmap2	(B, C, H/?, W/?)	fnet output feature maps (shared)
context₁ / hidden₁	(B, 128, ·, ·)	Pass 1 original context / hidden
disp₁	(B, 1, H, W)	Pass 1 disparity output
right_warped	(B, 3, H, W)	Right warped with disp₁.detach()
pol_diff	(B, 3, H, W)	Normalized contrast `(left − right_warped)/(left + right_warped + ε)`
pol_feat	(B, Cp, H, W)	PolEncoder output, Cp = 16~32
Wc(pol_feat)	(B, 128, ·, ·)	Projected to context dimension
Wh(pol_feat)	(B, 128, ·, ·)	Projected to hidden dimension
context₂ / hidden₂	(B, 128, ·, ·)	Pass 2 context / hidden after pol injection
disp₂	(B, 1, H, W)	Pass 2 disparity output

5. Hyperparameters

Hyperparameter	Value	Description
Pass 1 iterations N	12	Number of geometry-search GRU iterations
Pass 2 iterations M	4~6	Number of refinement GRU iterations, roughly 1/3~1/2 of N
L₁ weight	0.3	Pass 1 loss weight
L₂ weight	1.0	Pass 2 loss weight
PolEncoder channels Cp	16~32	Number of polarization feature channels

6. Design Decisions and Rationale (8 items)

1. pol_diff = normalized contrast + PolEncoder

Uses (I∥ − I⊥) / (I∥ + I⊥ + ε) instead of raw difference. Physically close to DoLP and scale-invariant (unaffected by exposure / white balance). High on glass, low on diffuse surfaces, and random where disp₁ is wrong. PolEncoder only needs to do spatial smoothing + channel projection (a tiny network); it does not need to learn normalization — the input is already a clean degree-of-polarization signal. Essentially a glass indicator map.

2. Inject into both context and hidden (additively)

hidden is the GRU’s initial memory. If only context is modified, the GRU update rule changes, but the “state it starts from” is still a pure RGB prior → pol can see but cannot change anything. Nopol-safe: pol_diff=0 → PolEncoder(0)≈0 → hidden₂ ≈ base_hidden.

3. Pass 2 starts from disp₁.detach() (not zero)

All of Pass 2’s assumptions rest on “the left and right are aligned to the same physical point”. Starting from zero = re-doing stereo matching → the pol context is forced to explain geometry → waste. Pass 2’s task is refinement, not re-search.

4. GRU₁ / GRU₂ have independent weights (same structure, different weights)

Pass 1 learns photometric consistency (geometry search) and Pass 2 learns material cues (local refinement). Sharing forces Pass 2 to use the search strategy, flattening the influence of polarization.

5. Both disp₁ and disp₂ are supervised (L₁=0.3, L₂=1.0)

disp₁.detach() cuts the gradient → Pass 1 cannot receive feedback from Pass 2 → it needs its own loss. Otherwise Pass 1 = blind guess → warp quality is unstable → pol_diff becomes garbage.

6. Pass 2 does not retain V2-C pol_corr (vanilla corr)

The disparity of Pass 2 is already near disp₁ and no longer performs a wide-range search. pol_corr is built for search; since Pass 2 no longer searches, it merely re-encodes information and adds noise.

7. Pass 2 iterations M = 4~6 (< Pass 1’s N=12)

The refinement pass does not need many iterations: (1) risk of over-correction (pol is a local cue; too many iterations over-correct); (2) the model starts to “trust pol more than geometry”; (3) errors are already in the low-frequency band, so late iterations only jitter. Rule of thumb: refinement ≈ 1/3~1/2 of search.

8. Training strategy: end-to-end + soft-frozen Pass 2 in the early stage

No staged training (two stages) is used, nor is everything fully released. The pol injection coefficient for L₂ ramps from 0 to 1.0, letting Pass 1 stabilize geometry search first. This can be paired with a slight LR reduction on Pass 2 (not hard suppression). Directly freezing Pass 2 parameters is not recommended.

7. Polarization Injection Points

Polarization injection in the Two-Pass architecture occurs at two places:

Injection point	Location	Form	Description
Pass 1 — pol_corr	correlation space	gradient gating + scheduled residual	Polarization guidance inside Pass 1, assisting geometry search
Pass 2 — context₂	Context Encoder output	additive: `cnet(left) + Wc(pol_feat)`	Affects the GRU update rule
Pass 2 — hidden₂	GRU initial hidden	additive: `cnet_h(left) + Wh(pol_feat)`	Affects the GRU’s initial memory state

Relation to design principles

This architecture follows three polarization-injection principles:

Polarization does not enter fnet — satisfied (fnet untouched).
Polarization does not only do spatial gating — satisfied (context injects into the GRU loop, and the GRU operates in disparity-aware space).
Polarization only acts in disparity-aware space — satisfied (context is injected into the GRU every iteration, where the GRU is coupled with correlation lookup).

8. New Parameters and Computational Cost

Module	Parameters	Description
PolEncoder	~1K–5K	3→Cp (16~32), 2–3 layers of 3×3 conv, no downsampling
Wc	~2K–4K	Cp→128, 1×1 conv
Wh	~2K–4K	Cp→128, 1×1 conv
GRU₂	~same as GRU₁	Same structure, independent weights

Computational cost: ~1.5× baseline (Pass 2 only runs 4–6 iterations, not 2.2×).

9. Highlights

Self-alignment eliminates the Oracle-Real Gap: the model uses its own Pass 1 disp₁ to align the left/right images, so training and inference rely on alignments from the same source, fully removing the bias caused by relying on GT disparity.
Division of labor between geometry and material: Pass 1 focuses on geometry search and Pass 2 focuses on polarization material cues. The independent weights of GRU₁ / GRU₂ let each task be optimized separately.
Scale-invariant polarization signal: normalized contrast is physically close to DoLP and is unaffected by exposure or white balance, so PolEncoder can be extremely lightweight.
Refine rather than re-search: Pass 2 starts from disp₁, reuses corr_pyramid, and runs only 1/3~1/2 the iterations of Pass 1. This avoids over-correction; the total computational cost is only ~1.5× baseline.
Dual injection into context and hidden: polarization simultaneously affects the GRU’s update rule and its initial memory state, so polarization can be “both seen and acted upon”, while gracefully degrading to a safe baseline under nopol input.