Implicit Polarization Input Architecture

1. Design Goals

The core physical scenario this architecture addresses is: on transparent surfaces (glass), the left camera (I∥, with a 0° polarizer) captures strong specular reflections while the right camera (I⊥, with a 90° polarizer) suppresses them, so I∥ >> I⊥; on diffuse backgrounds, I∥ ≈ I⊥. Transparent surfaces are nearly invisible to standard depth sensing, but the polarized image pair carries an exploitable brightness-difference signal in glass regions.

Implicit polarization input is the most direct, lowest-cost way to exploit this physical signal:

No architectural changes: still uses standard RAFT-Stereo.
No polarization encoder added: no polarization-specific module, no dual-stream design.
The only change is the input data: the polarized pair I∥ (left) and I⊥ (right) is fed in directly as ordinary left/right images.

The question this design seeks to answer is: “Without any architectural modifications, can the network learn glass-region cues from the polarized image pair on its own?” The term “implicit” refers to polarization information not being processed “explicitly” by any dedicated module, but rather being expected to be exploited “implicitly” by the network during training.

2. Architecture

In terms of architecture, implicit polarization input is equivalent to standard RAFT-Stereo; only the input differs:

Implicit polarization input data flow

The polarized pair I∥ / I⊥ replaces the usual left/right images fed into the network, while the architecture itself is unchanged.

3. Components and Modules

This method adds no new modules. All components are the three standard RAFT-Stereo components:

Feature Encoder (fnet): left/right shared weights, processing I∥ and I⊥ separately.
Context Encoder (cnet): looks only at I∥ (left).
Correlation Pyramid + GRU: iteratively refines disparity.

Polarization information is not processed “explicitly” by any dedicated module; it is expected to be exploited “implicitly” by the network during training.

4. Data Flow

The polarized pair I∥, I⊥ is fed directly as the left/right input.
I∥ and I⊥ pass through the shared-weight fnet separately → fmap1, fmap2.
I∥ passes through cnet → context + hidden state.
The Correlation Pyramid + GRU iteratively produce the disparity.

5. Tensor Dimensions

Identical to standard RAFT-Stereo:

Inputs I∥ / I⊥: 640 × 480.
fnet outputs fmap1, fmap2: downsampled feature maps.
Output disparity: matches the input resolution.

Since no module is added, the model parameter count is identical to standard RAFT-Stereo.

6. Hyperparameters

Parameter	Value	Description
`pretrained`	`raftstereo-sceneflow.pth`	SceneFlow pretrained weights
`glass_weight`	3.0	Loss weighting for glass regions
`lr`	0.00005	Learning rate
`batch_size`	8	Training batch size
`num_steps`	50000	Training steps
`iters`	16	GRU iterations
`scheduler`	cosine	Cosine learning rate schedule
`d1_weight`	0.2	D1 metric weight
Precision	FP32	More stable than BF16

7. Design Decisions and Rationale

7.1 Why an Implicit, Architecture-Free Approach

Implicit polarization input is the lowest-cost path to exploiting the signal. If the network can leverage polarization on its own, there is no need to design a complex polarization encoder. It cleanly isolates the contribution of “polarization information itself” from “architectural complexity”: the architecture is identical to ordinary stereo matching, with the sole variable being “whether the input contains a polarization difference”.

7.2 The Fundamental Limit of the Implicit Approach: BatchNorm Washes Out the Polarization Signal

The core problem with implicit polarization input is that the first layer of fnet contains BatchNorm. The value of the polarization signal lies in the macroscopic brightness-magnitude difference between I∥ and I⊥, and BN normalizes the input distribution and washes this difference away:

BatchNorm washes out the polarization signal

In other words, the polarization signal fed in by implicit polarization input has its macroscopic difference effectively normalized away after passing through the BN in fnet, and the network can only fall back on micro-structure for matching. For texture-less transparent glass, that fallback does not hold.

7.3 Applicability Boundary of the Implicit Approach

Because BN washes out the macroscopic magnitude difference of the polarization signal, implicit input cannot guarantee that the network “actively exploits” the polarization cue. For the polarization signal to be used effectively, polarization processing must completely bypass BN and operate from the raw |I∥ - I⊥|, which calls for an explicit polarization encoder. Implicit input is therefore suited to “low-cost exploration of whether polarization information has value”, but not as a final solution that depends on the polarization signal.

8. Highlights

Zero-modification path to using polarization: no new modules are added; only the input data is replaced, and the model parameter count is identical to a standard stereo matching network.
Clean variable isolation: the architecture is identical to ordinary stereo matching, with the only variable being “whether the input contains a polarization difference”, so the contribution of the polarization information itself can be evaluated in isolation.
Explicit physical assumption: grounded in the polarization physics of “I∥ >> I⊥ in glass regions and I∥ ≈ I⊥ on diffuse backgrounds”, turning glass visibility into a brightness difference between the image pair.
Identifies the fundamental BatchNorm limit: clearly delineates that the first BN layer of fnet normalizes away the macroscopic magnitude difference of the polarization signal, marking the applicability boundary of the implicit approach — for texture-less glass, implicit input is not enough to guarantee that the polarization signal is exploited.