Polarization Volume Architecture — Po-Ting Lin (林柏廷)

1. Design Goals

Computing the polarization difference (pol_diff) via warp suffers from a fundamental contradiction:

Ideal case:  pol_diff = warp(I_left, GT_disp) - I_right     → perfectly aligned ✓
Actual case: pol_diff = warp(I_left, pred_disp) - I_right   → contains error ✗

Core contradiction: computing pol_diff needs disparity, but disparity is precisely what we want to predict.

Using GT disparity to warp yields a perfectly aligned pol_diff; but at inference time only predicted disparity is available, and the warped pol_diff necessarily contains error. This creates a training/inference inconsistency.

Key Insight

In a rectified stereo system, corresponding points in the left and right images always lie on the same horizontal line (epipolar line).

We can therefore borrow the Correlation Volume idea from RAFT-Stereo: without knowing disparity in advance, pre-compute pol_diff for “all disparity candidates” into a volume, and later query it inside the GRU loop using the current disparity. The design goal of this architecture is to eliminate pol_diff’s dependence on a disparity estimate.

2. Solution: Polarization Volume

Analogously to RAFT-Stereo’s Correlation Volume, pre-compute pol_diff for all disparities:

# Approach that requires knowing disparity
pol_diff = warp(left, d_pred) - right

# Polarization Volume approach (no disparity needed)
pol_volume[d] = shift(left, d) - right   # for all d in [0, max_disp]

Because pol_volume is pre-computed over “all disparity candidates,” the entire forward pass no longer depends on any disparity estimate; the pol_volume computed under any disparity condition is identical.

3. Architecture

StereoPolVolume overall architecture

Data Flow

The left and right images are encoded by FeatureEncoder into fmap1 / fmap2, which form a CorrBlock (the standard RAFT-Stereo correlation volume).
The left and right images are also AvgPool 4x downsampled to left_ds / right_ds, which form a PolCorrBlock (the polarization-difference volume).
The left image is additionally processed by ContextEncoder to produce context and the initial hidden state.
Enter a 24-iteration GRU loop: each iteration queries both CorrBlock (yielding corr) and PolCorrBlock (yielding pol) using the current disparity.
UpdateBlockWithPol concatenates corr, pol, and disp as input and produces the disparity increment Δdisp.
disp = disp + Δdisp, iteratively updated.

4. Components and Modules

4.1 CorrBlock vs PolCorrBlock

Item	CorrBlock	PolCorrBlock
Computation	`dot(fmap1[x], fmap2[x-d])`	`left[x] - right[x-d]`
Meaning	Feature similarity	Polarization difference
Dimensions	(B*H, 1, W, W)	(B*H, 1, W, W)
Query	Sample by disp	Sample by disp

CorrBlock computes the inner product (dot product) of left and right features, measuring “feature similarity”—the core of standard RAFT-Stereo.
PolCorrBlock computes per-pixel differences between the (downsampled) left and right images along the same epipolar line, measuring “polarization difference.” The two are structurally symmetric, both forming all-pairs volumes of shape (B*H, 1, W, W), and both are sampled by the current disparity.

4.2 UpdateBlockWithPol

Compared with the standard RAFT-Stereo UpdateBlock, UpdateBlockWithPol accepts an additional input pol_corr, using concat(corr, pol, disp) as the update basis before the GRU produces the disparity increment.

5. Tensor Dimensions

Tensor	Dimensions	Description
`fmap1` / `fmap2`	FeatureEncoder outputs	Used to form CorrBlock
`corr` volume	(B*H, 1, W, W)	All-pairs feature similarity
`pol` volume	(B*H, 1, W, W)	All-pairs polarization difference
`left_ds` / `right_ds`	1/4 resolution	After AvgPool 4x downsampling
Query output	Controlled by `pol_levels=4` and `pol_radius=4`	corr / pol features sampled per GRU iteration

6. Hyperparameters

Parameter	Value	Description
`pol_volume`	enabled	Enable the Polarization Volume architecture
`pretrained`	`raftstereo-sceneflow.pth`	SceneFlow pre-trained weights
`pol_levels`	4	Number of pyramid levels for the Polarization Volume
`pol_radius`	4	Query sampling radius
`iters`	24	GRU iterations
`batch_size`	8	Training batch size
`num_steps`	60000	Training steps
`lr`	0.0003	Learning rate

7. Design Decisions and Rationale

Decision	Rationale
Pre-compute `pol_diff` like a Correlation Volume	Leverages epipolar geometry; disparity need not be known in advance
AvgPool 4x the polarization images before building the volume	Matches the corr-volume resolution and reduces compute
Make pol and corr structurally symmetric	Allows sampling via the same disparity-query mechanism
`UpdateBlockWithPol` uses concat fusion	The most direct multi-source fusion

Core Advantage

Warp-based pol_diff requires a disparity input (GT or predicted), so a gap exists between “ideal alignment” and “actual alignment,” and training/inference behaviors are inconsistent. The Polarization Volume instead uses shift(left, d) - right to pre-compute the polarization volume for all disparity candidates; the forward pass never needs disparity. As a result:

Disparity is not required to compute polarization differences.
There is no “ideal vs actual alignment” gap.
The polarization-feature computation behaves identically at training and inference.

8. Highlights

Volume replaces warp for polarization computation: by analogy with the Correlation Volume, pol_diff for all disparity candidates is pre-computed into a Polarization Volume, completely removing the circular dependency of “needing disparity to compute pol_diff while disparity is exactly what we want to predict.”
Training and inference fully consistent: the polarization volume does not depend on any disparity estimate; the pol_volume is identical at training and inference, eliminating the “ideal vs actual alignment” gap.
Exploits the epipolar geometric constraint: leverages the property that corresponding points in a rectified stereo system lie on the same horizontal line, using shift in place of warp and encoding this geometric prior directly into volume construction.
Symmetric design with the Correlation Volume: PolCorrBlock is structurally symmetric with CorrBlock—both are all-pairs volumes of shape (B*H, 1, W, W) and can be sampled with the same disparity-query mechanism.
Downsampling reduces compute: polarization images are AvgPool 4x downsampled before volume construction, matching the correlation-volume resolution and controlling compute without losing the macroscopic polarization signal.
Multi-source fusion within iterations: UpdateBlockWithPol consumes both similarity and polarization difference per GRU iteration via concat(corr, pol, disp), letting the two signals jointly guide disparity convergence.