1. Design Goals
Computing the polarization difference (pol_diff) via warp suffers from a fundamental contradiction:
Ideal case: pol_diff = warp(I_left, GT_disp) - I_right → perfectly aligned ✓
Actual case: pol_diff = warp(I_left, pred_disp) - I_right → contains error ✗
Core contradiction: computing pol_diff needs disparity, but disparity is precisely what we want to predict.
Using GT disparity to warp yields a perfectly aligned pol_diff; but at inference time only predicted disparity is available, and the warped pol_diff necessarily contains error. This creates a training/inference inconsistency.
Key Insight
In a rectified stereo system, corresponding points in the left and right images always lie on the same horizontal line (epipolar line).
We can therefore borrow the Correlation Volume idea from RAFT-Stereo: without knowing disparity in advance, pre-compute pol_diff for “all disparity candidates” into a volume, and later query it inside the GRU loop using the current disparity. The design goal of this architecture is to eliminate pol_diff’s dependence on a disparity estimate.
2. Solution: Polarization Volume
Analogously to RAFT-Stereo’s Correlation Volume, pre-compute pol_diff for all disparities:
# Approach that requires knowing disparity
pol_diff = warp(left, d_pred) - right
# Polarization Volume approach (no disparity needed)
pol_volume[d] = shift(left, d) - right # for all d in [0, max_disp]
Because pol_volume is pre-computed over “all disparity candidates,” the entire forward pass no longer depends on any disparity estimate; the pol_volume computed under any disparity condition is identical.
3. Architecture
Data Flow
- The left and right images are encoded by
FeatureEncoderintofmap1/fmap2, which form aCorrBlock(the standard RAFT-Stereo correlation volume). - The left and right images are also
AvgPool 4xdownsampled toleft_ds/right_ds, which form aPolCorrBlock(the polarization-difference volume). - The left image is additionally processed by
ContextEncoderto produce context and the initial hidden state. - Enter a 24-iteration GRU loop: each iteration queries both
CorrBlock(yieldingcorr) andPolCorrBlock(yieldingpol) using the current disparity. UpdateBlockWithPolconcatenatescorr,pol, anddispas input and produces the disparity incrementΔdisp.disp = disp + Δdisp, iteratively updated.
4. Components and Modules
4.1 CorrBlock vs PolCorrBlock
| Item | CorrBlock | PolCorrBlock |
|---|---|---|
| Computation | dot(fmap1[x], fmap2[x-d]) | left[x] - right[x-d] |
| Meaning | Feature similarity | Polarization difference |
| Dimensions | (B*H, 1, W, W) | (B*H, 1, W, W) |
| Query | Sample by disp | Sample by disp |
CorrBlockcomputes the inner product (dot product) of left and right features, measuring “feature similarity”—the core of standard RAFT-Stereo.PolCorrBlockcomputes per-pixel differences between the (downsampled) left and right images along the same epipolar line, measuring “polarization difference.” The two are structurally symmetric, both forming all-pairs volumes of shape(B*H, 1, W, W), and both are sampled by the current disparity.
4.2 UpdateBlockWithPol
Compared with the standard RAFT-Stereo UpdateBlock, UpdateBlockWithPol accepts an additional input pol_corr, using concat(corr, pol, disp) as the update basis before the GRU produces the disparity increment.
5. Tensor Dimensions
| Tensor | Dimensions | Description |
|---|---|---|
fmap1 / fmap2 | FeatureEncoder outputs | Used to form CorrBlock |
corr volume | (B*H, 1, W, W) | All-pairs feature similarity |
pol volume | (B*H, 1, W, W) | All-pairs polarization difference |
left_ds / right_ds | 1/4 resolution | After AvgPool 4x downsampling |
| Query output | Controlled by pol_levels=4 and pol_radius=4 | corr / pol features sampled per GRU iteration |
6. Hyperparameters
| Parameter | Value | Description |
|---|---|---|
pol_volume | enabled | Enable the Polarization Volume architecture |
pretrained | raftstereo-sceneflow.pth | SceneFlow pre-trained weights |
pol_levels | 4 | Number of pyramid levels for the Polarization Volume |
pol_radius | 4 | Query sampling radius |
iters | 24 | GRU iterations |
batch_size | 8 | Training batch size |
num_steps | 60000 | Training steps |
lr | 0.0003 | Learning rate |
7. Design Decisions and Rationale
| Decision | Rationale |
|---|---|
Pre-compute pol_diff like a Correlation Volume | Leverages epipolar geometry; disparity need not be known in advance |
| AvgPool 4x the polarization images before building the volume | Matches the corr-volume resolution and reduces compute |
| Make pol and corr structurally symmetric | Allows sampling via the same disparity-query mechanism |
UpdateBlockWithPol uses concat fusion | The most direct multi-source fusion |
Core Advantage
Warp-based pol_diff requires a disparity input (GT or predicted), so a gap exists between “ideal alignment” and “actual alignment,” and training/inference behaviors are inconsistent. The Polarization Volume instead uses shift(left, d) - right to pre-compute the polarization volume for all disparity candidates; the forward pass never needs disparity. As a result:
- Disparity is not required to compute polarization differences.
- There is no “ideal vs actual alignment” gap.
- The polarization-feature computation behaves identically at training and inference.
8. Highlights
- Volume replaces warp for polarization computation: by analogy with the Correlation Volume,
pol_difffor all disparity candidates is pre-computed into a Polarization Volume, completely removing the circular dependency of “needing disparity to computepol_diffwhile disparity is exactly what we want to predict.” - Training and inference fully consistent: the polarization volume does not depend on any disparity estimate; the
pol_volumeis identical at training and inference, eliminating the “ideal vs actual alignment” gap. - Exploits the epipolar geometric constraint: leverages the property that corresponding points in a rectified stereo system lie on the same horizontal line, using
shiftin place ofwarpand encoding this geometric prior directly into volume construction. - Symmetric design with the Correlation Volume:
PolCorrBlockis structurally symmetric withCorrBlock—both are all-pairs volumes of shape(B*H, 1, W, W)and can be sampled with the same disparity-query mechanism. - Downsampling reduces compute: polarization images are AvgPool 4x downsampled before volume construction, matching the correlation-volume resolution and controlling compute without losing the macroscopic polarization signal.
- Multi-source fusion within iterations:
UpdateBlockWithPolconsumes both similarity and polarization difference per GRU iteration viaconcat(corr, pol, disp), letting the two signals jointly guide disparity convergence.