1. Design Goals
When a polarization residual is injected into the correlation volume, using a static residual (with a fixed residual strength throughout the GRU iterations):
corr_enhanced = corr + pol_residual(pol_corr)
causes the problem of “injecting polarization at the wrong time with the wrong strength”:
| Iteration | Problem |
|---|---|
| Early | Stereo is still aligning coarse geometry, adding the pol residual directly = amplifying early noise |
| Late | RAFT has entered refinement, the pol residual is no longer “more important”, just “equally important” |
In other words, a fixed-strength residual makes pol merely a distractor throughout the entire process, rather than a refinement tool.
The design goal of this architecture is: let the residual strength grow with GRU iterations, so that pol takes effect at the “correct moment” — barely intervening early on to protect stereo geometry, and only being fully injected later as a refinement prior.
2. Architecture Mechanism: Scheduled Residual
The core mechanism is to replace the fixed residual with an “iteration-scheduled” residual:
# Scheduled Residual
alpha = i / max(iters - 1, 1) # 0 → 1
corr_enhanced = corr + alpha * self.pol_residual(pol_corr)
where i is the current GRU iteration index and iters is the total number of iterations (e.g. 24). alpha is 0 at the first iteration and 1 at the last, growing linearly.
The PolCorrResidual module consists of three convolutional layers plus a learnable scale, with the last layer initialized to 0, outputting the scaled residual Δcorr.
3. Three-Phase α Schedule Philosophy
The α schedule performs three things at once in a single formula:
| Phase | α Value | Role | Description |
|---|---|---|---|
| Early | α ≈ 0 | Protect stereo geometry | Equivalent to plain RAFT-Stereo; pretrained stereo is not disrupted by polarization |
| Mid | α gradually grows | Pol becomes auxiliary evidence | Stereo already has a reasonable disparity; pol only nudges (boundaries, specular) |
| Late | α → 1 | Pol = refinement prior | RAFT itself is doing small corrections; pol’s scale/semantic/timing all match |
Summary of design philosophy: the α schedule turns pol from a “distractor throughout the process” into a “refinement tool at the right moment”.
- Early phase: the stereo backbone is still aligning coarse geometry. With α≈0 the model behaves equivalently to plain RAFT-Stereo, avoiding amplification of early noise by the pol residual.
- Mid phase: stereo has obtained a reasonable disparity. α gradually increases, and pol serves as auxiliary evidence performing small nudging on boundary and specular regions.
- Late phase: RAFT itself is only doing small corrections, so α → 1, and pol’s strength (scale), semantics (semantic), and timing all match the refinement need exactly.
4. Data Flow
The key point is that alpha is a function of iteration, recomputed at every iteration, so early iterations inject almost no pol while late iterations inject it fully.
5. Components and Modules
5.1 PolCorrResidual
class PolCorrResidual(nn.Module):
def __init__(self, pol_dim, corr_dim, hidden_dim=64, init_scale=0.1):
self.net = nn.Sequential(
nn.Conv2d(pol_dim, hidden_dim, 3, padding=1),
nn.ReLU(),
nn.Conv2d(hidden_dim, hidden_dim, 3, padding=1),
nn.ReLU(),
nn.Conv2d(hidden_dim, corr_dim, 1), # project to corr dimension
)
self.scale = nn.Parameter(torch.tensor(init_scale))
# Initialize the last layer to 0 → initial Δcorr ≈ 0
nn.init.zeros_(self.net[-1].weight)
nn.init.zeros_(self.net[-1].bias)
def forward(self, pol_corr):
return self.scale * self.net(pol_corr)
net: three convolutional layers (3×3 → 3×3 → 1×1); the final 1×1 convolution projects features tocorr_dim.scale: learnable scalar,init_scale=0.1.- The last layer is initialized to 0, so
Δcorr ≈ 0at the start of training and learning begins from a stable starting point.
5.2 Schedule coefficient alpha
alpha = i / max(iters - 1, 1) is a fixed function of iteration, not a learnable parameter. It is recomputed from the current index at every GRU iteration.
6. Tensor Dimensions
| Tensor | Shape / Type | Description |
|---|---|---|
pol_corr | (B, pol_dim, H, W) | Output of PolCorrBlock |
Δcorr | (B, corr_dim, H, W) | Residual output of pol_residual |
alpha | scalar (plain value, not a parameter) | i / max(iters-1, 1), range [0, 1] |
corr_enhanced | (B, corr_dim, H, W) | corr + alpha * Δcorr |
7. Hyperparameters
| Hyperparameter | Value | Description |
|---|---|---|
pol_levels | 4 | Number of pyramid levels in the polarization volume |
pol_radius | 4 | Lookup radius of the polarization volume |
iters | 24 | Number of GRU iterations (also determines the length of the α schedule) |
hidden_dim | 64 | Number of channels in the intermediate layer of PolCorrResidual |
init_scale | 0.1 | Initial value of the learnable scale parameter |
8. Design Decisions and Rationale
| Decision | Rationale |
|---|---|
Introduce a linear schedule alpha = i/(iters-1) | Lets the pol residual strength grow with GRU iterations, aligned with the refinement timing |
| Early phase α≈0 | Protects pretrained stereo geometry from being disturbed by the pol residual |
| Late phase α→1 | RAFT does only small corrections in late iterations, the right moment for pol to act as a refinement prior |
alpha is a fixed function, not learnable | The schedule is a prior structure determined directly by the iteration; no learning needed |
Last layer of PolCorrResidual initialized to 0 | Δcorr≈0 at the start of training, learning starts from a stable point |
| UpdateBlock keeps the original RAFT | Fully preserves pretrained capability |
9. Highlights
- A linear iteration schedule
alpha = i / (iters-1)grows the polarization residual strength from 0 to 1, so pol only intervenes at the “right moment”. - Three phases in one shot: early phase α≈0 protects stereo geometry, mid phase α grows for auxiliary evidence, late phase α→1 becomes a refinement prior.
alphais a fixed function of iteration rather than a learnable parameter — the schedule serves as a prior structure determined directly by the iteration index, with zero extra parameters.- The polarization residual turns pol from a “distractor throughout the process” into a “refinement tool at the right moment”, exactly aligned with RAFT’s small-correction behavior in late iterations.
- The last layer of
PolCorrResidualis initialized to 0, givingΔcorr ≈ 0at the start of training, and the original RAFT UpdateBlock is reused, maximally preserving pretrained capability.