Gradient Gating Architecture — Po-Ting Lin (林柏廷)

1. Design Goals

When the polarization residual is injected into the correlation volume, even if an iteration schedule controls “when to inject pol”, the question of “where to inject pol” remains unsolved — regardless of region, the residual strength only depends on the iteration.

But the locations where polarization should intervene are spatially varying: regions with large gradients in the disparity map are often object boundaries, occlusions, or matching-hard regions — exactly where stereo is less reliable and pol should step in.

The design goal of this architecture is: use the disparity gradient as an uncertainty proxy, giving pol more say in “uncertain regions”. On top of the iteration schedule, multiply in a spatial gate driven by the disparity gradient.

2. Architecture Mechanism: Gradient Gating

disp_grad = compute_gradient(disp).detach()  # stop-grad to avoid feedback loop!
gate = GatingNetwork(pol_corr, disp_grad)    # tiny network, sigmoid → [0, 1]
alpha = i / max(iters - 1, 1)                # iteration schedule
corr_enhanced = corr + alpha * gate * pol_residual(pol_corr)

The overall formula is a three-way multiplicative modulation: corr + α · gate · residual.

α (iteration schedule): i / max(iters-1, 1), controls “when” to inject pol.
gate (gradient gating): produced by GatingNetwork, controls “where” to inject pol.
residual: polarization residual produced by PolCorrResidual, controls the injected content.

3. Design Points

3.1 disp_grad.detach() — stop-gradient

disp_grad = compute_gradient(disp).detach()

disp_grad is the spatial gradient of the current disparity estimate, serving as a structural cue for “uncertainty”.
.detach() blocks gradient backpropagation: avoids forming a feedback loop.
Without detach, the network could take a shortcut of “manipulating disparity to make the gate larger/smaller” to minimize loss, degrading the gate into a learnable shortcut and losing its physical meaning.
After detach, disp_grad is purely a side signal that “reads the shape of disparity” — it is a structural cue rather than a learnable shortcut.

3.2 GatingNetwork is tiny

GatingNetwork is deliberately designed to be very small:

Only 2 conv layers.
No BatchNorm.
No attention.
Inputs are pol_corr and disp_grad; sigmoid at the end outputs a gate in [0, 1].

Design rationale: this is a mechanism proof, not a black box. The smaller the network, the more we can confirm that “gradient gating as a mechanism” itself is effective, rather than being brute-fit by the capacity of a large network.

3.3 α × gate

Using gate alone has a problem: in early iterations the disparity is still coarse, disp_grad is itself noisy, and the gate is unreliable.
Therefore the iteration schedule alpha is kept: α × gate ensures that even if the gate is noisy in early iterations, the overall injection strength is still suppressed by α≈0.
Only in mid/late iterations, after disparity stabilizes, does disp_grad become meaningful and the gate truly take effect.

4. Architecture (Data Flow)

Gradient Gating data flow

5. Components and Modules

5.1 GatingNetwork

Takes pol_corr and the detached disp_grad as input, passes through 2 conv layers (no BN, no attention), and ends with sigmoid to output a spatial gate of shape (B, 1, H, W) with values in [0, 1].

5.2 PolCorrResidual

Three convolutional layers (3×3 → 3×3 → 1×1) plus a learnable scalar scale, with the last layer initialized to 0, outputting a polarization residual Δcorr projected to corr_dim.

5.3 Schedule Coefficient alpha

alpha = i / max(iters - 1, 1), a fixed function of iteration, not a learnable parameter.

6. Tensor Dimensions

Tensor	Shape / Type	Description
`disp`	`(B, 1, H, W)`	Current disparity estimate
`disp_grad`	`(B, *, H, W)`	Spatial gradient of disparity, detached
`pol_corr`	`(B, pol_dim, H, W)`	Output of PolCorrBlock
`gate`	`(B, 1, H, W)`	Output of GatingNetwork, in [0, 1]
`alpha`	scalar	Iteration schedule
`Δcorr`	`(B, corr_dim, H, W)`	Output of pol_residual
`corr_enhanced`	`(B, corr_dim, H, W)`	`corr + α·gate·Δcorr`

7. Hyperparameters

Hyperparameter	Value	Description
`pol_levels`	4	Number of pyramid levels in the polarization volume
`pol_radius`	4	Lookup radius of the polarization volume
`iters`	24	Number of GRU iterations (also determines the length of the α schedule)

8. Design Decisions and Rationale

Decision	Rationale
Use disparity gradient as uncertainty proxy	High-gradient regions (boundaries/occlusions) are mostly stereo-unreliable regions, exactly where pol should step in
`disp_grad.detach()`	Blocks the feedback loop, prevents the gate from degrading into a learnable shortcut
`GatingNetwork` is only 2 conv layers, no BN, no attention	Mechanism proof — validates the mechanism itself rather than relying on network capacity
Keep the iteration schedule `alpha` and multiply with the gate	Early-iteration `disp_grad` is noise; suppressed by α≈0
Residual and UpdateBlock follow the standard design	Adds a “where” dimension on top of “when”; everything else unchanged

9. Highlights

Uses the disparity gradient as an uncertainty proxy, letting the polarization residual automatically focus on object boundaries, occlusions, and other stereo-unreliable regions.
The injection formula corr + α · gate · residual is a three-way multiplicative modulation — α controls “when”, gate controls “where”, and residual controls the “content”.
disp_grad.detach() blocks gradient backpropagation, preventing the network from manipulating disparity to turn the gate into a learnable shortcut, so gradient gating retains its physical meaning.
GatingNetwork is deliberately tiny (2 conv layers, no BN, no attention), positioned as mechanism validation to ensure that the effect comes from the gating mechanism itself rather than from network capacity.
The α × gate multiplicative design elegantly handles the early-iteration disparity noise: even if the gate is unreliable, overall injection is still suppressed by α≈0.