Pol-in-Feature Early Fusion Architecture

1. Design Goals

The use of polarization information in stereo matching can occur at two levels: as a post-correlation intervention that corrects the correlation volume after it has formed, or as an early fusion that lets polarization participate before feature extraction.

The fundamental limitation of post-correlation intervention is this: no matter how residuals, schedules, or gating are added after the correlation step, polarization can only act after matching has already happened, and cannot change the matching features themselves.

The design goal of this architecture is to let polarization information enter feature extraction and directly influence matching itself: by feeding polarization in before features are extracted, the matching features learned by the feature encoder are directly shaped by polarization.

The core change is as follows:

# Conventional approach
fmap1 = fnet(left)   # 3 channels
fmap2 = fnet(right)  # 3 channels
# Pol can only intervene after corr is formed

# Pol-in-Feature (this architecture)
pol_diff = left - right
fmap1 = fnet(concat(left, pol_diff))   # 6 channels
fmap2 = fnet(concat(right, pol_diff))  # 6 channels
# Pol directly influences feature extraction

Concrete steps:

Compute pol_diff = left - right (3-channel polarization difference).
Concatenate pol_diff with the original image along the channel dimension: left path concat(left, pol_diff), right path concat(right, pol_diff), each 6 channels.
Feed the 6-channel tensor into the feature encoder fnet to obtain fmap1 / fmap2.

In this way, polarization participates before features are extracted, and the matching features learned by the feature encoder are directly influenced by polarization.

2. Design Principles

This architecture deliberately adopts the most minimal and pure design in order to isolate a single variable and validate the effectiveness of early fusion:

No new branch.
No attention.
No residual.
No gating.
One question only: can polarization influence feature matching?

No sophisticated module is introduced; the goal is to answer the single question “does feeding polarization into the feature extraction stage actually help?”

The design principle this architecture follows: polarization does not perform spatial gating on its own, but instead enters the feature extraction stage where it can influence matching.

3. Architecture (Data Flow)

Pol-in-Feature Early Fusion data flow

Note that pol_diff is fed to both the left and right paths, and both paths share the same pol_diff.

4. Components and Modules

pol_diff computation: pol_diff = left - right, a 3-channel polarization difference tensor, shared between the left and right paths.
fnet (Feature Encoder): input changed from 3 channels to 6 channels; runs once for each side, producing fmap1 / fmap2.
CorrBlock: builds the correlation volume from fmap1 / fmap2, following the original RAFT design.
UpdateBlock / GRU Loop: reuses the original RAFT update unit and iteration loop to produce disparity.

5. Tensor Dimensions

Tensor	Dimensions	Description
`left` / `right`	`(B, 3, H, W)`	Original RGB images
`pol_diff`	`(B, 3, H, W)`	`left - right`, 3-channel polarization difference
`concat(left, pol_diff)`	`(B, 6, H, W)`	Left-path fnet input
`concat(right, pol_diff)`	`(B, 6, H, W)`	Right-path fnet input
`fnet.conv1` input channels	3 → 6	Changed from 3 to 6 channels
`fmap1` / `fmap2`	fnet output features	Used to build CorrBlock

6. Hyperparameters

Hyperparameter	Description
`pol_levels = 4`	Pol pyramid levels
`pol_radius = 4`	Pol lookup radius
`iters = 24`	GRU iterations
curriculum	Curriculum training schedule enabled

7. Design Decisions and Rationale

Decision	Rationale
Polarization enters feature extraction (early fusion)	The only injection point that can directly influence matching itself
Inject polarization via 6-channel concat	The most direct form of early fusion, no extra modules needed
Left and right paths share the same `pol_diff`	`pol_diff` is a single tensor formed by left minus right; there is only one copy
No branch / attention / residual / gating	Isolates a single variable to purely test “can polarization influence feature matching”
Reuses RAFT’s original CorrBlock and UpdateBlock	Only the input to feature extraction is changed; downstream is untouched

8. Implementation Notes

Changing fnet’s input from 3 to 6 channels means the weight shape of fnet.conv1 changes. The pretrained fnet.conv1 trained for a 3-channel input is incompatible with the new 6-channel input. When loading pretrained weights this layer is skipped and must be retrained from scratch. This is a cost to be aware of in this design: the first convolutional layer loses its pretrained initialization.

9. Highlights

Earliest possible polarization injection point: polarization participates before features are extracted, the only way to directly change the matching features themselves, breaking through the ceiling of post-correlation corrections.
Zero extra modules: no new branch, attention, residual, or gating; early fusion is achieved purely through channel concatenation, the cleanest control design for validating the early fusion hypothesis.
Single-variable isolation: deliberately minimal design so that “can polarization influence feature matching” becomes the sole testable variable.
Downstream fully reused: CorrBlock and UpdateBlock are untouched; all changes are concentrated at the input side of feature extraction, minimizing the architectural change surface.