1. Design Goals
If a polarization stereo matching system aims to exploit polarization information without introducing explicit warp alignment or an additional refinement pass, it must face a core problem: when the left and right images are not aligned, per-pixel polarization differences are contaminated by disparity misalignment.
The design goal of this architecture is to obtain a region-level polarization cue with the lowest implementation complexity and computational cost: compute polarization features at low resolution (where precise alignment is not required), inject them into the Context Encoder, and finish in a single pass.
The underlying premise: when an image is downsampled, the left-right disparity (measured in pixels) shrinks substantially, so the relative alignment error becomes smaller, allowing a coarse, region-level polarization cue to be extracted without explicit warping.
2. Architecture (with Data Flow)
An honest assessment of the alignment problem
Downsampling does not really eliminate the alignment problem. The estimated disparity magnitudes at different resolutions are:
Original resolution 640×480, typical disparity 0~160px:
1/4 (160×120): disp 0~40px ← still large
1/8 (80×60): disp 0~20px ← still significant
1/16 (40×30): disp 0~10px ← 25% of image width, not negligible
Low resolution does not really eliminate the alignment problem. But what Coarse-Scale provides is not a pixel-level pol signal; instead it offers a coarser, region-level signal of “how large is the left/right statistical difference in this region”. CoarsePolEncoder, using convolutions with a large receptive field, can learn to compensate for the misalignment.
3. Components and Modules
3.1 Downsample
- Downsample
leftandrightto1/Sresolution, producingcoarse_leftandcoarse_right. S = 8orS = 16.
3.2 CoarsePolEncoder
- Lightweight CNN (2–3 layers).
- Input: 6 channels (
coarse_left+coarse_rightconcatenated, 3ch each). - Output: K-channel coarse pol features
coarse_pol. - Designed to use convolutions with a large receptive field, expected to learn to compensate for left/right misalignment.
3.3 Upsample
- Upsample
coarse_polback toH×W, producingcoarse_pol_up.
3.4 fnet (Feature Encoder)
- Runs once for each of left and right, producing fmaps.
- fnet is not modified.
3.5 cnet (pol-aware Context Encoder)
- Inputs:
leftandcoarse_pol_up. - Outputs: pol-aware
contextandhidden.
3.6 CorrBlock + GRU
CorrBlock(fmaps) → corr_pyramid.GRU × N with context → disp, a single pass.
4. Tensor Dimensions
| Tensor | Dimensions | Description |
|---|---|---|
| left / right | (B, 3, H, W) | Input polarization image pair, H×W = 480×640 |
| coarse_left / coarse_right | (B, 3, H/S, W/S) | Downsampled, S = 8 or 16 |
| CoarsePolEncoder input | (B, 6, H/S, W/S) | coarse_left + coarse_right concatenated |
| coarse_pol | (B, K, H/S, W/S) | Coarse pol features, K channels |
| coarse_pol_up | (B, K, H, W) | Upsampled back to original resolution |
| fmaps | (B, C, ·, ·) | fnet output feature maps |
| context / hidden | (B, ·, ·, ·) | pol-aware context (cnet input includes pol) |
| disp | (B, 1, H, W) | Single-pass disparity output |
Disparity magnitudes at different downsampling scales (relative to 640×480)
| Scale | Resolution | Typical disparity range |
|---|---|---|
| 1/4 | 160×120 | 0~40px |
| 1/8 | 80×60 | 0~20px |
| 1/16 | 40×30 | 0~10px (25% of image width) |
5. Design Decisions and Rationale
| Item | Description |
|---|---|
| Downsample scale | 1/8 or 1/16 |
| CoarsePolEncoder | Lightweight CNN (2–3 layers), input 6ch (left+right coarse), output Kch |
| Injection method | concat into cnet’s first layer (cnet input goes from 3ch to 3+K ch) |
| Alternative injection | concat into a middle layer of cnet (FiLM-style conditioning) |
Core design idea: rather than pursuing pixel-level polarization alignment accuracy, accept the fact that misalignment persists even at low resolution, and use a region-level “left/right statistical difference” as the polarization cue, relying on CoarsePolEncoder’s large receptive field to absorb the alignment error.
Computational cost: ~1.15× baseline RAFT-Stereo (almost no additional cost).
6. Polarization Injection Points
There is only one polarization injection point, located at the Context Encoder:
| Injection point | Location | Form | Description |
|---|---|---|---|
| coarse_pol_up → cnet | Context Encoder input | concat (first or middle layer) | cnet’s first-layer input goes from 3ch to 3+K ch; or FiLM-style conditioning in a middle layer |
After injection, the resulting pol-aware context affects the entire iteration loop via GRU × N with context.
Relation to design principles
This architecture follows three polarization-injection principles:
- Polarization does not enter fnet — satisfied (fnet untouched).
- Polarization does not only do spatial gating — satisfied (context injects into the GRU loop, and the GRU operates in disparity-aware space).
- Polarization only acts in disparity-aware space — satisfied (context is injected into the GRU every iteration, where the GRU is coupled with correlation lookup).
7. Highlights
- Single pass, no explicit warp: no second refinement pass or explicit alignment is needed; the polarization cue is extracted at low resolution and injected directly into the context, keeping the implementation complexity extremely low.
- Honest about alignment error: does not pretend that downsampling eliminates misalignment; instead it redefines the problem as a region-level “left/right statistical difference” and actively absorbs alignment error with CoarsePolEncoder’s large receptive field.
- Near-zero extra cost: polarization computation happens at 1/8–1/16 resolution; CoarsePolEncoder has only about 0.1M parameters, giving an overall computational cost of ~1.15× baseline.
- No inter-pass dependency: polarization quality does not depend on the output quality of any previous pass; the architectural stability risk is concentrated on a single axis — “is the information detailed enough?”.