Coarse-Scale Polarization Context Architecture

1. Design Goals

If a polarization stereo matching system aims to exploit polarization information without introducing explicit warp alignment or an additional refinement pass, it must face a core problem: when the left and right images are not aligned, per-pixel polarization differences are contaminated by disparity misalignment.

The design goal of this architecture is to obtain a region-level polarization cue with the lowest implementation complexity and computational cost: compute polarization features at low resolution (where precise alignment is not required), inject them into the Context Encoder, and finish in a single pass.

The underlying premise: when an image is downsampled, the left-right disparity (measured in pixels) shrinks substantially, so the relative alignment error becomes smaller, allowing a coarse, region-level polarization cue to be extracted without explicit warping.

2. Architecture (with Data Flow)

Coarse-Scale Pol Context data flow

An honest assessment of the alignment problem

Downsampling does not really eliminate the alignment problem. The estimated disparity magnitudes at different resolutions are:

Original resolution 640×480, typical disparity 0~160px:
  1/4  (160×120):  disp 0~40px   ← still large
  1/8  (80×60):    disp 0~20px   ← still significant
  1/16 (40×30):    disp 0~10px   ← 25% of image width, not negligible

Low resolution does not really eliminate the alignment problem. But what Coarse-Scale provides is not a pixel-level pol signal; instead it offers a coarser, region-level signal of “how large is the left/right statistical difference in this region”. CoarsePolEncoder, using convolutions with a large receptive field, can learn to compensate for the misalignment.

3. Components and Modules

3.1 Downsample

Downsample left and right to 1/S resolution, producing coarse_left and coarse_right.
S = 8 or S = 16.

3.2 CoarsePolEncoder

Lightweight CNN (2–3 layers).
Input: 6 channels (coarse_left + coarse_right concatenated, 3ch each).
Output: K-channel coarse pol features coarse_pol.
Designed to use convolutions with a large receptive field, expected to learn to compensate for left/right misalignment.

3.3 Upsample

Upsample coarse_pol back to H×W, producing coarse_pol_up.

3.4 fnet (Feature Encoder)

Runs once for each of left and right, producing fmaps.
fnet is not modified.

3.5 cnet (pol-aware Context Encoder)

Inputs: left and coarse_pol_up.
Outputs: pol-aware context and hidden.

3.6 CorrBlock + GRU

CorrBlock(fmaps) → corr_pyramid.
GRU × N with context → disp, a single pass.

4. Tensor Dimensions

Tensor	Dimensions	Description
left / right	(B, 3, H, W)	Input polarization image pair, H×W = 480×640
coarse_left / coarse_right	(B, 3, H/S, W/S)	Downsampled, S = 8 or 16
CoarsePolEncoder input	(B, 6, H/S, W/S)	coarse_left + coarse_right concatenated
coarse_pol	(B, K, H/S, W/S)	Coarse pol features, K channels
coarse_pol_up	(B, K, H, W)	Upsampled back to original resolution
fmaps	(B, C, ·, ·)	fnet output feature maps
context / hidden	(B, ·, ·, ·)	pol-aware context (cnet input includes pol)
disp	(B, 1, H, W)	Single-pass disparity output

Disparity magnitudes at different downsampling scales (relative to 640×480)

Scale	Resolution	Typical disparity range
1/4	160×120	0~40px
1/8	80×60	0~20px
1/16	40×30	0~10px (25% of image width)

5. Design Decisions and Rationale

Item	Description
Downsample scale	1/8 or 1/16
CoarsePolEncoder	Lightweight CNN (2–3 layers), input 6ch (left+right coarse), output Kch
Injection method	concat into cnet’s first layer (cnet input goes from 3ch to 3+K ch)
Alternative injection	concat into a middle layer of cnet (FiLM-style conditioning)

Core design idea: rather than pursuing pixel-level polarization alignment accuracy, accept the fact that misalignment persists even at low resolution, and use a region-level “left/right statistical difference” as the polarization cue, relying on CoarsePolEncoder’s large receptive field to absorb the alignment error.

Computational cost: ~1.15× baseline RAFT-Stereo (almost no additional cost).

6. Polarization Injection Points

There is only one polarization injection point, located at the Context Encoder:

Injection point	Location	Form	Description
coarse_pol_up → cnet	Context Encoder input	concat (first or middle layer)	cnet’s first-layer input goes from 3ch to 3+K ch; or FiLM-style conditioning in a middle layer

After injection, the resulting pol-aware context affects the entire iteration loop via GRU × N with context.

Relation to design principles

This architecture follows three polarization-injection principles:

Polarization does not enter fnet — satisfied (fnet untouched).
Polarization does not only do spatial gating — satisfied (context injects into the GRU loop, and the GRU operates in disparity-aware space).
Polarization only acts in disparity-aware space — satisfied (context is injected into the GRU every iteration, where the GRU is coupled with correlation lookup).

7. Highlights

Single pass, no explicit warp: no second refinement pass or explicit alignment is needed; the polarization cue is extracted at low resolution and injected directly into the context, keeping the implementation complexity extremely low.
Honest about alignment error: does not pretend that downsampling eliminates misalignment; instead it redefines the problem as a region-level “left/right statistical difference” and actively absorbs alignment error with CoarsePolEncoder’s large receptive field.
Near-zero extra cost: polarization computation happens at 1/8–1/16 resolution; CoarsePolEncoder has only about 0.1M parameters, giving an overall computational cost of ~1.15× baseline.
No inter-pass dependency: polarization quality does not depend on the output quality of any previous pass; the architectural stability risk is concentrated on a single axis — “is the information detailed enough?”.