Blueprint · 2026

Coarse-Scale Polarization Context Architecture

A single-pass polarization stereo matching architecture that computes a coarse polarization feature at low resolution and injects it into the Context Encoder, without requiring explicit warp-based alignment.

  • stereo matching
  • polarization
  • RAFT-Stereo

Using these blueprints

Everything here is an architecture proposal I designed and chose to publish openly. Free to use, adapt, or build on — no permission needed.

If one turns out useful and crediting is convenient, a link back to this site is appreciated. It's never required.

1. Design Goals

If a polarization stereo matching system aims to exploit polarization information without introducing explicit warp alignment or an additional refinement pass, it must face a core problem: when the left and right images are not aligned, per-pixel polarization differences are contaminated by disparity misalignment.

The design goal of this architecture is to obtain a region-level polarization cue with the lowest implementation complexity and computational cost: compute polarization features at low resolution (where precise alignment is not required), inject them into the Context Encoder, and finish in a single pass.

The underlying premise: when an image is downsampled, the left-right disparity (measured in pixels) shrinks substantially, so the relative alignment error becomes smaller, allowing a coarse, region-level polarization cue to be extracted without explicit warping.


2. Architecture (with Data Flow)

Coarse-Scale Pol Context data flow

An honest assessment of the alignment problem

Downsampling does not really eliminate the alignment problem. The estimated disparity magnitudes at different resolutions are:

Original resolution 640×480, typical disparity 0~160px:
  1/4  (160×120):  disp 0~40px   ← still large
  1/8  (80×60):    disp 0~20px   ← still significant
  1/16 (40×30):    disp 0~10px   ← 25% of image width, not negligible

Low resolution does not really eliminate the alignment problem. But what Coarse-Scale provides is not a pixel-level pol signal; instead it offers a coarser, region-level signal of “how large is the left/right statistical difference in this region”. CoarsePolEncoder, using convolutions with a large receptive field, can learn to compensate for the misalignment.


3. Components and Modules

3.1 Downsample

  • Downsample left and right to 1/S resolution, producing coarse_left and coarse_right.
  • S = 8 or S = 16.

3.2 CoarsePolEncoder

  • Lightweight CNN (2–3 layers).
  • Input: 6 channels (coarse_left + coarse_right concatenated, 3ch each).
  • Output: K-channel coarse pol features coarse_pol.
  • Designed to use convolutions with a large receptive field, expected to learn to compensate for left/right misalignment.

3.3 Upsample

  • Upsample coarse_pol back to H×W, producing coarse_pol_up.

3.4 fnet (Feature Encoder)

  • Runs once for each of left and right, producing fmaps.
  • fnet is not modified.

3.5 cnet (pol-aware Context Encoder)

  • Inputs: left and coarse_pol_up.
  • Outputs: pol-aware context and hidden.

3.6 CorrBlock + GRU

  • CorrBlock(fmaps) → corr_pyramid.
  • GRU × N with context → disp, a single pass.

4. Tensor Dimensions

TensorDimensionsDescription
left / right(B, 3, H, W)Input polarization image pair, H×W = 480×640
coarse_left / coarse_right(B, 3, H/S, W/S)Downsampled, S = 8 or 16
CoarsePolEncoder input(B, 6, H/S, W/S)coarse_left + coarse_right concatenated
coarse_pol(B, K, H/S, W/S)Coarse pol features, K channels
coarse_pol_up(B, K, H, W)Upsampled back to original resolution
fmaps(B, C, ·, ·)fnet output feature maps
context / hidden(B, ·, ·, ·)pol-aware context (cnet input includes pol)
disp(B, 1, H, W)Single-pass disparity output

Disparity magnitudes at different downsampling scales (relative to 640×480)

ScaleResolutionTypical disparity range
1/4160×1200~40px
1/880×600~20px
1/1640×300~10px (25% of image width)

5. Design Decisions and Rationale

ItemDescription
Downsample scale1/8 or 1/16
CoarsePolEncoderLightweight CNN (2–3 layers), input 6ch (left+right coarse), output Kch
Injection methodconcat into cnet’s first layer (cnet input goes from 3ch to 3+K ch)
Alternative injectionconcat into a middle layer of cnet (FiLM-style conditioning)

Core design idea: rather than pursuing pixel-level polarization alignment accuracy, accept the fact that misalignment persists even at low resolution, and use a region-level “left/right statistical difference” as the polarization cue, relying on CoarsePolEncoder’s large receptive field to absorb the alignment error.

Computational cost: ~1.15× baseline RAFT-Stereo (almost no additional cost).


6. Polarization Injection Points

There is only one polarization injection point, located at the Context Encoder:

Injection pointLocationFormDescription
coarse_pol_up → cnetContext Encoder inputconcat (first or middle layer)cnet’s first-layer input goes from 3ch to 3+K ch; or FiLM-style conditioning in a middle layer

After injection, the resulting pol-aware context affects the entire iteration loop via GRU × N with context.

Relation to design principles

This architecture follows three polarization-injection principles:

  1. Polarization does not enter fnet — satisfied (fnet untouched).
  2. Polarization does not only do spatial gating — satisfied (context injects into the GRU loop, and the GRU operates in disparity-aware space).
  3. Polarization only acts in disparity-aware space — satisfied (context is injected into the GRU every iteration, where the GRU is coupled with correlation lookup).

7. Highlights

  • Single pass, no explicit warp: no second refinement pass or explicit alignment is needed; the polarization cue is extracted at low resolution and injected directly into the context, keeping the implementation complexity extremely low.
  • Honest about alignment error: does not pretend that downsampling eliminates misalignment; instead it redefines the problem as a region-level “left/right statistical difference” and actively absorbs alignment error with CoarsePolEncoder’s large receptive field.
  • Near-zero extra cost: polarization computation happens at 1/8–1/16 resolution; CoarsePolEncoder has only about 0.1M parameters, giving an overall computational cost of ~1.15× baseline.
  • No inter-pass dependency: polarization quality does not depend on the output quality of any previous pass; the architectural stability risk is concentrated on a single axis — “is the information detailed enough?”.

← All blueprints