Blueprint · 2026

Dual Volume Architecture

This document is the standalone specification of the "Dual Volume" stereo matching architecture. It describes the architecture itself (design goals, data flow, components, tensor dimensions, design decisions, and polarization injection points).

  • stereo matching
  • polarization
  • RAFT-Stereo

Using these blueprints

Everything here is an architecture proposal I designed and chose to publish openly. Free to use, adapt, or build on — no permission needed.

If one turns out useful and crediting is convenient, a link back to this site is appreciated. It's never required.

1. Design Goals

Core Problem

RAFT-Stereo assumes photometric consistency: corresponding pixels in the left and right images have similar values. In an active polarization stereo system, however, this assumption does not hold over glass regions. The left camera (I∥) captures strong specular reflections while the right camera (I⊥) suppresses them, so on glass regions I∥(left) >> I⊥(right).

The resulting problems:

  • The Cost Volume produces garbage signals over glass regions.
  • A correct match yields a high cost, and the system misinterprets it as a “matching error”.
  • When the GRU iterates on garbage signals, the result degrades with every iteration.

Design Goal: Dual Volume Complementarity

This architecture introduces a second volume (the Pol Volume) that is complementary to the Cost Volume. Their reliability over glass vs. non-glass regions is exactly opposite:

VolumeNon-glass regionGlass region
Cost VolumeStrong, reliable signalGarbage, unreliable
Pol Volume≈ 0, no signalStrong signal (I∥ >> I⊥)

The two volumes are defined as:

Cost Volume: corr[x,d] = dot(fmap_left[x], fmap_right[x-d])
Pol Volume:  pol[x,d]  = left[x] - right[x-d]

Neither volume depends on the ground truth, so both are available at inference time.


2. Architecture (with Data Flow)

Dual Volume architecture data flow

Four Phases of the Data Flow

PhaseNameInputOutput
Phase 1Feature Extractionleft, rightfmap_left, fmap_right
Phase 2Build volumes (GT-free)fmap_left/right, left, rightCost Volume, Pol Volume
Phase 3Contextleft, pol_inputcontext = concat(rgb_ctx, pol_ctx), hidden
Phase 4GRU iterationCost/Pol Volume, context, hiddendelta_disp (iteratively updates disp)

3. Components and Modules

3.1 fnet (Feature Network)

  • fmap_left, fmap_right = fnet(left), fnet(right).

3.2 Cost Volume

  • corr[x,d] = dot(fmap_left[x], fmap_right[x-d]), a dot-product correlation.

3.3 Pol Volume

  • pol[x,d] = left[x] - right[x-d], the raw polarization difference.
  • GT-free, available at inference time.

3.4 rgb_cnet (RGB Context Encoder)

  • rgb_ctx, hidden = rgb_cnet(left).
  • Produces both the context and the GRU initial hidden state.

3.5 pol_cnet (Pol Context Encoder)

  • Input: pol_input, 2 channels.
  • Output: pol_ctx.
  • Inputs differ by training stage:
    • Stage 1 (Pretrain): pol_input = GT_mask.
    • Stage 2 (Finetune): pol_input = pol_stats(pol_vol).

3.6 Context Fusion

  • context = concat(rgb_ctx, pol_ctx), simple concatenation.

3.7 GRU Update Unit

Each iteration:

  • corr_feat = lookup(Cost_Volume, disp).
  • pol_feat = lookup(Pol_Volume, disp).
  • motion = encoder(concat(corr_feat, pol_feat, disp)) — concat-based fusion.
  • hidden = gru(hidden, concat(motion, context)).
  • delta_disp = disp_head(hidden).

4. Two-Stage Training Strategy

Stage 1 (Pretrain): Cheating with the GT Mask

  • pol_input = [GT_mask, zeros] (2 channels, second channel filled with zeros).
  • Goal: let the Pol CNet learn what a “correct glass context” looks like.

Stage 2 (Finetune): Pol Volume Statistics

  • pol_input = [pol_max, pol_var] (2 channels).
  • pol_max: maximum of the Pol Volume along the disparity dimension.
  • pol_var: variance of the Pol Volume along the disparity dimension.

Transition Strategy

  • Lower the Pol CNet learning rate during Stage 2 (×0.1).
  • Or use a warmup blending period.

5. Tensor Dimensions

TensorDimensions / SettingDescription
left / right(B, 3, H, W)Input polarization image pair
fmap_left / fmap_right(B, C, ·, ·)Output feature maps of fnet
Cost Volume corr[x,d]Indexed by disparitydot product
Pol Volume pol[x,d]Indexed by disparityleft[x] - right[x-d]
pol_input(B, 2, ·, ·)2 channels (Pretrain: [GT_mask, zeros]; Finetune: [pol_max, pol_var])
rgb_ctx / pol_ctx(B, ·, ·, ·)The two context branches
context(B, ·, ·, ·)concat(rgb_ctx, pol_ctx)
hidden(B, ·, ·, ·)GRU initial hidden state (from rgb_cnet)
corr_feat / pol_feat(B, ·, ·, ·)Lookup results of the two volumes
disp / delta_disp(B, 1, ·, ·)Disparity and its iterative residual

6. Design Decisions and Rationale

DecisionChoiceRationale
Cost + Pol fusionConcatSimple and stable; let the encoder learn the weighting
Pol CNet input channels2 channelsZero-padded in Pretrain, max+var in Finetune
Pol Volume statisticsmax + varmax = peak polarization difference, var = signal stability
Transition strategyLower LRAvoid overwriting what was learned in Stage 1

Design Principles

  1. No GT at inference: training may cheat, but inference must stand alone.
  2. No extra inputs: only left (I∥) and right (I⊥) are used.
  3. No degradation on non-glass regions: the new design must only add value.
  4. The Pol Volume is the savior: it is the only reliable signal over glass.

Monitoring

Non-glass degradation monitoring:

metrics = {
    'epe_total': ...,
    'epe_glass': ...,
    'epe_non_glass': ...,  # key metric to monitor
}

if epe_non_glass > baseline * 1.05:
    print("WARNING: Non-glass degradation!")

7. Polarization Injection Points

Injection PointPhaseFormDescription
Pol Volume → motion encoderPhase 4 (GRU iteration)motion = encoder(concat(corr_feat, pol_feat, disp))Concat-fused with the Cost Volume every iteration
pol_input → pol_cnet → contextPhase 3 (Context)context = concat(rgb_ctx, pol_ctx)Polarization enters the context via the Pol CNet

Polarization has two entry points: (1) the Pol Volume goes directly into the GRU loop to complement the Cost Volume; (2) pol_input enters the context via the Pol CNet. Both are GT-free (after Stage 2).


8. Highlights

  • Complementary Dual Volume design: the Pol Volume fills the blind spot where the Cost Volume fails on glass; their reliability is mutually exclusive, covering both glass and non-glass regions.
  • Fully self-sufficient at inference: both volumes are GT-free, so training can cheat while inference still runs independently from only left (I∥) and right (I⊥) images.
  • Minimalist concat fusion: no complex cross-attention or gated fusion — the two volumes are simply concatenated and the motion encoder learns the trade-off, keeping the design stable and debug-friendly.
  • Two-stage training bridges oracle and real signals: Stage 1 pretrains the Pol CNet with the GT mask to establish the “correct glass context” representation; Stage 2 switches to Pol Volume statistics with a lowered learning rate for a smooth transition.
  • Built-in non-glass degradation monitoring: region-wise EPE acts as a gate, ensuring the new polarization path only adds value and never harms matching performance on existing non-glass regions.

← All blueprints