Blueprint · 2026

Glass-Aware Context Encoder Pretraining Architecture

An architecture that adds a polarization-contrast side branch and gated fusion to the Context Encoder, and pretrains a glass-aware cnet under glass-segmentation supervision — without touching the stereo matching pipeline at all.

  • stereo matching
  • polarization
  • RAFT-Stereo

Using these blueprints

Everything here is an architecture proposal I designed and chose to publish openly. Free to use, adapt, or build on — no permission needed.

If one turns out useful and crediting is convenient, a link back to this site is appreciated. It's never required.

1. Design Goals

Background

The glass textures produced by synthetic rendering are often too rich, allowing baseline RAFT-Stereo to perform stereo matching in glass regions without polarization. Such synthetic data is therefore not a fair arena for comparing polarization vs. non-polarization.

Core problem

How can we exploit the physically correct polarization signal in synthetic data without baking its structural errors (texture / correlation) into the model?

Solution idea

  • The polarization contrast (I∥ vs I⊥) in synthetic data is physically correct — Fresnel equations are not affected by rendering artifacts.
  • But the textural structure of synthetic data is misleading for stereo matching — the texture of real glass is ≈ 0.
  • Therefore: learn only the polarization semantics from synthetic data (where the glass is), not the stereo matching (how to match glass).

Entry point: Context Encoder (cnet)

  • cnet is the only module in RAFT-Stereo that looks only at the left image — it produces the context + hidden state for the GRU.
  • cnet learns “where to pay attention” (attention guidance), not “how to match” (correlation / GRU).
  • Training glass segmentation on cnet does not touch the stereo matching pipeline — it is fully isolated.

Core idea: on top of an RGB-only Context Encoder pretraining, add a polarization-contrast side branch and pretrain cnet into a glass-aware module, with no contact at all with disparity loss.


2. Architecture (with Data Flow)

Glass-Aware Context Encoder structure and data flow

Data flow overview

StageOperationOutput dimensions
RGB main stemconv1(3→64,7,s2) + BN + ReLUfeat_rgb (B,64,H/2)
Pol side-branch stempol_conv(1→64,7,s2) + BN + ReLUfeat_pol (B,64,H/2)
Gated fusioncat → gate_conv(128→64,1×1) → sigmoid, then feat_rgb + gate × feat_polfeat (B,64,H/2)
Backbonelayer1 (64→64) → layer2 (64→96, s2) → layer3 (96→128)(B,128,H/4)
Dual-head outputhead_union / head_strictIndependent segmentation predictions

3. Components and Modules

3.1 RGB main stem (conv1)

  • conv1: 3→64, 7×7, stride 2.
  • BN(64) + ReLU.
  • Stays 3-channel input so that SceneFlow pretrained weights load directly.

3.2 Pol Side-Branch (pol_conv)

  • pol_conv: 1→64, 7×7, stride 2.
  • BN(64) + ReLU (i.e. norm_pol).
  • Input is the single-channel pol_contrast.
  • Designed as a side branch; does not modify the original backbone structure.

3.3 Gated Fusion

  • Concatenate feat_rgb and feat_pol into (B,128,H/2).
  • gate_conv: 128→64, 1×1 conv, followed by sigmoid, producing gate (B,64,H/2).
  • Fusion formula: feat = feat_rgb + gate × feat_pol.
  • The learned sigmoid gate lets the model decide when to inject the pol signal.

3.4 Backbone

  • layer1: 64→64.
  • layer2: 64→96, stride 2.
  • layer3: 96→128.
  • Output (B,128,H/4).

3.5 Dual heads: head_union / head_strict

  • head_union: supervised against GT mask (union version).
  • head_strict: supervised against GT mask_strict (strict version).
  • Both head losses are BCE + Dice and do not touch the disparity loss at all.

4. Pol Contrast Computation

Synthetic-data stage: warp right (I⊥) to the left viewpoint using GT disparity, then compute:

pol_contrast = |I∥_gray - warp(I⊥_gray, disp_GT)| / (I∥_gray + warp(I⊥_gray, disp_GT) + ε)

Real-world stage: switch to a two-pass design (Pass 1 coarse disparity → warp → pol contrast → Pass 2 refinement).

Notes:

  1. padding_mode='zeros' — for out-of-bound warp, warped_right = 0 → pol_contrast ≈ 1.0 (looks like high polarization contrast). This is acceptable in synthetic pretraining (GT glass masks do not cover large occlusion regions).
  2. No clamping — preserves physical intuition; BN layers handle extreme values.

5. Tensor Dimensions

TensorDimensionsDescription
x_rgb(B,3,H,W)RGB input image
pol_contrast(B,1,H,W)Polarization contrast (single channel)
feat_rgb(B,64,H/2,W/2)conv1 output
feat_pol(B,64,H/2,W/2)pol_conv output
cat(feat_rgb, feat_pol)(B,128,H/2,W/2)After concatenation
gate(B,64,H/2,W/2)gate_conv + sigmoid output
feat (after fusion)(B,64,H/2,W/2)feat_rgb + gate × feat_pol
layer3 output(B,128,H/4,W/4)Backbone tail features

6. Design Decisions and Rationale

DecisionChoiceRationale
conv1 channel countKeep 3chSceneFlow pretrained weights load directly
Pol input methodSide branch (1ch→64ch)Does not modify the original backbone structure
Fusion mechanismGated fusion (learned sigmoid)Lets the model learn when to inject the pol signal
Supervision signalBCE + Dice on glass maskDoes not touch disparity loss — fully isolated
Pol contrast computationGT disparity warp + normalized differenceRemoves disparity noise; pure polarization signal

Core principle: polarization is used only to train cnet’s glass-awareness (segmentation); the entire stereo matching pipeline (correlation / GRU / fnet) is left untouched, so the structural texture errors in synthetic data will not be baked into the model.


7. Polarization Injection Points

Injection pointLocationFormDescription
pol_contrast → pol_convSide branch next to the cnet stemIndependent conv stem (1→64)Does not modify the conv1 of the RGB main branch
feat_pol → gated fusionBetween the RGB stem and the backbonefeat = feat_rgb + gate × feat_pol, where gate is a learned sigmoidThe model learns the injection strength on its own

Polarization only enters the Context Encoder (cnet) and is used solely for segmentation supervision; it does not enter fnet, does not enter correlation, and does not touch disparity loss.


8. Added Parameter Count

ComponentParameters
pol_conv (1→64, 7×7)3,200
norm_pol (BN 64)128
gate_conv (128→64, 1×1)8,256
Total added~11,584 (<2% of backbone)

9. Highlights

  • Decoupling semantics from structure: only the physically correct polarization semantics of “where is the glass” is learned from synthetic data; the structural cue of “how to match glass” is deliberately not learned, avoiding baking rendering artifacts into the model.
  • Fully isolated stereo pipeline: polarization enters only cnet and is supervised with BCE+Dice glass segmentation; correlation / GRU / fnet and the disparity loss are not touched at all.
  • Non-destructive side branch: the RGB main branch stays 3-channel so SceneFlow pretrained weights load directly; polarization joins via an independent 1→64 conv stem without modifying the original backbone structure.
  • Learned gated fusion: feat_rgb + gate × feat_pol with a learned sigmoid gate lets the model decide when to inject the polarization signal.
  • Extremely lightweight: about 11.6K added parameters, less than 2% of the backbone — virtually no extra cost.

← All blueprints