Glass-Aware Context Encoder Pretraining Architecture

1. Design Goals

Background

The glass textures produced by synthetic rendering are often too rich, allowing baseline RAFT-Stereo to perform stereo matching in glass regions without polarization. Such synthetic data is therefore not a fair arena for comparing polarization vs. non-polarization.

Core problem

How can we exploit the physically correct polarization signal in synthetic data without baking its structural errors (texture / correlation) into the model?

Solution idea

The polarization contrast (I∥ vs I⊥) in synthetic data is physically correct — Fresnel equations are not affected by rendering artifacts.
But the textural structure of synthetic data is misleading for stereo matching — the texture of real glass is ≈ 0.
Therefore: learn only the polarization semantics from synthetic data (where the glass is), not the stereo matching (how to match glass).

Entry point: Context Encoder (cnet)

cnet is the only module in RAFT-Stereo that looks only at the left image — it produces the context + hidden state for the GRU.
cnet learns “where to pay attention” (attention guidance), not “how to match” (correlation / GRU).
Training glass segmentation on cnet does not touch the stereo matching pipeline — it is fully isolated.

Core idea: on top of an RGB-only Context Encoder pretraining, add a polarization-contrast side branch and pretrain cnet into a glass-aware module, with no contact at all with disparity loss.

2. Architecture (with Data Flow)

Glass-Aware Context Encoder structure and data flow

Data flow overview

Stage	Operation	Output dimensions
RGB main stem	`conv1(3→64,7,s2)` + BN + ReLU	feat_rgb (B,64,H/2)
Pol side-branch stem	`pol_conv(1→64,7,s2)` + BN + ReLU	feat_pol (B,64,H/2)
Gated fusion	`cat → gate_conv(128→64,1×1) → sigmoid`, then `feat_rgb + gate × feat_pol`	feat (B,64,H/2)
Backbone	layer1 (64→64) → layer2 (64→96, s2) → layer3 (96→128)	(B,128,H/4)
Dual-head output	head_union / head_strict	Independent segmentation predictions

3. Components and Modules

3.1 RGB main stem (conv1)

conv1: 3→64, 7×7, stride 2.
BN(64) + ReLU.
Stays 3-channel input so that SceneFlow pretrained weights load directly.

3.2 Pol Side-Branch (pol_conv)

pol_conv: 1→64, 7×7, stride 2.
BN(64) + ReLU (i.e. norm_pol).
Input is the single-channel pol_contrast.
Designed as a side branch; does not modify the original backbone structure.

3.3 Gated Fusion

Concatenate feat_rgb and feat_pol into (B,128,H/2).
gate_conv: 128→64, 1×1 conv, followed by sigmoid, producing gate (B,64,H/2).
Fusion formula: feat = feat_rgb + gate × feat_pol.
The learned sigmoid gate lets the model decide when to inject the pol signal.

3.4 Backbone

layer1: 64→64.
layer2: 64→96, stride 2.
layer3: 96→128.
Output (B,128,H/4).

3.5 Dual heads: head_union / head_strict

head_union: supervised against GT mask (union version).
head_strict: supervised against GT mask_strict (strict version).
Both head losses are BCE + Dice and do not touch the disparity loss at all.

4. Pol Contrast Computation

Synthetic-data stage: warp right (I⊥) to the left viewpoint using GT disparity, then compute:

pol_contrast = |I∥_gray - warp(I⊥_gray, disp_GT)| / (I∥_gray + warp(I⊥_gray, disp_GT) + ε)

Real-world stage: switch to a two-pass design (Pass 1 coarse disparity → warp → pol contrast → Pass 2 refinement).

Notes:

padding_mode='zeros' — for out-of-bound warp, warped_right = 0 → pol_contrast ≈ 1.0 (looks like high polarization contrast). This is acceptable in synthetic pretraining (GT glass masks do not cover large occlusion regions).
No clamping — preserves physical intuition; BN layers handle extreme values.

5. Tensor Dimensions

Tensor	Dimensions	Description
x_rgb	(B,3,H,W)	RGB input image
pol_contrast	(B,1,H,W)	Polarization contrast (single channel)
feat_rgb	(B,64,H/2,W/2)	conv1 output
feat_pol	(B,64,H/2,W/2)	pol_conv output
cat(feat_rgb, feat_pol)	(B,128,H/2,W/2)	After concatenation
gate	(B,64,H/2,W/2)	gate_conv + sigmoid output
feat (after fusion)	(B,64,H/2,W/2)	`feat_rgb + gate × feat_pol`
layer3 output	(B,128,H/4,W/4)	Backbone tail features

6. Design Decisions and Rationale

Decision	Choice	Rationale
conv1 channel count	Keep 3ch	SceneFlow pretrained weights load directly
Pol input method	Side branch (1ch→64ch)	Does not modify the original backbone structure
Fusion mechanism	Gated fusion (learned sigmoid)	Lets the model learn when to inject the pol signal
Supervision signal	BCE + Dice on glass mask	Does not touch disparity loss — fully isolated
Pol contrast computation	GT disparity warp + normalized difference	Removes disparity noise; pure polarization signal

Core principle: polarization is used only to train cnet’s glass-awareness (segmentation); the entire stereo matching pipeline (correlation / GRU / fnet) is left untouched, so the structural texture errors in synthetic data will not be baked into the model.

7. Polarization Injection Points

Injection point	Location	Form	Description
pol_contrast → pol_conv	Side branch next to the cnet stem	Independent conv stem (1→64)	Does not modify the conv1 of the RGB main branch
feat_pol → gated fusion	Between the RGB stem and the backbone	`feat = feat_rgb + gate × feat_pol`, where gate is a learned sigmoid	The model learns the injection strength on its own

Polarization only enters the Context Encoder (cnet) and is used solely for segmentation supervision; it does not enter fnet, does not enter correlation, and does not touch disparity loss.

8. Added Parameter Count

Component	Parameters
pol_conv (1→64, 7×7)	3,200
norm_pol (BN 64)	128
gate_conv (128→64, 1×1)	8,256
Total added	~11,584 (<2% of backbone)

9. Highlights

Decoupling semantics from structure: only the physically correct polarization semantics of “where is the glass” is learned from synthetic data; the structural cue of “how to match glass” is deliberately not learned, avoiding baking rendering artifacts into the model.
Fully isolated stereo pipeline: polarization enters only cnet and is supervised with BCE+Dice glass segmentation; correlation / GRU / fnet and the disparity loss are not touched at all.
Non-destructive side branch: the RGB main branch stays 3-channel so SceneFlow pretrained weights load directly; polarization joins via an independent 1→64 conv stem without modifying the original backbone structure.
Learned gated fusion: feat_rgb + gate × feat_pol with a learned sigmoid gate lets the model decide when to inject the polarization signal.
Extremely lightweight: about 11.6K added parameters, less than 2% of the backbone — virtually no extra cost.