S2M2 Polarization Injection Points (A/B/C)

1. Design Goals

For a stereo matching architecture to use the polarization signal to detect transparent objects, the key is to let the polarization signal participate in the core matching computation rather than only being applied as a post-hoc correction. The architectural characteristics of S2M2 make this possible:

The matching core can be modulated: the matching computation cv = einsum(...) in S2M2 lies before the Sinkhorn Optimal Transport. It is an open interface, and a polarization weight can directly modulate cv element-wise, influencing the entire optimal transport.
OT inherently contains an assignment mechanism: all-pairs correlation matches in a higher-dimensional space, and Optimal Transport has a built-in confidence / assignment mechanism that is well suited to accommodating the uncertainty information that polarization brings.
Cross-attention perceives appearance differences: the MRT cross-attention can directly perceive polarization-induced appearance differences (I∥ vs I⊥) when the left and right features interact.
LayerNorm preserves polarization magnitudes: S2M2 uses LayerNorm in DispInit; unlike BatchNorm, it does not wash out magnitude differences within a batch. The essence of the polarization signal is the magnitude difference between I∥ and I⊥, so the choice of normalization directly determines whether the polarization signal can be preserved.

# S2M2 DispInit
self.layer_norm = nn.LayerNorm(dim, elementwise_affine=True)

This document defines three polarization injection points A / B / C, corresponding respectively to the input layer, the cross-scale fusion layer, and the matching core.

2. Architecture: Positions of the Three Injection Points

Positions of the three S2M2 polarization injection points A/B/C

3. Design and Code for the Three Injection Points

3.1 Injection Point C: Correlation Volume (best)

# Original
cv = torch.einsum('...hic,...hjc -> ...hij', feature0, feature1)

# Modification: Pol-weighted Correlation
pol_weight = compute_pol_weight(left, right)  # [B, H, W, W]
cv = torch.einsum('...hic,...hjc -> ...hij', feature0, feature1) * pol_weight

Advantages:

Polarization directly participates in core matching rather than serving as side info.
Modulating before Sinkhorn influences the entire optimal transport.

3.2 Injection Point B: FeatureFusion (AGFL)

# Original: two-stream fusion
z_out = fusion(cat(z0,z1)) + w*z0 + (1-w)*z1

# Modification: three-stream fusion
z_out = fusion(cat(z0,z1,pol)) + w0*z0 + w1*z1 + w2*pol

3.3 Injection Point A: CNNEncoder Input

# Original
self.conv0 = nn.Conv2d(3, 16, kernel_size=1)

# Modification: extend the number of input channels
self.conv0 = nn.Conv2d(3 + pol_channels, 16, kernel_size=1)

4. Tensor Dimensions

Injection Point	Injected Tensor	Shape / Description
A	pol_channels	Concatenated with RGB; input channels extended to `3 + pol_channels`
A	conv0 input	(B, 3 + pol_channels, H, W)
B	Pol feature stream	Same scale as z0, z1; participates in three-stream fusion
C	feature0 / feature1	MRT output features; einsum indices `...hic` / `...hjc`
C	cv	`[B, ..., H, W, W]` (along epipolar i, j dimensions)
C	pol_weight	`[B, H, W, W]`, element-wise multiplied with cv

5. Comparison of Injection Points and Design Decisions

Injection Point	Position	Pol Role	Participates in Core Matching?
A	CNNEncoder input	Input channel extension; integrated from the first layer	Indirect (via feature extraction)
B	FeatureFusion (AGFL)	Third stream in three-stream fusion	Indirect (via cross-scale fusion)
C	Correlation Volume	Modulates cv, before Sinkhorn	Directly modulates core matching

5.1 Design Decisions and Rationale

Decision	Rationale
Prioritize injection point C	Polarization directly participates in core matching; modulating before OT has the deepest effect
Injection point C is before Sinkhorn	Modulating cv influences the entire optimal transport, achieving the deepest effect
Injection point B as the second choice	Three-stream fusion is a smaller change; cross-scale gating can selectively use Pol
Injection point A as the simplest option	Minimal change (only the conv0 input channels are modified); polarization is integrated from the first layer
Preserve a “direct” path	The S2M2 serial architecture is tightly coupled, so polarization injection must keep a safety net to avoid single-point failure

5.2 Additional Advantages of S2M2 (auxiliary for transparent object detection)

Occlusion output: can be used to detect boundaries of transparent objects (such as glass).
Confidence output: can be used to identify uncertain regions (transparent regions typically fall into this category).
Transformer architecture: cross-attention is naturally suited to handling appearance differences (I∥ vs I⊥).

6. Highlights

Three injection points cover the whole pipeline: A (input layer), B (cross-scale fusion), and C (matching core), with change magnitude from small to large, selectable as needed.
Injection point C brings polarization into core matching: by element-wise modulating cv before Sinkhorn, the polarization signal influences the entire optimal transport rather than serving merely as a post-hoc correction.
OT’s built-in assignment mechanism accommodates polarization information: all-pairs correlation matches in a high-dimensional space and inherently includes confidence/assignment, well suited to accommodating polarization injection.
Multi-output auxiliary detection: S2M2 outputs disparity, occlusion, and confidence at once; the latter two can assist in detecting the boundaries and uncertain regions of transparent objects.
A direct safety net is preserved: the serial architecture is tightly coupled, so when polarization is injected, a direct channel that bypasses the polarization path is kept to avoid a single point of failure dragging down the whole pipeline.