1. Design Goals
For a stereo matching architecture to use the polarization signal to detect transparent objects, the key is to let the polarization signal participate in the core matching computation rather than only being applied as a post-hoc correction. The architectural characteristics of S2M2 make this possible:
- The matching core can be modulated: the matching computation
cv = einsum(...)in S2M2 lies before the Sinkhorn Optimal Transport. It is an open interface, and a polarization weight can directly modulate cv element-wise, influencing the entire optimal transport. - OT inherently contains an assignment mechanism: all-pairs correlation matches in a higher-dimensional space, and Optimal Transport has a built-in confidence / assignment mechanism that is well suited to accommodating the uncertainty information that polarization brings.
- Cross-attention perceives appearance differences: the MRT cross-attention can directly perceive polarization-induced appearance differences (I∥ vs I⊥) when the left and right features interact.
- LayerNorm preserves polarization magnitudes: S2M2 uses LayerNorm in DispInit; unlike BatchNorm, it does not wash out magnitude differences within a batch. The essence of the polarization signal is the magnitude difference between I∥ and I⊥, so the choice of normalization directly determines whether the polarization signal can be preserved.
# S2M2 DispInit
self.layer_norm = nn.LayerNorm(dim, elementwise_affine=True)
This document defines three polarization injection points A / B / C, corresponding respectively to the input layer, the cross-scale fusion layer, and the matching core.
2. Architecture: Positions of the Three Injection Points
3. Design and Code for the Three Injection Points
3.1 Injection Point C: Correlation Volume (best)
# Original
cv = torch.einsum('...hic,...hjc -> ...hij', feature0, feature1)
# Modification: Pol-weighted Correlation
pol_weight = compute_pol_weight(left, right) # [B, H, W, W]
cv = torch.einsum('...hic,...hjc -> ...hij', feature0, feature1) * pol_weight
Advantages:
- Polarization directly participates in core matching rather than serving as side info.
- Modulating before Sinkhorn influences the entire optimal transport.
3.2 Injection Point B: FeatureFusion (AGFL)
# Original: two-stream fusion
z_out = fusion(cat(z0,z1)) + w*z0 + (1-w)*z1
# Modification: three-stream fusion
z_out = fusion(cat(z0,z1,pol)) + w0*z0 + w1*z1 + w2*pol
3.3 Injection Point A: CNNEncoder Input
# Original
self.conv0 = nn.Conv2d(3, 16, kernel_size=1)
# Modification: extend the number of input channels
self.conv0 = nn.Conv2d(3 + pol_channels, 16, kernel_size=1)
4. Tensor Dimensions
| Injection Point | Injected Tensor | Shape / Description |
|---|---|---|
| A | pol_channels | Concatenated with RGB; input channels extended to 3 + pol_channels |
| A | conv0 input | (B, 3 + pol_channels, H, W) |
| B | Pol feature stream | Same scale as z0, z1; participates in three-stream fusion |
| C | feature0 / feature1 | MRT output features; einsum indices ...hic / ...hjc |
| C | cv | [B, ..., H, W, W] (along epipolar i, j dimensions) |
| C | pol_weight | [B, H, W, W], element-wise multiplied with cv |
5. Comparison of Injection Points and Design Decisions
| Injection Point | Position | Pol Role | Participates in Core Matching? |
|---|---|---|---|
| A | CNNEncoder input | Input channel extension; integrated from the first layer | Indirect (via feature extraction) |
| B | FeatureFusion (AGFL) | Third stream in three-stream fusion | Indirect (via cross-scale fusion) |
| C | Correlation Volume | Modulates cv, before Sinkhorn | Directly modulates core matching |
5.1 Design Decisions and Rationale
| Decision | Rationale |
|---|---|
| Prioritize injection point C | Polarization directly participates in core matching; modulating before OT has the deepest effect |
| Injection point C is before Sinkhorn | Modulating cv influences the entire optimal transport, achieving the deepest effect |
| Injection point B as the second choice | Three-stream fusion is a smaller change; cross-scale gating can selectively use Pol |
| Injection point A as the simplest option | Minimal change (only the conv0 input channels are modified); polarization is integrated from the first layer |
| Preserve a “direct” path | The S2M2 serial architecture is tightly coupled, so polarization injection must keep a safety net to avoid single-point failure |
5.2 Additional Advantages of S2M2 (auxiliary for transparent object detection)
- Occlusion output: can be used to detect boundaries of transparent objects (such as glass).
- Confidence output: can be used to identify uncertain regions (transparent regions typically fall into this category).
- Transformer architecture: cross-attention is naturally suited to handling appearance differences (I∥ vs I⊥).
6. Highlights
- Three injection points cover the whole pipeline: A (input layer), B (cross-scale fusion), and C (matching core), with change magnitude from small to large, selectable as needed.
- Injection point C brings polarization into core matching: by element-wise modulating cv before Sinkhorn, the polarization signal influences the entire optimal transport rather than serving merely as a post-hoc correction.
- OT’s built-in assignment mechanism accommodates polarization information: all-pairs correlation matches in a high-dimensional space and inherently includes confidence/assignment, well suited to accommodating polarization injection.
- Multi-output auxiliary detection: S2M2 outputs disparity, occlusion, and confidence at once; the latter two can assist in detecting the boundaries and uncertain regions of transparent objects.
- A direct safety net is preserved: the serial architecture is tightly coupled, so when polarization is injected, a direct channel that bypasses the polarization path is kept to avoid a single point of failure dragging down the whole pipeline.