Four Architectural Approaches for Region-Level Polarization Understanding

1. Design Goal: Overcoming the Scope Problem of CNNs

The four approaches proposed in this document address the same problem: the polarization branch needs region-level structural understanding, which the local receptive field of a standard CNN cannot provide.

1.1 Nature of the Problem: the Scope Problem of CNNs

CNN (pixel scan):
  - 3x3, 5x5 kernel
  - Local receptive field
  - Each location processed independently
  - Cannot understand "structure"

Polarization difference characteristics:
  - Glass is a whole region
  - Need to "know this is a piece of glass" to interpret polarization correctly
  - Requires structural / regional understanding

Why polarization signals are hard for a CNN to capture:

Within a 3x3 patch:
  -> May see only a tiny brightness difference
  -> Cannot tell "this is a glass boundary"
  -> Cannot understand "the whole region has polarization characteristics"
  -> CNN treats polarization differences as local noise

Per-pixel PolCostVolume has the same problem:
  |I∥(x) - I⊥(x-d)|
  -> No structure awareness

The core of the problem: glass is “a whole region”, and correctly interpreting a polarization signal at some location requires knowing the region-level fact “this is a piece of glass”. However, a CNN’s local receptive field only sees a small range of pixels at a time and cannot capture the region-level structure of a whole piece of glass, so it discards polarization differences as local noise.

1.2 The Common Theme of the Four Approaches

The four architectural approaches below all revolve around the same theme: how to introduce structural / regional understanding to the polarization branch. They differ in their “introduction mechanism” and “scope of change”.

2. Approach A: Introduce ViT into the Pol Stream

Architecture of Approach A: Introducing ViT into the Pol Stream

Advantages:
  - Self-attention provides a global receptive field
  - Each patch can see all other patches
  - Can understand "this entire region is glass"

Issues:
  - Heterogeneous architecture coupling (CNN RGB + ViT Pol)
  - Different feature spaces; fusion may be problematic

Design points: the Pol stream is replaced by a Vision Transformer, using self-attention to obtain a global receptive field so that every patch can see all other patches and thereby understand “the whole piece of glass”.

Risks: the RGB stream is a CNN and the Pol stream is a ViT — a heterogeneous architecture coupling; the feature spaces differ, which may cause issues during fusion.

3. Approach B: Hybrid CNN + Attention

Architecture of Approach B: Hybrid CNN + Attention

# PolFNet: CNN backbone + Self-Attention layer
class PolFNetHybrid(nn.Module):
    def __init__(self):
        self.cnn_backbone = ...  # extract local features
        self.self_attention = nn.MultiheadAttention(...)  # global reasoning

Design points: PolFNet keeps the CNN backbone (for local feature extraction) together with a self-attention layer (for global reasoning). Compared with Approach A, which fully switches to ViT, Approach B is a compromise — adding one attention layer on top of a CNN architecture is less likely to conflict with the CNN feature space of the RGB stream.

4. Approach C: Large Kernel Convolution (recommended to try)

Architecture of Approach C: Large Kernel Convolution

Core idea: the Pol stream focuses on structure rather than pixel-level details.

# Typical design (local)
Conv2d(in, out, kernel_size=3, stride=1)

# Improved approach (regional)
Conv2d(in, out, kernel_size=15, stride=4)  # large kernel
Conv2d(in, out, kernel_size=7, dilation=4)  # dilated convolution

Design rationale:

Polarization signals do not need pixel-level precision.
What is needed is a region-level “where is the glass” judgment.
The output of the polarization branch is already at a lower resolution (e.g. H/4, W/4).
Trade spatial resolution for a larger receptive field.

Two implementation paths:

Large kernel: e.g. kernel_size=15, stride=4.
Dilated convolution: e.g. kernel_size=7, dilation=4, enlarging the receptive field without adding parameters.

5. Approach D: Spatial Attention for α

Architecture of Approach D: Spatial Attention for α

# Add attention only before α prediction
class PolCNetWithAttention(nn.Module):
    def forward(self, x):
        features = self.backbone(x)
        # Add spatial attention so α has structure awareness
        attended = self.spatial_attention(features)
        alpha = self.alpha_head(attended)
        return alpha

Design points: do not modify the whole Pol stream; only add one spatial attention layer before α prediction so that α prediction has structure awareness. This is the smallest change among the four approaches — the backbone is unchanged, and attention is inserted only before the α head.

6. Comparison of the Four Approaches

Approach	Mechanism	Scope of Change	Main Risk
A: ViT	Self-attention (full ViT)	Pol stream fully replaced	Heterogeneous architecture coupling; different feature spaces
B: Hybrid	CNN backbone + Self-Attention layer	Inside PolFNet	Milder than Approach A
C: Large Kernel	Large kernel / dilated conv	Conv layers of the Pol stream	Trades spatial resolution (recommended to try)
D: Spatial Attention	Spatial attention before α prediction	Only before the α head	Smallest change; may be insufficient to solve the backbone’s scope problem

The four approaches lie along a spectrum: from the largest change with strongest global capability in Approach A (full ViT), to the smallest change with only local enhancement in Approach D (attention before the α head). The scope of change roughly correlates with structural understanding capability.

7. Layered Protection: α Supervision and Large Kernel are Complementary

When the polarization branch is fused with RGB via α weighting (cost_fused = α × cost_rgb + (1-α) × cost_pol), there is, besides “region-level structural understanding”, another independent issue — insufficient learning signal for α. These two issues need to be handled in layers, so α supervision and Large Kernel are complementary rather than substitutes.

7.1 Problem Decomposition

1. Learning-signal problem:
   - α has no clear learning direction
   - It tends to follow the "better-performing" RGB
   - Solution: α supervision

2. Architectural-capability problem:
   - A CNN's local receptive field cannot capture structure
   - The Pol stream may learn shortcuts in the data
   - Solution: Large Kernel (structure-level features)

7.2 Why Both Are Needed

With α supervision only:
  ✓ α will differentiate
  ✗ Pol stream may still learn features with domain gap
  ✗ Cross-domain deployment may fail

With Large Kernel only:
  ✓ Pol stream learns structure-level features with less domain gap
  ✗ If the RGB FNet is frozen, α may still follow RGB
  ✗ α may not differentiate

Combining both:
  ✓ α supervision -> α will definitely differentiate
  ✓ Large Kernel -> Pol features are more robust
  ✓ Layered protection, reducing the sim-to-real gap

7.3 Design of the Two-Layer Protection

Architecture of the layered protection strategy: Layer 1 + Layer 2

Layer 1 (α soft supervision): apply soft supervision on α (glass region → 0.2, non-glass region → 0.8), ensuring the differentiation direction of α is correct and solving the “learning signal” problem.
Layer 2 (Large Kernel PolFNet): use a large kernel (15×15) or dilated convolution to make the Pol stream focus on structure-level features, reducing the risk of pixel-level shortcuts and solving the “architectural capability / domain gap” problem.

8. Design Decisions and Rationale

8.1 Polarization understanding requires region-level scope

Glass is a whole region, and correctly interpreting a polarization signal is premised on “knowing this is a piece of glass”. Any architectural improvement to the polarization branch is essentially about enlarging its receptive field or introducing structural reasoning.

8.2 Trade-off between scope of change and structural capability

The four approaches offer options ranging from “full replacement” to “local enhancement”. Approaches A/B introduce global reasoning via attention but bring the risk of heterogeneous architecture coupling; Approach C enlarges the receptive field within a pure CNN framework using larger conv kernels, avoiding architectural heterogeneity; Approach D involves the smallest change but offers the most limited structural understanding.

8.3 Large Kernel: trading resolution for receptive field

The value of polarization signals lies in region-level “where is the glass”, not in pixel-level precision. It is therefore acceptable to proactively trade spatial resolution for a larger receptive field. Dilated convolution can additionally achieve this without adding parameters, making it the recommended option to try first.

8.4 Learning signal and architectural capability are two independent issues

Failure of α to differentiate may stem from either “insufficient learning signal” or “insufficient architectural capability”. The layered protection strategy provides one protection for each (α supervision for the former, Large Kernel for the latter), ensuring that both issues are addressed.

9. Highlights

Clearly identifies the core difficulty of polarization understanding: glass is region-level structure, and a CNN’s local receptive field misjudges the polarization signal of a whole piece of glass as local noise — all four approaches start from this common premise.
The four approaches form a complete spectrum: from full ViT (strongest global capability, largest change) to spatial attention before the α head (smallest change), covering different trade-offs between capability and cost.
Large Kernel trades resolution for scope: leveraging the property that “polarization only needs region-level precision”, proactively trading spatial resolution for a larger receptive field; dilated convolution can further enlarge the scope at zero parameter cost.
The dual-track design of layered protection: separates “insufficient α learning signal” and “insufficient architectural structural capability” into two independent issues, addressed respectively by α soft supervision and Large Kernel PolFNet — the two layers are complementary rather than substitutes.