Shallow Learnable Polarization Encoder Architecture

1. Design Goals

This architecture introduces polarization information into the stereo matching framework and addresses the representational limitations of “physics-based polarization features”.

Physics-based polarization features (pol_diff, pol_ratio, etc.) are produced by fixed formulas and suffer from three limitations in representational power:

Fixed channel weights: the R / G / B channels are equally weighted and cannot be adjusted by their information content.
Non-linear cross-channel combinations cannot be learned: pure physics formulas contain no learnable non-linear transforms.
Channel-wise differences in information cannot be exploited: the separability analysis shows that polarization-channel discriminative power follows B > G > R (see Section 2), but fixed features cannot leverage this unequal weighting.

The design goal of this architecture: add a shallow learnable encoder after the physics-based features so that the model can learn optimal channel weights and non-linear combinations. At the same time, based on the separability analysis, only the discriminative channels are retained (removing the useless Sobel), reducing polarization features to 6ch.

2. Design Basis: Channel Analysis

This architecture is grounded in a separability analysis of polarization channels.

2.1 Analysis Method

For scenes with clear polarization signals, the glass / non-glass separability of each channel is analyzed:

Separability = |glass_mean - nonglass_mean| / pooled_std

Higher separability means the channel is more separable between glass and non-glass.

2.2 Per-Channel Separability Ranking

Rank	Channel	Separability	Rating
1	pol_diff_B	1.14	Best
2	pol_diff_G	1.03	Good
3	pol_diff_R	0.77	Decent
4	pol_ratio_B	0.39	Average
5	pol_ratio_G	0.35	Average
6	pol_ratio_R	0.33	Average
7–12	sobel_*	< 0.01	Useless

2.3 Per-Group Summary

Group	Avg Separability	Conclusion
pol_diff	0.98	Core feature; must be retained
pol_ratio	0.36	Has some value; retain
sobel_x	0.007	Useless; remove
sobel_y	0.007	Useless; remove

2.4 Conclusions of the Analysis

Sobel is completely useless: separability < 0.01, indistinguishable from noise. Gradient features of glass boundaries are insufficient to distinguish glass from non-glass.
pol_diff is the core: sep ~1.0, excellent glass / non-glass discriminative power.
The blue channel is the strongest: B > G > R, consistent with Fresnel physics (shorter wavelengths reflect more strongly).
pol_ratio has auxiliary value: sep ~0.35, not as strong as pol_diff but still contributes.

Physical interpretation:

pol_diff = |I∥ - I⊥|: directly measures the polarization intensity difference; Fresnel reflection on glass surfaces produces a clear difference.
pol_ratio = I∥/(I∥+I⊥): polarization ratio, related to the incidence angle.
Sobel: edge information is intended to find glass boundaries, but glass / non-glass edge gradients differ little.

Based on this analysis, the architecture reduces pol_features from 12ch to 6ch (removing the useless Sobel) and adds a learnable encoder to exploit cross-channel unequal information such as B > G > R.

3. Architecture

Architecture of the Shallow Learnable Pol Encoder

A 2-layer shallow conv encoder is inserted between the physics-based features (6ch: pol_diff 3ch + pol_ratio 3ch) and the MotionEncoder, transforming 6ch into 32ch.

4. Components and Modules

4.1 PolEncoder (shallow learnable encoder)

class PolEncoder(nn.Module):
    """Shallow learnable encoder for pol features"""
    def __init__(self, in_ch=6, hidden_ch=16, out_ch=32):
        super().__init__()
        self.conv1 = nn.Conv2d(in_ch, hidden_ch, 3, padding=1)
        self.conv2 = nn.Conv2d(hidden_ch, out_ch, 3, padding=1)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        return x

Two conv layers: 6 → 16 → 32, each followed by ReLU. About 3K parameters.

4.2 Polarization Features (6ch)

Polarization features are 6ch, retaining only the two discriminative groups (removing Sobel, whose separability is < 0.01):

pol_diff (3ch): |I∥ - I⊥|
pol_ratio (3ch): I∥ / (I∥ + I⊥ + ε)

4.3 MotionEncoder (pol branch input dimension)

Because PolEncoder expands polarization features to 32ch, the input dimension of the MotionEncoder polarization branch is 32ch.

5. Tensor Dimensions

Item	Shape	Description
pol_raw	(B, 6, H/8, W/8)	pol_diff(3) + pol_ratio(3)
PolEncoder conv1 output	(B, 16, H/8, W/8)	Conv(6→16) + ReLU
PolEncoder conv2 output	(B, 32, H/8, W/8)	Conv(16→32) + ReLU
MotionEncoder pol branch input	(B, 32, H/8, W/8)	PolEncoder output

6. Parameter Count

Component	Parameters
PolEncoder	~3K
MotionEncoder pol branch (input 32ch)	~17K
Total added	~15K

7. Hyperparameters

Hyperparameter	Value	Description
PolEncoder in_ch	6	pol_diff(3) + pol_ratio(3)
PolEncoder hidden_ch	16	Number of channels in the intermediate layer
PolEncoder out_ch	32	Number of output channels, injected into MotionEncoder
ε	1e-6	Denominator stabilizer for pol_ratio

8. Design Decisions and Rationale

8.1 Adding a learnable encoder

Physics-based features have equally weighted channels and cannot learn non-linear combinations. A shallow encoder lets the model:

Learn the B > G > R weighting.
Learn an optimal combination of pol_diff and pol_ratio.
Potentially discover more useful features through non-linear transforms.

8.2 Removing Sobel (12ch → 6ch)

The Channel Analysis shows Sobel separability < 0.01, indistinguishable from noise. Removing this 50% of useless channels makes the model cleaner. This is data-driven feature selection — not every physics-inspired feature is useful; it must be validated with data.

8.3 Keep the encoder shallow

Only 2 conv layers (~3K parameters) are used to keep the side information injection “lightweight”. The encoder’s role is to learn channel combinations, not to extract deep features.

9. Highlights

Data-driven feature selection: a quantitative separability metric filters out Sobel channels whose discriminative power is close to noise (< 0.01), reducing polarization features from 12ch to 6ch and validating that “physics-inspired features are not necessarily useful”.
A shallow learnable encoder complements the expressiveness of physics-based features: only 2 conv layers (~3K parameters) let the model learn the unequal weighting B > G > R and the non-linear combination of pol_diff/pol_ratio.
Combining physics and learning: the foundation of polarization features remains deterministic physics formulas (pol_diff, pol_ratio); the learnable encoder only performs a lightweight transform on top, balancing interpretability with learnability.
Channel Analysis directly drives the architecture: the separability ranking (pol_diff_B strongest, Sobel weakest) is consistent with Fresnel physics (shorter wavelengths reflect more), and the conclusions directly determine channel selection and the motivation for the encoder design.