1. Design Goals
This architecture introduces polarization information into the stereo matching framework and addresses the representational limitations of “physics-based polarization features”.
Physics-based polarization features (pol_diff, pol_ratio, etc.) are produced by fixed formulas and suffer from three limitations in representational power:
- Fixed channel weights: the R / G / B channels are equally weighted and cannot be adjusted by their information content.
- Non-linear cross-channel combinations cannot be learned: pure physics formulas contain no learnable non-linear transforms.
- Channel-wise differences in information cannot be exploited: the separability analysis shows that polarization-channel discriminative power follows B > G > R (see Section 2), but fixed features cannot leverage this unequal weighting.
The design goal of this architecture: add a shallow learnable encoder after the physics-based features so that the model can learn optimal channel weights and non-linear combinations. At the same time, based on the separability analysis, only the discriminative channels are retained (removing the useless Sobel), reducing polarization features to 6ch.
2. Design Basis: Channel Analysis
This architecture is grounded in a separability analysis of polarization channels.
2.1 Analysis Method
For scenes with clear polarization signals, the glass / non-glass separability of each channel is analyzed:
Separability = |glass_mean - nonglass_mean| / pooled_std
Higher separability means the channel is more separable between glass and non-glass.
2.2 Per-Channel Separability Ranking
| Rank | Channel | Separability | Rating |
|---|---|---|---|
| 1 | pol_diff_B | 1.14 | Best |
| 2 | pol_diff_G | 1.03 | Good |
| 3 | pol_diff_R | 0.77 | Decent |
| 4 | pol_ratio_B | 0.39 | Average |
| 5 | pol_ratio_G | 0.35 | Average |
| 6 | pol_ratio_R | 0.33 | Average |
| 7–12 | sobel_* | < 0.01 | Useless |
2.3 Per-Group Summary
| Group | Avg Separability | Conclusion |
|---|---|---|
| pol_diff | 0.98 | Core feature; must be retained |
| pol_ratio | 0.36 | Has some value; retain |
| sobel_x | 0.007 | Useless; remove |
| sobel_y | 0.007 | Useless; remove |
2.4 Conclusions of the Analysis
- Sobel is completely useless: separability < 0.01, indistinguishable from noise. Gradient features of glass boundaries are insufficient to distinguish glass from non-glass.
- pol_diff is the core: sep ~1.0, excellent glass / non-glass discriminative power.
- The blue channel is the strongest: B > G > R, consistent with Fresnel physics (shorter wavelengths reflect more strongly).
- pol_ratio has auxiliary value: sep ~0.35, not as strong as pol_diff but still contributes.
Physical interpretation:
pol_diff = |I∥ - I⊥|: directly measures the polarization intensity difference; Fresnel reflection on glass surfaces produces a clear difference.pol_ratio = I∥/(I∥+I⊥): polarization ratio, related to the incidence angle.Sobel: edge information is intended to find glass boundaries, but glass / non-glass edge gradients differ little.
Based on this analysis, the architecture reduces pol_features from 12ch to 6ch (removing the useless Sobel) and adds a learnable encoder to exploit cross-channel unequal information such as B > G > R.
3. Architecture
A 2-layer shallow conv encoder is inserted between the physics-based features (6ch: pol_diff 3ch + pol_ratio 3ch) and the MotionEncoder, transforming 6ch into 32ch.
4. Components and Modules
4.1 PolEncoder (shallow learnable encoder)
class PolEncoder(nn.Module):
"""Shallow learnable encoder for pol features"""
def __init__(self, in_ch=6, hidden_ch=16, out_ch=32):
super().__init__()
self.conv1 = nn.Conv2d(in_ch, hidden_ch, 3, padding=1)
self.conv2 = nn.Conv2d(hidden_ch, out_ch, 3, padding=1)
def forward(self, x):
x = F.relu(self.conv1(x))
x = F.relu(self.conv2(x))
return x
Two conv layers: 6 → 16 → 32, each followed by ReLU. About 3K parameters.
4.2 Polarization Features (6ch)
Polarization features are 6ch, retaining only the two discriminative groups (removing Sobel, whose separability is < 0.01):
pol_diff(3ch):|I∥ - I⊥|pol_ratio(3ch):I∥ / (I∥ + I⊥ + ε)
4.3 MotionEncoder (pol branch input dimension)
Because PolEncoder expands polarization features to 32ch, the input dimension of the MotionEncoder polarization branch is 32ch.
5. Tensor Dimensions
| Item | Shape | Description |
|---|---|---|
| pol_raw | (B, 6, H/8, W/8) | pol_diff(3) + pol_ratio(3) |
| PolEncoder conv1 output | (B, 16, H/8, W/8) | Conv(6→16) + ReLU |
| PolEncoder conv2 output | (B, 32, H/8, W/8) | Conv(16→32) + ReLU |
| MotionEncoder pol branch input | (B, 32, H/8, W/8) | PolEncoder output |
6. Parameter Count
| Component | Parameters |
|---|---|
| PolEncoder | ~3K |
| MotionEncoder pol branch (input 32ch) | ~17K |
| Total added | ~15K |
7. Hyperparameters
| Hyperparameter | Value | Description |
|---|---|---|
| PolEncoder in_ch | 6 | pol_diff(3) + pol_ratio(3) |
| PolEncoder hidden_ch | 16 | Number of channels in the intermediate layer |
| PolEncoder out_ch | 32 | Number of output channels, injected into MotionEncoder |
| ε | 1e-6 | Denominator stabilizer for pol_ratio |
8. Design Decisions and Rationale
8.1 Adding a learnable encoder
Physics-based features have equally weighted channels and cannot learn non-linear combinations. A shallow encoder lets the model:
- Learn the B > G > R weighting.
- Learn an optimal combination of pol_diff and pol_ratio.
- Potentially discover more useful features through non-linear transforms.
8.2 Removing Sobel (12ch → 6ch)
The Channel Analysis shows Sobel separability < 0.01, indistinguishable from noise. Removing this 50% of useless channels makes the model cleaner. This is data-driven feature selection — not every physics-inspired feature is useful; it must be validated with data.
8.3 Keep the encoder shallow
Only 2 conv layers (~3K parameters) are used to keep the side information injection “lightweight”. The encoder’s role is to learn channel combinations, not to extract deep features.
9. Highlights
- Data-driven feature selection: a quantitative separability metric filters out Sobel channels whose discriminative power is close to noise (< 0.01), reducing polarization features from 12ch to 6ch and validating that “physics-inspired features are not necessarily useful”.
- A shallow learnable encoder complements the expressiveness of physics-based features: only 2 conv layers (~3K parameters) let the model learn the unequal weighting B > G > R and the non-linear combination of pol_diff/pol_ratio.
- Combining physics and learning: the foundation of polarization features remains deterministic physics formulas (pol_diff, pol_ratio); the learnable encoder only performs a lightweight transform on top, balancing interpretability with learnability.
- Channel Analysis directly drives the architecture: the separability ranking (pol_diff_B strongest, Sobel weakest) is consistent with Fresnel physics (shorter wavelengths reflect more), and the conclusions directly determine channel selection and the motivation for the encoder design.