Blueprint · 2026

Shallow Learnable Polarization Encoder Architecture

A stereo matching architecture that adds a shallow learnable encoder after the physics-based polarization features, letting the model learn an optimal combination of channels before injecting them into the GRU.

  • stereo matching
  • polarization
  • RAFT-Stereo

Using these blueprints

Everything here is an architecture proposal I designed and chose to publish openly. Free to use, adapt, or build on — no permission needed.

If one turns out useful and crediting is convenient, a link back to this site is appreciated. It's never required.

1. Design Goals

This architecture introduces polarization information into the stereo matching framework and addresses the representational limitations of “physics-based polarization features”.

Physics-based polarization features (pol_diff, pol_ratio, etc.) are produced by fixed formulas and suffer from three limitations in representational power:

  1. Fixed channel weights: the R / G / B channels are equally weighted and cannot be adjusted by their information content.
  2. Non-linear cross-channel combinations cannot be learned: pure physics formulas contain no learnable non-linear transforms.
  3. Channel-wise differences in information cannot be exploited: the separability analysis shows that polarization-channel discriminative power follows B > G > R (see Section 2), but fixed features cannot leverage this unequal weighting.

The design goal of this architecture: add a shallow learnable encoder after the physics-based features so that the model can learn optimal channel weights and non-linear combinations. At the same time, based on the separability analysis, only the discriminative channels are retained (removing the useless Sobel), reducing polarization features to 6ch.


2. Design Basis: Channel Analysis

This architecture is grounded in a separability analysis of polarization channels.

2.1 Analysis Method

For scenes with clear polarization signals, the glass / non-glass separability of each channel is analyzed:

Separability = |glass_mean - nonglass_mean| / pooled_std

Higher separability means the channel is more separable between glass and non-glass.

2.2 Per-Channel Separability Ranking

RankChannelSeparabilityRating
1pol_diff_B1.14Best
2pol_diff_G1.03Good
3pol_diff_R0.77Decent
4pol_ratio_B0.39Average
5pol_ratio_G0.35Average
6pol_ratio_R0.33Average
7–12sobel_*< 0.01Useless

2.3 Per-Group Summary

GroupAvg SeparabilityConclusion
pol_diff0.98Core feature; must be retained
pol_ratio0.36Has some value; retain
sobel_x0.007Useless; remove
sobel_y0.007Useless; remove

2.4 Conclusions of the Analysis

  1. Sobel is completely useless: separability < 0.01, indistinguishable from noise. Gradient features of glass boundaries are insufficient to distinguish glass from non-glass.
  2. pol_diff is the core: sep ~1.0, excellent glass / non-glass discriminative power.
  3. The blue channel is the strongest: B > G > R, consistent with Fresnel physics (shorter wavelengths reflect more strongly).
  4. pol_ratio has auxiliary value: sep ~0.35, not as strong as pol_diff but still contributes.

Physical interpretation:

  • pol_diff = |I∥ - I⊥|: directly measures the polarization intensity difference; Fresnel reflection on glass surfaces produces a clear difference.
  • pol_ratio = I∥/(I∥+I⊥): polarization ratio, related to the incidence angle.
  • Sobel: edge information is intended to find glass boundaries, but glass / non-glass edge gradients differ little.

Based on this analysis, the architecture reduces pol_features from 12ch to 6ch (removing the useless Sobel) and adds a learnable encoder to exploit cross-channel unequal information such as B > G > R.


3. Architecture

Architecture of the Shallow Learnable Pol Encoder

A 2-layer shallow conv encoder is inserted between the physics-based features (6ch: pol_diff 3ch + pol_ratio 3ch) and the MotionEncoder, transforming 6ch into 32ch.


4. Components and Modules

4.1 PolEncoder (shallow learnable encoder)

class PolEncoder(nn.Module):
    """Shallow learnable encoder for pol features"""
    def __init__(self, in_ch=6, hidden_ch=16, out_ch=32):
        super().__init__()
        self.conv1 = nn.Conv2d(in_ch, hidden_ch, 3, padding=1)
        self.conv2 = nn.Conv2d(hidden_ch, out_ch, 3, padding=1)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        return x

Two conv layers: 6 → 16 → 32, each followed by ReLU. About 3K parameters.

4.2 Polarization Features (6ch)

Polarization features are 6ch, retaining only the two discriminative groups (removing Sobel, whose separability is < 0.01):

  • pol_diff (3ch): |I∥ - I⊥|
  • pol_ratio (3ch): I∥ / (I∥ + I⊥ + ε)

4.3 MotionEncoder (pol branch input dimension)

Because PolEncoder expands polarization features to 32ch, the input dimension of the MotionEncoder polarization branch is 32ch.


5. Tensor Dimensions

ItemShapeDescription
pol_raw(B, 6, H/8, W/8)pol_diff(3) + pol_ratio(3)
PolEncoder conv1 output(B, 16, H/8, W/8)Conv(6→16) + ReLU
PolEncoder conv2 output(B, 32, H/8, W/8)Conv(16→32) + ReLU
MotionEncoder pol branch input(B, 32, H/8, W/8)PolEncoder output

6. Parameter Count

ComponentParameters
PolEncoder~3K
MotionEncoder pol branch (input 32ch)~17K
Total added~15K

7. Hyperparameters

HyperparameterValueDescription
PolEncoder in_ch6pol_diff(3) + pol_ratio(3)
PolEncoder hidden_ch16Number of channels in the intermediate layer
PolEncoder out_ch32Number of output channels, injected into MotionEncoder
ε1e-6Denominator stabilizer for pol_ratio

8. Design Decisions and Rationale

8.1 Adding a learnable encoder

Physics-based features have equally weighted channels and cannot learn non-linear combinations. A shallow encoder lets the model:

  1. Learn the B > G > R weighting.
  2. Learn an optimal combination of pol_diff and pol_ratio.
  3. Potentially discover more useful features through non-linear transforms.

8.2 Removing Sobel (12ch → 6ch)

The Channel Analysis shows Sobel separability < 0.01, indistinguishable from noise. Removing this 50% of useless channels makes the model cleaner. This is data-driven feature selection — not every physics-inspired feature is useful; it must be validated with data.

8.3 Keep the encoder shallow

Only 2 conv layers (~3K parameters) are used to keep the side information injection “lightweight”. The encoder’s role is to learn channel combinations, not to extract deep features.


9. Highlights

  • Data-driven feature selection: a quantitative separability metric filters out Sobel channels whose discriminative power is close to noise (< 0.01), reducing polarization features from 12ch to 6ch and validating that “physics-inspired features are not necessarily useful”.
  • A shallow learnable encoder complements the expressiveness of physics-based features: only 2 conv layers (~3K parameters) let the model learn the unequal weighting B > G > R and the non-linear combination of pol_diff/pol_ratio.
  • Combining physics and learning: the foundation of polarization features remains deterministic physics formulas (pol_diff, pol_ratio); the learnable encoder only performs a lightweight transform on top, balancing interpretability with learnability.
  • Channel Analysis directly drives the architecture: the separability ranking (pol_diff_B strongest, Sobel weakest) is consistent with Fresnel physics (shorter wavelengths reflect more), and the conclusions directly determine channel selection and the motivation for the encoder design.

← All blueprints