Blueprint · 2026

Polarization Attention + Gated Fusion Architecture

Model class: `StereoPolVolumeV2` Document type: Architecture design specification (design only, no experimental results)

  • stereo matching
  • polarization
  • RAFT-Stereo

Using these blueprints

Everything here is an architecture proposal I designed and chose to publish openly. Free to use, adapt, or build on — no permission needed.

If one turns out useful and crediting is convenient, a link back to this site is appreciated. It's never required.

1. Design Goals

In polarized stereo matching, a common polarization stereo module handles the polarization signal in a simple manner:

pol_diff = left - right  (simple subtraction)
pol_corr = query(pol_volume, disp)
output = concat(corr, pol_corr, disp) → encoder → GRU

This approach has the following issues:

  1. No learnable polarization feature extraction.
  2. No spatial attention mechanism (the model does not know where the glass is).
  3. Simple concat fusion, with no learned weighting between stereo and pol.

The design goal of this architecture is to address these three issues by introducing a spatial attention mechanism and learnable fusion: the model can mark glass locations and dynamically decide whether to trust stereo or pol information.

The core consists of two modules:

  • Polarization Attention: Generates a spatial attention map from pol_corr to mark glass locations.
  • Gated Fusion: Learns dynamic fusion weights between stereo and pol features.

Two optional extension modules can also be added:

  • Learnable Pol Encoder (optional): Uses a Conv encoder to learn richer polarization features rather than relying solely on raw subtraction.
  • Glass-aware Auxiliary Head (optional): Predicts a glass mask as an auxiliary task, useful for visualization and interpretability.

2. Architecture

Overall architecture of Polarization Attention + Gated Fusion

Data Flow

  1. pol_diff enters PolarizationAttention, which produces pol_attention_map via Conv + sigmoid (marking glass locations).
  2. GatedFusion receives three inputs: corr, pol_corr, and pol_attention_map.
  3. The three are concatenated and used to compute the gate (sigmoid, [0,1]).
  4. corr and pol_corr are each enhanced by 1×1 convolution into enhanced_corr / enhanced_pol.
  5. fused = gate * enhanced_corr + (1-gate) * enhanced_pol performs weighted fusion.
  6. concat(fused, disp) is fed into the GRU.

3. Components and Modules

3.1 PolarizationAttention

class PolarizationAttention(nn.Module):
    """Generate a spatial attention map from pol_diff"""
    def __init__(self, in_channels, reduction=4):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_channels, in_channels // reduction, 1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels // reduction, 1, 1),
            nn.Sigmoid()
        )

    def forward(self, pol_corr):
        return self.conv(pol_corr)  # (B, 1, H, W)
  • Two 1×1 convolutions, with channel compression by reduction=4 in between.
  • Final Sigmoid produces a single-channel spatial attention map in [0,1].
  • Output shape: (B, 1, H, W).
  • Intuition: Regions with high polarization difference (glass) receive higher weights, so the model knows “where to trust the polarization information”.

3.2 GatedFusion

class GatedFusion(nn.Module):
    """Learn the fusion weight between stereo and pol"""
    def __init__(self, corr_dim, pol_dim, out_dim):
        super().__init__()
        self.gate_net = nn.Sequential(
            nn.Conv2d(corr_dim + pol_dim + 1, 64, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 1, 1),
            nn.Sigmoid()
        )
        self.corr_enhance = nn.Conv2d(corr_dim, out_dim, 1)
        self.pol_enhance = nn.Conv2d(pol_dim, out_dim, 1)

    def forward(self, corr, pol_corr, pol_attn):
        gate = self.gate_net(torch.cat([corr, pol_corr, pol_attn], dim=1))
        corr_feat = self.corr_enhance(corr)
        pol_feat = self.pol_enhance(pol_corr)
        fused = gate * corr_feat + (1 - gate) * pol_feat
        return fused, gate
  • gate_net: Input channels are corr_dim + pol_dim + 1 (+1 for the single-channel pol_attn); the sequence is 3×3 Conv → ReLU → 1×1 Conv → Sigmoid, producing a single-channel gate.
  • corr_enhance / pol_enhance: Each is a 1×1 convolution projecting corr / pol to out_dim.
  • forward returns fused and gate (the gate can be visualized).
  • Gate semantics: gate = 1 → trust stereo, gate = 0 → trust pol; in glass regions, the stereo weight is automatically reduced and the pol weight is increased.

3.3 UpdateBlockV2

Built on the standard RAFT UpdateBlock and integrates PolarizationAttention and GatedFusion, concatenating the fused features with disp as the GRU input.

3.4 Learnable Pol Encoder (Optional)

Data flow of the Learnable Pol Encoder

Uses a Conv encoder to learn a richer polarization feature representation rather than relying solely on raw subtraction.

3.5 Glass-aware Auxiliary Head (Optional)

Data flow of the Glass-aware Auxiliary Head

Predicts glass locations as an auxiliary task, useful for visualization and interpretability.


4. Tensor Dimensions

TensorShapeDescription
pol_diff / pol_corr input(B, in_channels, H, W)Input to PolarizationAttention
pol_attention_map(B, 1, H, W)Spatial attention map
gate_net input(B, corr_dim + pol_dim + 1, H, W)concat of corr/pol/attn
gate(B, 1, H, W)Fusion weight
corr_feat / pol_feat(B, out_dim, H, W)Enhanced features
fused(B, out_dim, H, W)Fused output

5. Hyperparameters

HyperparameterValueDescription
pol_levels4Number of pyramid levels in the polarization volume
pol_radius4Lookup radius of the polarization volume
iters24Number of GRU iterations
reduction4Channel compression ratio in PolarizationAttention

6. Design Decisions and Rationale

DecisionRationale
Introduce PolarizationAttentionProduces a spatial attention map so the model knows “where the glass is”
Use 1×1 conv with reduction compressionKeeps the attention module lightweight
Introduce GatedFusionMakes stereo / pol fusion weights learnable, replacing simple concat
Gate takes corr, pol, and attn togetherFusion decisions consider all three sources, dynamically choosing which to trust
corr_enhance / pol_enhance project to the same out_dimBoth streams must share the same dimension to be combined by a weighted sum
Learnable Encoder and Auxiliary Head are optionalCore is attention + gated fusion; the encoder is a capacity extension, the head is for interpretability

7. Highlights

  • Uses PolarizationAttention to derive a spatial attention map from the polarization difference, explicitly marking glass locations so the model “knows where to trust polarization information”.
  • GatedFusion learns a [0,1] gate that dynamically weights the stereo and pol streams, replacing static concat fusion.
  • The gate input simultaneously covers corr, pol_corr, and the attention map, giving fusion decisions complete contextual information.
  • The attention module stays lightweight via 1×1 convolutions and channel reduction; the gate output can be directly visualized, providing interpretability.
  • Provides two optional modules — Learnable Pol Encoder and Glass-aware Auxiliary Head — to extend polarization feature depth and auxiliary supervision on top of the core architecture.

← All blueprints