1. Design Goals
This architecture addresses the detection of transparent glass in stereo matching.
The polarization signal (the two cameras capture I∥ and I⊥ respectively) has two possible uses, but their applicability differs:
- Polarization is unsuitable for matching: correlation measures texture similarity. Polarization causes
I∥ ≠ I⊥on glass surfaces, and glass itself lacks texture, so the Pol correlation has no clear peak and cannot support disparity search. - Polarization is suitable for segmentation:
|I∥ - I⊥|is separable between glass and non-glass — this is a classification signal rather than a matching signal.
The design goal of this architecture: use the right signal for the right task. Do not ask polarization to bear correlation matching; instead, use polarization to tell the network “where the glass is”, and let the GRU learn a region-specific update strategy in glass regions (for example, being more conservative or using a different update magnitude).
2. Architecture
2.1 Glass-Aware Concept
2.2 Full Data Flow
3. Components and Modules
3.1 GlassSegmentationBranch (encoder-decoder)
A lightweight glass segmentation branch that takes the polarization difference |I∥ - I⊥| as input and outputs a glass probability map in [0,1]. It uses an encoder-decoder structure.
class GlassSegmentationBranch(nn.Module):
"""
Lightweight glass segmentation branch
Input: |I∥ - I⊥| (polarization difference)
Output: Glass probability map [0, 1]
"""
def __init__(self):
super().__init__()
self.encoder = nn.Sequential(
# 3 -> 32
nn.Conv2d(3, 32, 3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(inplace=True),
nn.MaxPool2d(2), # H/2
# 32 -> 64
nn.Conv2d(32, 64, 3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.MaxPool2d(2), # H/4
# 64 -> 128
nn.Conv2d(128, 128, 3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
)
self.decoder = nn.Sequential(
nn.Upsample(scale_factor=2, mode='bilinear'),
nn.Conv2d(128, 64, 3, padding=1),
nn.ReLU(inplace=True),
nn.Upsample(scale_factor=2, mode='bilinear'),
nn.Conv2d(64, 32, 3, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(32, 1, 1),
nn.Sigmoid(),
)
def forward(self, left, right):
# Polarization difference
pol_diff = torch.abs(left - right)
# Encode
features = self.encoder(pol_diff)
# Decode to probability map
glass_prob = self.decoder(features)
return glass_prob
Structure:
- Encoder: 3 → 32 → 64 → 128, two MaxPool downsamplings to H/4, with BatchNorm.
- Decoder: two bilinear Upsamples, 128 → 64 → 32 → 1, ending with Sigmoid to produce the probability map.
3.2 Three Options for Context Fusion
There are three candidate ways to inject the glass probability into the Context Network.
Option A: Concatenation
context = self.cnet(left) # [B, 128, H/4, W/4]
glass_prob_down = F.interpolate(glass_prob, scale_factor=0.25)
context = torch.cat([context, glass_prob_down], dim=1) # [B, 129, H/4, W/4]
Option B: Attention
context = self.cnet(left) # [B, 128, H/4, W/4]
glass_attention = 1.0 + 0.5 * glass_prob_down # [1.0, 1.5] range
context = context * glass_attention # Amplify features in glass regions
Option C: Separate heads (recommended)
# GRU receives both context and glass_prob separately
# Let GRU learn how to use glass information
net, inp = torch.split(context, [128, 128], dim=1)
inp = torch.cat([inp, glass_prob_down], dim=1) # Add glass info to inp
Option C is recommended: inject the glass probability into the GRU’s inp branch and let the GRU learn how to use the glass information on its own.
3.3 Loss Function Design
def compute_loss(pred_disps, gt_disp, glass_prob, glass_mask):
# 1. Disparity loss (standard)
disp_loss = sequence_loss(pred_disps, gt_disp)
# 2. Glass segmentation loss
seg_loss = F.binary_cross_entropy(glass_prob, glass_mask)
# 3. Glass-weighted disparity loss (optional)
# Encourage the network to focus more on disparity accuracy in glass regions
glass_weight = 1.0 + glass_mask # [1.0, 2.0]
weighted_disp_loss = (glass_weight * disp_loss).mean()
# Total
total_loss = weighted_disp_loss + 0.1 * seg_loss
return total_loss
Three loss terms:
- Disparity loss: standard sequence loss.
- Glass segmentation loss: binary cross entropy between glass_prob and glass_mask.
- Glass-weighted disparity loss (optional): weighted in glass regions (weight in [1.0, 2.0]), encouraging the network to focus on disparity accuracy in glass regions.
Total loss: weighted_disp_loss + 0.1 * seg_loss.
4. Tensor Dimensions
| Item | Shape | Description |
|---|---|---|
| pol_diff | (B, 3, H, W) | ` |
| Glass encoder output | (B, 128, H/4, W/4) | Two MaxPool downsamplings |
| glass_prob | (B, 1, H, W) | Sigmoid probability map |
| glass_prob_down | (B, 1, H/4, W/4) | interpolate scale 0.25 |
| context (CNet output) | (B, 128, H/4, W/4) | RGB CNet |
| context (Option A after fusion) | (B, 129, H/4, W/4) | concat glass_prob_down |
| cost_rgb | (B, 36, …) | RGB correlation |
5. Hyperparameters
| Hyperparameter | Value | Description |
|---|---|---|
| Glass encoder channel sequence | 3→32→64→128 | Encoder channel counts per layer |
| Number of MaxPools | 2 | Downsample to H/4 |
| seg_loss weight | 0.1 | Weight of segmentation loss in total loss |
| Glass weight range | [1.0, 2.0] | Weight range for glass-weighted disparity loss |
| Attention range | [1.0, 1.5] | Feature amplification range for Option B |
6. Training Strategy
Option 1: End-to-end (recommended) Train the whole network jointly so that the segmentation branch and disparity are learned together.
Option 2: Two-stage
- Stage 1: train the glass segmentation branch first (freeze the backbone).
- Stage 2: unfreeze the backbone and fine-tune the whole network.
7. Design Decisions and Rationale
7.1 Use the right signal for the right task
The polarization signal is suitable for segmentation (|I∥ - I⊥| is separable between glass and non-glass) but not for matching (correlation has no peak). This architecture therefore does not let polarization enter the cost volume; it uses polarization for segmentation instead.
7.2 A lightweight branch is sufficient
Glass segmentation is a relatively easy binary classification task; it does not require a complex architecture. A 3–4 layer CNN encoder-decoder is enough.
7.3 Let the network learn the fusion
Inject the glass probability into the context (recommended: into the GRU’s inp branch) and let the GRU learn how to use the glass information, rather than using fixed rules to decide how glass regions should be handled.
7.4 Glass-aware behavior in the GRU
With the glass probability available, the GRU knows which regions are glass and can learn region-specific update strategies — being more conservative in glass regions and using a different update magnitude — rather than treating all pixels uniformly.
8. Highlights
- Signal-task alignment: explicitly recognizes that polarization is “suitable for segmentation, unsuitable for matching” and uses it for context-level glass detection rather than in the cost volume, avoiding asking correlation to bear a task it cannot support.
- Lightweight segmentation branch: a 3–4 layer CNN encoder-decoder is enough to produce a glass probability map at low computational cost.
- Glass probability injected into context: injected via separate heads into the GRU’s inp branch, letting the GRU learn a region-specific update strategy in glass regions.
- Synergy of multiple losses: combining disparity loss, segmentation loss, and glass-weighted disparity loss supervises segmentation quality while reinforcing disparity accuracy in glass regions.