Glass-Aware Segmentation Branch Architecture

1. Design Goals

This architecture addresses the detection of transparent glass in stereo matching.

The polarization signal (the two cameras capture I∥ and I⊥ respectively) has two possible uses, but their applicability differs:

Polarization is unsuitable for matching: correlation measures texture similarity. Polarization causes I∥ ≠ I⊥ on glass surfaces, and glass itself lacks texture, so the Pol correlation has no clear peak and cannot support disparity search.
Polarization is suitable for segmentation: |I∥ - I⊥| is separable between glass and non-glass — this is a classification signal rather than a matching signal.

The design goal of this architecture: use the right signal for the right task. Do not ask polarization to bear correlation matching; instead, use polarization to tell the network “where the glass is”, and let the GRU learn a region-specific update strategy in glass regions (for example, being more conservative or using a different update magnitude).

2. Architecture

2.1 Glass-Aware Concept

Glass-Aware concept architecture

2.2 Full Data Flow

Glass-Aware full data flow

3. Components and Modules

3.1 GlassSegmentationBranch (encoder-decoder)

A lightweight glass segmentation branch that takes the polarization difference |I∥ - I⊥| as input and outputs a glass probability map in [0,1]. It uses an encoder-decoder structure.

class GlassSegmentationBranch(nn.Module):
    """
    Lightweight glass segmentation branch
    Input: |I∥ - I⊥| (polarization difference)
    Output: Glass probability map [0, 1]
    """
    def __init__(self):
        super().__init__()
        self.encoder = nn.Sequential(
            # 3 -> 32
            nn.Conv2d(3, 32, 3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),  # H/2

            # 32 -> 64
            nn.Conv2d(32, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),  # H/4

            # 64 -> 128
            nn.Conv2d(128, 128, 3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
        )

        self.decoder = nn.Sequential(
            nn.Upsample(scale_factor=2, mode='bilinear'),
            nn.Conv2d(128, 64, 3, padding=1),
            nn.ReLU(inplace=True),

            nn.Upsample(scale_factor=2, mode='bilinear'),
            nn.Conv2d(64, 32, 3, padding=1),
            nn.ReLU(inplace=True),

            nn.Conv2d(32, 1, 1),
            nn.Sigmoid(),
        )

    def forward(self, left, right):
        # Polarization difference
        pol_diff = torch.abs(left - right)

        # Encode
        features = self.encoder(pol_diff)

        # Decode to probability map
        glass_prob = self.decoder(features)

        return glass_prob

Structure:

Encoder: 3 → 32 → 64 → 128, two MaxPool downsamplings to H/4, with BatchNorm.
Decoder: two bilinear Upsamples, 128 → 64 → 32 → 1, ending with Sigmoid to produce the probability map.

3.2 Three Options for Context Fusion

There are three candidate ways to inject the glass probability into the Context Network.

Option A: Concatenation

context = self.cnet(left)  # [B, 128, H/4, W/4]
glass_prob_down = F.interpolate(glass_prob, scale_factor=0.25)
context = torch.cat([context, glass_prob_down], dim=1)  # [B, 129, H/4, W/4]

Option B: Attention

context = self.cnet(left)  # [B, 128, H/4, W/4]
glass_attention = 1.0 + 0.5 * glass_prob_down  # [1.0, 1.5] range
context = context * glass_attention  # Amplify features in glass regions

Option C: Separate heads (recommended)

# GRU receives both context and glass_prob separately
# Let GRU learn how to use glass information
net, inp = torch.split(context, [128, 128], dim=1)
inp = torch.cat([inp, glass_prob_down], dim=1)  # Add glass info to inp

Option C is recommended: inject the glass probability into the GRU’s inp branch and let the GRU learn how to use the glass information on its own.

3.3 Loss Function Design

def compute_loss(pred_disps, gt_disp, glass_prob, glass_mask):
    # 1. Disparity loss (standard)
    disp_loss = sequence_loss(pred_disps, gt_disp)

    # 2. Glass segmentation loss
    seg_loss = F.binary_cross_entropy(glass_prob, glass_mask)

    # 3. Glass-weighted disparity loss (optional)
    # Encourage the network to focus more on disparity accuracy in glass regions
    glass_weight = 1.0 + glass_mask  # [1.0, 2.0]
    weighted_disp_loss = (glass_weight * disp_loss).mean()

    # Total
    total_loss = weighted_disp_loss + 0.1 * seg_loss

    return total_loss

Three loss terms:

Disparity loss: standard sequence loss.
Glass segmentation loss: binary cross entropy between glass_prob and glass_mask.
Glass-weighted disparity loss (optional): weighted in glass regions (weight in [1.0, 2.0]), encouraging the network to focus on disparity accuracy in glass regions.

Total loss: weighted_disp_loss + 0.1 * seg_loss.

4. Tensor Dimensions

Item	Shape	Description
pol_diff	(B, 3, H, W)	`
Glass encoder output	(B, 128, H/4, W/4)	Two MaxPool downsamplings
glass_prob	(B, 1, H, W)	Sigmoid probability map
glass_prob_down	(B, 1, H/4, W/4)	interpolate scale 0.25
context (CNet output)	(B, 128, H/4, W/4)	RGB CNet
context (Option A after fusion)	(B, 129, H/4, W/4)	concat glass_prob_down
cost_rgb	(B, 36, …)	RGB correlation

5. Hyperparameters

Hyperparameter	Value	Description
Glass encoder channel sequence	3→32→64→128	Encoder channel counts per layer
Number of MaxPools	2	Downsample to H/4
seg_loss weight	0.1	Weight of segmentation loss in total loss
Glass weight range	[1.0, 2.0]	Weight range for glass-weighted disparity loss
Attention range	[1.0, 1.5]	Feature amplification range for Option B

6. Training Strategy

Option 1: End-to-end (recommended) Train the whole network jointly so that the segmentation branch and disparity are learned together.

Option 2: Two-stage

Stage 1: train the glass segmentation branch first (freeze the backbone).
Stage 2: unfreeze the backbone and fine-tune the whole network.

7. Design Decisions and Rationale

7.1 Use the right signal for the right task

The polarization signal is suitable for segmentation (|I∥ - I⊥| is separable between glass and non-glass) but not for matching (correlation has no peak). This architecture therefore does not let polarization enter the cost volume; it uses polarization for segmentation instead.

7.2 A lightweight branch is sufficient

Glass segmentation is a relatively easy binary classification task; it does not require a complex architecture. A 3–4 layer CNN encoder-decoder is enough.

7.3 Let the network learn the fusion

Inject the glass probability into the context (recommended: into the GRU’s inp branch) and let the GRU learn how to use the glass information, rather than using fixed rules to decide how glass regions should be handled.

7.4 Glass-aware behavior in the GRU

With the glass probability available, the GRU knows which regions are glass and can learn region-specific update strategies — being more conservative in glass regions and using a different update magnitude — rather than treating all pixels uniformly.

8. Highlights

Signal-task alignment: explicitly recognizes that polarization is “suitable for segmentation, unsuitable for matching” and uses it for context-level glass detection rather than in the cost volume, avoiding asking correlation to bear a task it cannot support.
Lightweight segmentation branch: a 3–4 layer CNN encoder-decoder is enough to produce a glass probability map at low computational cost.
Glass probability injected into context: injected via separate heads into the GRU’s inp branch, letting the GRU learn a region-specific update strategy in glass regions.
Synergy of multiple losses: combining disparity loss, segmentation loss, and glass-weighted disparity loss supervises segmentation quality while reinforcing disparity accuracy in glass regions.