S2M2 Three-Stage Training Architecture — Po-Ting Lin (林柏廷)

1. Design Goals

When a stereo matching model is composed of “base model + auxiliary correction module”, joint training from the start runs into a fundamental training-order problem. The three-stage training plan is designed to solve precisely this problem.

1.1 Problem: the correction task is impossible

If the base model has never seen the target domain (e.g., the pretrained weights have never seen synthetic transparent-object data), training the correction module under such conditions produces:

Flow showing why the correction task is impossible

Root cause: the upper bound of the correction is far smaller than the base model’s error, so even a perfect correction can only fix a small portion. The network discovers that “always outputting the upper bound” is a local optimum; the saturated region of tanh has near-zero gradient and cannot learn spatial structure.

Core constraint: training the correction module while the base model has no understanding of the target domain makes the task itself impossible. The base model’s error must first be pushed below the correction’s upper bound before the correction task becomes feasible.

2. Architecture: Three-Stage Training Plan

Three-stage training plan Stage A/B/C flow

3. Three-Stage Components and Design

3.1 Stage A — Baseline Domain Adaptation

Item	Setting
Goal	Let the base model first adapt to the target data domain
Input	Pretrained S2M2 weights
Training scope	Train only the base model; no polarization module
Expected effect	Error over transparent regions drops substantially below the correction’s upper bound
Key role	Establishes a reasonable foundation so the subsequent correction task becomes feasible

3.2 Stage B — Polarization Only Learning

Item	Setting
Goal	Let the correction module learn the pol_diff -> disparity correction mapping
Input	Stage A checkpoint
Training scope	Freeze the base model; train only the correction module
Warp mode	GT warp (ensures pol_diff quality)
Key role	base_err is already below max_correction, so the task is feasible

Another reason to freeze the base model is to avoid a shortcut: if the base were trainable, it would take the shortcut of learning disparity directly, and the correction would not be forced to learn.

3.3 Stage C — Finetune Integration

Item	Setting
Goal	Joint fine-tuning so the base and correction work together
Input	Stage B checkpoint
Training scope	Everything unfrozen; joint fine-tune
Warp mode	Switch to pred warp (simulates real inference)
Key role	Both modules are already pretrained, so joint tuning does not shortcut

4. Why Three Stages Solve the Problem

Problem	How the three-stage plan handles it
Base error too large; correction cannot move it	Stage A first pushes the base error below the correction’s upper bound
Correction does not learn (task impossible)	Stage B trains the correction under conditions where the task is feasible
Shortcut problem (base steals credit)	Stage B freezes the base, forcing the correction to learn from pol_diff
Joint training collapse	By Stage C both sides are pretrained, so joint tuning is stable

The key feasibility constraint: before training, confirm max_correction >= base_error; otherwise the network cannot solve the task and will converge to a constant output. The role of Stage A is exactly to push base_error below max_correction.

5. Tensor / Training State Dimensions

Stage	Trainable modules	Frozen modules	Source of warp disparity
Stage A	base model (S2M2)	—	not applicable (no polarization module)
Stage B	correction module	base model	GT disparity
Stage C	base model + correction module	—	predicted disparity

6. Polarization Injection Points

The Pol injection point in the three-stage training enters the correction module via warp-based pol_diff. The three stages are a “training pipeline design”; they do not change the injection point itself, but progressively make the modules around the injection point reach a trainable state:

Stage A: the correction module that hosts the injection point is not yet enabled.
Stage B: the injection point is enabled; pol_diff is computed with GT warp; only the correction module is trained.
Stage C: the injection point is enabled; pol_diff is computed with pred warp; the full model is jointly trained.

7. Design Decisions and Rationale

Decision	Rationale
Add Stage A (Domain Adaptation)	Pretrained weights have not seen the target domain; base_error must first be pushed below max_correction
Freeze base model in Stage B	Avoids the base taking a shortcut and stealing credit; forces correction to learn from pol_diff
GT warp in Stage B	Ensures pol_diff quality so the correction learns semantics under perfect alignment
Unfreeze everything in Stage C joint fine-tune	Both modules are pretrained, so joint fine-tuning does not shortcut
Switch to pred warp in Stage C	Simulates the real inference setting
Three-stage as a general pattern	Domain Adaptation -> Module-specific Learning -> Joint Fine-tune applies to any “base + auxiliary module” architecture

8. Highlights

Breaks an impossible task through training order: first lower the base error, then train the correction, avoiding correction being forced to converge to a constant output when the base has no understanding of the domain.
Explicit feasibility constraint: before training, use max_correction >= base_error as the task-feasibility criterion, and use tanh saturation (raw) as a danger signal.
Freezing the base prevents shortcut: Stage B freezes the base, forcing the correction to truly learn from pol_diff rather than letting the base shortcut and steal credit.
Pretrain each side before joint fine-tuning: by Stage C both modules are in place, avoiding the situation where one side has not yet learned and is overwhelmed by the other.
Transferable general training pattern: Domain Adaptation -> Module-specific Learning -> Joint Fine-tune applies to any two-stage “base + auxiliary module” architecture.