Weight-Space Geometry of Offline Reasoning Training

An interactive companion to the paper · source & data ↗

Abstract

Six losses, one base model, one fixed set of math rollouts. If the data is held identical, what does the choice of loss actually do to the weights? This page lets you turn the same knobs we did — pick method pairs, scrub across all 36 layers, and watch each geometric metric respond.

Reading the geometry

Every chart below is computed on the LoRA weight update ΔW — the small change each method writes into the base model — or on the representations that update produces. Four tools, each asking a different version of “are these two the same?”

Cosine similarity: Do two weight updates point the same way? +1 = identical direction, 0 = orthogonal (unrelated), below 0 = opposed. This is the headline number, taken on the stacked ΔW.
Principal angles: How far apart are the subspaces the two updates span — a basis-free generalization of cosine. A few degrees means effectively the same subspace; near 90° means disjoint.
Mode connectivitylinear · LMC: Interpolate between two trained adapters and watch the loss. A flat path means they sit in the same basin; a bump in the middle is a barrier separating two different solutions.
CKAcentered kernel alignment: Do the two models compute the same thing inside? Unlike the others, CKA compares hidden representations, not weights. ≈1 = near-identical computation; lower means the circuit has been rewired.

The six losses

Every method is trained on the same rollouts from Qwen3-4B-Instruct with attention-only LoRA (q, k, v, o; rank 32). They differ only in how the loss treats negatives, reward, and a reference policy.

The objectives, written out

Every objective above is token-level cross-entropy: the term −𝔼 log π_θ(y∣x) is exactly the CE between a rollout and the model. SFT, RFT and RIFT are the same CE, only reweighted per rollout — over all tokens, over positives only, or by reward. DFT reweights it by the stop-gradient probability; GRPO and DPO leave the CE form entirely.

A map of directions

Start global. Stack each method's LoRA update into a single vector ΔW and measure the cosine between every pair. Three blocks of the matrix tell the whole story: a hot reward-weighted cluster (SFT / RFT / RIFT), a lukewarm Offline GRPO, and a cold, near-orthogonal DPO. Add the on-policy methods and they detach from everything offline.

hover a cell for the value · click to inspect that pair below

Layer by layer

A single number hides where methods agree. Here is the cosine of ΔW computed independently in each of the 36 transformer blocks. Toggle pairs and drag the slider: the SFT family is colinear from embedding to head, while DPO and online RL stay pinned near zero — and Offline GRPO peels away in the late layers.

Layer

Does it rewire the computation?

Cosine compares updates. CKA compares what the network actually computes — the hidden representations. Most methods leave them almost untouched (CKA ≈ 1). DPO is the exception: its representation similarity collapses in the final blocks, the fingerprint of a method that changes the circuit, not just the write direction. The layer slider is shared with the chart above.

Same answer, different basis

Low cosine does not always mean a different solution. Decompose each ΔW and compare only the dominant output direction (the top left-singular vector u). Across the SFT family these stay aligned even where the raw vectors diverge — the updates point the same way in output space while differing in their input-side basis, an artifact of random LoRA initialization rather than a genuinely different circuit.

How far apart are the subspaces?

Principal angles measure the gap between the subspaces two updates span — a basis-free version of cosine. SFT and RFT sit about 7° apart (effectively the same subspace); SFT and DPO open up to ~55°. Each bar is the median over 144 modules; the whisker shows the spread of the worst of the top-10 angles.

Size and rank of the move

Direction is only half of it. How far does each loss push, and how concentrated is the push? The SFT family travels far along a low-rank direction; DPO barely moves yet spreads that tiny step across a much higher effective rank — a small, broad nudge versus a large, focused shove.

One basin or two?

Linearly interpolate between two trained adapters and watch the loss. A flat or monotone path means the two solutions share a basin; a bump in the middle is an energy barrier separating them. SFT ↔ Offline GRPO is barrier-free — same basin. Paths into DPO climb a wall.

Does the geometry show up in accuracy?

Yes — and it inverts the usual intuition. The methods that move orthogonally to the SFT direction (DPO and on-policy RL) hold onto the base model's accuracy, while the colinear SFT family drags GSM8K below base. Online GRPO posts the best AIME26.

Is the geometry an artifact of seed or learning rate?

A fair worry: maybe the directions are just noise. They are not. Two seeds of the same loss produce a low raw weight-cosine — yet the top-1 output direction stays at ~0.99. The disagreement is entirely in the input-side basis (random LoRA A-init), not in the solution. Separately, a 10× learning-rate change rotates ΔW rather than merely rescaling it — so DPO's smaller LR is genuinely part of its geometry.

Takeaways

The reward-weighted MLE family is one direction. SFT, RFT, and RIFT have cosine ≥ 0.94 and ~7° top-1 principal angle — interchangeable in weight space.
DFT diverges the most among offline losses despite seeing identical data — the stop-gradient reshaping matters geometrically.
Offline GRPO stays in the SFT basin but adds a large orthogonal late-layer component (up to ~86% off-SFT in the final blocks).
DPO is the outlier: near-orthogonal subspace, a mode-connectivity barrier, late-layer CKA collapse — and the best accuracy, at a 10× smaller learning rate.
On-policy RL is geometrically unlike everything offline. Online GRPO/DAPO are near-orthogonal to every offline loss and to each other: shared-rollout colinearity is partly an artifact of training on the same fixed data.

Base model Qwen3-4B-Instruct-2507 · attention-only LoRA (q,k,v,o, r32 a64) · DeepScaleR math rollouts · math-verify reward. All metrics on this page are computed from the published analysis JSON. Code, adapters, and raw results ↗