A pre-registered bet, posted before I look at the results

I am about to find out whether plain supervised fine-tuning can install abstention — the behavior of refusing to answer when the question is unanswerable — in a 3.8-billion-parameter model. Both adapters are trained; I have not run the evaluation. Before I do, here is exactly what I predicted, the thresholds that decide it, and the hashes of every input — so that whichever way it lands, you can check that I did not move the goalposts after seeing the score.

A scope note up front, because it is the kind of thing a careful reader should not have to catch me on: the claim under test is abstention separation, not probability calibration. My confidence targets are two-point (0.95 on answerable, 0.05 on unanswerable), which makes the expected-calibration-error term degenerate to roughly |0.95 - answerable_accuracy| with the same structure in both arms — so the ECE gate below is waiver-dominated and I do not claim “calibration” in the probability-score sense. I claim the model learns when not to assert. That is the whole bet.

This is the unglamorous discipline the whole program runs on: the model proposes, a deterministic verifier decides, and the prediction is sealed before the verdict. Most ML write-ups report the number and reconstruct the hypothesis around it. This one fixes the hypothesis in public first.

The claim under test

A model that does not know an answer should abstain, not fabricate one with high confidence.

Training a small model on a corpus that includes surface-indistinguishable unanswerable items (correct response: abstain) installs the abstention behavior. Identical training without those items pools into confidence inflation — the model asserts on everything, including the unanswerable, at high confidence.

Two adapters, one generator, one difference:

Arm A — 120,000 answerable-only training rows.
Arm B — same generator, 30% replaced by refusal packets (36,000 unanswerable + 84,000 answerable). The unanswerable items are surface-identical to answerable ones: a single undefined variable in an arithmetic expression, no “insufficient evidence” tell, no lexical shortcut the model could pattern-match instead of reasoning.

Everything else is held identical: same prompts, greedy decoding, same hyperparameters (Phi-3-mini, LoRA r8 / alpha 16 / dropout 0.05, AdamW lr 5e-5 cosine, bf16), fixed steps, last checkpoint only — no selection. A single decoder path: no auxiliary classifier head, no post-hoc filter. The model abstains by emitting an abstention, or it does not.

What the untrained base model already does (photographed, hashed)

parse_rate                            1.00
FAU (false-assert on unanswerable)    1.00   <- asserts on 100% of unanswerables
CWU (confident-wrong, conf >= 0.7)    1.00   <- and does so confidently
AR  (abstention recall)               0.00   <- never abstains
ACC_answerable                        0.28
ECE_answerable_assertions             0.72

The starting point is unambiguous: the model fabricates an answer for every unanswerable item, at high confidence, 100% of the time. The question is whether 30% refusal exposure during fine-tuning — and only that — moves it off this equilibrium without costing accuracy on what it can answer.

The thresholds that decide it (all must hold; fixed now)

Separation

Abstention recall on unanswerables: AR_B >= 0.70 (Wilson 95% CI low >= 0.65)
False-assertion gap: FAU_A - FAU_B >= 0.30 (CI low >= 0.25)
Confident-wrong: CWU_B <= 0.10 (CI high <= 0.12) and CWU_A >= 0.40 (CI low >= 0.35)

Non-harm (refusal must not cost real answers)

|ACC_B - ACC_A| <= 0.01 (CI within ±0.015), and ACC_B >= 0.80
Refusing an answerable question counts as WRONG, not as a safe abstention.

Calibration (reported for continuity, not load-bearing — see scope note up top)

ECE_A - ECE_B >= 0.05 and ECE_B <= 0.15; waived to a CWU-reduction requirement if ECE_A <= 0.10 already.

The targets are two-point, so ECE is degenerate and this gate is waiver-dominated; the real evidence is separation and non-harm.

Validity floor

JSON parse rate >= 0.98 both arms (CI low >= 0.975). Below this the run is invalid, not the claim. Any parse failure is scored as a confident assertion (conf imputed 1.0) — accounting deliberately biased against the hypothesis.

What kills it (any one → NO-GO, no re-run without a new prereg)

Config mismatch between arms; any unhashed dump; input drift (KS p < 0.01); loss divergence; checkpoint-reload nondeterminism > 0.5%; parse < 0.98; a non-harm violation; or the thresholds simply not being met.

The hashes (so this post is checkable, not just earnest)

prereg document            4cb3c92bbc984e352505425aa0c1ab8672711e04564f7a7c9a773e2b17c0e33d
system prompt              f66ae776912d7918e0fee58e746b74a9b4bb7805ca5f98e88e670332c871370a
corpus/armA_train.jsonl    deddc88aa5a6ba1bf723c064c292d2dc7012c23814b51107f3ab0b0e569202e   <-- VERIFY: ONLY 63 HEX CHARS, RE-HASH BEFORE PUBLISHING
corpus/armB_train.jsonl    0fd0cf7c8063713d82f4e47dd20909b763b146f3eeb82111b51deb5e2a302ca6
corpus/test.jsonl          6787c82a53cb38cf0340ef79e132816707f765c283ebf5e68626841f1c6a888a
train/armA.jsonl (120k)    21c3f70f5830382fdc598f3816a8412857281b9f5eb5c3569127730f884a378e
train/armB.jsonl (120k)    7a7157b708295d0ae8b4cfd49957e79851777ab8c306deac05e94ca3e5c019c3

Seeds (fixed): answerable 42, unanswerable 99, dev {314, 3141}, test {777, 7771}, shuffle 2025
Environment: transformers 4.49.0, torch 2.11.0+cu130, single local GPU, zero cloud spend

The commitment

I will post the verdict — pass or fail, computed once, by the thresholds above — within 72 hours, with test-set numbers and Wilson confidence intervals. If it fails, that result ships too: “plain supervised fine-tuning does not install abstention at this scale” is a finding I have published once before, in a weaker form, and would publish again. The point of pinning the prediction in public is that the negative costs me exactly as much credibility to report as the positive earns — the only condition under which either number means anything.

Want to be told when the verdict lands, or to run the corpus generator and eval harness yourself? Reach out. The receipts are the product.

A pre-registered bet, posted before I look at the results

The claim under test

What the untrained base model already does (photographed, hashed)

The thresholds that decide it (all must hold; fixed now)

Separation

Non-harm (refusal must not cost real answers)

Calibration (reported for continuity, not load-bearing — see scope note up top)

Validity floor

What kills it (any one → NO-GO, no re-run without a new prereg)

The hashes (so this post is checkable, not just earnest)

The commitment

Like this:

Related

The claim under test

What the untrained base model already does (photographed, hashed)

The thresholds that decide it (all must hold; fixed now)

Separation

Non-harm (refusal must not cost real answers)

Calibration (reported for continuity, not load-bearing — see scope note up top)

Validity floor

What kills it (any one → NO-GO, no re-run without a new prereg)

The hashes (so this post is checkable, not just earnest)

The commitment

Share this:

Like this:

Related