The v3.1 vs v4 Decision Gate
A ten-day arc of research, evidence, correction, and verdict. Every entry anchors to a commit; corrections sit alongside releases because correction-driven tightening is part of the evidence.
April 2026
2026-04-10 - v3.1 Data Anchoring Extension shipped
Commit 099bcff. Four additive conventions layered on top of the v3 packet grammar: numeric bundles, entity anchors, causal operator split, summary plus breakdown packet pairs. Backward-compatible with v3 parsers - any unknown convention is safely ignored.
2026-04-11 - v3.1 claims tightened after review
Commit 8fc20c0, authoritative v3.1 spec. Two overclaims rolled back: the <- evidence rule no longer says "Always in ARG1" (now: "use it in the evidence field, not for causal implication"), and the summary claim no longer cites "zero compression cost" (now: "approximately compression-neutral on the 10-packet bakeoff set, +0.4% total chars"). Sets the precedent for the project's evidence-honesty discipline.
2026-04-11 - Production baseline measurement
Commit 6fed4dd. Measured compress.axlprotocol.org v0.9.0 against the authoritative 41K CloudKitchen memo using tiktoken(cl100k_base). Published numbers (2.90x chars, 1.40x tokens) replace earlier estimates (3.27x, 2.69x) that had come from partial reconstructions. Production measurement becomes the only citable ratio from this point forward.
2026-04-11 - Self-bootstrapped v3.1 measurement
Commit af6345b. LLM-bootstrapped v3.1 compression achieves 2.66x chars, 1.49x tokens, 90% round-trip fidelity - independently verifying that the spec is learnable from examples alone.
2026-04-11 - v3.2 Glyph Compression Layer drafted and cold-tested
Specs at commits f176046 and 312fe7d; cold-read data at commit f0a6bcc. Optional additive layer over v3.1 that replaces English labels inside packets with single-token CJK ideograms (金, 人, 上, 下, 大, 小, 高), Greek letters (Δ μ σ π α β γ δ ε λ ω), and math operators (↑ ↓ → ⟹). Two measured insights: (1) emoji are token poison (5 to 7 tokens each in cl100k_base); (2) CJK ideograms are 1 token each and do not trigger language-switching in cold-tested LLMs. Cold-tested on three non-Anthropic models (Qwen 3.5, Gemini Flash, DeepSeek) with preliminary results in the 96 to 100 percent range on the CloudKitchen corpus under the legacy weighted F scorer. The legacy scorer was later established to be corpus-specific, so v3.2 evidence is preliminary pending re-scoring under the primary extractor. v3.2 never shipped as an independent v3-line extension; its glyph-palette thinking was absorbed into the v4 Kernel Router's domain modules (the 金 marker in v4's financial Rosetta is inherited from v3.2). Full research brief at the v3.2 research brief.
2026-04-11 - v4 Kernel Router blueprint
Commit 371094c. Architecture for pluggable domain Rosetta modules dispatched by a kernel router. Research prototype, not a replacement for v3.1.
2026-04-13 - v4 working prototype
Commit 0f65c95. All four target metrics met: char ratio at least 2.66x, token ratio at least 1.45x, round-trip fidelity at least 75%, stacked wire compression at least 7x.
2026-04-13 - Construction Rosetta module added
Commit 2ba79e1. First domain module beyond prose and financial. Demonstrates that the kernel-router plus pluggable-module architecture can accept a new domain without touching the kernel.
2026-04-13 to 2026-04-14 - Five pre-gate adversarial review rounds (R1 through R5, with an R4.1 sub-finding)
Dual-agent workflow: Claude Code writes, Codex (GPT 5.4) reviews adversarially, the operator steers. Each round tightens a specific invariant. The rounds are anchored by commits:
- R1 (
0a5cad4): parser validation required; "AXL compliance" claims now mean kernel-parseable - R2 (
35e26d5): packet-grammar conformance (operation codes, field structure) - R3 (
6228281,be52755): shared canonical form layer plus envelope floor; runtime router gate plus hermetic tests - R4 (
099dbe6): canon_date error namespace isolation; stop stale router constant drift - R4.1 sub-finding: tight drift tolerance (narrow range would have accepted stale 30 / 50 / 55 baselines)
- R5 (
6961dec): tight drift detector tracking the full 58K corpus; sabotage-tested
Post-gate Codex reviews are commit-anchored rather than R-numbered. They appear below as their own dated entries at 5dcdabc, 7da8533, c7704a6, and 8980042. The clean-checkout verification protocol (git clone /tmp/axl-* && pytest) became standard practice during the R3 and R4 rounds and is used on every commit that changes runtime or evidence.
2026-04-14 - Gap 1: construction dollar plus date emitters
Commit 330f53a. Construction fidelity on the tracked 58K corpus: 41.43% to 50.57%. Dollars and dates go from 0% recall to 100%, wired through the shared canonical layer (canon_dollar, canon_date).
2026-04-14 - Gap 2: dimension cap plus canonical short-form recognizer
Commit ab092fa. Construction fidelity: 50.57% to 76.00%. Root cause diagnosed: the extractor did not recognize canonical short-form dimension units (sqft, sqm, standalone m). Fix was cleaner than the obvious "remove dim cap" answer because the real leak was extractor and emitter unit asymmetry.
2026-04-14 - Negative-path gate test restored
Commit 623f0b8. Codex caught that the inverted gate test (construction clearing the 60% floor) no longer exercised the negative path. Added test_route_gate_demotes_below_floor which monkeypatches fidelity below the floor and verifies demotion to prose.
2026-04-14 - Gap 3: artifact-driven routing
Commit 29800b4. Replaced the hardcoded _MEASURED_ROUTING_FIDELITY dict with _load_fidelity_artifact() reading from benchmarks/module_fidelity.json. Writer: run_module_bench.py. CI check: run_module_bench.py --check. The router only does schema validation; provenance fields (git_sha, generated_at, fixture_sha256) are for CI and humans, never runtime gating.
2026-04-14 - Cold-read decision-gate kit (v3.1 vs v4)
Commit 9c3247e. First corpus of the decision gate. Kit contains source SHA256, v3.1 and v4 candidates, the operator-facing cold-read prompt, per-model seeds, scorer, metadata, README.
2026-04-14 - Corpus #1 initial result, then corrected
Commits 205a68f to 5dcdabc. First published result overclaimed at 32.01% Gemini recall. Review (and operator oversight) caught a file-handling bug: two LLM sessions had been concatenated into one save file. Corrected Gemini recall is 23.08%. Amended writeup explicitly acknowledges the error, establishes the clean-checkout protocol, and adds structural-warning guards to the scorer. Also establishes the precision-check rule: a clean v4 win requires Δrecall greater than 0 AND Δprecision greater than or equal to 0 simultaneously - no split-sign cases count.
2026-04-14 - Anthropic-family models excluded from cold-read panels
Both Claude Haiku runs at 205a68f opened with explicit meta-commentary identifying the format by name ("This is a Rosetta v3 compositional compression format..."). Confirmed: Anthropic-family models are contaminated by training priors. Control set from this point forward: Gemini Flash, Qwen 3.5, Grok, DeepSeek.
2026-04-14 - Corpus #2 cold-read kit
Commit 4a5559b. Tech-spec corpus (construction 58K). Also adds structural guards to the scorer: mid-line header detection, duplicate h1/h2 detection, cover-page marker clustering, training-prior phrase detection.
2026-04-15 - Longer prompt plus expanded panel
Commit 99c584b. Prompt rewritten for explicit fidelity guidance and anti-meta-commentary clauses. Panel extended to 4 models (adds Grok and DeepSeek alongside Gemini and Qwen). Seeds regenerated.
2026-04-15 - Corpus #2 result: clean sweep on four models
Commit 3987aa3. v4 wins every model on recall AND precision. Clean average Δrecall +36.64 (range +12.85 to +59.71), Δprecision +43.96 (range +40.97 to +55.27). Architectural-generalization claim validated: the v4 modular-Rosetta architecture generalizes to a new domain with cross-model consistency.
2026-04-15 - Prose fallback invariant fix (real compression)
Commit d9f82bc. Discovered during corpus #3 preflight: the v4 prose module was a near-verbatim carrier, not a compressor. On a 35K narrative corpus it produced a blob larger than the source (36,397 chars). Fix: rewrite as a per-block keyword-signature compressor. Compression ratio on the same corpus becomes 2.83x chars, 1.41x tokens - within 13% of the production v3.1 ratio on identical content.
2026-04-16 - Corpus #3 kit plus initial result
Commits a7a9254 and 4184bfe. Prose-fallback test: museum repatriation narrative, 35K chars. Initial result: mixed - v4 wins recall (+20.97 avg) but loses precision (-11.40 avg). Grok shows the strongest per-model result; Qwen outlier at -30.27 precision flagged for diagnosis.
2026-04-16 - Qualified reversal of the fold-back conclusion
Commit b176ad2. docs/v4-research-document.md previously concluded "fold v4's formalism back into v3, not to replace v3 with v4" - the preliminary conclusion from before any cold-read evidence. The research doc now carries an AMENDMENT NOTICE at line 34 with the three-corpus evidence table and qualified verdict; the retired quote is preserved at line 207 with a "Retired 2026-04-16" marker. The full superseding verdict:
v4 replaces v3.1 on domain-backed content (corpora #1 and #2, both axes up). On prose fallback, v4 is a recall-favored tradeoff, not a clean replacement. The v4 runtime architecture is independently validated regardless of the wire-format succession question.
2026-04-16 - Prose envelope invariant enforced at runtime
Commit 7da8533 after Codex review of b176ad2. The research doc and CHANGELOG had described the prose no-expansion invariant as "enforced at runtime" - in fact it was not; ProseRosetta.compress() never measured serialized size. Codex reproduced an above-envelope counterexample (911 bytes to 1,762 bytes inflation). Fix: compress() now computes both keyword-signature and passthrough outputs and returns whichever serializes shorter. Adversarial regression tests added for three inflation-prone input shapes.
2026-04-16 - Prose precision pass
Commit 595b743. Two targeted fixes based on direct diagnostic of hallucinated entities in corpus #3 reconstructions: (1) word-aware entity aliasing with standalone-count so tokens like Marchetti, Selvaria, March, and Redman get their own aliases when they appear standalone (was being silently blocked by raw-string substring containment against longer multi-word entities); (2) header-derived block signatures lowercase their title-case non-entity tokens to discourage cold LLMs from re-rendering them as title-case phrases that false-match the extractor's two-caps-words entity regex.
2026-04-19 - Prose header acronym preservation
Commit c7704a6 after Codex review of 595b743. The header-lowercase rule over-applied; all-uppercase acronyms like UNESCO were being lowercased to unesco, which would suppress a real entity signal if an acronym appears only in a header. Fix: lowercase only mixed-case tokens under is_header; preserve all-uppercase acronyms. Metadata provenance corrected to match the actual commit that regenerated the blob.
2026-04-20 - Corpus #3 precision pass result
Commit a6785c2. Precision gap closed 76 percent: Δprecision -11.40 to -2.71 while Δrecall held +20.97 to +21.47. One model (Grok) flipped to a clean win on both axes (+16.17 / +5.87). Qwen's outlier fixed (-30.27 to -4.77). Gemini's residual -8.44 unchanged; mechanism identified as model-specific section-title re-capitalization. DeepSeek's reconstruction flagged as format-mimicry (emitted OBS.95:, PRD.72: prefixes and copied the manifest line). Per the strict interpretive rule, verdict remains narrowly mixed; near parity, no clean flip.
2026-04-20 - Scorer structural-mimicry guard
Commit 8980042. Closes the methodology gap that let DeepSeek's corpus #3 AXL op-code mimicry pass silently. Eight new regex patterns scanned anywhere in reconstructions (not just the first 600 chars) catch: AXL op codes with confidence suffixes (OBS.88:, PRD.72:, etc.), manifest identifiers (axl-core, @m.B.axl_v4, compressor_manifest), module markers (rosetta:prose, rosetta:financial, etc.), passthrough flag (^mode:passthrough), and format and version markers (fm:axl/, spec:3.1.0). Nine regression tests lock in the coverage for concatenation, opening contamination, structural mimicry, and clean-prose negative cases.
Current state
v4 is the mainline successor path for domain-backed content. Corpora #1 and #2 both produce clean wins on recall and precision across four independent non-Anthropic models. The qualified reversal of the fold-back conclusion is documented in docs/v4-research-document.md with the full three-corpus evidence table.
Prose fallback remains the qualified slice. Corpus #3 post-precision-pass sits at +21.47 Δrecall / -2.71 Δprecision; recall-positive, precision-near-parity. Grok clean; Qwen outlier fixed; Gemini residual mechanism understood but not yet closed.
v3.1 remains the shipping production protocol. For hallucination-sensitive prose content, for cross-domain content without a specialized module, and for backward compatibility, v3.1 is the right tool. The research evidence strengthens v3.1's public case: its precision on narrative prose is 88 to 97 percent across four cold LLMs, and its compression is a measured 2.90x chars / 1.40x tokens against the authoritative benchmark corpus.
Methodology is strong enough to carry load-bearing public claims. Clean-checkout verification, precision-check rule, structural-warning guards across four contamination classes, independent regex extractor (no LLM scores itself), artifact-driven drift detection with CI --check mode. Every number in this timeline is reproducible from a clean clone of the research repository.