AXL / ROSETTA / V4.0.1 / RESEARCH

v4.0.1 Research Status

Research-stage public preview. Productization gated.

v4.0.1 is the public preview of a research-stage release. It is qualified successor to v3.1 on domain-backed content and a recall-favored tradeoff on prose fallback. Productization is gated per CC-OPS-AXLSERVER directive sections 14-16.

Qualified Successor Verdict

The verdict below is the canonical AMENDMENT NOTICE language from docs/v4-research-document.md (2026-04-16), reproduced verbatim. It replaces the prior "fold v4 back into v3" conclusion that predated cold-read evidence. The amendment was authored when 201 tests were passing; the current count is 217.

  1. On domain-backed content (v4 has a dedicated Rosetta module), v4 replaces v3.1. Both recall and precision are materially higher on both home-turf corpora. Cross-model consistency held across four independent non-Anthropic models.
  2. On prose fallback (v4 has no domain module), v4 is a recall-favored tradeoff, not a clean replacement. v4's keyword-signature compression gives cold LLMs more entity hooks (recall up) but also leads LLMs to hallucinate more false entity mentions when reassembling prose from keyword spines (precision down). Precision-sensitive use cases on pure narrative prose may prefer v3.1 until this gap is closed.
  3. The v4 runtime architecture is independently validated. Kernel router, pluggable Rosetta modules, shared canonical form layer, artifact-driven gating, and drift detection are all implemented and under test discipline at 201 tests passing at the time of this amendment.

Cold-Read Gate Evidence

Three corpora, four non-Anthropic models, executed 2026-04-14 to 2026-04-16. Primary scorer: measure_fidelity recall and precision. The interpretive rule agreed with the operator: a clean win requires dRecall > 0 AND dPrecision >= 0 simultaneously. Split-sign results are reported as mixed.

CorpusSourceModuledRecall (v4 - v3.1)dPrecisionVerdict
#1CloudKitchen investment memo (41K chars)financial+15.02+14.54clean win
#2Construction technical spec (58K chars)construction+36.64+43.96clean win
#3Museum repatriation narrative (35K chars)prose fallback+20.97-11.40mixed

Corpora, seeds, reconstructions, scoring scripts, and per-corpus RESULTS files are committed under benchmarks/cold_read/, benchmarks/cold_read_corpus2/, and benchmarks/cold_read_corpus3/ in the axl-research repository.

Test Status

The v4 implementation is under test discipline.

  • 217 of 217 tests passing in 2.66 seconds (verified 2026-04-25, session start)
  • Source freeze: tag v4.0.2-r6-freeze
  • HEAD commit: 51e75de (ci tiktoken install fix)
  • Test runner: pytest
  • Coverage: kernel parsing, router dispatch, all four implemented Rosetta modules (prose / financial / construction / code), canonical form layer, drift detection, fidelity scoring

The amendment notice was authored at 201 tests passing. The 217-test count reflects continued test additions between 2026-04-16 and 2026-04-25 without behavior regression. Substantive milestones in this window:

  • 51e75de ci tiktoken install fix (HEAD)
  • 343a9c1 explain archive_4184bfe
  • 50082c2 document module_fidelity.json scope
  • cd674c3 add Codex-authored decompression
  • 30370bd archive pre-gate-2026-04-14

Dual-Agent Discipline

v4 is built under a dual-agent research discipline. Two AI agents own non-overlapping scopes and challenge each other through documented adversarial review rounds. Operator (Diego Carranza) acts as final arbiter.

Specifications + Research Narrative

Claude Code Opus 4.7

Owns spec/v4-*.md, research documents, narrative framing, gate interpretation, public-facing copy.

Anchored to evidence-first writing. Refuses to ship claims that are not corpus-backed.

Implementation + Tests

OpenAI Codex GPT-5.4

Owns src/axl_v4/ implementation, test harness, fidelity scoring, decompression code (committed at cd674c3).

Anchored to test discipline. Refuses to merge claims that lack a passing test.

Adversarial review rounds (committed dialogue):

  • docs/codex-r1-challenges-response.md - Round 1 challenges from Codex with response
  • docs/codex-r2-counter-response.md - Round 2 counter-response
  • docs/cross-model-consensus.md - Multi-model evidence convergence (v3 baseline through v3.2)

The cold-read gate is the public output of this discipline: independent, third-party model evaluation across four non-Anthropic models, with documented exclusion of Anthropic-family models for training-prior contamination.

Anthropic-Contamination Exclusion

Anthropic-family models (Claude Haiku, Sonnet, Opus) were excluded from the cold-read gate after corpus #1. The exclusion was not arbitrary. It was forced by direct evidence of training-prior contamination.

"both Haiku runs opened with explicit meta-commentary identifying the format by name (\"This is a Rosetta v3 compositional compression format...\"), which is warm-with-priors, not cold recovery." benchmarks/cold_read/RESULTS.md, quoted in v4-research-document.md AMENDMENT NOTICE

A cold-read gate measures what an LLM can recover from a compressed packet without prior knowledge of the format. A model that names the format on first read is not cold. Cross-model coverage requires models that have not been trained on the format documentation. Of the major model families:

  • Excluded (warm with priors): Anthropic Claude family (Haiku, Sonnet, Opus). Confirmed at corpus #1.
  • Included (cold): Gemini Flash (Google), Qwen 3.5 (Alibaba), Grok (xAI), DeepSeek (DeepSeek). All four scored consistently across all three corpora.

The cross-model consistency on excluded-Anthropic data is the strongest available evidence that the cold-read gate measures format-independent recoverability rather than family-specific priors.

What v4 Does Not Yet Settle

The prose-fallback result (corpus #3) is from a single corpus. The qualification on claim 2 of the verdict ("v4 is a recall-favored tradeoff, not a clean replacement on prose") stands until all three of the following are complete:

  1. Substrate fix. A change to the prose-fallback module that closes the precision gap without giving back the recall advantage. Candidates under investigation: tighter keyword-signature token cap, entity-only signatures, explicit non-present-entity signals.
  2. Re-run corpus #3 after fix. Confirm the verdict flips from mixed to clean on both axes (recall AND precision positive).
  3. Additional prose corpus. At least one additional prose corpus confirms the clean result generalizes beyond a single document.

Until all three are complete, the qualification stands and v3.1 remains the precision-favored choice for narrative prose.

How to Follow the Work