v3.1 vs v4.0.1 - Side by Side
Productized stable vs research-stage qualified successor.
v3.1 is the productized stable release. v4.0.1 is the research-stage public preview, qualified successor on domain-backed content. Both share the same kernel format. The differences are above the kernel: vocabulary, dispatch, modules, and validation regime.
Capability Table
| Capability | v3.1 | v4.0.1 |
|---|---|---|
| Wire format | PKT[VER|CLS|SUB|TAG|ARG1|ARG2|META] | same (unchanged) |
| Kernel size | 75 lines | 94 lines |
| Compression vocabulary | One universal | Router + N modules |
| Domain modules | None | prose, financial, construction, code |
| Math operators | Yes (ARG2 math-arg grammar) | Yes (unchanged) |
| Ideographic composition (v3.2) | Yes | Yes (unchanged) |
| Code compression | No | Yes (lossy structural IR) |
| Drift detection | No | Yes (runtime architecture) |
| Artifact-driven gating | Manual review | Yes (artifact-anchored gate) |
| Backward compatibility | v3 packets parse | v3, v3.1, v3.2 packets parse |
| Tests passing | (legacy count) | 217 / 217 |
| Status | Productized stable | Research preview (qualified successor) |
| Public release label | v3.1 | v4.0.1 (frozen at v4.0.2-r6 / 51e75de) |
| Pip package | axl-core v0.10.x | Pending productization gate close |
Cold-Read Gate Evidence
Three corpora, four non-Anthropic models, executed 2026-04-14 to 2026-04-16. The interpretive rule agreed with the operator: a clean win requires dRecall > 0 AND dPrecision >= 0 simultaneously. Split-sign results are reported as mixed.
| Corpus | Source | Module | dRecall (v4 - v3.1) | dPrecision | Verdict |
|---|---|---|---|---|---|
| #1 | CloudKitchen investment memo (41K chars) | financial | +15.02 | +14.54 | clean win |
| #2 | Construction technical spec (58K chars) | construction | +36.64 | +43.96 | clean win |
| #3 | Museum repatriation narrative (35K chars) | prose fallback | +20.97 | -11.40 | mixed |
Source: docs/v4-research-document.md AMENDMENT NOTICE (2026-04-16). Per-corpus RESULTS files committed under benchmarks/cold_read/, benchmarks/cold_read_corpus2/, benchmarks/cold_read_corpus3/.
Cross-Model Coverage
The cold-read gate ran across four independent non-Anthropic models. Anthropic-family models were excluded after corpus #1 confirmed training-prior contamination (Haiku named the format on first read).
| Model | Provider | Role |
|---|---|---|
| Gemini Flash | Cold-read scorer | |
| Qwen 3.5 | Alibaba | Cold-read scorer |
| Grok | xAI | Cold-read scorer |
| DeepSeek | DeepSeek | Cold-read scorer |
| Claude Haiku | Anthropic | Excluded (warm with priors) |
Evidence quote: "both Haiku runs opened with explicit meta-commentary identifying the format by name." Source: benchmarks/cold_read/RESULTS.md.
Compression Ratios
Token compression is roughly half what early thesis projections claimed
The v4 research document was originally written using estimated token counts and tests against a reconstructed partial memo. On 2026-04-11 a production measurement was run against the live compressor at compress.axlprotocol.org v0.9.0 using the authoritative 41,256-char CloudKitchen investment memo and tiktoken with the cl100k_base encoding. The corrected numbers are below.
| Metric | Prior (estimate) | Measured (production) |
|---|---|---|
| Char compression | 3.27x | 2.90x |
| Token compression | 2.69x | 1.40x |
| Tokens saved per message | ~2,500 | 2,582 |
Honest framing. Character compression is close to the estimate. Token compression is roughly half what the earlier tests suggested. Cost savings are real (about 29% of tokens, 2,582 tokens saved per message) but significantly smaller than the "4x token reduction" the original thesis implied. All token-related claims in older v4 material should be read as corrected by this notice.
Source: docs/v4-research-document.md CORRECTION NOTICE (lines 8-30) in axl-research @ v4.0.2-r6-freeze (commit 51e75de). Full production measurement: benchmarks/production_baseline.md. Repo is private; access via productization gate.
Three numbers matter. Character compression. Token compression. Cold-read precision and recall. The first two come from production measurement; the third comes from the cold-read gate above. Char and token compression are different metrics and the gap between them is the headline finding of the 2026-04-11 production baseline.
| Version | Metric | Ratio | Source |
|---|---|---|---|
| v3 baseline | Char compression | 3.06x | cross-model-consensus.md |
| v3.2 ideographic | Char compression | 4.28x | cross-model-consensus.md |
| v4 production (CloudKitchen 41K memo) | Char compression | 2.90x | v4-research-document.md (corrected 2026-04-11) |
| v4 production (CloudKitchen 41K memo) | Token compression (tiktoken cl100k_base) | 1.40x | v4-research-document.md (corrected 2026-04-11) |
| v4 production (CloudKitchen 41K memo) | Tokens saved per message | 2,582 | compress.axlprotocol.org v0.9.0 |
| v4 production (CloudKitchen 41K memo) | Token cost savings | ~29% | derived from 1.40x token ratio |
Why Char and Token Ratios Diverge
Character compression measures the byte-level shrink of a packet relative to the source. Token compression measures the same shrink in tokenizer units (here, tiktoken with cl100k_base, the GPT-4-family encoding). AXL's compressed packets pack more semantic density per character, but tokenizers segment short symbolic sequences less efficiently than long natural-language strings. The result: a packet that is 2.90x shorter in characters is only 1.40x shorter in tokens. The gap between 2.90x and 1.40x is the cost of tokenization, not a flaw in the kernel.
The cold-read gate above measures something different again (semantic recovery via recall + precision) and is the primary v4 vs v3.1 evidence. Compression ratio is necessary but not sufficient: a packet that compresses well but does not survive cold decompression is not a win.
Measurement Provenance
- Tokenizer:
tiktokenwithcl100k_baseencoding (GPT-4-family). - Source corpus: CloudKitchen investment memo, 41,256 characters (authoritative full memo, not a reconstructed partial).
- Compressor:
compress.axlprotocol.orgv0.9.0 (production endpoint). - Measurement date: 2026-04-11.
- Canonical record: CORRECTION NOTICE at the head of
docs/v4-research-document.md(lines 8-30) andbenchmarks/production_baseline.mdin axl-research @ v4.0.2-r6-freeze (commit 51e75de). Repository is private; full audit access is gated by the productization checkpoint (CC-OPS-AXLSERVER directive sections 14-16). Public mirror lives under github.com/axlprotocol/axl-research with the same privacy posture.
What v3.1 Still Does Best
- Pure narrative prose. Corpus #3 shows v4 prose-fallback gives -11.40 dPrecision relative to v3.1. Until the prose-fallback gap is closed, v3.1 is the precision-favored choice for narrative prose.
- SLA-bound production paths. v3.1 carries the productized badge. v4.0.1 is research-stage until the productization gate (CC-OPS-AXLSERVER directive sections 14-16) closes.
- Tooling reach. The pip-published
axl-coretargets v3.1. Five existing MCP plugins target v3.1. - Stability. v3.1 has a longer track record. v4 architectural primitives are validated under test discipline but the production deployment runway is shorter.
What v4.0.1 Adds
- Domain-aware compression. Financial and construction modules show clean wins on home-turf corpora.
- Code compression. First AXL surface for source-code packets. Lossy structural IR.
- Pluggable architecture. Adding a new domain means registering a new module name. No kernel change required.
- Drift detection. Runtime architecture catches vocabulary drift across module versions.
- Artifact-driven gating. The production gate is anchored to committed artifacts; gate evidence is committed alongside results.
- Cross-model validated. Cold-read gate ran across four non-Anthropic models with documented contamination exclusion.
How to Choose
For a complete migration walkthrough see /rosetta/v4/migration/. For the full research-stage status and AMENDMENT NOTICE, see /rosetta/v4/research/.