v3.1 vs v4.0.1 - Side by Side

Productized stable vs research-stage qualified successor.

v3.1 is the productized stable release. v4.0.1 is the research-stage public preview, qualified successor on domain-backed content. Both share the same kernel format. The differences are above the kernel: vocabulary, dispatch, modules, and validation regime.

Capability Table

Capability	v3.1	v4.0.1
Wire format	`PKT[VER\|CLS\|SUB\|TAG\|ARG1\|ARG2\|META]`	same (unchanged)
Kernel size	75 lines	94 lines
Compression vocabulary	One universal	Router + N modules
Domain modules	None	prose, financial, construction, code
Math operators	Yes (ARG2 math-arg grammar)	Yes (unchanged)
Ideographic composition (v3.2)	Yes	Yes (unchanged)
Code compression	No	Yes (lossy structural IR)
Drift detection	No	Yes (runtime architecture)
Artifact-driven gating	Manual review	Yes (artifact-anchored gate)
Backward compatibility	v3 packets parse	v3, v3.1, v3.2 packets parse
Tests passing	(legacy count)	217 / 217
Status	Productized stable	Research preview (qualified successor)
Public release label	`v3.1`	`v4.0.1` (frozen at v4.0.2-r6 / 51e75de)
Pip package	`axl-core` v0.10.x	Pending productization gate close

Cold-Read Gate Evidence

Three corpora, four non-Anthropic models, executed 2026-04-14 to 2026-04-16. The interpretive rule agreed with the operator: a clean win requires dRecall > 0 AND dPrecision >= 0 simultaneously. Split-sign results are reported as mixed.

Corpus	Source	Module	dRecall (v4 - v3.1)	dPrecision	Verdict
#1	CloudKitchen investment memo (41K chars)	financial	+15.02	+14.54	clean win
#2	Construction technical spec (58K chars)	construction	+36.64	+43.96	clean win
#3	Museum repatriation narrative (35K chars)	prose fallback	+20.97	-11.40	mixed

Source: docs/v4-research-document.md AMENDMENT NOTICE (2026-04-16). Per-corpus RESULTS files committed under benchmarks/cold_read/, benchmarks/cold_read_corpus2/, benchmarks/cold_read_corpus3/.

Cross-Model Coverage

The cold-read gate ran across four independent non-Anthropic models. Anthropic-family models were excluded after corpus #1 confirmed training-prior contamination (Haiku named the format on first read).

Model	Provider	Role
Gemini Flash	Google	Cold-read scorer
Qwen 3.5	Alibaba	Cold-read scorer
Grok	xAI	Cold-read scorer
DeepSeek	DeepSeek	Cold-read scorer
Claude Haiku	Anthropic	Excluded (warm with priors)

Evidence quote: "both Haiku runs opened with explicit meta-commentary identifying the format by name." Source: benchmarks/cold_read/RESULTS.md.

Compression Ratios

Correction Notice

Token compression is roughly half what early thesis projections claimed

The v4 research document was originally written using estimated token counts and tests against a reconstructed partial memo. On 2026-04-11 a production measurement was run against the live compressor at compress.axlprotocol.org v0.9.0 using the authoritative 41,256-char CloudKitchen investment memo and tiktoken with the cl100k_base encoding. The corrected numbers are below.

Metric	Prior (estimate)	Measured (production)
Char compression	3.27x	2.90x
Token compression	2.69x	1.40x
Tokens saved per message	~2,500	2,582

Honest framing. Character compression is close to the estimate. Token compression is roughly half what the earlier tests suggested. Cost savings are real (about 29% of tokens, 2,582 tokens saved per message) but significantly smaller than the "4x token reduction" the original thesis implied. All token-related claims in older v4 material should be read as corrected by this notice.

Source: docs/v4-research-document.md CORRECTION NOTICE (lines 8-30) in axl-research @ v4.0.2-r6-freeze (commit 51e75de). Full production measurement: benchmarks/production_baseline.md. Repo is private; access via productization gate.

Three numbers matter. Character compression. Token compression. Cold-read precision and recall. The first two come from production measurement; the third comes from the cold-read gate above. Char and token compression are different metrics and the gap between them is the headline finding of the 2026-04-11 production baseline.

Version	Metric	Ratio	Source
v3 baseline	Char compression	3.06x	cross-model-consensus.md
v3.2 ideographic	Char compression	4.28x	cross-model-consensus.md
v4 production (CloudKitchen 41K memo)	Char compression	2.90x	v4-research-document.md (corrected 2026-04-11)
v4 production (CloudKitchen 41K memo)	Token compression (tiktoken cl100k_base)	1.40x	v4-research-document.md (corrected 2026-04-11)
v4 production (CloudKitchen 41K memo)	Tokens saved per message	2,582	compress.axlprotocol.org v0.9.0
v4 production (CloudKitchen 41K memo)	Token cost savings	~29%	derived from 1.40x token ratio

Why Char and Token Ratios Diverge

Character compression measures the byte-level shrink of a packet relative to the source. Token compression measures the same shrink in tokenizer units (here, tiktoken with cl100k_base, the GPT-4-family encoding). AXL's compressed packets pack more semantic density per character, but tokenizers segment short symbolic sequences less efficiently than long natural-language strings. The result: a packet that is 2.90x shorter in characters is only 1.40x shorter in tokens. The gap between 2.90x and 1.40x is the cost of tokenization, not a flaw in the kernel.

The cold-read gate above measures something different again (semantic recovery via recall + precision) and is the primary v4 vs v3.1 evidence. Compression ratio is necessary but not sufficient: a packet that compresses well but does not survive cold decompression is not a win.

Measurement Provenance

Tokenizer: tiktoken with cl100k_base encoding (GPT-4-family).
Source corpus: CloudKitchen investment memo, 41,256 characters (authoritative full memo, not a reconstructed partial).
Compressor: compress.axlprotocol.org v0.9.0 (production endpoint).
Measurement date: 2026-04-11.
Canonical record: CORRECTION NOTICE at the head of docs/v4-research-document.md (lines 8-30) and benchmarks/production_baseline.md in axl-research @ v4.0.2-r6-freeze (commit 51e75de). Repository is private; full audit access is gated by the productization checkpoint (CC-OPS-AXLSERVER directive sections 14-16). Public mirror lives under github.com/axlprotocol/axl-research with the same privacy posture.

What v3.1 Still Does Best

Pure narrative prose. Corpus #3 shows v4 prose-fallback gives -11.40 dPrecision relative to v3.1. Until the prose-fallback gap is closed, v3.1 is the precision-favored choice for narrative prose.
SLA-bound production paths. v3.1 carries the productized badge. v4.0.1 is research-stage until the productization gate (CC-OPS-AXLSERVER directive sections 14-16) closes.
Tooling reach. The pip-published axl-core targets v3.1. Five existing MCP plugins target v3.1.
Stability. v3.1 has a longer track record. v4 architectural primitives are validated under test discipline but the production deployment runway is shorter.

What v4.0.1 Adds

Domain-aware compression. Financial and construction modules show clean wins on home-turf corpora.
Code compression. First AXL surface for source-code packets. Lossy structural IR.
Pluggable architecture. Adding a new domain means registering a new module name. No kernel change required.
Drift detection. Runtime architecture catches vocabulary drift across module versions.
Artifact-driven gating. The production gate is anchored to committed artifacts; gate evidence is committed alongside results.
Cross-model validated. Cold-read gate ran across four non-Anthropic models with documented contamination exclusion.

How to Choose

Rule of Thumb Domain-backed content with a matching v4 module: use v4.0.1. Pure narrative prose with precision-sensitive downstream: stay on v3.1 until the prose-fallback gap closes. Mixed pipelines: route per packet using the ^rosetta META flag.

For a complete migration walkthrough see /rosetta/v4/migration/. For the full research-stage status and AMENDMENT NOTICE, see /rosetta/v4/research/.