AXL / ROSETTA / V3.1 / EVIDENCE

AXL Rosetta v3.1 - Evidence Brief

Status: Shipping. Authoritative at spec commit 8fc20c0 (2026-04-11).
Production endpoint: compress.axlprotocol.org/api/v1/compress-text (running v0.9.0).

Methodology update - 2026-04-22. The numbers in this brief (2.90x chars, 1.40x tokens on the CloudKitchen 41K corpus) were measured with tiktoken(cl100k_base) directly and remain authoritative. The production API field metrics.tokens_saved_pct was a char-ratio heuristic that overstated token savings by approximately 2.3x on short inputs and clamped negative results to zero. As of this date the API returns real tiktoken counts in new fields (tokens_in_cl100k, tokens_out_cl100k, tokens_saved_pct_cl100k, and the o200k equivalents), with the legacy fields marked deprecated: true and scheduled for removal in v0.11.0. Short-input compression (below ~20K chars) expands token count rather than reducing it - AXL Rosetta v3.1 is now positioned as a corpus-scale .md-relay protocol, not a short-prompt compressor. See the methodology post for the full correction.

What v3.1 is

AXL Rosetta v3.1 is a compositional compression protocol for LLM-to-LLM communication. The v3.1 Data Anchoring Extension added four additive conventions to the v3 base:

Numeric bundles - label[$value,qualifier,%] preserves value-with-context in a single compact token group
Entity anchors - @ent.XX aliases replace repeated named entities with short handles
Causal operator split - <- for evidence, => for causal, -> for numeric transitions
Summary + breakdown packet pairs - summary packet followed by itemized components

The extension is backward-compatible with v3 parsers: any v3-aware consumer ignores anchors and bundles it does not recognize.

Measured compression (production v0.9.0, CloudKitchen 41K corpus)

axis	source	compressed	ratio
characters	41,256	14,215	2.90x
tokens (cl100k_base)	9,028	6,446	1.40x

The token ratio is the measured value from the production API against the authoritative source, using the same tokenizer LLMs use for billing. Earlier estimated numbers (2.69x / 3.27x) came from partial reconstructions and were corrected to the production measurement in 2026-04-11. The production ratio is the only one worth citing externally.

Measured cold-read performance

Cold-read benchmarks put the v3.1 compressed output in front of non-Anthropic LLMs (Gemini Flash, Qwen 3.5, Grok, DeepSeek) with zero protocol spec, zero examples, and minimal reconstruction instructions. Fact recovery is measured with an independent regex extractor - no LLM scores itself.

v3.1 primary-scorer cold-read results across three corpora:

corpus	content	v3.1 recall (mean)	v3.1 precision (mean)
Museum 35K	narrative prose	38% (4 clean models)	92%
CloudKitchen 41K	investment memo	26% (2 clean models)	79%
Construction 58K	technical spec	23% (4 clean models)	42%

v3.1's precision on narrative prose is 88 to 97 percent across four independent non-Anthropic models (Gemini Flash, Qwen 3.5, Grok, DeepSeek). When what matters is "the facts it recovers are the right facts," v3.1 is the reliable choice on prose content. The pattern is consistent: v3.1's fragment-style packet format constrains cold-LLM reconstruction to the narrow context of each packet, which produces fewer recovered facts but higher hit-rate on the facts that do come through.

The numbers are from numeric-extractor measurements in the decision-gate results files at the commits where each corpus was scored. Commits are cited at the bottom of this brief.

Where v3.1 is the right tool

Hallucination-sensitive use cases on prose content. Legal, literary, journalism - any domain where a fact stated in reconstruction needs to actually be in the source. The fragment-style packet format constrains the receiver's reconstruction to the narrow context of each packet; the receiver hallucinates fewer adjacent facts than it would when assembling prose from a looser keyword-signature representation.
Cross-domain content without a specialized module. v3.1 handles financial, construction, narrative, and mixed-genre content with the same packet grammar. No router, no per-domain vocabulary - one compressor, one contract.
Single-producer to many-consumer topologies. Where the producer does not know the consumer's domain priors, v3.1's generic structure maximizes decompressibility across the long tail of consumer models.
Backward-compatible integration. v3 parsers still parse v3.1 output; unknown conventions are safely ignored.

Known limitations

Token inflation on structured technical content. On the construction 58K corpus, v3.1 produces more tokens (16,076) than the source (14,782). v3.1's generic fields do not capture domain-specific units, dimensions, or code references in recoverable form. For construction, regulatory, or other heavy-structure domains, a domain-matched compressor is more efficient.
Recall ceiling on home-turf corpora. Where a receiving LLM has priors for a specific domain, v3.1's fragment-style fields recover fewer facts than a structured-domain compressor. This is the recall-precision tradeoff; see the measured numbers above.
CloudKitchen-tuned secondary scorer. The original fidelity_score.py weighted F formula used hand-curated CloudKitchen keyword lists and is not portable across corpora. The primary scorer (measure_fidelity with independent regex extractor) is corpus-agnostic and is the authoritative signal.

How to use

curl -X POST https://compress.axlprotocol.org/api/v1/compress-text \
  -H 'Content-Type: application/json' \
  -d '{"text": "<your prose here>"}'

Response: {compressed, metrics, packets}.

compressed - the serialized packet stream
packets - the same, as a list of strings
metrics.ratio - character ratio (authoritative)
metrics.packets_count - packet count (authoritative)

Do not use metrics.input_tokens_est, metrics.output_tokens_est, or metrics.tokens_saved_pct. These are server-side estimates derived from character ratios, and they overstate token savings by approximately 2.3x on typical content (the server reports 65.5% savings where the real tiktoken measurement is 28.6%). This is a known server-side issue documented at benchmarks/production_baseline.md and scheduled for a future production update. For authoritative token counts, encode the input and output strings with tiktoken(cl100k_base) directly; every token-based claim in this brief uses that method.

For the full v3.1 specification including grammar and operation-code taxonomy, see spec/v3.1-data-anchoring.md in the research repository.

Evidence links

Spec: commit 8fc20c0 (tightens data-anchoring claims and provenance rule)
Production baseline measurement: benchmarks/production_baseline.md
Cold-read decision gate, corpus #1: benchmarks/cold_read/RESULTS.md
Cold-read decision gate, corpus #2: benchmarks/cold_read_corpus2/RESULTS.md
Cold-read decision gate, corpus #3: benchmarks/cold_read_corpus3/RESULTS_precision_pass.md

All cold-read kits are self-contained: source corpus, compressed candidates, model reconstructions, scorer, metadata with SHA256 hashes and generator-commit provenance. Every number in this brief is reproducible from a clean clone of the research repository.