AXL / ROSETTA / V3.1 / EVIDENCE

AXL Rosetta v3.1 - Evidence Brief

Status: Shipping. Authoritative at spec commit 8fc20c0 (2026-04-11).
Production endpoint: compress.axlprotocol.org/api/v1/compress-text (running v0.9.0).


What v3.1 is

AXL Rosetta v3.1 is a compositional compression protocol for LLM-to-LLM communication. The v3.1 Data Anchoring Extension added four additive conventions to the v3 base:

The extension is backward-compatible with v3 parsers: any v3-aware consumer ignores anchors and bundles it does not recognize.


Measured compression (production v0.9.0, CloudKitchen 41K corpus)

axissourcecompressedratio
characters41,25614,2152.90x
tokens (cl100k_base)9,0286,4461.40x

The token ratio is the measured value from the production API against the authoritative source, using the same tokenizer LLMs use for billing. Earlier estimated numbers (2.69x / 3.27x) came from partial reconstructions and were corrected to the production measurement in 2026-04-11. The production ratio is the only one worth citing externally.


Measured cold-read performance

Cold-read benchmarks put the v3.1 compressed output in front of non-Anthropic LLMs (Gemini Flash, Qwen 3.5, Grok, DeepSeek) with zero protocol spec, zero examples, and minimal reconstruction instructions. Fact recovery is measured with an independent regex extractor - no LLM scores itself.

v3.1 primary-scorer cold-read results across three corpora:

corpuscontentv3.1 recall (mean)v3.1 precision (mean)
Museum 35Knarrative prose38% (4 clean models)92%
CloudKitchen 41Kinvestment memo26% (2 clean models)79%
Construction 58Ktechnical spec23% (4 clean models)42%

v3.1's precision on narrative prose is 88 to 97 percent across four independent non-Anthropic models (Gemini Flash, Qwen 3.5, Grok, DeepSeek). When what matters is "the facts it recovers are the right facts," v3.1 is the reliable choice on prose content. The pattern is consistent: v3.1's fragment-style packet format constrains cold-LLM reconstruction to the narrow context of each packet, which produces fewer recovered facts but higher hit-rate on the facts that do come through.

The numbers are from numeric-extractor measurements in the decision-gate results files at the commits where each corpus was scored. Commits are cited at the bottom of this brief.


Where v3.1 is the right tool

  1. Hallucination-sensitive use cases on prose content. Legal, literary, journalism - any domain where a fact stated in reconstruction needs to actually be in the source. The fragment-style packet format constrains the receiver's reconstruction to the narrow context of each packet; the receiver hallucinates fewer adjacent facts than it would when assembling prose from a looser keyword-signature representation.
  2. Cross-domain content without a specialized module. v3.1 handles financial, construction, narrative, and mixed-genre content with the same packet grammar. No router, no per-domain vocabulary - one compressor, one contract.
  3. Single-producer to many-consumer topologies. Where the producer does not know the consumer's domain priors, v3.1's generic structure maximizes decompressibility across the long tail of consumer models.
  4. Backward-compatible integration. v3 parsers still parse v3.1 output; unknown conventions are safely ignored.

Known limitations

  1. Token inflation on structured technical content. On the construction 58K corpus, v3.1 produces more tokens (16,076) than the source (14,782). v3.1's generic fields do not capture domain-specific units, dimensions, or code references in recoverable form. For construction, regulatory, or other heavy-structure domains, a domain-matched compressor is more efficient.
  2. Recall ceiling on home-turf corpora. Where a receiving LLM has priors for a specific domain, v3.1's fragment-style fields recover fewer facts than a structured-domain compressor. This is the recall-precision tradeoff; see the measured numbers above.
  3. CloudKitchen-tuned secondary scorer. The original fidelity_score.py weighted F formula used hand-curated CloudKitchen keyword lists and is not portable across corpora. The primary scorer (measure_fidelity with independent regex extractor) is corpus-agnostic and is the authoritative signal.

How to use

curl -X POST https://compress.axlprotocol.org/api/v1/compress-text \
  -H 'Content-Type: application/json' \
  -d '{"text": "<your prose here>"}'

Response: {compressed, metrics, packets}.

Do not use metrics.input_tokens_est, metrics.output_tokens_est, or metrics.tokens_saved_pct. These are server-side estimates derived from character ratios, and they overstate token savings by approximately 2.3x on typical content (the server reports 65.5% savings where the real tiktoken measurement is 28.6%). This is a known server-side issue documented at benchmarks/production_baseline.md and scheduled for a future production update. For authoritative token counts, encode the input and output strings with tiktoken(cl100k_base) directly; every token-based claim in this brief uses that method.

For the full v3.1 specification including grammar and operation-code taxonomy, see spec/v3.1-data-anchoring.md in the research repository.


Evidence links

All cold-read kits are self-contained: source corpus, compressed candidates, model reconstructions, scorer, metadata with SHA256 hashes and generator-commit provenance. Every number in this brief is reproducible from a clean clone of the research repository.