AXL Rosetta v3.1 - Evidence Brief
Status: Shipping. Authoritative at spec commit 8fc20c0 (2026-04-11).
Production endpoint: compress.axlprotocol.org/api/v1/compress-text (running v0.9.0).
What v3.1 is
AXL Rosetta v3.1 is a compositional compression protocol for LLM-to-LLM communication. The v3.1 Data Anchoring Extension added four additive conventions to the v3 base:
- Numeric bundles -
label[$value,qualifier,%]preserves value-with-context in a single compact token group - Entity anchors -
@ent.XXaliases replace repeated named entities with short handles - Causal operator split -
<-for evidence,=>for causal,->for numeric transitions - Summary + breakdown packet pairs - summary packet followed by itemized components
The extension is backward-compatible with v3 parsers: any v3-aware consumer ignores anchors and bundles it does not recognize.
Measured compression (production v0.9.0, CloudKitchen 41K corpus)
| axis | source | compressed | ratio |
|---|---|---|---|
| characters | 41,256 | 14,215 | 2.90x |
| tokens (cl100k_base) | 9,028 | 6,446 | 1.40x |
The token ratio is the measured value from the production API against the authoritative source, using the same tokenizer LLMs use for billing. Earlier estimated numbers (2.69x / 3.27x) came from partial reconstructions and were corrected to the production measurement in 2026-04-11. The production ratio is the only one worth citing externally.
Measured cold-read performance
Cold-read benchmarks put the v3.1 compressed output in front of non-Anthropic LLMs (Gemini Flash, Qwen 3.5, Grok, DeepSeek) with zero protocol spec, zero examples, and minimal reconstruction instructions. Fact recovery is measured with an independent regex extractor - no LLM scores itself.
v3.1 primary-scorer cold-read results across three corpora:
| corpus | content | v3.1 recall (mean) | v3.1 precision (mean) |
|---|---|---|---|
| Museum 35K | narrative prose | 38% (4 clean models) | 92% |
| CloudKitchen 41K | investment memo | 26% (2 clean models) | 79% |
| Construction 58K | technical spec | 23% (4 clean models) | 42% |
v3.1's precision on narrative prose is 88 to 97 percent across four independent non-Anthropic models (Gemini Flash, Qwen 3.5, Grok, DeepSeek). When what matters is "the facts it recovers are the right facts," v3.1 is the reliable choice on prose content. The pattern is consistent: v3.1's fragment-style packet format constrains cold-LLM reconstruction to the narrow context of each packet, which produces fewer recovered facts but higher hit-rate on the facts that do come through.
The numbers are from numeric-extractor measurements in the decision-gate results files at the commits where each corpus was scored. Commits are cited at the bottom of this brief.
Where v3.1 is the right tool
- Hallucination-sensitive use cases on prose content. Legal, literary, journalism - any domain where a fact stated in reconstruction needs to actually be in the source. The fragment-style packet format constrains the receiver's reconstruction to the narrow context of each packet; the receiver hallucinates fewer adjacent facts than it would when assembling prose from a looser keyword-signature representation.
- Cross-domain content without a specialized module. v3.1 handles financial, construction, narrative, and mixed-genre content with the same packet grammar. No router, no per-domain vocabulary - one compressor, one contract.
- Single-producer to many-consumer topologies. Where the producer does not know the consumer's domain priors, v3.1's generic structure maximizes decompressibility across the long tail of consumer models.
- Backward-compatible integration. v3 parsers still parse v3.1 output; unknown conventions are safely ignored.
Known limitations
- Token inflation on structured technical content. On the construction 58K corpus, v3.1 produces more tokens (16,076) than the source (14,782). v3.1's generic fields do not capture domain-specific units, dimensions, or code references in recoverable form. For construction, regulatory, or other heavy-structure domains, a domain-matched compressor is more efficient.
- Recall ceiling on home-turf corpora. Where a receiving LLM has priors for a specific domain, v3.1's fragment-style fields recover fewer facts than a structured-domain compressor. This is the recall-precision tradeoff; see the measured numbers above.
- CloudKitchen-tuned secondary scorer. The original
fidelity_score.pyweighted F formula used hand-curated CloudKitchen keyword lists and is not portable across corpora. The primary scorer (measure_fidelitywith independent regex extractor) is corpus-agnostic and is the authoritative signal.
How to use
curl -X POST https://compress.axlprotocol.org/api/v1/compress-text \
-H 'Content-Type: application/json' \
-d '{"text": "<your prose here>"}'
Response: {compressed, metrics, packets}.
compressed- the serialized packet streampackets- the same, as a list of stringsmetrics.ratio- character ratio (authoritative)metrics.packets_count- packet count (authoritative)
Do not use metrics.input_tokens_est, metrics.output_tokens_est, or metrics.tokens_saved_pct. These are server-side estimates derived from character ratios, and they overstate token savings by approximately 2.3x on typical content (the server reports 65.5% savings where the real tiktoken measurement is 28.6%). This is a known server-side issue documented at benchmarks/production_baseline.md and scheduled for a future production update. For authoritative token counts, encode the input and output strings with tiktoken(cl100k_base) directly; every token-based claim in this brief uses that method.
For the full v3.1 specification including grammar and operation-code taxonomy, see spec/v3.1-data-anchoring.md in the research repository.
Evidence links
- Spec: commit
8fc20c0(tightens data-anchoring claims and provenance rule) - Production baseline measurement:
benchmarks/production_baseline.md - Cold-read decision gate, corpus #1:
benchmarks/cold_read/RESULTS.md - Cold-read decision gate, corpus #2:
benchmarks/cold_read_corpus2/RESULTS.md - Cold-read decision gate, corpus #3:
benchmarks/cold_read_corpus3/RESULTS_precision_pass.md
All cold-read kits are self-contained: source corpus, compressed candidates, model reconstructions, scorer, metadata with SHA256 hashes and generator-commit provenance. Every number in this brief is reproducible from a clean clone of the research repository.