AXL / POSTS / 2026-04-22

Why we are changing how we report compression

Date: 2026-04-22. Affected surface: compress.axlprotocol.org/api/v1/compress-text and the web form at compress.axlprotocol.org. Status: Corrected in production.

TL;DR

The metrics.tokens_saved_pct field on the public compress API was computed from a char/4 heuristic and clamped negative values to zero. On typical short inputs this overstated real token savings by approximately 2.3x - and hid the fact that below roughly 20,000 input characters, AXL actually expands token count instead of reducing it. We corrected the API to report real tiktoken counts, added a break-even advisory, and repositioned AXL Rosetta v3.1 as a corpus-scale .md-file relay protocol. The CloudKitchen 41K / Construction 58K numbers in the evidence brief were already measured with tiktoken directly and remain authoritative.

What was wrong

Until today, the production API returned these metrics on every compression:

{
  "metrics": {
    "input_tokens_est": 407,         // char/4 estimate
    "output_tokens_est": 110,        // char/4 estimate
    "tokens_saved": 297,             // clamped to zero on expansion
    "tokens_saved_pct": 73.0         // derived from the two estimates
  }
}

The problem was not that the numbers were rounded - it was that they were not measurements. They were derived from the assumption that one token corresponds to four characters, which is a rough heuristic for prose but systematically wrong for AXL output. AXL packets pack multiple semantic labels into short ASCII sigil sequences; tiktoken(cl100k_base) tokenizes them very differently than it tokenizes English prose of the same character length.

A simple check: paste a 1,857-character investment memo into the API, and measure the real token delta with tiktoken independently. What we found on 2026-04-22:

MetricAPI claimedReal (tiktoken cl100k)
Input tokens465 (char/4)431
Output tokens331 (char/4)611
Tokens saved134-180
Tokens saved %+28.8%-41.8%

The output used 42% more tokens than the input - yet the API reported a 29% saving. On a single short prompt, the absolute overstatement is roughly 70 percentage points; across the distribution of typical short inputs we see in production traffic, the multiplicative factor is close to 2.3x.

Why the bug existed

Two decisions compounded:

  1. We used char/4 for speed - at the time, tiktoken was a heavier dependency we had not yet pulled in, and the char-ratio approximation was a placeholder that stayed longer than it should have.
  2. We clamped the result to max(0, saved) so the field would never display as negative. The intent was cosmetic - "compression shouldn't be negative" - but the practical effect was to hide expansion cases instead of surfacing them. Any input where AXL's fixed-overhead header (manifest + schema version + meta-packets, ~200 chars / ~60 tokens) exceeded the savings from entity aliasing on the short body silently reported zero savings when the honest answer was a negative delta.

The honesty brief and llms.txt already carried a "do not use metrics.input_tokens_est, metrics.output_tokens_est, or metrics.tokens_saved_pct" warning. That warning was correct, but a warning in a brief is not the same as a corrected API response. Anyone who called the endpoint and trusted the fields got the wrong number.

What is fixed

As of 2026-04-22, the API response carries real tokenizer counts as first-class fields:

{
  "metrics": {
    // Authoritative real-tokenizer counts
    "tokens_in_cl100k": 431,
    "tokens_out_cl100k": 611,
    "tokens_saved_cl100k": -180,
    "tokens_saved_pct_cl100k": -41.8,
    "tokens_in_o200k": 438,
    "tokens_out_o200k": 598,
    "tokens_saved_o200k": -160,
    "tokens_saved_pct_o200k": -36.5,

    // Legacy char/4 estimates - DEPRECATED, removed in v0.11.0
    "input_tokens_est": 465,
    "output_tokens_est": 331,
    "tokens_saved": 134,
    "tokens_saved_pct": 28.8,

    "deprecated": {
      "tokens_saved_pct": "Derived from char estimate, overstates by ~2.3x on short inputs. Use tokens_saved_pct_cl100k. Removed in v0.11.0.",
      ...
    }
  },
  "warning": {
    "will_expand_tokens": true,
    "below_break_even": true,
    "break_even_chars": 20000,
    "message": "Output uses more tokens than input (cl100k: -41.8%, o200k: -36.5%). AXL's fixed-overhead header dominates short inputs. Input is 1,857 chars, below the 20,000-char break-even threshold..."
  }
}

Three things changed:

The legacy fields are still present for one more release so anyone who pinned against them does not break on deploy. They carry a deprecated: true marker and a removal version. Scheduled removal: axl-compress v0.11.0.

Why the evidence brief numbers are still valid

The v3.1 evidence brief reports 2.90x character compression and 1.40x real token compression against the CloudKitchen 41K corpus, plus 92% precision on cold-read comprehension across four non-Anthropic LLMs. Those numbers were not computed by the buggy API field. They were measured by:

  1. Compressing the 41K corpus with axl-core v0.9.0.
  2. Running tiktoken(cl100k_base).encode() on both input and output independently.
  3. Committing the raw RESULTS files to the research repository, traceable by SHA from the research log at /research-log/.

In other words, the brief was already using the methodology we are now making the API default. The brief remains authoritative. What changes is the alignment between the brief and what the API returns on a live call.

The deeper shift: short prompts never worked

Fixing the measurement exposed the harder problem. It is not that AXL compression is weaker than claimed on short inputs. It is that AXL compression was never supposed to work on short inputs, and the individual-user web-demo framing quietly pretended otherwise.

The math:

Selling a web form for users to paste 500-word prompts into and watch tokens shrink was never going to produce real savings, because the regime was wrong. The evidence brief proved AXL works at corpus scale. The web form was an easier-to-demo artifact that happened to land in a regime where the same engine loses.

What AXL is now

AXL Protocol remains the same grammar. What shifts is the product surface and the claim:

If you built something against the old API

Your integration will continue to work until axl-compress v0.11.0 (scheduled for week 2-3 of this pivot). The legacy fields are still present with a deprecated: true flag. Migration is mechanical: replace metrics.tokens_saved_pct with metrics.tokens_saved_pct_cl100k (or the o200k variant), and be prepared for the number to be smaller or negative on short inputs. If your integration routes AXL output to a downstream LLM, the new number is the one that matches your actual token bill.

What we are committing to

  1. Every number we publish from here forward is measurable against tiktoken (or a named and reproducible tokenizer) by anyone with 10 minutes and the Python package.
  2. If we add a measurement field, we add the tokenizer name, the version, and the encode method call next to it.
  3. If a number turns out to be wrong, we correct it in public within 48 hours. This post is the current instance of that commitment.
  4. If Anthropic publishes a Claude-native tokenizer, we re-run every Claude-specific claim against it and publish a diff.

Short version: if you are calling compress.axlprotocol.org/api/v1/compress-text today, check the warning field and read tokens_saved_pct_cl100k. If you are looking to compress documentation corpora or multi-agent context pools at scale, wait two weeks - axl-corpus (the CLI, the relay.axlprotocol.org streaming API, and the self-host Docker image) ships in the week 2 tranche of the AXL Relay pivot.


References: v3.1 evidence brief | research log | v3.1 vs v4 decision gate | axl-core on GitHub