← Back to Timeline

01 / 04 — Compressor

AXL Compressor Evolution: v0.4.0 to v0.9.0

PublishedApril 2026 AuthorDiego Carranza PeriodMarch 19 - April 10, 2026 Read time~14 min

AXL Protocol was conceived on January 29, 2026, as a concept on paper. After weeks of brainstorming sessions with multiple LLM architectures, infrastructure was provisioned on February 12, the domain registered on March 10, and the site went live on March 16. Three days later, the first code shipped.

Abstract

This document traces the full technical evolution of the AXL compressor across six major versions and five patch releases spanning 22 days of active development. The compressor began as a deterministic NLP pipeline translating English prose into AXL v3 packet format and grew into a two-pass architecture with entity aliasing, vocabulary compression, and structural deduplication. Each version introduced specific capabilities, exposed new failure modes, and refined the core thesis: that semantic compression of structured claims requires attacking the emission layer, not just the extraction layer.

1. Background: The AXL Packet Format

AXL (Axiomatic Exchange Language) is a structured wire format for machine-readable semantic claims. A v3 packet takes the form:

ID:AGENT|OP.CONF|SUBJECT|ROLE|ARG2|TEMPORAL

Where:

The Rosetta v3 kernel is a 5,853-character grammar reference that allows any receiving agent to interpret packets without prior AXL knowledge. Self-bootstrapping means each compression output prepends this kernel, followed by a ---PACKETS--- separator.

The compression ratio metric used throughout this document is:

ratio = len(input_bytes) / len(output_bytes)

A ratio above 1.0 means output is smaller than input. A ratio of 1.92x means the AXL output is 48% the size of the original English.

2. v0.4.0 (March 19, 2026) - Genesis

2.1 Summary

v0.4.0 was the initial public release of axl-core to PyPI. It established the foundational architecture: a parser, emitter, validator, and translator for the v1 packet format.

2.2 Capabilities

The v0.4.0 release shipped with:

2.3 Rosetta Domains (v1)

The 10 semantic domains defined in v0.4.0 remained stable through all subsequent versions:

CodeDomain
TRDTrade / Commerce
SIGSignal / Alert
COMMCommunication
OPSOperations
SECSecurity
DEVDevelopment
RESResearch
REGRegulatory
PAYPayment
FUNDFunding

2.4 Design Decisions

The zero-dependency constraint was deliberate. The package needed to run in air-gapped environments and inside other packages without triggering dependency conflicts. All parsing used pure Python regex and string operations.

The CLI was designed for scripting and pipeline use. A typical validation command:

axl validate --file claims.axl --strict

3. v0.5.0 (March 29, 2026) - v3 Support

3.1 Summary

v0.5.0 introduced full support for the AXL v3 packet format, which extended the grammar to support richer semantic roles, evidence linking, and confidence scoring. The Rosetta v3 kernel was embedded as rosetta/v3.md.

3.2 Capabilities Added

3.3 The v3 Grammar Extension

The key structural change in v3 was the addition of the ROLE field and evidence prefix notation:

v1: ID:AGENT|OP.CONF|SUBJECT|ARG|TEMPORAL
v3: ID:AGENT|OP.CONF|SUBJECT|ROLE|ARG2|TEMPORAL

Evidence in ARG2 could now carry typed prefixes:

3.4 JSON Lowering

The JSON representation allowed AXL packets to be transmitted in JSON-native systems:

{
  "id": "COMPRESS",
  "op": "OBS",
  "conf": 85,
  "subject": "Blood_oxygen_levels",
  "role": null,
  "arg2": ["^89%"],
  "temporal": "NOW"
}

4. v0.6.0 (April 7, 2026) - The Deterministic Compressor

4.1 Summary

v0.6.0 introduced english_to_v3(), the first deterministic English-to-AXL compressor. This was the primary innovation of the axl-core project: translating natural language into structured semantic packets using pure NLP heuristics, with no LLM calls.

4.2 The 7-Step Pipeline

The compressor was implemented as a sequential spaCy NLP pipeline:

  1. Sentence splitting: Input text broken into atomic sentences using spaCy's sentence boundary detection.
  2. NER extraction: Named entity recognition identified PERSON, ORG, GPE, MONEY, PERCENT, DATE, CARDINAL entities.
  3. Operation classification: Heuristic rules mapped linguistic patterns to AXL operation codes (OBS, PRD, SIG, MRG, etc.).
  4. Confidence scoring: Lexical hedging words reduced confidence; declarative statements scored higher.
  5. Temporal extraction: Date expressions and temporal adverbs mapped to AXL temporal fields.
  6. Evidence linking: Prepositional phrases and causal constructions extracted as ARG2 evidence.
  7. Packet emission: Fields assembled and formatted per v3 grammar.

4.3 Example Output

Input:  "The patient has a blood oxygen level of 89%"
Output: ID:COMPRESS|OBS.90|#Blood_oxygen_levels||^89%|NOW

The # prefix denotes a measurement or clinical observation subject. The ^ prefix labels the value. The confidence of 90 reflects a declarative, unhedged statement.

4.4 No LLM Dependency

The zero-LLM constraint was a deliberate architectural choice. The compressor needed to run at the edge, in resource-constrained environments, and with predictable latency. spaCy's small English model (25MB) provided sufficient NLP capability for the initial pipeline.

5. v0.6.1 (April 7, 2026) - Self-Bootstrapping

5.1 Summary

v0.6.1 modified the compressor output format to prepend the Rosetta v3 kernel before all packets. This made AXL compression outputs self-contained: any receiving agent could parse the output without prior AXL knowledge.

5.2 Output Format

Every english_to_v3() call now produced:

[Rosetta v3 kernel - 5,853 characters of grammar reference]
---PACKETS---
ID:COMPRESS|OBS.90|#Blood_oxygen_levels||^89%|NOW
ID:COMPRESS|SIG.75|@Doctor||<-elevated_concern|NOW

5.3 Significance

This was the zero-configuration receiver guarantee. A fresh LLM instance with no AXL training could receive a compression output and interpret all packets correctly because the grammar was embedded in the output itself. The design followed the principle of self-describing data formats - the message carries its own schema.

The ---PACKETS--- separator allowed receivers to isolate the grammar section from the semantic payload.

6. v0.7.0 (April 8, 2026) - The Decompressor

6.1 Summary

v0.7.0 introduced the decompressor: a deterministic inverse of the compressor. Given an AXL packet or bundle, v3_to_english() produced human-readable English. This completed the round-trip: English -> AXL -> English.

6.2 Decompressor Architecture

The decompressor was built around four components:

6.3 Evidence Extraction Rewrite

Evidence extraction was redesigned around four pattern groups:

  1. Causal patterns: "because", "due to", "as a result of" - mapped to <- prefix.
  2. Attribution patterns: "according to", "reported by" - mapped to entity reference.
  3. Dependency patterns: "based on", "contingent on" - mapped to conditional evidence.
  4. Contradiction patterns: "despite", "although", "however" - mapped to counter-evidence.

A spaCy dependency tree fallback handled sentences that matched none of the four groups.

6.4 Confidence Scoring Rewrite

Confidence scoring became operation-aware. Base scores varied by operation type:

OperationBase Confidence
OBS85
SIG70
PRD65
MRG60

A 23-word hedging dictionary reduced confidence when matched. Words like "might", "could", "approximately", "expected to" each subtracted from the base score.

6.5 Bundle Manifest

Every compression output now appended a bundle manifest (loss contract) listing fields omitted during compression:

---MANIFEST---
sentences_processed: 12
packets_emitted: 8
fields_dropped: [adjectives, adverbs, parentheticals]

6.6 NER Value Prefix Map

ARG2 values were labeled with type prefixes to preserve semantic meaning:

PrefixType
^amt:Currency amount
^pct:Percentage
^count:Integer count
^qty:Quantity with unit
^date:Date expression

6.7 Test Suite

77 passing tests. The v0.7.0 release was the first version with full round-trip test coverage.

6.8 Known Bugs (Identified, Not Fixed)

The following bugs were documented in release notes but deferred:

7. v0.8.0 (April 8, 2026) - The GPT Code Review

7.1 Summary

v0.8.0 was driven by an external code review conducted by GPT-4. The review identified 7 bugs - 4 matching the known list from v0.7.0 and 3 previously undetected. All 7 were fixed.

7.2 Bug Inventory

Bug 1 - DATE/year guard (KNOWN)

Years matching the regex (18|19|20|21)\d{2} were being passed through the numeric compactor, which converted "2025" to "2.0K". Fix: pre-screen tokens against the year pattern before numeric normalization.

Bug 2 - Word-scale normalization (KNOWN)

"5 million dollars" was not being collapsed to "5M". Fix: implemented word-scale detection for million, billion, thousand with currency context.

Bug 3 - Pronoun subject rejection (KNOWN)

First and third-person pronouns were valid NER extractions but semantically meaningless as packet subjects. Fix: rejection list [I, it, they, we, he, she, this, that] applied before subject selection. When a pronoun was the grammatical subject, the semantic object was extracted instead.

Bug 4 - Semantic subject ranking (KNOWN)

MONEY and PERCENT entities were being ranked equally with PERSON and ORG. Fix: implemented explicit scoring:

Bug 5 - Safer evidence fallback (SILENT)

Generic prepositional phrases like "by 2025" were being extracted as causal evidence with the <- prefix. This was incorrect - temporal prepositions are not causal. Fix: prepositions without a semantic trigger word were excluded from causal evidence extraction.

Bug 6 - Synthetic MRG disabled (SILENT)

The compressor was generating synthetic merge operation packets with made-up growth targets (e.g., RE:5+3+30%) that had no basis in the input text. Fix: synthetic MRG emission disabled entirely.

Bug 7 - Atomic fact splitting (SILENT)

Complex coordinated sentences ("X does A and B and C") were being emitted as a single packet with all three facts concatenated. Fix: coordinated clauses are split into separate packets.

7.3 Before/After Example

Input: "By 2025, revenue will reach 5 million dollars."

v0.7.0: ID:COMPRESS|PRD.75|^2025|<-By_2025|^date:2.0K+^amt:5milliondollars|NOW

v0.8.0: ID:COMPRESS|PRD.75|$revenue||^amt:5M+^date:2025|NOW

The v0.8.0 output is shorter (48 chars vs 60 chars), uses the correct subject ($revenue instead of ^2025), eliminates the spurious causal evidence (<-By_2025), and correctly formats the amount (5M) and year (2025).

7.4 Test Suite

80 passing tests. The 3 additional tests covered the silent bugs discovered by GPT-4.

8. v0.8.1 (April 10, 2026) - The Density Crisis

8.1 Summary

The quality fixes in v0.8.0 produced a compression ratio collapse. On a 40,000-character CloudKitchen internal memo, packet count rose from 208 (v0.7.0) to 380 (v0.8.0), and compression ratio fell from 1.92x to 1.34x. v0.8.1 attempted to recover density through clause-level packing.

8.2 Root Cause Analysis

The atomic fact splitting fix (Bug 7) was the primary driver. A single complex sentence previously emitting 1 packet now emitted 3-5 packets. Additionally, the concatenated value fix (Bug 4) and the pronoun rejection fix (Bug 3) both produced longer, more explicit packets because the compressor could no longer take shortcuts.

8.3 Packing Helpers Added

Three new internal functions:

8.4 Packing Limits

8.5 Mini Kernel

The Rosetta kernel prepended to every output was reduced from 5,853 characters to 958 characters by removing inline examples and keeping only grammar rules.

8.6 Results

CloudKitchen memo (40K chars):
v0.7.0: 208 packets, ratio 1.92x
v0.8.0: 380 packets, ratio 1.34x
v0.8.1: 380 packets, ratio 1.34x

Packet count was unchanged. The packing heuristics had no measurable effect.

8.7 Lesson

The packing limits were set at 3 facts and 64 characters. The average correctly-extracted fact in v0.8.0 was 18-22 characters. Three facts at 20 characters each = 60 characters, already near the limit. The helpers were firing at nearly every sentence, but the result was the same packet count because splitting was happening upstream (at the sentence and clause level) faster than packing could consolidate downstream.

9. v0.8.2 (April 10, 2026) - Adjusted Limits

9.1 Summary

v0.8.2 raised the packing limits and added two refinements, attempting to recover ratio through more permissive consolidation.

9.2 Changes

9.3 Results

CloudKitchen memo (40K chars):
v0.8.1: 380 packets, ratio 1.34x
v0.8.2: 380 packets, ratio 1.39x

Packet count identical. Ratio improved marginally (0.05x) due to the mini kernel size reduction compounding with the qualifier drop.

9.4 Lesson

The problem was not packing limits. A 4-fact, 78-character limit was already permissive for the extraction quality the pipeline was producing. The bottleneck was architectural: the emission layer was producing English-with-pipes rather than genuinely compressed semantic notation. Subject names were full English phrases. Evidence was full prepositional phrases. The packet format allowed short values but nothing enforced short values.

10. v0.9.0 (April 10, 2026) - Architecture Redesign

10.1 Summary

v0.9.0 abandoned incremental packing fixes and redesigned the emission architecture. Two key changes drove the improvement: entity aliasing (vocabulary compression) and same-subject merging (structural deduplication).

10.2 Two-Pass Architecture

The compressor now operated in two passes over the document:

Pass 1 - Document Scan:

Pass 2 - Packed Emission:

10.3 Entity Registry

Named entities appearing more than twice in the document were assigned aliases:

CloudKitchen -> CK
Marcus Chen  -> MC
San Francisco -> SF
Q1 2025      -> Q1

The mapping was emitted as the first packet in every output - the ontology manifest:

ID:C|@m.O.doc||^df:CK=CloudKitchen+MC=Marcus_Chen+SF=San_Francisco|NOW

Any receiver could reconstruct full entity names from this manifest.

10.4 Compressed Subjects

Subjects in v0.9.0 used aliases where registered:

v0.8.x: ID:COMPRESS|OBS.90|@Sales_team||^targets:30%|NOW
v0.9.0: ID:C|OBS.90|@CK.sales||^targets:30%|NOW

CK.sales communicates both the parent entity (CloudKitchen, via alias) and the sub-entity (sales team) in 8 characters versus 10 for Sales_team, while adding the organizational context that was previously absent.

10.5 Agent ID Compression

The agent identifier was shortened from "COMPRESS" to "C":

v0.8.x: ID:COMPRESS|... (11 chars for agent field)
v0.9.0: ID:C|...       (3 chars for agent field)

10 characters saved per packet. On a 380-packet document, this alone saves 3,800 characters.

10.6 Evidence Compression

Evidence in ARG2 was reformatted to verb:object notation with a 30-character cap:

v0.8.x: <-due_to_declining_foot_traffic_in_downtown_locations
v0.9.0: <-declining:foot_traffic

The semantic content is preserved (causal relationship, verb, object) with substantial length reduction.

10.7 Same-Subject Merging

Adjacent packets sharing the same subject, operation type, and temporal reference were merged:

Before merging:
ID:C|OBS.85|@CK.revenue||^amt:2M|Q1
ID:C|OBS.85|@CK.revenue||^pct:-12%|Q1

After merging:
ID:C|OBS.85|@CK.revenue||^amt:2M+^pct:-12%|Q1

This is structurally sound because both facts are observations about the same entity in the same time window. The merged packet is unambiguous.

10.8 Mini Kernel Final

The kernel prepended to outputs was reduced to 376 characters - the minimum grammar specification sufficient for a receiver to parse all v3 packet fields. This represented a 94% reduction from the 5,853-character full kernel shipped in v0.6.1.

10.9 Before/After Comparison

Full subject example:
v0.8.x: ID:COMPRESS|OBS.90|@Sales_team||^targets:30%+^date:Q1_2025|NOW
         63 characters

v0.9.0: ID:C|OBS.90|@CK.sales||^targets:30%+^date:Q1_2025|NOW
         53 characters

10-character reduction (16%) on this single packet. Multiplied across a 380-packet document, with additional savings from evidence compression and same-subject merging, the cumulative effect is substantial.

11. The Wrong Flank Lesson

The v0.8.0 through v0.8.2 development arc attacked the extraction layer - the part of the pipeline responsible for identifying entities, classifying operations, and scoring confidence. Each of the 7 bug fixes in v0.8.0 was technically correct:

But every fix made the output longer. Rejecting pronouns forced the compressor to find a longer, more explicit subject. Splitting coordinated facts produced more packets. Labeling amounts correctly produced longer ARG2 values.

The extraction layer was not the bottleneck. The emission layer was producing English-with-pipes: packet fields that contained full English phrases with underscores substituted for spaces. This is not compression - it is reformatting.

The analogy to compiler development is direct: the compressor's parser (extraction) was being optimized while the code generator (emission) was producing verbose output. The quality of parsing is irrelevant if the code generator wastes the parsed information.

v0.9.0 attacked the right layer:

None of these changes touched extraction quality. The extraction layer from v0.8.0 was carried forward unchanged into v0.9.0.

12. The Density vs Quality Tradeoff

The three-version arc illustrates a recurring tension in lossy compression design:

v0.7.0: Ratio 1.92x, extraction quality poor. The compressor achieved density by taking shortcuts - accepting pronouns as subjects, concatenating values without separators, emitting synthetic packets not grounded in input text. These shortcuts reduced output size but degraded semantic fidelity. A receiving agent parsing a v0.7.0 bundle would encounter subjects like "it" and values like "2.0K" that do not correspond to recoverable claims.

v0.8.0: Ratio 1.34x, extraction quality correct. Fixing the extraction bugs restored semantic fidelity but each fix eliminated a shortcut that had been compressing the output. The result was longer packets and more of them.

v0.9.0: Targets both. The extraction quality from v0.8.0 is preserved. Density is recovered not by reintroducing extraction shortcuts but by compressing the notation itself.

The fundamental insight of v0.9.0 is that AXL compression wins through two mechanisms:

  1. Vocabulary compression: Short aliases for repeated entities reduce the character cost of every packet that mentions those entities. A document with 40 mentions of "CloudKitchen" saves 9 characters per mention by aliasing to "CK".
  2. Structural deduplication: Same-subject merging eliminates the per-packet overhead (ID, agent, operation, confidence, temporal) for facts that share a subject context. Instead of paying 30-40 characters of structural overhead per fact, multiple facts share one structural frame.

Neither of these mechanisms requires accepting incorrect extractions. They are output-layer optimizations orthogonal to extraction quality.

13. Version Summary Table

VersionDateKey ChangeTestsRatio (40K)
v0.4.0Mar 19Initial release, v1 format42N/A
v0.5.0Mar 29v3 support, auto-detection66N/A
v0.6.0Apr 7english_to_v3() compressorN/AN/A
v0.6.1Apr 7Self-bootstrapping kernelN/AN/A
v0.7.0Apr 8Decompressor, round-trip771.92x
v0.8.0Apr 87 bug fixes, GPT review801.34x
v0.8.1Apr 10Clause packing, mini kernel801.34x
v0.8.2Apr 10Raised limits, lemma roles801.39x
v0.9.0Apr 10Entity aliasing, mergingTBDTBD

14. Open Problems

The following problems remain open as of v0.9.0:

AXL Protocol Inc.