01 / 04 — Compressor

AXL Compressor Evolution: v0.4.0 to v0.9.0

PublishedApril 2026 AuthorDiego Carranza PeriodMarch 19 - April 10, 2026 Read time~14 min

AXL Protocol was conceived on January 29, 2026, as a concept on paper. After weeks of brainstorming sessions with multiple LLM architectures, infrastructure was provisioned on February 12, the domain registered on March 10, and the site went live on March 16. Three days later, the first code shipped.

Abstract

This document traces the full technical evolution of the AXL compressor across six major versions and five patch releases spanning 22 days of active development. The compressor began as a deterministic NLP pipeline translating English prose into AXL v3 packet format and grew into a two-pass architecture with entity aliasing, vocabulary compression, and structural deduplication. Each version introduced specific capabilities, exposed new failure modes, and refined the core thesis: that semantic compression of structured claims requires attacking the emission layer, not just the extraction layer.

1. Background: The AXL Packet Format

AXL (Axiomatic Exchange Language) is a structured wire format for machine-readable semantic claims. A v3 packet takes the form:

ID:AGENT|OP.CONF|SUBJECT|ROLE|ARG2|TEMPORAL

Where:

ID - sender identifier
AGENT - compression agent name
OP - operation code (OBS, PRD, SIG, etc.)
CONF - confidence score (0-100)
SUBJECT - primary entity
ROLE - semantic role annotation (optional)
ARG2 - evidence, values, or linked facts
TEMPORAL - time reference (NOW, PAST, FUTURE, or date)

The Rosetta v3 kernel is a 5,853-character grammar reference that allows any receiving agent to interpret packets without prior AXL knowledge. Self-bootstrapping means each compression output prepends this kernel, followed by a ---PACKETS--- separator.

The compression ratio metric used throughout this document is:

ratio = len(input_bytes) / len(output_bytes)

A ratio above 1.0 means output is smaller than input. A ratio of 1.92x means the AXL output is 48% the size of the original English.

2. v0.4.0 (March 19, 2026) - Genesis

2.1 Summary

v0.4.0 was the initial public release of axl-core to PyPI. It established the foundational architecture: a parser, emitter, validator, and translator for the v1 packet format.

2.2 Capabilities

The v0.4.0 release shipped with:

Parser: Read AXL v1 packets from raw strings and structured files.
Emitter: Produce canonical v1 packet strings from Python data structures.
Validator: Schema-level validation of packet fields against v1 grammar rules.
Translator: Cross-domain packet translation using the Rosetta domain map.
CLI: Four subcommands - parse, validate, translate, emit.
Test suite: 42 passing tests covering all four components.
Zero runtime dependencies: No third-party packages required.

2.3 Rosetta Domains (v1)

The 10 semantic domains defined in v0.4.0 remained stable through all subsequent versions:

Code	Domain
TRD	Trade / Commerce
SIG	Signal / Alert
COMM	Communication
OPS	Operations
SEC	Security
DEV	Development
RES	Research
REG	Regulatory
PAY	Payment
FUND	Funding

2.4 Design Decisions

The zero-dependency constraint was deliberate. The package needed to run in air-gapped environments and inside other packages without triggering dependency conflicts. All parsing used pure Python regex and string operations.

The CLI was designed for scripting and pipeline use. A typical validation command:

axl validate --file claims.axl --strict

3. v0.5.0 (March 29, 2026) - v3 Support

3.1 Summary

v0.5.0 introduced full support for the AXL v3 packet format, which extended the grammar to support richer semantic roles, evidence linking, and confidence scoring. The Rosetta v3 kernel was embedded as rosetta/v3.md.

3.2 Capabilities Added

v3 parser, emitter, validator, translator: Full parallel implementation for the extended grammar.
Version auto-detection: detect_version() inspected packet structure to determine v1 vs v3 format without caller configuration.
JSON lowering: Packets could be serialized to application/vnd.axl+json per RFC 8785 canonical JSON.
Backward compatibility: All v1 packets remained valid inputs. The CLI defaulted to auto-detection.
Test suite: Expanded to 66 passing tests.

3.3 The v3 Grammar Extension

The key structural change in v3 was the addition of the ROLE field and evidence prefix notation:

v1: ID:AGENT|OP.CONF|SUBJECT|ARG|TEMPORAL
v3: ID:AGENT|OP.CONF|SUBJECT|ROLE|ARG2|TEMPORAL

Evidence in ARG2 could now carry typed prefixes:

<- - causal evidence ("because of X")
-> - output or consequence
^ - labeled value (e.g., ^amt:5M)
@ - entity reference

3.4 JSON Lowering

The JSON representation allowed AXL packets to be transmitted in JSON-native systems:

{
  "id": "COMPRESS",
  "op": "OBS",
  "conf": 85,
  "subject": "Blood_oxygen_levels",
  "role": null,
  "arg2": ["^89%"],
  "temporal": "NOW"
}

4. v0.6.0 (April 7, 2026) - The Deterministic Compressor

4.1 Summary

v0.6.0 introduced english_to_v3(), the first deterministic English-to-AXL compressor. This was the primary innovation of the axl-core project: translating natural language into structured semantic packets using pure NLP heuristics, with no LLM calls.

4.2 The 7-Step Pipeline

The compressor was implemented as a sequential spaCy NLP pipeline:

Sentence splitting: Input text broken into atomic sentences using spaCy's sentence boundary detection.
NER extraction: Named entity recognition identified PERSON, ORG, GPE, MONEY, PERCENT, DATE, CARDINAL entities.
Operation classification: Heuristic rules mapped linguistic patterns to AXL operation codes (OBS, PRD, SIG, MRG, etc.).
Confidence scoring: Lexical hedging words reduced confidence; declarative statements scored higher.
Temporal extraction: Date expressions and temporal adverbs mapped to AXL temporal fields.
Evidence linking: Prepositional phrases and causal constructions extracted as ARG2 evidence.
Packet emission: Fields assembled and formatted per v3 grammar.

4.3 Example Output

Input:  "The patient has a blood oxygen level of 89%"
Output: ID:COMPRESS|OBS.90|#Blood_oxygen_levels||^89%|NOW

The # prefix denotes a measurement or clinical observation subject. The ^ prefix labels the value. The confidence of 90 reflects a declarative, unhedged statement.

4.4 No LLM Dependency

The zero-LLM constraint was a deliberate architectural choice. The compressor needed to run at the edge, in resource-constrained environments, and with predictable latency. spaCy's small English model (25MB) provided sufficient NLP capability for the initial pipeline.

5. v0.6.1 (April 7, 2026) - Self-Bootstrapping

5.1 Summary

v0.6.1 modified the compressor output format to prepend the Rosetta v3 kernel before all packets. This made AXL compression outputs self-contained: any receiving agent could parse the output without prior AXL knowledge.

5.2 Output Format

Every english_to_v3() call now produced:

[Rosetta v3 kernel - 5,853 characters of grammar reference]
---PACKETS---
ID:COMPRESS|OBS.90|#Blood_oxygen_levels||^89%|NOW
ID:COMPRESS|SIG.75|@Doctor||<-elevated_concern|NOW

5.3 Significance

This was the zero-configuration receiver guarantee. A fresh LLM instance with no AXL training could receive a compression output and interpret all packets correctly because the grammar was embedded in the output itself. The design followed the principle of self-describing data formats - the message carries its own schema.

The ---PACKETS--- separator allowed receivers to isolate the grammar section from the semantic payload.

6. v0.7.0 (April 8, 2026) - The Decompressor

6.1 Summary

v0.7.0 introduced the decompressor: a deterministic inverse of the compressor. Given an AXL packet or bundle, v3_to_english() produced human-readable English. This completed the round-trip: English -> AXL -> English.

6.2 Decompressor Architecture

The decompressor was built around four components:

parse_packet(): Field-level parser for v3 packet strings.
strip_kernel(): Separated the Rosetta kernel from the packet payload at the ---PACKETS--- boundary.
format_decompressed(): Template-based claim reconstruction per operation type.
Receipt mode: 0.3ms processing time, zero LLM calls, deterministic output.

6.3 Evidence Extraction Rewrite

Evidence extraction was redesigned around four pattern groups:

Causal patterns: "because", "due to", "as a result of" - mapped to <- prefix.
Attribution patterns: "according to", "reported by" - mapped to entity reference.
Dependency patterns: "based on", "contingent on" - mapped to conditional evidence.
Contradiction patterns: "despite", "although", "however" - mapped to counter-evidence.

A spaCy dependency tree fallback handled sentences that matched none of the four groups.

6.4 Confidence Scoring Rewrite

Confidence scoring became operation-aware. Base scores varied by operation type:

Operation	Base Confidence
OBS	85
SIG	70
PRD	65
MRG	60

A 23-word hedging dictionary reduced confidence when matched. Words like "might", "could", "approximately", "expected to" each subtracted from the base score.

6.5 Bundle Manifest

Every compression output now appended a bundle manifest (loss contract) listing fields omitted during compression:

---MANIFEST---
sentences_processed: 12
packets_emitted: 8
fields_dropped: [adjectives, adverbs, parentheticals]

6.6 NER Value Prefix Map

ARG2 values were labeled with type prefixes to preserve semantic meaning:

Prefix	Type
`^amt:`	Currency amount
`^pct:`	Percentage
`^count:`	Integer count
`^qty:`	Quantity with unit
`^date:`	Date expression

6.7 Test Suite

77 passing tests. The v0.7.0 release was the first version with full round-trip test coverage.

6.8 Known Bugs (Identified, Not Fixed)

The following bugs were documented in release notes but deferred:

Year compaction: Dates like "2025" were being formatted as "2.0K" by the numeric normalizer.
Pronoun subjects: "I", "it", "they" were extracted as packet subjects instead of being rejected.
Number-over-org ranking: MONEY and CARDINAL entities were outranking PERSON and ORG in subject selection.
Concatenated values: Multiple values were being joined without separators, producing unparseable ARG2 fields.

7. v0.8.0 (April 8, 2026) - The GPT Code Review

7.1 Summary

v0.8.0 was driven by an external code review conducted by GPT-4. The review identified 7 bugs - 4 matching the known list from v0.7.0 and 3 previously undetected. All 7 were fixed.

7.2 Bug Inventory

Bug 1 - DATE/year guard (KNOWN)

Years matching the regex (18|19|20|21)\d{2} were being passed through the numeric compactor, which converted "2025" to "2.0K". Fix: pre-screen tokens against the year pattern before numeric normalization.

Bug 2 - Word-scale normalization (KNOWN)

"5 million dollars" was not being collapsed to "5M". Fix: implemented word-scale detection for million, billion, thousand with currency context.

Bug 3 - Pronoun subject rejection (KNOWN)

First and third-person pronouns were valid NER extractions but semantically meaningless as packet subjects. Fix: rejection list [I, it, they, we, he, she, this, that] applied before subject selection. When a pronoun was the grammatical subject, the semantic object was extracted instead.

Bug 4 - Semantic subject ranking (KNOWN)

MONEY and PERCENT entities were being ranked equally with PERSON and ORG. Fix: implemented explicit scoring:

ORG, PERSON, GPE: +110 points
MONEY, PERCENT, CARDINAL: -40 points

Bug 5 - Safer evidence fallback (SILENT)

Generic prepositional phrases like "by 2025" were being extracted as causal evidence with the <- prefix. This was incorrect - temporal prepositions are not causal. Fix: prepositions without a semantic trigger word were excluded from causal evidence extraction.

Bug 6 - Synthetic MRG disabled (SILENT)

The compressor was generating synthetic merge operation packets with made-up growth targets (e.g., RE:5+3+30%) that had no basis in the input text. Fix: synthetic MRG emission disabled entirely.

Bug 7 - Atomic fact splitting (SILENT)

Complex coordinated sentences ("X does A and B and C") were being emitted as a single packet with all three facts concatenated. Fix: coordinated clauses are split into separate packets.

7.3 Before/After Example

Input: "By 2025, revenue will reach 5 million dollars."

v0.7.0: ID:COMPRESS|PRD.75|^2025|<-By_2025|^date:2.0K+^amt:5milliondollars|NOW

v0.8.0: ID:COMPRESS|PRD.75|$revenue||^amt:5M+^date:2025|NOW

The v0.8.0 output is shorter (48 chars vs 60 chars), uses the correct subject ($revenue instead of ^2025), eliminates the spurious causal evidence (<-By_2025), and correctly formats the amount (5M) and year (2025).

7.4 Test Suite

80 passing tests. The 3 additional tests covered the silent bugs discovered by GPT-4.

8. v0.8.1 (April 10, 2026) - The Density Crisis

8.1 Summary

The quality fixes in v0.8.0 produced a compression ratio collapse. On a 40,000-character CloudKitchen internal memo, packet count rose from 208 (v0.7.0) to 380 (v0.8.0), and compression ratio fell from 1.92x to 1.34x. v0.8.1 attempted to recover density through clause-level packing.

8.2 Root Cause Analysis

The atomic fact splitting fix (Bug 7) was the primary driver. A single complex sentence previously emitting 1 packet now emitted 3-5 packets. Additionally, the concatenated value fix (Bug 4) and the pronoun rejection fix (Bug 3) both produced longer, more explicit packets because the compressor could no longer take shortcuts.

8.3 Packing Helpers Added

Three new internal functions:

_find_clause_anchor(): Identified the main verb of a subordinate clause for use as a merge candidate.
_extract_clause_subject(): Pulled the grammatical subject of a clause for co-reference tracking.
_can_merge_fact(): Decision function - returns True if two facts share subject and operation and can be packed into one ARG2.

8.4 Packing Limits

Maximum 3 facts per packet (ARG2 field).
64-character budget for the ARG2 field.

8.5 Mini Kernel

The Rosetta kernel prepended to every output was reduced from 5,853 characters to 958 characters by removing inline examples and keeping only grammar rules.

8.6 Results

CloudKitchen memo (40K chars):
v0.7.0: 208 packets, ratio 1.92x
v0.8.0: 380 packets, ratio 1.34x
v0.8.1: 380 packets, ratio 1.34x

Packet count was unchanged. The packing heuristics had no measurable effect.

8.7 Lesson

The packing limits were set at 3 facts and 64 characters. The average correctly-extracted fact in v0.8.0 was 18-22 characters. Three facts at 20 characters each = 60 characters, already near the limit. The helpers were firing at nearly every sentence, but the result was the same packet count because splitting was happening upstream (at the sentence and clause level) faster than packing could consolidate downstream.

9. v0.8.2 (April 10, 2026) - Adjusted Limits

9.1 Summary

v0.8.2 raised the packing limits and added two refinements, attempting to recover ratio through more permissive consolidation.

9.2 Changes

Raised limits: Maximum 4 facts per packet (up from 3), 78-character ARG2 budget (up from 64).
Qualifier handling: Qualifying clauses ("which was announced last quarter") no longer forced a packet split. They were either dropped or appended as non-evidence context.
Role inference: Verb lemmas used for role labels instead of raw forms, improving consistency (report instead of reported, claim instead of claiming).

9.3 Results

CloudKitchen memo (40K chars):
v0.8.1: 380 packets, ratio 1.34x
v0.8.2: 380 packets, ratio 1.39x

Packet count identical. Ratio improved marginally (0.05x) due to the mini kernel size reduction compounding with the qualifier drop.

9.4 Lesson

The problem was not packing limits. A 4-fact, 78-character limit was already permissive for the extraction quality the pipeline was producing. The bottleneck was architectural: the emission layer was producing English-with-pipes rather than genuinely compressed semantic notation. Subject names were full English phrases. Evidence was full prepositional phrases. The packet format allowed short values but nothing enforced short values.

10. v0.9.0 (April 10, 2026) - Architecture Redesign

10.1 Summary

v0.9.0 abandoned incremental packing fixes and redesigned the emission architecture. Two key changes drove the improvement: entity aliasing (vocabulary compression) and same-subject merging (structural deduplication).

10.2 Two-Pass Architecture

The compressor now operated in two passes over the document:

Pass 1 - Document Scan:

Extract all named entities across the full document.
Assign 2-3 character aliases to entities appearing more than twice.
Build the entity registry.
Emit ontology manifest packet.

Pass 2 - Packed Emission:

Replace entity names with aliases at emission time.
Compress evidence into verb:object notation with 30-character max.
Merge adjacent packets sharing the same (subject, operation, temporal) triplet.

10.3 Entity Registry

Named entities appearing more than twice in the document were assigned aliases:

CloudKitchen -> CK
Marcus Chen  -> MC
San Francisco -> SF
Q1 2025      -> Q1

The mapping was emitted as the first packet in every output - the ontology manifest:

ID:C|@m.O.doc||^df:CK=CloudKitchen+MC=Marcus_Chen+SF=San_Francisco|NOW

Any receiver could reconstruct full entity names from this manifest.

10.4 Compressed Subjects

Subjects in v0.9.0 used aliases where registered:

v0.8.x: ID:COMPRESS|OBS.90|@Sales_team||^targets:30%|NOW
v0.9.0: ID:C|OBS.90|@CK.sales||^targets:30%|NOW

CK.sales communicates both the parent entity (CloudKitchen, via alias) and the sub-entity (sales team) in 8 characters versus 10 for Sales_team, while adding the organizational context that was previously absent.

10.5 Agent ID Compression

The agent identifier was shortened from "COMPRESS" to "C":

v0.8.x: ID:COMPRESS|... (11 chars for agent field)
v0.9.0: ID:C|...       (3 chars for agent field)

10 characters saved per packet. On a 380-packet document, this alone saves 3,800 characters.

10.6 Evidence Compression

Evidence in ARG2 was reformatted to verb:object notation with a 30-character cap:

v0.8.x: <-due_to_declining_foot_traffic_in_downtown_locations
v0.9.0: <-declining:foot_traffic

The semantic content is preserved (causal relationship, verb, object) with substantial length reduction.

10.7 Same-Subject Merging

Adjacent packets sharing the same subject, operation type, and temporal reference were merged:

Before merging:
ID:C|OBS.85|@CK.revenue||^amt:2M|Q1
ID:C|OBS.85|@CK.revenue||^pct:-12%|Q1

After merging:
ID:C|OBS.85|@CK.revenue||^amt:2M+^pct:-12%|Q1

This is structurally sound because both facts are observations about the same entity in the same time window. The merged packet is unambiguous.

10.8 Mini Kernel Final

The kernel prepended to outputs was reduced to 376 characters - the minimum grammar specification sufficient for a receiver to parse all v3 packet fields. This represented a 94% reduction from the 5,853-character full kernel shipped in v0.6.1.

10.9 Before/After Comparison

Full subject example:
v0.8.x: ID:COMPRESS|OBS.90|@Sales_team||^targets:30%+^date:Q1_2025|NOW
         63 characters

v0.9.0: ID:C|OBS.90|@CK.sales||^targets:30%+^date:Q1_2025|NOW
         53 characters

10-character reduction (16%) on this single packet. Multiplied across a 380-packet document, with additional savings from evidence compression and same-subject merging, the cumulative effect is substantial.

11. The Wrong Flank Lesson

The v0.8.0 through v0.8.2 development arc attacked the extraction layer - the part of the pipeline responsible for identifying entities, classifying operations, and scoring confidence. Each of the 7 bug fixes in v0.8.0 was technically correct:

Year dates were genuinely being misformatted.
Pronoun subjects were genuinely semantically wrong.
ORG entities genuinely should outrank MONEY entities as packet subjects.

But every fix made the output longer. Rejecting pronouns forced the compressor to find a longer, more explicit subject. Splitting coordinated facts produced more packets. Labeling amounts correctly produced longer ARG2 values.

The extraction layer was not the bottleneck. The emission layer was producing English-with-pipes: packet fields that contained full English phrases with underscores substituted for spaces. This is not compression - it is reformatting.

The analogy to compiler development is direct: the compressor's parser (extraction) was being optimized while the code generator (emission) was producing verbose output. The quality of parsing is irrelevant if the code generator wastes the parsed information.

v0.9.0 attacked the right layer:

Entity aliasing reduces vocabulary size at the emission stage.
verb:object evidence notation reduces field length at emission.
Same-subject merging reduces packet count at emission.
Agent ID shortening reduces per-packet overhead at emission.

None of these changes touched extraction quality. The extraction layer from v0.8.0 was carried forward unchanged into v0.9.0.

12. The Density vs Quality Tradeoff

The three-version arc illustrates a recurring tension in lossy compression design:

v0.7.0: Ratio 1.92x, extraction quality poor. The compressor achieved density by taking shortcuts - accepting pronouns as subjects, concatenating values without separators, emitting synthetic packets not grounded in input text. These shortcuts reduced output size but degraded semantic fidelity. A receiving agent parsing a v0.7.0 bundle would encounter subjects like "it" and values like "2.0K" that do not correspond to recoverable claims.

v0.8.0: Ratio 1.34x, extraction quality correct. Fixing the extraction bugs restored semantic fidelity but each fix eliminated a shortcut that had been compressing the output. The result was longer packets and more of them.

v0.9.0: Targets both. The extraction quality from v0.8.0 is preserved. Density is recovered not by reintroducing extraction shortcuts but by compressing the notation itself.

The fundamental insight of v0.9.0 is that AXL compression wins through two mechanisms:

Vocabulary compression: Short aliases for repeated entities reduce the character cost of every packet that mentions those entities. A document with 40 mentions of "CloudKitchen" saves 9 characters per mention by aliasing to "CK".
Structural deduplication: Same-subject merging eliminates the per-packet overhead (ID, agent, operation, confidence, temporal) for facts that share a subject context. Instead of paying 30-40 characters of structural overhead per fact, multiple facts share one structural frame.

Neither of these mechanisms requires accepting incorrect extractions. They are output-layer optimizations orthogonal to extraction quality.

13. Version Summary Table

Version	Date	Key Change	Tests	Ratio (40K)
v0.4.0	Mar 19	Initial release, v1 format	42	N/A
v0.5.0	Mar 29	v3 support, auto-detection	66	N/A
v0.6.0	Apr 7	english_to_v3() compressor	N/A	N/A
v0.6.1	Apr 7	Self-bootstrapping kernel	N/A	N/A
v0.7.0	Apr 8	Decompressor, round-trip	77	1.92x
v0.8.0	Apr 8	7 bug fixes, GPT review	80	1.34x
v0.8.1	Apr 10	Clause packing, mini kernel	80	1.34x
v0.8.2	Apr 10	Raised limits, lemma roles	80	1.39x
v0.9.0	Apr 10	Entity aliasing, merging	TBD	TBD

14. Open Problems

The following problems remain open as of v0.9.0:

Cross-sentence co-reference: Pronouns in later sentences referring to entities established earlier are currently dropped rather than resolved. A co-reference pass would recover subject identity across sentence boundaries.
Nested entity disambiguation: "Marcus Chen, VP of Sales at CloudKitchen" contains three entities in one noun phrase. The current NER pipeline extracts PERSON and ORG but loses the role relationship.
Confidence calibration: The hedging dictionary was assembled heuristically. No calibration against human-labeled confidence scores has been performed. The base scores per operation type are estimates.
Alias collision: With 2-3 character aliases, collision is possible in documents with many proper nouns. No collision resolution strategy is currently implemented.
Decompressor parity: The decompressor was not updated to handle the v0.9.0 alias format. A receiving agent parsing v0.9.0 output must expand aliases using the ontology manifest before passing packets to v3_to_english().

AXL Protocol Inc.

← Back to Timeline Protocol →