01 / 04 — Compressor
AXL Compressor Evolution: v0.4.0 to v0.9.0
AXL Protocol was conceived on January 29, 2026, as a concept on paper. After weeks of brainstorming sessions with multiple LLM architectures, infrastructure was provisioned on February 12, the domain registered on March 10, and the site went live on March 16. Three days later, the first code shipped.
Abstract
This document traces the full technical evolution of the AXL compressor across six major versions and five patch releases spanning 22 days of active development. The compressor began as a deterministic NLP pipeline translating English prose into AXL v3 packet format and grew into a two-pass architecture with entity aliasing, vocabulary compression, and structural deduplication. Each version introduced specific capabilities, exposed new failure modes, and refined the core thesis: that semantic compression of structured claims requires attacking the emission layer, not just the extraction layer.
1. Background: The AXL Packet Format
AXL (Axiomatic Exchange Language) is a structured wire format for machine-readable semantic claims. A v3 packet takes the form:
ID:AGENT|OP.CONF|SUBJECT|ROLE|ARG2|TEMPORAL
Where:
- ID - sender identifier
- AGENT - compression agent name
- OP - operation code (OBS, PRD, SIG, etc.)
- CONF - confidence score (0-100)
- SUBJECT - primary entity
- ROLE - semantic role annotation (optional)
- ARG2 - evidence, values, or linked facts
- TEMPORAL - time reference (NOW, PAST, FUTURE, or date)
The Rosetta v3 kernel is a 5,853-character grammar reference that allows any receiving agent to interpret packets without prior AXL knowledge. Self-bootstrapping means each compression output prepends this kernel, followed by a ---PACKETS--- separator.
The compression ratio metric used throughout this document is:
ratio = len(input_bytes) / len(output_bytes)
A ratio above 1.0 means output is smaller than input. A ratio of 1.92x means the AXL output is 48% the size of the original English.
2. v0.4.0 (March 19, 2026) - Genesis
2.1 Summary
v0.4.0 was the initial public release of axl-core to PyPI. It established the foundational architecture: a parser, emitter, validator, and translator for the v1 packet format.
2.2 Capabilities
The v0.4.0 release shipped with:
- Parser: Read AXL v1 packets from raw strings and structured files.
- Emitter: Produce canonical v1 packet strings from Python data structures.
- Validator: Schema-level validation of packet fields against v1 grammar rules.
- Translator: Cross-domain packet translation using the Rosetta domain map.
- CLI: Four subcommands -
parse,validate,translate,emit. - Test suite: 42 passing tests covering all four components.
- Zero runtime dependencies: No third-party packages required.
2.3 Rosetta Domains (v1)
The 10 semantic domains defined in v0.4.0 remained stable through all subsequent versions:
| Code | Domain |
|---|---|
| TRD | Trade / Commerce |
| SIG | Signal / Alert |
| COMM | Communication |
| OPS | Operations |
| SEC | Security |
| DEV | Development |
| RES | Research |
| REG | Regulatory |
| PAY | Payment |
| FUND | Funding |
2.4 Design Decisions
The zero-dependency constraint was deliberate. The package needed to run in air-gapped environments and inside other packages without triggering dependency conflicts. All parsing used pure Python regex and string operations.
The CLI was designed for scripting and pipeline use. A typical validation command:
axl validate --file claims.axl --strict
3. v0.5.0 (March 29, 2026) - v3 Support
3.1 Summary
v0.5.0 introduced full support for the AXL v3 packet format, which extended the grammar to support richer semantic roles, evidence linking, and confidence scoring. The Rosetta v3 kernel was embedded as rosetta/v3.md.
3.2 Capabilities Added
- v3 parser, emitter, validator, translator: Full parallel implementation for the extended grammar.
- Version auto-detection:
detect_version()inspected packet structure to determine v1 vs v3 format without caller configuration. - JSON lowering: Packets could be serialized to
application/vnd.axl+jsonper RFC 8785 canonical JSON. - Backward compatibility: All v1 packets remained valid inputs. The CLI defaulted to auto-detection.
- Test suite: Expanded to 66 passing tests.
3.3 The v3 Grammar Extension
The key structural change in v3 was the addition of the ROLE field and evidence prefix notation:
v1: ID:AGENT|OP.CONF|SUBJECT|ARG|TEMPORAL
v3: ID:AGENT|OP.CONF|SUBJECT|ROLE|ARG2|TEMPORAL
Evidence in ARG2 could now carry typed prefixes:
<-- causal evidence ("because of X")->- output or consequence^- labeled value (e.g.,^amt:5M)@- entity reference
3.4 JSON Lowering
The JSON representation allowed AXL packets to be transmitted in JSON-native systems:
{
"id": "COMPRESS",
"op": "OBS",
"conf": 85,
"subject": "Blood_oxygen_levels",
"role": null,
"arg2": ["^89%"],
"temporal": "NOW"
}
4. v0.6.0 (April 7, 2026) - The Deterministic Compressor
4.1 Summary
v0.6.0 introduced english_to_v3(), the first deterministic English-to-AXL compressor. This was the primary innovation of the axl-core project: translating natural language into structured semantic packets using pure NLP heuristics, with no LLM calls.
4.2 The 7-Step Pipeline
The compressor was implemented as a sequential spaCy NLP pipeline:
- Sentence splitting: Input text broken into atomic sentences using spaCy's sentence boundary detection.
- NER extraction: Named entity recognition identified PERSON, ORG, GPE, MONEY, PERCENT, DATE, CARDINAL entities.
- Operation classification: Heuristic rules mapped linguistic patterns to AXL operation codes (OBS, PRD, SIG, MRG, etc.).
- Confidence scoring: Lexical hedging words reduced confidence; declarative statements scored higher.
- Temporal extraction: Date expressions and temporal adverbs mapped to AXL temporal fields.
- Evidence linking: Prepositional phrases and causal constructions extracted as ARG2 evidence.
- Packet emission: Fields assembled and formatted per v3 grammar.
4.3 Example Output
Input: "The patient has a blood oxygen level of 89%"
Output: ID:COMPRESS|OBS.90|#Blood_oxygen_levels||^89%|NOW
The # prefix denotes a measurement or clinical observation subject. The ^ prefix labels the value. The confidence of 90 reflects a declarative, unhedged statement.
4.4 No LLM Dependency
The zero-LLM constraint was a deliberate architectural choice. The compressor needed to run at the edge, in resource-constrained environments, and with predictable latency. spaCy's small English model (25MB) provided sufficient NLP capability for the initial pipeline.
5. v0.6.1 (April 7, 2026) - Self-Bootstrapping
5.1 Summary
v0.6.1 modified the compressor output format to prepend the Rosetta v3 kernel before all packets. This made AXL compression outputs self-contained: any receiving agent could parse the output without prior AXL knowledge.
5.2 Output Format
Every english_to_v3() call now produced:
[Rosetta v3 kernel - 5,853 characters of grammar reference]
---PACKETS---
ID:COMPRESS|OBS.90|#Blood_oxygen_levels||^89%|NOW
ID:COMPRESS|SIG.75|@Doctor||<-elevated_concern|NOW
5.3 Significance
This was the zero-configuration receiver guarantee. A fresh LLM instance with no AXL training could receive a compression output and interpret all packets correctly because the grammar was embedded in the output itself. The design followed the principle of self-describing data formats - the message carries its own schema.
The ---PACKETS--- separator allowed receivers to isolate the grammar section from the semantic payload.
6. v0.7.0 (April 8, 2026) - The Decompressor
6.1 Summary
v0.7.0 introduced the decompressor: a deterministic inverse of the compressor. Given an AXL packet or bundle, v3_to_english() produced human-readable English. This completed the round-trip: English -> AXL -> English.
6.2 Decompressor Architecture
The decompressor was built around four components:
parse_packet(): Field-level parser for v3 packet strings.strip_kernel(): Separated the Rosetta kernel from the packet payload at the---PACKETS---boundary.format_decompressed(): Template-based claim reconstruction per operation type.- Receipt mode: 0.3ms processing time, zero LLM calls, deterministic output.
6.3 Evidence Extraction Rewrite
Evidence extraction was redesigned around four pattern groups:
- Causal patterns: "because", "due to", "as a result of" - mapped to
<-prefix. - Attribution patterns: "according to", "reported by" - mapped to entity reference.
- Dependency patterns: "based on", "contingent on" - mapped to conditional evidence.
- Contradiction patterns: "despite", "although", "however" - mapped to counter-evidence.
A spaCy dependency tree fallback handled sentences that matched none of the four groups.
6.4 Confidence Scoring Rewrite
Confidence scoring became operation-aware. Base scores varied by operation type:
| Operation | Base Confidence |
|---|---|
| OBS | 85 |
| SIG | 70 |
| PRD | 65 |
| MRG | 60 |
A 23-word hedging dictionary reduced confidence when matched. Words like "might", "could", "approximately", "expected to" each subtracted from the base score.
6.5 Bundle Manifest
Every compression output now appended a bundle manifest (loss contract) listing fields omitted during compression:
---MANIFEST---
sentences_processed: 12
packets_emitted: 8
fields_dropped: [adjectives, adverbs, parentheticals]
6.6 NER Value Prefix Map
ARG2 values were labeled with type prefixes to preserve semantic meaning:
| Prefix | Type |
|---|---|
^amt: | Currency amount |
^pct: | Percentage |
^count: | Integer count |
^qty: | Quantity with unit |
^date: | Date expression |
6.7 Test Suite
77 passing tests. The v0.7.0 release was the first version with full round-trip test coverage.
6.8 Known Bugs (Identified, Not Fixed)
The following bugs were documented in release notes but deferred:
- Year compaction: Dates like "2025" were being formatted as "2.0K" by the numeric normalizer.
- Pronoun subjects: "I", "it", "they" were extracted as packet subjects instead of being rejected.
- Number-over-org ranking: MONEY and CARDINAL entities were outranking PERSON and ORG in subject selection.
- Concatenated values: Multiple values were being joined without separators, producing unparseable ARG2 fields.
7. v0.8.0 (April 8, 2026) - The GPT Code Review
7.1 Summary
v0.8.0 was driven by an external code review conducted by GPT-4. The review identified 7 bugs - 4 matching the known list from v0.7.0 and 3 previously undetected. All 7 were fixed.
7.2 Bug Inventory
Bug 1 - DATE/year guard (KNOWN)
Years matching the regex (18|19|20|21)\d{2} were being passed through the numeric compactor, which converted "2025" to "2.0K". Fix: pre-screen tokens against the year pattern before numeric normalization.
Bug 2 - Word-scale normalization (KNOWN)
"5 million dollars" was not being collapsed to "5M". Fix: implemented word-scale detection for million, billion, thousand with currency context.
Bug 3 - Pronoun subject rejection (KNOWN)
First and third-person pronouns were valid NER extractions but semantically meaningless as packet subjects. Fix: rejection list [I, it, they, we, he, she, this, that] applied before subject selection. When a pronoun was the grammatical subject, the semantic object was extracted instead.
Bug 4 - Semantic subject ranking (KNOWN)
MONEY and PERCENT entities were being ranked equally with PERSON and ORG. Fix: implemented explicit scoring:
- ORG, PERSON, GPE:
+110points - MONEY, PERCENT, CARDINAL:
-40points
Bug 5 - Safer evidence fallback (SILENT)
Generic prepositional phrases like "by 2025" were being extracted as causal evidence with the <- prefix. This was incorrect - temporal prepositions are not causal. Fix: prepositions without a semantic trigger word were excluded from causal evidence extraction.
Bug 6 - Synthetic MRG disabled (SILENT)
The compressor was generating synthetic merge operation packets with made-up growth targets (e.g., RE:5+3+30%) that had no basis in the input text. Fix: synthetic MRG emission disabled entirely.
Bug 7 - Atomic fact splitting (SILENT)
Complex coordinated sentences ("X does A and B and C") were being emitted as a single packet with all three facts concatenated. Fix: coordinated clauses are split into separate packets.
7.3 Before/After Example
Input: "By 2025, revenue will reach 5 million dollars."
v0.7.0: ID:COMPRESS|PRD.75|^2025|<-By_2025|^date:2.0K+^amt:5milliondollars|NOW
v0.8.0: ID:COMPRESS|PRD.75|$revenue||^amt:5M+^date:2025|NOW
The v0.8.0 output is shorter (48 chars vs 60 chars), uses the correct subject ($revenue instead of ^2025), eliminates the spurious causal evidence (<-By_2025), and correctly formats the amount (5M) and year (2025).
7.4 Test Suite
80 passing tests. The 3 additional tests covered the silent bugs discovered by GPT-4.
8. v0.8.1 (April 10, 2026) - The Density Crisis
8.1 Summary
The quality fixes in v0.8.0 produced a compression ratio collapse. On a 40,000-character CloudKitchen internal memo, packet count rose from 208 (v0.7.0) to 380 (v0.8.0), and compression ratio fell from 1.92x to 1.34x. v0.8.1 attempted to recover density through clause-level packing.
8.2 Root Cause Analysis
The atomic fact splitting fix (Bug 7) was the primary driver. A single complex sentence previously emitting 1 packet now emitted 3-5 packets. Additionally, the concatenated value fix (Bug 4) and the pronoun rejection fix (Bug 3) both produced longer, more explicit packets because the compressor could no longer take shortcuts.
8.3 Packing Helpers Added
Three new internal functions:
_find_clause_anchor(): Identified the main verb of a subordinate clause for use as a merge candidate._extract_clause_subject(): Pulled the grammatical subject of a clause for co-reference tracking._can_merge_fact(): Decision function - returns True if two facts share subject and operation and can be packed into one ARG2.
8.4 Packing Limits
- Maximum 3 facts per packet (ARG2 field).
- 64-character budget for the ARG2 field.
8.5 Mini Kernel
The Rosetta kernel prepended to every output was reduced from 5,853 characters to 958 characters by removing inline examples and keeping only grammar rules.
8.6 Results
CloudKitchen memo (40K chars):
v0.7.0: 208 packets, ratio 1.92x
v0.8.0: 380 packets, ratio 1.34x
v0.8.1: 380 packets, ratio 1.34x
Packet count was unchanged. The packing heuristics had no measurable effect.
8.7 Lesson
The packing limits were set at 3 facts and 64 characters. The average correctly-extracted fact in v0.8.0 was 18-22 characters. Three facts at 20 characters each = 60 characters, already near the limit. The helpers were firing at nearly every sentence, but the result was the same packet count because splitting was happening upstream (at the sentence and clause level) faster than packing could consolidate downstream.
9. v0.8.2 (April 10, 2026) - Adjusted Limits
9.1 Summary
v0.8.2 raised the packing limits and added two refinements, attempting to recover ratio through more permissive consolidation.
9.2 Changes
- Raised limits: Maximum 4 facts per packet (up from 3), 78-character ARG2 budget (up from 64).
- Qualifier handling: Qualifying clauses ("which was announced last quarter") no longer forced a packet split. They were either dropped or appended as non-evidence context.
- Role inference: Verb lemmas used for role labels instead of raw forms, improving consistency (
reportinstead ofreported,claiminstead ofclaiming).
9.3 Results
CloudKitchen memo (40K chars):
v0.8.1: 380 packets, ratio 1.34x
v0.8.2: 380 packets, ratio 1.39x
Packet count identical. Ratio improved marginally (0.05x) due to the mini kernel size reduction compounding with the qualifier drop.
9.4 Lesson
The problem was not packing limits. A 4-fact, 78-character limit was already permissive for the extraction quality the pipeline was producing. The bottleneck was architectural: the emission layer was producing English-with-pipes rather than genuinely compressed semantic notation. Subject names were full English phrases. Evidence was full prepositional phrases. The packet format allowed short values but nothing enforced short values.
10. v0.9.0 (April 10, 2026) - Architecture Redesign
10.1 Summary
v0.9.0 abandoned incremental packing fixes and redesigned the emission architecture. Two key changes drove the improvement: entity aliasing (vocabulary compression) and same-subject merging (structural deduplication).
10.2 Two-Pass Architecture
The compressor now operated in two passes over the document:
Pass 1 - Document Scan:
- Extract all named entities across the full document.
- Assign 2-3 character aliases to entities appearing more than twice.
- Build the entity registry.
- Emit ontology manifest packet.
Pass 2 - Packed Emission:
- Replace entity names with aliases at emission time.
- Compress evidence into
verb:objectnotation with 30-character max. - Merge adjacent packets sharing the same (subject, operation, temporal) triplet.
10.3 Entity Registry
Named entities appearing more than twice in the document were assigned aliases:
CloudKitchen -> CK
Marcus Chen -> MC
San Francisco -> SF
Q1 2025 -> Q1
The mapping was emitted as the first packet in every output - the ontology manifest:
ID:C|@m.O.doc||^df:CK=CloudKitchen+MC=Marcus_Chen+SF=San_Francisco|NOW
Any receiver could reconstruct full entity names from this manifest.
10.4 Compressed Subjects
Subjects in v0.9.0 used aliases where registered:
v0.8.x: ID:COMPRESS|OBS.90|@Sales_team||^targets:30%|NOW
v0.9.0: ID:C|OBS.90|@CK.sales||^targets:30%|NOW
CK.sales communicates both the parent entity (CloudKitchen, via alias) and the sub-entity (sales team) in 8 characters versus 10 for Sales_team, while adding the organizational context that was previously absent.
10.5 Agent ID Compression
The agent identifier was shortened from "COMPRESS" to "C":
v0.8.x: ID:COMPRESS|... (11 chars for agent field)
v0.9.0: ID:C|... (3 chars for agent field)
10 characters saved per packet. On a 380-packet document, this alone saves 3,800 characters.
10.6 Evidence Compression
Evidence in ARG2 was reformatted to verb:object notation with a 30-character cap:
v0.8.x: <-due_to_declining_foot_traffic_in_downtown_locations
v0.9.0: <-declining:foot_traffic
The semantic content is preserved (causal relationship, verb, object) with substantial length reduction.
10.7 Same-Subject Merging
Adjacent packets sharing the same subject, operation type, and temporal reference were merged:
Before merging:
ID:C|OBS.85|@CK.revenue||^amt:2M|Q1
ID:C|OBS.85|@CK.revenue||^pct:-12%|Q1
After merging:
ID:C|OBS.85|@CK.revenue||^amt:2M+^pct:-12%|Q1
This is structurally sound because both facts are observations about the same entity in the same time window. The merged packet is unambiguous.
10.8 Mini Kernel Final
The kernel prepended to outputs was reduced to 376 characters - the minimum grammar specification sufficient for a receiver to parse all v3 packet fields. This represented a 94% reduction from the 5,853-character full kernel shipped in v0.6.1.
10.9 Before/After Comparison
Full subject example:
v0.8.x: ID:COMPRESS|OBS.90|@Sales_team||^targets:30%+^date:Q1_2025|NOW
63 characters
v0.9.0: ID:C|OBS.90|@CK.sales||^targets:30%+^date:Q1_2025|NOW
53 characters
10-character reduction (16%) on this single packet. Multiplied across a 380-packet document, with additional savings from evidence compression and same-subject merging, the cumulative effect is substantial.
11. The Wrong Flank Lesson
The v0.8.0 through v0.8.2 development arc attacked the extraction layer - the part of the pipeline responsible for identifying entities, classifying operations, and scoring confidence. Each of the 7 bug fixes in v0.8.0 was technically correct:
- Year dates were genuinely being misformatted.
- Pronoun subjects were genuinely semantically wrong.
- ORG entities genuinely should outrank MONEY entities as packet subjects.
But every fix made the output longer. Rejecting pronouns forced the compressor to find a longer, more explicit subject. Splitting coordinated facts produced more packets. Labeling amounts correctly produced longer ARG2 values.
The extraction layer was not the bottleneck. The emission layer was producing English-with-pipes: packet fields that contained full English phrases with underscores substituted for spaces. This is not compression - it is reformatting.
The analogy to compiler development is direct: the compressor's parser (extraction) was being optimized while the code generator (emission) was producing verbose output. The quality of parsing is irrelevant if the code generator wastes the parsed information.
v0.9.0 attacked the right layer:
- Entity aliasing reduces vocabulary size at the emission stage.
verb:objectevidence notation reduces field length at emission.- Same-subject merging reduces packet count at emission.
- Agent ID shortening reduces per-packet overhead at emission.
None of these changes touched extraction quality. The extraction layer from v0.8.0 was carried forward unchanged into v0.9.0.
12. The Density vs Quality Tradeoff
The three-version arc illustrates a recurring tension in lossy compression design:
v0.7.0: Ratio 1.92x, extraction quality poor. The compressor achieved density by taking shortcuts - accepting pronouns as subjects, concatenating values without separators, emitting synthetic packets not grounded in input text. These shortcuts reduced output size but degraded semantic fidelity. A receiving agent parsing a v0.7.0 bundle would encounter subjects like "it" and values like "2.0K" that do not correspond to recoverable claims.
v0.8.0: Ratio 1.34x, extraction quality correct. Fixing the extraction bugs restored semantic fidelity but each fix eliminated a shortcut that had been compressing the output. The result was longer packets and more of them.
v0.9.0: Targets both. The extraction quality from v0.8.0 is preserved. Density is recovered not by reintroducing extraction shortcuts but by compressing the notation itself.
The fundamental insight of v0.9.0 is that AXL compression wins through two mechanisms:
- Vocabulary compression: Short aliases for repeated entities reduce the character cost of every packet that mentions those entities. A document with 40 mentions of "CloudKitchen" saves 9 characters per mention by aliasing to "CK".
- Structural deduplication: Same-subject merging eliminates the per-packet overhead (ID, agent, operation, confidence, temporal) for facts that share a subject context. Instead of paying 30-40 characters of structural overhead per fact, multiple facts share one structural frame.
Neither of these mechanisms requires accepting incorrect extractions. They are output-layer optimizations orthogonal to extraction quality.
13. Version Summary Table
| Version | Date | Key Change | Tests | Ratio (40K) |
|---|---|---|---|---|
| v0.4.0 | Mar 19 | Initial release, v1 format | 42 | N/A |
| v0.5.0 | Mar 29 | v3 support, auto-detection | 66 | N/A |
| v0.6.0 | Apr 7 | english_to_v3() compressor | N/A | N/A |
| v0.6.1 | Apr 7 | Self-bootstrapping kernel | N/A | N/A |
| v0.7.0 | Apr 8 | Decompressor, round-trip | 77 | 1.92x |
| v0.8.0 | Apr 8 | 7 bug fixes, GPT review | 80 | 1.34x |
| v0.8.1 | Apr 10 | Clause packing, mini kernel | 80 | 1.34x |
| v0.8.2 | Apr 10 | Raised limits, lemma roles | 80 | 1.39x |
| v0.9.0 | Apr 10 | Entity aliasing, merging | TBD | TBD |
14. Open Problems
The following problems remain open as of v0.9.0:
- Cross-sentence co-reference: Pronouns in later sentences referring to entities established earlier are currently dropped rather than resolved. A co-reference pass would recover subject identity across sentence boundaries.
- Nested entity disambiguation: "Marcus Chen, VP of Sales at CloudKitchen" contains three entities in one noun phrase. The current NER pipeline extracts PERSON and ORG but loses the role relationship.
- Confidence calibration: The hedging dictionary was assembled heuristically. No calibration against human-labeled confidence scores has been performed. The base scores per operation type are estimates.
- Alias collision: With 2-3 character aliases, collision is possible in documents with many proper nouns. No collision resolution strategy is currently implemented.
- Decompressor parity: The decompressor was not updated to handle the v0.9.0 alias format. A receiving agent parsing v0.9.0 output must expand aliases using the ontology manifest before passing packets to
v3_to_english().