# Headroom Cuts LLM Token Use 60–95% via Context Compression

> Headroom, an open-source context compression library published by chopratejas on GitHub, preprocesses tool outputs, logs, RAG chunks, files, and conversation history before they reach a large language model, reducing token counts by 60 to 95 percent while preserving answer quality. The project ships as a Python and TypeScript library, an OpenAI-compatible proxy, an agent wrapper, and an MCP server.

- Canonical URL: https://agentry.press/news/headroom-cuts-llm-token-use-6095-via-context-compression/
- Type: News
- Published: 2026-06-19
- By: agentry
- Tags: context-compression, open-source, agent-engineering, mcp-server, langchain, token-optimization

---

## What Headroom Does

Headroom is an open-source context compression layer that intercepts the inputs an AI agent sends to a large language model and reduces their token count before the model ever processes them. The library targets tool outputs, logs, RAG chunks, files, and conversation history, compressing each category before LLM ingestion while leaving the model's answers unchanged [1].

The project's repository documents a concrete example: a 10,144-token input reduced to 1,260 tokens, with the same FATAL error identified in the output as would have been found in the uncompressed version [1]. The stated compression range across workloads is 60 to 95 percent.

The project runs locally, meaning data does not leave the operator's environment during compression [1].

## Four Deployment Modes

Headroom ships with four distinct integration paths, each suited to a different adoption scenario.

The library mode exposes a `compress(messages)` function available in both Python and TypeScript, allowing developers to call compression inline within an existing application without changing the surrounding architecture [1].

The proxy mode launches with `headroom proxy --port 8787` and presents an OpenAI-compatible endpoint. Because it requires zero code changes, any application or agent already targeting an OpenAI-style API can route through Headroom without modification, regardless of the programming language in use [1].

The agent-wrap mode accepts a single command, `headroom wrap claude|codex|cursor|aider|copilot`, and wraps a named agent tool directly. This path is aimed at developers using CLI-based coding assistants who want compression without touching application code [1].

The MCP server mode exposes three tools, `headroom_compress`, `headroom_retrieve`, and `headroom_stats`, to any MCP client, integrating compression into the Model Context Protocol ecosystem [1].

## Six Compression Algorithms and Reversibility

Headroom's internal pipeline routes content through a CacheAligner and ContentRouter before applying one of several compression strategies. The documented algorithms include SmartCrusher for JSON payloads, CodeCompressor using AST-based analysis for source code, and Kompress-base, a Hugging Face-backed model for general text [1].

A mechanism called Compressed Context Retrieval (CCR) ensures that original content is never deleted. When the LLM needs a full original, it can retrieve it on demand through the retrieval tool exposed by the pipeline [1]. This reversibility property is positioned as a safeguard against information loss in long-running or multi-step agent sessions.

## Cross-Agent Memory and Learning

Headroom maintains a shared memory store that persists across sessions involving different models, including Anthropic Claude, OpenAI Codex, and Google Gemini. The store performs automatic deduplication, preventing the same content from being stored and compressed redundantly across agents [1].

A separate feature, `headroom learn`, mines failed agent sessions and writes corrections to `CLAUDE.md` or `AGENTS.md` files [1]. This mechanism allows teams to accumulate session-level corrections in a form that agent frameworks can read on subsequent runs.

## Supported Agents and Practical Fit

The project explicitly targets Claude Code, Cursor, Codex, Aider, and GitHub Copilot through the agent-wrap interface. Framework-level integrations cover LangChain, Agno, and Strands, as well as custom agent code [1].

For teams running high-volume or long-context workloads, the practical implications center on three variables: token cost per request, latency from prompt size, and available context window. Compressing inputs by 60 to 95 percent before they reach the model directly reduces billable tokens and can free context window space for additional tool outputs or history in agents that otherwise hit length limits [1].

Migration friction varies by path. The proxy mode requires no code changes, making it the lowest-effort entry point. The library and agent-wrap modes require either a function call or a CLI command, respectively. The MCP server mode requires an MCP-capable client.

## FAQ

**Q. Does Headroom send data to an external service during compression?**
The repository states that Headroom runs locally and that data stays in the operator's environment [1]. No external compression API is described in the available documentation.

**Q. What happens if the compressed context omits a detail the LLM needs?**
The Compressed Context Retrieval mechanism retains originals and provides the LLM with a retrieval tool to fetch full content on demand, so the model is not limited to the compressed version alone [1].

**Q. Which integration path requires the fewest code changes?**
The proxy mode, launched with `headroom proxy --port 8787`, is documented as requiring zero code changes and works with any language targeting an OpenAI-compatible endpoint [1].

**Q. Does the shared memory store work across different LLM providers?**
The repository documents cross-agent memory spanning Claude, Codex, and Gemini with automatic deduplication, suggesting the store is model-agnostic at the session level [1].

**Q. Is TypeScript support available, or is this Python-only?**
Headroom ships as both a Python and TypeScript library, with the `compress(messages)` function available in both languages [1].

## Key Takeaways

- Headroom compresses tool outputs, logs, RAG chunks, files, and conversation history before LLM ingestion, with documented token reductions of 60 to 95 percent [1].
- Four deployment modes (library, proxy, agent-wrap, MCP server) offer a range of integration paths from zero-code-change proxy routing to inline function calls [1].
- Compressed Context Retrieval preserves originals locally and makes them available for on-demand retrieval, maintaining reversibility [1].
- A shared cross-agent memory store with automatic deduplication spans Claude, Codex, and Gemini sessions, and the `headroom learn` feature mines failed sessions to write corrections back to agent instruction files [1].
- Supported frameworks include LangChain, Agno, and Strands, with agent-wrap support for Claude Code, Cursor, Codex, Aider, and Copilot [1].

## Sources

1. https://github.com/chopratejas/headroom
