Letta is an agent runtime where memory lives inside the agent. HeurChain is memory infrastructure external to any agent — ChatGPT, Claude, Cursor, your own code. Different scopes; both legitimate. Here's how each fits.
Dataset: LongMemEval-S (ICLR 2025), 500 questions across 6 reasoning categories.
HeurChain numbers: measured on the heurchain-benchmarks harness (sharded_bench.py and multitenant_bench.py) against the broker in the main repo. Both are public; you can rerun the whole thing.
Letta numbers: from MemGPT paper (arXiv 2310.08560) and current Letta docs where available; left blank otherwise. We do not fabricate competitor numbers.
What we measured: retrieval R@k, MRR, NDCG@10, p50/p95 latency, and end-to-end QA accuracy with three independent judge models.
Cross-judge QA validation (May 2026): We ran the same retrieved facts through three judges from independent model families — full results published here. Mean QA accuracy (6 categories × 30 tasks): Local 14B 32.8%, DeepSeek V3.1 671B 31.7%, Kimi K2.6 28.3%. The two frontier judges agreed with each other on 87.8% of per-question verdicts, validating each as an independent judge. The local-14B mean was confirmed directionally correct within 4.5 pp of frontier judges — no inflation at the headline level.
What the per-category swings showed: the cross-judge run exposed a v2 fact-extraction quality bottleneck (specific entity-action assignments stripped to meta-summaries) on multi-session, knowledge-update, and temporal-reasoning categories. Where extraction preserves the answer-bearing detail, all three judges converge. Where it doesn't, the local 14B "won" by confabulating answers the local 14B judge then accepted — frontier judges honestly refused. Smoking-gun example in the writeup.
What we still owe: a v3 extraction prompt that preserves entity-action-value triples, and a closed-weight frontier judge run (Claude Sonnet 4.6 via Anthropic API) for additional independent validation. Both queued.
Bias disclosure: this is our internal harness, written by us. Of course it favors what we built well. The cross-judge run is the way we expose that bias and report it honestly. If you're evaluating both, the most reliable move is to run them on your data.
Letta doesn't publish standalone retrieval R@k — memory is one layer of a full agent runtime, evaluated end-to-end on task completion. Comparing retrieval directly would mean instrumenting Letta's internal layer, which we haven't done. Our numbers below are HeurChain on LongMemEval-S; treat the cross-system comparison as illustrative of architecture, not of "who wins."
Where this comparison gets muddy: Letta is an agent runtime, not a memory-only service. Comparing retrieval R@k apples-to-apples is category-mixing — Letta's memory is evaluated as part of the whole agent on task benchmarks (MMLU-style, AgentBench-style), not as a standalone retriever. Pick Letta if you want a memory-aware agent runtime. Pick HeurChain if you want a memory layer your existing agents can call into.
| Metric | HeurChain (dense) | HeurChain (hybrid α=0.9) | Letta |
|---|---|---|---|
| R@1 | 0.543 | 0.542 | — |
| R@5 | 0.939 | 0.933 | — |
| R@10 | 0.972 | 0.978 | — |
| MRR | 0.911 | 0.913 | — |
| NDCG@10 | 0.911 | 0.914 | — |
Latencies from different harnesses on different deployment topologies — not strictly apples-to-apples. The multi-tenant Docker number is the closest analog to what a SaaS would actually serve. The in-process number shows what the algorithm itself is capable of with the network removed.
| System / configuration | P95 latency | Source | What it actually measures |
|---|---|---|---|
| HeurChain — multi-tenant load (Docker, 10 tenants concurrent) | 20.5 ms | This benchmark | Closest to production SaaS scenario |
| Letta agent loop (memory I/O) | Embedded | Letta docs | Not exposed as standalone metric |
| HeurChain — dense, in-process | 35 ms | This benchmark | Algorithm-only ceiling; no network |
| HeurChain — BM25 only | 4.6 ms | This benchmark | Keyword-only path; useful for hot queries |
| Mem0 (reference) | 200 ms | Mem0 paper Table 1 | Search latency; stack-specific |
| LangMem (reference) | 59,820 ms | Mem0 paper Table 1 | Vector scan; broken at LongMemEval scale |
| HeurChain | Letta | |
|---|---|---|
| Retrieval method | BM25 + dense (bge-m3) + RRF (tunable α) | Tiered virtual memory (core / archival / recall) with dense retrieval over archival |
| Storage backend | Redis (vectors + BM25) + SQLite (metadata) | Postgres + pgvector for archival memory |
| Position in stack | External memory service called by any agent (ChatGPT, Claude, Cursor, custom) | Memory lives inside the Letta agent runtime — agent + memory ship together |
| Multi-tenant model | Per-tenant namespace + agent_id sub-isolation; published zero-leak verification | Per-agent isolation within a Letta server; multi-tenant deployment is your responsibility |
| Self-hosted option | Single Go binary + Redis + SQLite | Letta server + Postgres + pgvector via Docker Compose |
| API surface | REST + MCP SSE — auto-discovered by Claude Code, ChatGPT Apps | Letta REST API (agent-centric) + Python SDK |
| Coupling to LLM choice | None — pure retrieval infrastructure | Letta loop expects an LLM with tool-calling; ships with provider integrations |
We're not going to pretend HeurChain wins on every dimension. These are real cases where Letta is the better fit:
Most readers should pick on architecture fit, not price. Letta Cloud has a free tier; the runtime is Apache-2.0 open source. HeurChain self-host is MIT-licensed. The numbers below exist so you can see them, not because we think they should drive your decision.
| If you... | HeurChain | Letta |
|---|---|---|
| Hobby / kicking the tires | Free self-host (MIT) | Free tier (Letta Cloud) or self-host (Apache 2.0) |
| Solo developer, managed | $5/mo (Solo — memory only) | Letta Cloud paid plans (full runtime included) |
| Team, shared workspace | $49.99/mo (Workgroup — memory only) | Letta team pricing varies |
| Enterprise — SOC2, SAML | Custom | Custom |
Apples vs oranges note: Letta's price covers a full agent runtime (LLM orchestration, tool calling, memory — the whole loop). HeurChain's price covers only the memory layer; you bring the agent. If you don't have an agent runtime yet, Letta gives you more per dollar. If you already have one in ChatGPT / Claude / Cursor, HeurChain adds memory without forcing you to adopt a new runtime.
python3 sharded_bench.py for the single-tenant baseline; python3 multitenant_bench.py --mode load --max-tenants 10 for the Docker multi-tenant number.Or self-host the same binary for free. If Letta fits your use case better, use Letta — we'd rather you pick the right tool.