Both are memory layers for AI agents on the LongMemEval-S benchmark. Mem0 is the better-known, peer-reviewed incumbent with a knowledge-graph variant. HeurChain is a faster, simpler, MIT-licensed alternative. Here's how the differences actually shape up.
Dataset: LongMemEval-S (ICLR 2025), 500 questions across 6 reasoning categories.
HeurChain numbers: measured on the heurchain-benchmarks harness (sharded_bench.py and multitenant_bench.py) against the broker in the main repo. Both are public; you can rerun the whole thing.
Mem0 numbers: from arXiv 2504.19413 (ECAI 2025) where available; left blank otherwise. We do not fabricate competitor numbers.
What we measured: retrieval R@k, MRR, NDCG@10, p50/p95 latency, and end-to-end QA accuracy with three independent judge models.
Cross-judge QA validation (May 2026): We ran the same retrieved facts through three judges from independent model families — full results published here. Mean QA accuracy (6 categories × 30 tasks): Local 14B 32.8%, DeepSeek V3.1 671B 31.7%, Kimi K2.6 28.3%. The two frontier judges agreed with each other on 87.8% of per-question verdicts, validating each as an independent judge. The local-14B mean was confirmed directionally correct within 4.5 pp of frontier judges — no inflation at the headline level.
What the per-category swings showed: the cross-judge run exposed a v2 fact-extraction quality bottleneck (specific entity-action assignments stripped to meta-summaries) on multi-session, knowledge-update, and temporal-reasoning categories. Where extraction preserves the answer-bearing detail, all three judges converge. Where it doesn't, the local 14B "won" by confabulating answers the local 14B judge then accepted — frontier judges honestly refused. Smoking-gun example in the writeup.
What we still owe: a v3 extraction prompt that preserves entity-action-value triples, and a closed-weight frontier judge run (Claude Sonnet 4.6 via Anthropic API) for additional independent validation. Both queued.
Bias disclosure: this is our internal harness, written by us. Of course it favors what we built well. The cross-judge run is the way we expose that bias and report it honestly. If you're evaluating both, the most reliable move is to run them on your data.
Retrieval-quality numbers. We publish R@k / MRR / NDCG; Mem0's paper publishes end-to-end QA accuracy with an LLM-as-judge (different metric family). The two cannot be merged into a single ranking, so we show both honestly.
Where this comparison gets muddy: Mem0's paper reports task QA accuracy with a GPT-4o judge, not retrieval R@k. Our table is retrieval-only because we don't want to introduce judge-model bias. Our internal QA accuracy with a 14B answerer is 38%; projected to GPT-4o, our numbers land near Mem0's — but until we publish the LLM-judge run, treat that as a hypothesis, not a result.
| Metric | HeurChain (dense) | HeurChain (hybrid α=0.9) | Mem0 (base) |
|---|---|---|---|
| R@1 | 0.543 | 0.542 | — |
| R@5 | 0.939 | 0.933 | — |
| R@10 | 0.972 | 0.978 | — |
| MRR | 0.911 | 0.913 | — |
| NDCG@10 | 0.911 | 0.914 | — |
Latencies from different harnesses on different deployment topologies — not strictly apples-to-apples. The multi-tenant Docker number is the closest analog to what a SaaS would actually serve. The in-process number shows what the algorithm itself is capable of with the network removed.
| System / configuration | P95 latency | Source | What it actually measures |
|---|---|---|---|
| HeurChain — multi-tenant load (Docker, 10 tenants concurrent) | 20.5 ms | This benchmark | Closest to production SaaS scenario |
| Mem0 | 200 ms | Mem0 paper | Search latency only — stack-specific |
| HeurChain — dense, in-process | 35 ms | This benchmark | Algorithm-only ceiling; no network |
| HeurChain — BM25 only | 4.6 ms | This benchmark | Keyword-only path; useful for hot queries |
| Mem0 (reference) | 200 ms | Mem0 paper Table 1 | Search latency; stack-specific |
| LangMem (reference) | 59,820 ms | Mem0 paper Table 1 | Vector scan; broken at LongMemEval scale |
| HeurChain | Mem0 | |
|---|---|---|
| Retrieval method | BM25 + dense (bge-m3) + RRF (tunable α) | Dense vector; hybrid added Apr 2026 (BM25 + vector + entity) |
| Storage backend | Redis (vectors + BM25) + SQLite (metadata) | Vector DB (Qdrant / pgvector) + optional Neo4j (for Mem0g) |
| Temporal awareness | Sequence-tagged facts (on roadmap) | Flat vector storage in base; Mem0g graph variant adds entity timeline |
| Multi-tenant model | Per-tenant namespace + agent_id sub-isolation; published zero-leak verification | 4-scope model (user_id / agent_id / run_id / org_id) |
| Self-hosted option | Single Go binary + Redis + SQLite | Open-source Python library; you supply Postgres + Qdrant + optionally Neo4j |
| API surface | REST + MCP SSE — auto-discovered by Claude Code, ChatGPT Apps | Python SDK + REST; MCP support varies |
We're not going to pretend HeurChain wins on every dimension. These are real cases where Mem0 is the better fit:
Most readers should pick on architecture fit, not price. Mem0 has a free hosted tier with usage limits, plus paid tiers from ~$249/mo for production use. HeurChain self-host is MIT-licensed. The numbers below exist so you can see them, not because we think they should drive your decision.
| If you... | HeurChain | Mem0 |
|---|---|---|
| Hobby / kicking the tires | Free self-host (MIT) | Free tier (Mem0 hosted, with usage limits) |
| Solo developer, managed | $5/mo (Solo) | Paid tiers from ~$249/mo for production |
| Team, shared workspace | $49.99/mo (Workgroup) | Mem0 team pricing varies |
| Enterprise — SOC2, SAML | Custom | Custom |
Both projects are free to self-host. If you want a managed solo plan, HeurChain's $5/mo is materially cheaper than Mem0's paid tiers — but if your usage fits Mem0's free tier and you don't need their advanced features paywalled, that's a legitimate path too.
python3 sharded_bench.py for the single-tenant baseline; python3 multitenant_bench.py --mode load --max-tenants 10 for the Docker multi-tenant number.Or self-host the same binary for free. If Mem0 fits your use case better, use Mem0 — we'd rather you pick the right tool.