Cognee builds a knowledge graph from your documents and reasons over it. HeurChain ranks and returns relevant passages with hybrid search. Same input, different output shape. Here's how to choose.
Dataset: LongMemEval-S (ICLR 2025), 500 questions across 6 reasoning categories.
HeurChain numbers: measured on the heurchain-benchmarks harness (sharded_bench.py and multitenant_bench.py) against the broker in the main repo. Both are public; you can rerun the whole thing.
Cognee numbers: from the topoteretes/cognee GitHub repo and cognee.ai docs where available; left blank otherwise. We do not fabricate competitor numbers.
What we measured: retrieval R@k, MRR, NDCG@10, p50/p95 latency, and end-to-end QA accuracy with three independent judge models.
Cross-judge QA validation (May 2026): We ran the same retrieved facts through three judges from independent model families — full results published here. Mean QA accuracy (6 categories × 30 tasks): Local 14B 32.8%, DeepSeek V3.1 671B 31.7%, Kimi K2.6 28.3%. The two frontier judges agreed with each other on 87.8% of per-question verdicts, validating each as an independent judge. The local-14B mean was confirmed directionally correct within 4.5 pp of frontier judges — no inflation at the headline level.
What the per-category swings showed: the cross-judge run exposed a v2 fact-extraction quality bottleneck (specific entity-action assignments stripped to meta-summaries) on multi-session, knowledge-update, and temporal-reasoning categories. Where extraction preserves the answer-bearing detail, all three judges converge. Where it doesn't, the local 14B "won" by confabulating answers the local 14B judge then accepted — frontier judges honestly refused. Smoking-gun example in the writeup.
What we still owe: a v3 extraction prompt that preserves entity-action-value triples, and a closed-weight frontier judge run (Claude Sonnet 4.6 via Anthropic API) for additional independent validation. Both queued.
Bias disclosure: this is our internal harness, written by us. Of course it favors what we built well. The cross-judge run is the way we expose that bias and report it honestly. If you're evaluating both, the most reliable move is to run them on your data.
Cognee evaluates on knowledge-graph benchmarks (entity recall, multi-hop reasoning) — not retrieval R@k. We don't have apples-to-apples Cognee LongMemEval-S numbers to publish, and we won't fabricate them. The table below is HeurChain's measured performance.
Where this comparison gets muddy: Cognee's strength is knowledge-graph reasoning, not flat-passage retrieval. Comparing the two on R@k is category-mixing. Pick Cognee if your queries require entity relationship traversal ("what companies has this VC invested in alongside Acme?"). Pick HeurChain if your queries are "give me the most relevant context for this prompt." Both are legitimate; they're solving different problems with overlapping interfaces.
| Metric | HeurChain (dense) | HeurChain (hybrid α=0.9) | Cognee |
|---|---|---|---|
| R@1 | 0.543 | 0.542 | — |
| R@5 | 0.939 | 0.933 | — |
| R@10 | 0.972 | 0.978 | — |
| MRR | 0.911 | 0.913 | — |
| NDCG@10 | 0.911 | 0.914 | — |
Latencies from different harnesses on different deployment topologies — not strictly apples-to-apples. The multi-tenant Docker number is the closest analog to what a SaaS would actually serve. The in-process number shows what the algorithm itself is capable of with the network removed.
| System / configuration | P95 latency | Source | What it actually measures |
|---|---|---|---|
| HeurChain — multi-tenant load (Docker, 10 tenants concurrent) | 20.5 ms | This benchmark | Closest to production SaaS scenario |
| Cognee (graph query) | Variable | Cognee docs | Depends on graph depth + backend |
| HeurChain — dense, in-process | 35 ms | This benchmark | Algorithm-only ceiling; no network |
| HeurChain — BM25 only | 4.6 ms | This benchmark | Keyword-only path; useful for hot queries |
| Mem0 (reference) | 200 ms | Mem0 paper Table 1 | Search latency; stack-specific |
| LangMem (reference) | 59,820 ms | Mem0 paper Table 1 | Vector scan; broken at LongMemEval scale |
| HeurChain | Cognee | |
|---|---|---|
| Retrieval method | BM25 + dense (bge-m3) + RRF (tunable α) | LLM-driven entity + relationship extraction into a graph; traversal + vector search |
| Storage backend | Redis (vectors + BM25) + SQLite (metadata) | Graph DB (Kuzu / Neo4j / FalkorDB) + vector DB (LanceDB / Qdrant / others) |
| Ingestion model | Embed-only — lightweight | LLM call per document for entity extraction — heavyweight (paid at ingest) |
| Query model | Single retrieval call returns ranked passages | Cognee Search API: multi-step graph traversal returning structured results |
| Multi-tenant model | Per-tenant namespace + agent_id sub-isolation; published zero-leak verification | Operator-managed; depends on which graph DB + vector DB backends you wire up |
| Self-hosted option | Single Go binary + Redis + SQLite | Python service + your choice of graph DB + vector DB |
| API surface | REST + MCP SSE — auto-discovered by Claude Code, ChatGPT Apps | Python SDK + REST; MCP integration available |
We're not going to pretend HeurChain wins on every dimension. These are real cases where Cognee is the better fit:
Most readers should pick on architecture fit, not price. Cognee is Apache-2.0 open source; Cognee Cloud has a free tier with limits. HeurChain self-host is MIT-licensed. The numbers below exist so you can see them, not because we think they should drive your decision.
| If you... | HeurChain | Cognee |
|---|---|---|
| Hobby / kicking the tires | Free self-host (MIT) | Free self-host (Apache 2.0) or Cognee Cloud free tier |
| Solo developer, managed | $5/mo (Solo) | Cognee Cloud paid tiers |
| Team, shared workspace | $49.99/mo (Workgroup) | Cognee Cloud team pricing |
| Enterprise — SOC2, SAML | Custom | Custom |
Both have free options. The real cost difference at scale is ingestion: Cognee runs an LLM call per document at ingest time for entity extraction (~$0.005-0.015 per doc at GPT-4o prices, depending on size). HeurChain ingestion is embed-only. If you're indexing high-volume content streams, that compounds; if you're indexing a few hundred documents, it's noise.
python3 sharded_bench.py for the single-tenant baseline; python3 multitenant_bench.py --mode load --max-tenants 10 for the Docker multi-tenant number.Or self-host the same binary for free. If Cognee fits your use case better, use Cognee — we'd rather you pick the right tool.