Is HeurChain faster than Zep?

On retrieval latency under multi-tenant load, yes — by roughly 8x. HeurChain measures 20.5ms p95 under a 10-tenant Docker stack; Zep/Graphiti's published number is roughly 300ms. The architectural reason is that Graphiti's temporal-KG traversal is in the query hot path, whereas HeurChain's hybrid retrieval is index lookups. These numbers come from different harnesses — see our methodology section.

Does HeurChain support temporal facts like Zep?

Not as a first-class architectural primitive — that's deliberate. HeurChain stores sequence-tagged facts and recency-ranks them in the dense retriever, but does not model explicit fact validity windows the way Graphiti does. If your domain is heavily temporal (contracts, medical records, evolving entities), Zep is the right tool. For most general agent memory, the temporal overhead isn't earning its keep.

How much does HeurChain cost compared to Zep?

Both have free options: HeurChain self-host is MIT, Graphiti self-host is Apache-2.0, Zep Cloud has a free tier. For managed, HeurChain Solo is $5/month. Self-hosting Zep means operating Neo4j, Postgres, and the Zep services; HeurChain is one binary plus Redis + SQLite. Pick based on what your ops team already runs.

When is Zep the better choice?

Four cases. First, domains with explicit fact validity that need temporal querying (contracts, medical, evolving entities). Second, multi-hop graph traversal queries that return paths, not passages. Third, teams already operating Neo4j in production. Fourth, when procurement values peer-reviewed academic backing.

When is HeurChain the better choice?

Four cases. First, when your queries are "give me relevant context," not graph traversal — the common case. Second, latency-sensitive agent loops. Third, single-binary self-hosting without a graph DB. Fourth, multi-tenant SaaS with auditable isolation requirements.

HeurChain vs Zep — Honest Comparison

Methodology — read this first

How we measured this comparison

Dataset: LongMemEval-S (ICLR 2025), 500 questions across 6 reasoning categories.

HeurChain numbers: measured on the heurchain-benchmarks harness (sharded_bench.py and multitenant_bench.py) against the broker in the main repo. Both are public; you can rerun the whole thing.

Zep numbers: from arXiv 2501.13956 (Zep / Graphiti) where available; left blank otherwise. We do not fabricate competitor numbers.

What we measured: retrieval R@k, MRR, NDCG@10, p50/p95 latency, and end-to-end QA accuracy with three independent judge models.

Cross-judge QA validation (May 2026): We ran the same retrieved facts through three judges from independent model families — full results published here. Mean QA accuracy (6 categories × 30 tasks): Local 14B 32.8%, DeepSeek V3.1 671B 31.7%, Kimi K2.6 28.3%. The two frontier judges agreed with each other on 87.8% of per-question verdicts, validating each as an independent judge. The local-14B mean was confirmed directionally correct within 4.5 pp of frontier judges — no inflation at the headline level.

What the per-category swings showed: the cross-judge run exposed a v2 fact-extraction quality bottleneck (specific entity-action assignments stripped to meta-summaries) on multi-session, knowledge-update, and temporal-reasoning categories. Where extraction preserves the answer-bearing detail, all three judges converge. Where it doesn't, the local 14B "won" by confabulating answers the local 14B judge then accepted — frontier judges honestly refused. Smoking-gun example in the writeup.

What we still owe: a v3 extraction prompt that preserves entity-action-value triples, and a closed-weight frontier judge run (Claude Sonnet 4.6 via Anthropic API) for additional independent validation. Both queued.

Bias disclosure: this is our internal harness, written by us. Of course it favors what we built well. The cross-judge run is the way we expose that bias and report it honestly. If you're evaluating both, the most reliable move is to run them on your data.

Retrieval p95 (fair comparison)

~8×

HeurChain 20.5 ms p95 under multi-tenant Docker load vs roughly 300 ms for Zep/Graphiti's published numbers (temporal-KG traversal in the hot path). Numbers from different harnesses — see methodology.

Inspectable

Open

Harness + every benchmark number in a public repo. Reproduce or refute on your own data.

Architecture

No Neo4j

HeurChain self-host is one Go binary + Redis + SQLite. Zep self-host is Neo4j + Postgres + Zep services. That's an ops profile choice, not a quality judgment.

Retrieval quality

Retrieval-only metrics on LongMemEval-S

Zep's published evaluations focus on task accuracy with an LLM judge, not retrieval R@k. We don't have Zep R@k numbers to compare — fabricating them would be exactly the kind of "trust me bro" benchmarking we're trying to avoid.

Where this comparison gets muddy: Zep publishes task-completion accuracy (DMR / LongMemEval-style with an LLM judge), not retrieval R@k. Mixing the two would mislead. On retrieval-only metrics our numbers are what we measured; on Zep's published axes, Graphiti is genuinely strong at temporal reasoning. They're solving overlapping but different problems.

Metric	HeurChain (dense)	HeurChain (hybrid α=0.9)	Zep (Graphiti)
R@1	0.543	0.542	—
R@5	0.939	0.933	—
R@10	0.972	0.978	—
MRR	0.911	0.913	—
NDCG@10	0.911	0.914	—

Latency

P95 retrieval latency in context

Latencies from different harnesses on different deployment topologies — not strictly apples-to-apples. The multi-tenant Docker number is the closest analog to what a SaaS would actually serve. The in-process number shows what the algorithm itself is capable of with the network removed.

System / configuration	P95 latency	Source	What it actually measures
HeurChain — multi-tenant load (Docker, 10 tenants concurrent)	20.5 ms	This benchmark	Closest to production SaaS scenario
Zep / Graphiti	~300 ms	arXiv 2501.13956	Temporal KG traversal in critical path
HeurChain — dense, in-process	35 ms	This benchmark	Algorithm-only ceiling; no network
HeurChain — BM25 only	4.6 ms	This benchmark	Keyword-only path; useful for hot queries
Mem0 (reference)	200 ms	Mem0 paper Table 1	Search latency; stack-specific
LangMem (reference)	59,820 ms	Mem0 paper Table 1	Vector scan; broken at LongMemEval scale

Under the hood

Architecture comparison

	HeurChain	Zep
Retrieval method	BM25 + dense (bge-m3) + RRF (tunable α)	Temporal knowledge graph (Graphiti) — episodic + semantic node search
Storage backend	Redis (vectors + BM25) + SQLite (metadata)	Neo4j (graph) + Postgres + embedding store
Temporal awareness	Sequence-tagged facts (on roadmap)	First-class temporal facts with validity periods (Graphiti's whole point)
Multi-tenant model	Per-tenant namespace + agent_id sub-isolation; published zero-leak verification	Per-graph isolation; multi-tenant via Neo4j namespace setup
Self-hosted option	Single Go binary + Redis + SQLite	Docker Compose: Neo4j + Postgres + Zep services
API surface	REST + MCP SSE — auto-discovered by Claude Code, ChatGPT Apps	Python / TypeScript SDK + REST

Honest assessment

When Zep is the better choice

We're not going to pretend HeurChain wins on every dimension. These are real cases where Zep is the better fit:

You need first-class temporal fact validity. If your domain is dense with facts that decay or get superseded (medical records, contract terms, evolving entity relationships), Graphiti's temporal-KG model is the right architecture. It's not over-engineering when the problem genuinely requires it.
You want explicit graph traversal. Multi-hop entity queries ("who reported to whose manager in Q3?") are Graphiti's home turf. HeurChain returns ranked passages; Graphiti returns paths.
You're already running Neo4j in production. If Neo4j is in your stack and your team operates it well, Zep slots in cleanly. The operational overhead that's a downside for greenfield teams is a non-issue for you.
You want an established academic backing. The Graphiti paper is peer-reviewed and the team has been publishing in this space for some time. If procurement values that, it's a real factor.

Where HeurChain fits

When HeurChain is the better choice

Most agent memory queries are "give me relevant context". Not graph traversal. If your queries are "user said X recently, what related prior context exists," hybrid retrieval is the right tool. Temporal-KG overhead is wasted compute here.
Latency-sensitive agent loops. ~8× faster on retrieval. Compounds when agents hit memory many times per turn.
You want one binary, not a Neo4j+Postgres+services stack. Single Go binary + Redis + SQLite. €20/mo CPX31. Done.
Multi-tenant SaaS with auditable isolation. Per-tenant namespacing with published zero-leak verification across 90 probe queries.

Pricing

Cost — for reference, not the headline

Most readers should pick on architecture fit, not price. Zep Cloud has a free tier with limits and Hobby/Pro paid tiers; Graphiti is Apache-2.0 open source. HeurChain self-host is MIT-licensed. The numbers below exist so you can see them, not because we think they should drive your decision.

If you...	HeurChain	Zep
Hobby / kicking the tires	Free self-host (MIT)	Free tier (Zep Cloud) or self-host Graphiti (Apache 2.0)
Solo developer, managed	$5/mo (Solo)	Zep Cloud Hobby / Pro tiers
Team, shared workspace	$49.99/mo (Workgroup)	Zep Cloud team pricing
Enterprise — SOC2, SAML	Custom	Custom

Both have free options — Zep Cloud has a free tier, Graphiti is Apache-2.0 open source. If you're already running Neo4j in production, Zep self-host is essentially incremental. If you're not, the operational footprint matters.

Don't take our word

Reproduce these numbers yourself

Clone heurchain-benchmarks and the main HeurChain repo; pull the LongMemEval-S dataset (instructions in the README).
Run python3 sharded_bench.py for the single-tenant baseline; python3 multitenant_bench.py --mode load --max-tenants 10 for the Docker multi-tenant number.
Re-run on your data — your conversation logs, your documents. Public benchmarks correlate with real workloads, but they're not the same thing.
If our numbers don't reproduce on your hardware, open an issue. We'll fix or correct.

HeurChain vs Zep: different shapes, different jobs.

How we measured this comparison

Retrieval-only metrics on LongMemEval-S

P95 retrieval latency in context

Architecture comparison

When Zep is the better choice

When HeurChain is the better choice

Cost — for reference, not the headline

Reproduce these numbers yourself

If HeurChain is a fit, the easiest start is the Solo plan.