← Back to all posts
HeurChain team

The 512-token chunk pattern

When you write text into a vector memory system, the first decision you face is how to split it up. The chunk size you choose has a larger effect on retrieval quality than almost any other parameter — including the choice of embedding model.

We use 512 tokens with a 10% overlap for HeurChain's default chunking. Here is why, and when to deviate from it.

Why chunk size matters

A vector embedding is a fixed-length representation of a chunk of text. The embedding captures the semantic content of that chunk — not of the surrounding document. When you query the memory store, the retrieval system computes similarity between your query embedding and all stored chunk embeddings, then returns the top-k most similar chunks.

The problem with large chunks is that a single embedding has to represent too many ideas at once. If you store a 2,000-token passage that covers three distinct topics, the embedding is a blurred average of all three. A query about one of the three topics will get a moderate score rather than a high score, and you may miss it in favor of smaller, more focused chunks that score higher.

The problem with small chunks is the opposite: each chunk is too narrow to include enough context for the embedding to be meaningful. A 50-token chunk might be a single sentence that makes no sense without the surrounding paragraphs.

512 tokens is the empirical sweet spot for general-purpose agent memory. It is small enough that each chunk covers one idea, large enough that the embedding has meaningful context.

The bge-base-en-v1.5 context window

HeurChain uses BAAI's bge-base-en-v1.5 as its embedding model. This model has a maximum context window of 512 tokens. Anything longer than 512 tokens is silently truncated.

This is a hard constraint, not a soft guideline. If you try to embed a 1,000-token chunk with bge-base-en-v1.5, you get an embedding that represents only the first 512 tokens. The second half of your text is invisible to the retrieval system.

This is why our default chunk size matches the model context window exactly. Going over 512 tokens does not produce a richer embedding — it produces a corrupted one.

The overlap buffer

We add a 10% overlap between adjacent chunks: the last 51 tokens of chunk N become the first 51 tokens of chunk N+1. This prevents a concept that spans a chunk boundary from being split in a way that makes both halves meaningless.

The trade-off is storage: with 10% overlap, a 5,000-token document produces roughly 10-11 chunks instead of 9 or 10. For most agent memory use cases, this is a fine trade.

A complete chunking implementation

Here is the chunking function we use internally, simplified:

import re
from typing import List

def chunk_text(text: str, max_tokens: int = 512, overlap_tokens: int = 51) -> List[str]:
    """
    Split text into overlapping chunks of at most max_tokens tokens.
    Token count is approximated as words (close enough for English prose;
    use a proper tokenizer if precision matters).
    """
    words = text.split()
    chunks = []
    start = 0

    while start < len(words):
        end = min(start + max_tokens, len(words))
        chunk = " ".join(words[start:end])
        chunks.append(chunk)

        if end == len(words):
            break

        # Step forward, keeping overlap_tokens words from the previous chunk
        start = end - overlap_tokens

    return chunks


def store_document(text: str, source: str, api_key: str) -> int:
    """
    Chunk a document and store each chunk in HeurChain.
    Returns the number of chunks stored.
    """
    import httpx

    chunks = chunk_text(text)
    stored = 0

    for i, chunk in enumerate(chunks):
        resp = httpx.post(
            "https://api.heurchain.com/store",
            json={
                "text": chunk,
                "source": source,
                "metadata": {"chunk_index": i, "total_chunks": len(chunks)}
            },
            headers={"Authorization": f"Bearer {api_key}"}
        )
        resp.raise_for_status()
        stored += 1

    return stored

This is the pattern we run in production for ingesting session summaries, research notes, and code documentation. The word-based token approximation is close enough for English prose; if you are ingesting code or multilingual content, swap in a proper tokenizer (tiktoken or HuggingFace tokenizers).

When to use smaller chunks

512 tokens is not always right. Use smaller chunks when:

  • Your content is highly structured. Code functions, database schema fields, and individual FAQ answers are often better stored as atomic units — whatever the natural boundary is, regardless of token count.
  • You need precise attribution. If you want retrieved results to point to a specific line or paragraph, smaller chunks give you finer-grained provenance.
  • Your embedding model has a smaller context window. Some smaller/faster models have 256-token windows. Match the chunk size to the model.

When to use larger chunks or full-document embedding

Sometimes you want the full document's semantic meaning rather than granular retrieval. If you are building a "which document is most relevant to this query" system rather than a "which passage is most relevant" system, embedding the full document (or a document-level summary) and using chunk-level retrieval only for the top-k results is a two-stage approach worth exploring.

HeurChain's /query endpoint returns individual chunk results by default. Two-stage retrieval requires you to implement the document re-ranking step yourself — call /query, group results by source, pick the top source, then do a second query or linear scan over that source's chunks.

The practical takeaway

If you are getting started with HeurChain and are not sure what chunk size to use: 512 tokens with 10% overlap. It works for most agent memory use cases. Track your retrieval quality on a set of test queries, and adjust if you find systematic misses.

The chunking decision is worth revisiting after you have seen what kinds of queries your agents actually make. Most production systems converge on something in the 256–512 range. Below 128 tokens you are usually hurting more than helping.

Start building at heurchain.com/pricing.

Get started

Try HeurChain — $5/mo

One tenant, unlimited agents, hybrid retrieval out of the box. No per-query billing.

See pricing →