Database 3 min read

Schema Design in a NoSQL World: What We Learned Building KDB

Table of contents

    Building KDB, our NoSQL database, put us on both sides of an old argument. As database authors, we designed the engine that makes flexible schemas possible. As consultants, we get called in when that flexibility has produced a data swamp: fourteen shapes of the same document, written by five generations of application code, and no one left who remembers which fields mean what.

    The conclusion from both vantage points is the same. "Schemaless" describes the storage engine, not the data. The schema always exists - the only choice is whether it is written down and enforced, or reconstructed from production incidents.

    Design from access patterns, not from entities

    Relational modeling starts with entities and normalizes; queries come later, served by whatever indexes you add. Document modeling runs in the other direction, and fighting this is the root mistake in most struggling deployments we audit.

    The working method: enumerate the queries first. For each, write down its shape, frequency, and latency requirement - then design documents so that the hottest queries are single-document reads.

    If the product screen shows an order with its line items, the order document contains its line items. Normalizing them into a separate collection because relational habit says so buys you application-side joins, N+1 reads, and no transactional guarantee around them.

    Denormalization is a contract, not a shortcut

    Duplicating data across documents is how document stores achieve read performance, but every duplicated field is a promise that something keeps the copies consistent. The failures we see are never the duplication itself - they are duplication without an owner.

    Our rules in KDB deployments:

    • Every denormalized field has a named authoritative source and a documented propagation path
    • Propagation is asynchronous and idempotent, so it can be replayed after failures
    • A reconciliation job compares copies against sources on a schedule and reports drift as a metric, not a mystery

    Drift will happen. The difference between a mature deployment and a swamp is whether drift is measured.

    Version every document, migrate lazily

    Schema change without ALTER TABLE is the promise, but old documents do not disappear. The pattern that scales, and the one we designed KDB's update primitives around:

    1. Every document carries a schema_version field from day one
    2. Readers accept the current version and N previous versions, upgrading in memory
    3. Writes always persist the current version, so documents migrate lazily as they are touched
    4. A background sweep upgrades cold documents, then reader support for old versions is deleted

    Step four is the discipline step. Reader-side compatibility code accretes forever if nothing retires it - we have audited codebases carrying seven versions of upgrade logic, none of which had matching documents left in storage. Version distribution should be a dashboard metric; when a version hits zero, its code gets deleted.

    Validate at the boundary

    The most valuable line of defense costs the least: validate documents at the application boundary before they are written. A malformed document rejected at write time is a bug ticket; the same document discovered at read time, eighteen months later, is an incident with data-repair work attached.

    What the engine work taught us

    Designing KDB's storage layer made one thing vivid: the engine cheerfully persists anything. Consistency of meaning was never the database's job - not in PostgreSQL, where the type system merely helps, and not in any document store. It has always belonged to the team.

    Flexible schemas move that responsibility from the database to your engineering culture. That is a fine trade, and often the right one - as long as you notice you have made it.

    Copied