When you ask a search service to "create an index," it reads like a single write to a single place. While building a managed search engine in Rust this year, I started treating that one word as three separate facts, each owned by a different system: the record that an index exists, the index itself, and a durable copy for when the machine holding it dies.
Pulling those apart changed how I reason about failure and scaling, and it sharpened my sense of what a relational database is actually good for. Here is the structure I settled on, and the path a single create request takes through it.
create request
|
v
Control plane . Postgres does it exist? who owns it? quota
|
| commit first (cheap, authoritative)
v
Data plane . tantivy the inverted index, BM25, local disk
|
| materialize second (expensive, can fail)
v
Durability . object storage authoritative copy, survives the node
Three planes, one job each
The control plane is Postgres. It holds metadata and nothing else: which indexes exist, which tenant owns each one, the field configuration, and the quota. The table is tiny.
CREATE TABLE indexes (
id TEXT PRIMARY KEY, -- a uuid, the real identifier
org_id TEXT NOT NULL, -- the owning tenant
name TEXT NOT NULL, -- a human label, not unique
settings JSONB NOT NULL, -- field config, stored as an opaque blob
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX indexes_org_idx ON indexes (org_id);
Look at what is missing. There is no document text and no tsvector. The settings column is JSONB, but I never query into it. It gets read back whole and parsed in the application. Postgres knows that an index exists and who owns it. It has no idea how to search one.
That indexes_org_idx is worth an aside, because the word "index" trips people up here. It is a plain B-tree that speeds up "list the indexes for this tenant." It has nothing to do with full-text search. Two unrelated meanings of the same word, sitting in the same paragraph.
The data plane is tantivy, a Lucene-class search library for Rust. The real inverted index lives here, on local disk, memory-mapped. Document text, the BM25 postings, the term dictionary, the stored originals: all on disk in the data plane, none of it in Postgres.
Durability is object storage, an S3-compatible bucket. I treat the local tantivy files as a working copy. The authoritative durable copy lives in the bucket. If the box holding the working copy vanishes, the next boot rebuilds it from object storage.
One more piece sits in memory: a map from index id to its open tantivy handle, so a query does not reopen files on every request. That map is not a source of truth. It is reconstructed from the control plane at startup.
Walking a create through all three
The HTTP handler does almost nothing interesting. It checks that the API key has write permission and belongs to a tenant, then hands off to the index manager. The ordering inside the manager is where the design lives.
pub async fn create(&self, org_id: &str, settings: IndexSettings) -> Result<String> {
validate(&settings)?; // name length, language, vector dims
let id = new_uuid();
// 1. Control plane commits first. This insert is also the quota gate.
if !self.catalog.insert(&id, org_id, &settings).await? {
bail!("index quota reached");
}
// 2. Data plane materializes second.
let index = match FtsIndex::open(&self.dir_for(&id), settings.clone()) {
Ok(index) => index,
Err(e) => {
// Opening the on-disk index failed. Roll back the catalog row
// so we never leave a record pointing at nothing.
self.catalog.delete(&id).await?;
return Err(e);
}
};
// 3. Publish the open handle so queries can find it.
self.live.write().insert(id.clone(), Entry { org_id: org_id.into(), index });
Ok(id)
}
The shape here is simple to say: commit the cheap authoritative thing first, materialize the expensive thing second, and undo the first if the second fails. The catalog row is a few bytes in a transactional database. Opening a tantivy index touches the filesystem and can fail for boring reasons like no disk, bad permissions, or a corrupt directory. Writing the row first means the quota is enforced atomically. Rolling it back on failure avoids the worst state in this whole system: a row in Postgres that claims an index exists with no data behind it.
After the manager returns, the handler does one more thing. It syncs the new index directory to object storage, and if that sync fails it returns 503 instead of 200. I would rather tell the caller "this did not durably succeed" than report success for something that only exists on one disk.
The quota check is a single statement
The quota gate looks small, and getting it wrong is a classic race. The naive version reads the count, compares it to the limit, then inserts:
count = SELECT count(*) ... // says 4 of 5 used
if count < limit: INSERT ... // ok, insert
Run two creates at the same moment and both read "4 of 5," both pass the check, and both insert. Now the tenant has 6 indexes on a limit of 5. The check and the write were two steps with a gap between them, and concurrency lives in that gap.
The fix is to make the check and the insert the same statement, so the database serializes them:
INSERT INTO indexes (id, org_id, name, settings)
SELECT $1, $2, $3, $4
WHERE (SELECT count(*) FROM indexes WHERE org_id = $2)
< (SELECT max_indexes FROM organizations WHERE id = $2);
If the tenant is at the limit, the WHERE is false, no row goes in, and rows_affected() comes back as 0. The application reads that zero as "quota reached." There is no window for two creates to both win, because there is no separate read to race against.
Why split metadata from the index at all
The split costs something. Every create now writes to two systems, and I have to keep them consistent. The payoff shows up in three places.
Search traffic never touches the control-plane database. A burst of queries hits the in-process tantivy reader and the local disk. It does not compete for connections with the database that authenticates requests and enforces quotas. The thing under the most load and the thing that must stay correct are physically different systems, and they scale on different axes.
Failure has a clear blast radius. Losing the data plane is annoying but recoverable, since the durable copy is in object storage and the catalog still knows what should exist. Losing the control plane is the real outage, which is exactly why I keep it small, transactional, and boring. A few kilobytes per index in Postgres is cheap insurance for the gigabytes of index data it describes.
Boot becomes a reconciliation. On startup the engine reads every row from the catalog and, for each one, opens the local copy or restores it from object storage if the local copy is missing. Then it sweeps the bucket for leftover data that has no matching catalog row, which is how a delete that got interrupted halfway gets cleaned up. The control plane is the list of what should exist, and startup makes reality match the list.
What a field actually turns into
One detail from the data plane is worth showing, because it explains why search and a database want different things from the same string. When I create a text field that is both searchable and filterable, it becomes two physical fields in tantivy:
title -> analyzed field (tokenized, lowercased, stemmed) for BM25 matching
title__kw -> keyword field (the raw string, untouched) for exact filters
Full-text matching wants "Running" to find "runs," so it tokenizes and stems. An exact filter on a brand name wants "ACME" to mean exactly "ACME," with no stemming and no folding. The same input needs opposite treatment depending on the question, so it gets stored twice, once for each job. Vectors are a third case. They never enter the tantivy schema at all; each vector field gets its own approximate-nearest-neighbor store on disk, beside the text index but separate from it.
What it costs to run this way
The architecture has real rough edges, and I would rather name them than pretend otherwise.
Writes commit synchronously. Each batch fsyncs and reloads the reader so documents are searchable the instant the call returns. That is good for correctness and bad for tiny frequent writes, so bulk loads have to batch thousands of documents per request or throughput falls off a cliff.
Cold starts pay for the durability. After a deploy, an index restores from object storage and pages in from disk, so the first handful of queries run slower until the cache warms. Warm and cold are genuinely different latency regimes.
A single index lives on a single node. There is no sharding of one index across machines yet. That is fine for tens of millions of documents on one large server and wrong for trillions. It is a deliberate ceiling, and worth stating plainly rather than hiding.
The takeaway
If there is one rule worth keeping from this, it is the ordering. Commit the cheap, authoritative fact first. Materialize the expensive thing second. Roll back the first if the second fails.
The record that an index exists belongs in a small transactional database. The index itself belongs in a purpose-built search library on fast local disk. The durable copy belongs in object storage. Wire a create through the three in that order and you get independent scaling and clean recovery for very little extra code, while the database goes back to doing the one thing it is best at: being the boring, correct source of truth for what exists.
I build backend systems with this kind of separation: control planes, search, and correctness under concurrency. If that is the problem in front of you, get in touch.