Document ingest

The ingest loop

A document arrives through an upload, an API call, or a connector. UrBrain stores the new version as the canonical raw record with a content hash, source URI, and workspace ownership. It then invokes StructuredMerge as an internal service to compare the prior version or manifest with the new one.

StructuredMerge returns a pgvector_queue JSONL of upsert artifacts and a deletes_jsonl of removed chunks. The UrBrain importer turns upserts into memories (or updates existing memories whose content hash has changed) and tombstones the deletes, so retrieval reflects the new document immediately.

connector, upload, or API
  -> UrBrain stores new document version
  -> UrBrain invokes internal StructuredMerge ingest/delta service
  -> StructuredMerge compares previous version/manifest with new version
  -> StructuredMerge returns pgvector_queue + deletes_jsonl + verification
  -> UrBrain imports upserts and tombstones deletes
  -> UrBrain embeds, indexes, and retrieves memories

What StructuredMerge produces

StructuredMerge is a deterministic transformation engine. Given a prior version and a current version of a source, it produces:

parsed and chunked content with stable chunk IDs and byte spans;
content hashes per chunk so UrBrain can skip unchanged work;
upsert artifacts ready for embedding and indexing;
delete artifacts identifying chunks that no longer exist;
verification metadata that allows an importer to replay the same job and arrive at the same artifacts.

The same engine handles cleanup-style updates, format conversions, and structural rewrites. Because the chunk IDs are stable across versions where the underlying content has not moved, the importer avoids re-embedding sections that did not change.

Idempotent imports

The importer is keyed by (workspace_id, source_system, source_ref, source_hash). Replays of the same artifact set converge to the same memory rows. Partial failures resume cleanly, and a re-run after a deploy never produces duplicate vectors for content that was already indexed.

Tombstones follow the same path. When a chunk is removed from a source, the matching memories are marked tombstoned in a single transaction with the upsert batch, so retrieval never sees a partially applied delta.

Standalone ingest, later

For the first product, StructuredMerge is internal: customers subscribe to UrBrain and get version-aware document memory without managing a separate ingest subscription. The same engine can later be exposed as a standalone API for teams that want deterministic diff and merge artifacts outside the memory product — CI pipelines, code review systems, RAG preprocessors, and binary-format handlers are all reasonable downstream consumers.

Updates that produce reviewable deltas.

The ingest loop

What StructuredMerge produces

Idempotent imports

Standalone ingest, later