Document ingest
Updates that produce reviewable deltas.
Documents change. UrBrain stores each change as a durable version, compares it against the prior version through an internal StructuredMerge service, and applies the resulting upserts and tombstones to the memory index.
The ingest loop
A document arrives through an upload, an API call, or a connector. UrBrain stores the new version as the canonical raw record with a content hash, source URI, and workspace ownership. It then invokes StructuredMerge as an internal service to compare the prior version or manifest with the new one.
StructuredMerge returns a pgvector_queue JSONL of upsert
artifacts and a deletes_jsonl of removed chunks. The
UrBrain importer turns upserts into memories (or updates existing
memories whose content hash has changed) and tombstones the deletes,
so retrieval reflects the new document immediately.
connector, upload, or API -> UrBrain stores new document version -> UrBrain invokes internal StructuredMerge ingest/delta service -> StructuredMerge compares previous version/manifest with new version -> StructuredMerge returns pgvector_queue + deletes_jsonl + verification -> UrBrain imports upserts and tombstones deletes -> UrBrain embeds, indexes, and retrieves memories
What StructuredMerge produces
StructuredMerge is a deterministic transformation engine. Given a prior version and a current version of a source, it produces:
- parsed and chunked content with stable chunk IDs and byte spans;
- content hashes per chunk so UrBrain can skip unchanged work;
- upsert artifacts ready for embedding and indexing;
- delete artifacts identifying chunks that no longer exist;
- verification metadata that allows an importer to replay the same job and arrive at the same artifacts.
The same engine handles cleanup-style updates, format conversions, and structural rewrites. Because the chunk IDs are stable across versions where the underlying content has not moved, the importer avoids re-embedding sections that did not change.
Idempotent imports
The importer is keyed by (workspace_id, source_system,
source_ref, source_hash). Replays of the same artifact set
converge to the same memory rows. Partial failures resume cleanly,
and a re-run after a deploy never produces duplicate vectors for
content that was already indexed.
Tombstones follow the same path. When a chunk is removed from a source, the matching memories are marked tombstoned in a single transaction with the upsert batch, so retrieval never sees a partially applied delta.
Standalone ingest, later
For the first product, StructuredMerge is internal: customers subscribe to UrBrain and get version-aware document memory without managing a separate ingest subscription. The same engine can later be exposed as a standalone API for teams that want deterministic diff and merge artifacts outside the memory product — CI pipelines, code review systems, RAG preprocessors, and binary-format handlers are all reasonable downstream consumers.