RFC-001: Workflow Run Model

Proposing a durable, replayable run model for Vectorbea workflows, runs, steps, checkpoints, and how they relate to the event history.

Status

Accepted and implemented (v1, January 2026). Superseded in part by refinements described in later articles, particularly around event payload versioning.

Vectorbea workflows can run for minutes to days, involve external side effects, and need to survive process restarts. We need a data model for "a run" that supports resumption, retries, and inspection without relying on in-memory state.

Problem

Our prototype held run state in memory inside the worker process executing it. This meant a crashed or redeployed worker lost track of in-progress runs entirely, they neither completed nor failed cleanly, they just stopped existing from the system's point of view.

Goals

A run's state must be fully reconstructable from persisted data at any time.
Any worker must be able to pick up any run at its last known-good point.
The model must support step-level retry without re-running completed steps.
The model must be inspectable, a human should be able to ask "what is run X doing right now and how did it get there?" and get a complete answer.

Non-goals

Sub-second step transitions. We accept added latency from persistence in exchange for durability.
Distributed transactions across external systems. Steps are responsible for their own idempotency (see RFC for retry semantics, covered in a later proposal).
Supporting workflow definitions that change shape mid-run. A run executes against a fixed snapshot of its workflow definition.

Proposed design

A runs table holds one row per execution: run ID, workflow version, status, and timestamps. A checkpoints table holds periodic snapshots of accumulated run state, keyed by run ID and step sequence number. An append-only events table (see the event history article for detail) records everything that happens.

The engine's "what should happen next" logic always derives its answer from the latest checkpoint plus any events appended after it, never from an in-memory cache that could disagree with persisted state.

Alternatives considered

In-memory state with periodic snapshotting to disk. Faster during normal operation, but reconstruction after a crash requires figuring out how stale the last snapshot was and replaying an unknown amount of in-memory-only history that no longer exists. Rejected, this reintroduces the exact problem we're trying to solve.

A single mutable run_state JSON column, updated in place per step. Simpler to query, but provides no audit trail and is vulnerable to partial-write corruption (a crash mid-update leaves the column in an inconsistent state with no way to detect it). Rejected in favor of an append-only log plus derived projections.

Tradeoffs

Persisting state at every step boundary adds write load and latency compared to an in-memory model. We accept this because the alternative, losing track of in-progress work, is unacceptable for a product whose core promise is durability. The write load has, so far, not been a bottleneck at our current scale; we'd revisit this design if it became one.

Open questions

How do we handle workflow definitions that need to be upgraded while runs against the old version are still in flight? Currently, in-flight runs continue against their original snapshot, and new runs use the new version, but this means two versions can be "live" simultaneously, which has implications for debugging and support.
At what scale does per-step persistence become a bottleneck, and what's the mitigation , batching writes, a faster storage tier for checkpoints, something else?