RFC
RFC-002: Event History Model
Proposing an append-only, typed event log as the authoritative source of truth for workflow run state, replacing a mutable run-status table.
Status
Accepted and implemented (v1, January 2026). Event payload versioning proposed as a follow-up (tracked separately, not yet an RFC at time of writing).
Context
RFC-001 established the run model, runs, checkpoints, and a basic event log. In practice, the
event log was being treated as a secondary, "for debugging" artifact, while a runs.status
column was the engine's actual source of truth. The two have begun to disagree after crashes,
and reconciling them has become a recurring debugging task.
Problem
Two representations of "what state is this run in" can drift, and when they do, neither one is trustworthy without cross-referencing the other. We need exactly one source of truth, and a principled way to derive everything else from it.
Goals
- Establish the event log as the sole authoritative record of run state.
- Make "current state" a deterministic function of the event log (full or partial replay).
- Preserve the ability to query "current state" cheaply for UI purposes (a run list shouldn't require replaying every run's full history on every page load).
- Produce, as a side effect, a trustworthy audit trail, without building a separate audit system.
Non-goals
- Real-time event streaming to the frontend (polling is acceptable for v1; streaming can be layered on later without changing the underlying model).
- Cross-run event correlation or analytics. This RFC is about per-run state, not aggregate reporting.
Proposed design
Make the events table append-only and authoritative. Define a closed set of typed events
(RUN_STARTED, STEP_STARTED, STEP_COMPLETED, STEP_FAILED, RETRY_SCHEDULED,
CHECKPOINT_WRITTEN, APPROVAL_REQUESTED, APPROVAL_GRANTED, RUN_COMPLETED, RUN_FAILED,
and others as needed), each with a versioned JSON payload schema. Maintain a run_summary
projection table, derived from the event log, rebuildable from scratch, for cheap UI queries.
Remove the mutable runs.status column; replace all reads of it with reads of the projection,
and all writes to it with event appends that the projection consumes.
Alternatives considered
Keep the mutable status column as a cache, with the event log as a "true" backup. This is roughly the status quo, and the status quo is the problem, a cache that can disagree with the thing it's caching is not a cache, it's a second source of truth wearing a cache's clothes.
Use a general-purpose event sourcing framework. We considered adopting an off-the-shelf event sourcing library to get snapshotting, projections, and replay machinery for free. We decided our needs were narrow enough (one aggregate type, the run; a small, closed set of event types) that a bespoke, minimal implementation would be easier to understand, debug, and extend than learning and constraining ourselves to a general framework's abstractions.
Tradeoffs
Append-only storage means the table only grows and "current state" requires either replay or maintained projections, more moving parts than a single mutable table. We accept this because it eliminates an entire category of bugs (the two-sources-of-truth class) and gives us audit and replay essentially for free, which we'd otherwise have to build separately.
Open questions
- How do we evolve event payload schemas over time without breaking replay of historical events?
We're leaning toward an explicit
schema_versionfield per event type plus migration functions invoked during replay, but haven't fully specified this. - At what volume does the
eventstable need partitioning or archival, and what does "archival" mean for a table that's also your audit trail? - Should projections be rebuilt synchronously (on write) or asynchronously (via a queue)? We currently do the former for simplicity; the latter would reduce write-path latency at the cost of eventual consistency in the UI.