BYOK Architecture for an AI SaaS: Benefits, Risks, and Boundaries

Vectorbea supports "bring your own key" (BYOK) for LLM providers: customers can connect their own OpenAI, Anthropic, or other provider accounts, and workflows run against those credentials rather than ones we provide. This post is about why we built it that way, what it buys everyone involved, and the security thinking behind it. I'm deliberately not going to describe our actual secret storage implementation, partly because security-through-obscurity has limits either way, but mostly because the category of decisions matters more here than our specific answers, which will keep evolving.

Why BYOK at all

Three reasons came up repeatedly in conversations with early users, and they're worth separating because they imply different design priorities:

Cost ownership. Teams want LLM spend to show up on their provider bill, with their negotiated rates and their spend controls, not marked up and bundled into a SaaS invoice.
Compliance and data handling. Some organizations have approved vendor lists, data processing agreements, or regional requirements that dictate which LLM provider (and sometimes which region) their data can touch. BYOK lets the workflow run against a provider relationship the customer has already vetted.
Rate limits and quotas that are actually theirs. A shared pool of provider capacity means noisy neighbors. A customer's own key means their usage competes only with their own usage.

The tradeoff customers are making

BYOK shifts cost and compliance ownership to the customer, which is exactly what they're asking for, but it also means they now depend on their provider relationship being healthy. If their key is rate-limited or their account is suspended, that's now a Vectorbea-visible problem even though it's not a Vectorbea problem to fix. We try to make this distinction clear in the product, because "the workflow failed" and "your OpenAI account hit a spending cap" need very different responses.

What it costs us in complexity

Supporting BYOK is not "add a field for an API key." It changes the shape of several systems:

Secret handling becomes a first-class concern rather than something we can defer. Customer credentials for their paid accounts are now flowing through our infrastructure, and a leak isn't just our problem, it's a bill on someone else's card.
Error attribution gets harder. A failed LLM call could be our bug, the provider's outage, or the customer's account/quota issue. The workflow run, the event history, and the UI all need to be able to say which, or at least narrow it down honestly.
Testing and staging require either mocking providers convincingly or maintaining test credentials that behave like real ones, and real provider behavior (rate limits, streaming quirks, model deprecations) is exactly what's hardest to mock well.

Tradeoff

We could have shipped faster by only supporting our own pooled provider keys. BYOK delayed several other features because it forced us to get secret handling right before we needed it for anything else. In hindsight this was the correct order of operations, retrofitting careful secret handling onto a system that already has customer credentials flowing through it is a much worse position to be in than building it in from the start.

Boundaries we think any BYOK system needs

Without describing our specific mechanisms, here's the shape of the boundaries we think matter, roughly in order of how non-negotiable they feel to us:

Keys are write-only from the customer's perspective. Once entered, a key should never be displayed again in full, not to the customer, not to support staff, not in logs. If someone needs to verify a key is still valid, that's a "test connection" action against the provider, not a "show me the key" action.
Encryption at rest with access scoped tightly, such that the application code path that uses a key to make a provider call is different from, and much more restricted than, any path that could conceivably export or display it.
No keys in logs, ever, including error logs, including stack traces, including anything that might get forwarded to a third-party observability tool. This requires active scrubbing, not just discipline, because the moment a key reaches a logging call by accident, it's effectively public.
Per-customer isolation strong enough that a bug in one customer's workflow cannot expose another customer's key, this is as much about how workflows are sandboxed and how secrets are injected at execution time as it is about storage.
Revocation has to be immediate and verifiable. A customer rotating or removing a key needs confidence that the old one stops being usable now, not "eventually, once caches expire."

A short security checklist for anyone building BYOK

If you're evaluating whether to add BYOK to a product, these are the questions I'd want satisfying answers to before writing any code:

Where does the key live between "customer enters it" and "worker uses it to make a call," and who/what can read it at each point in that path?
What gets logged at each step of using the key, and have you actually tried to make a key show up in a log to confirm your scrubbing works?
If an attacker compromised one application server, what's the blast radius, one customer's keys, or all of them?
How does a customer prove to themselves that revoking a key actually stopped it from being used, without just trusting your word for it?
What happens to in-flight workflow runs when a key is rotated or removed mid-run?

Lesson learned

The hardest part of BYOK wasn't the cryptography, it was making sure that every code path that could conceivably touch a key, including the ones written six months later by someone who didn't know the rules, went through the same narrow, audited interface. That's a discipline problem as much as a technical one, and it's the kind of thing that benefits from a boring, repetitive code review checklist more than from clever architecture.

Next: how Vectorbea scales workers using Redis Streams, consumer groups, the pending entries list, retry and dead-letter handling, and the point at which we'd reach for something heavier like Kafka.

BYOK Architecture for an AI SaaS: Benefits, Risks, and Boundaries

Why BYOK at all

What it costs us in complexity

Boundaries we think any BYOK system needs

A short security checklist for anyone building BYOK

Related articles

Designing Event History as a Primitive for AI Workflows

Why Long-Running AI Workflows Need Durable Execution

Lessons from Building Vectorbea v1

Related articles

Feb 2, 2026·5 min read·Execution Engine
Designing Event History as a Primitive for AI Workflows
How we modeled the append-only event log that backs every Vectorbea run, and why we treat it as the source of truth rather than an audit trail bolted on afterward.
event-sourcingarchitectureobservability

Jan 12, 2026·5 min read·Execution Engine
Why Long-Running AI Workflows Need Durable Execution
Async jobs and retry decorators get you most of the way to a working agent, and then they don't. Here's why we built Vectorbea around durable execution from day one.
durable-executionarchitecturereliability

May 26, 2026·5 min read·Lessons
Lessons from Building Vectorbea v1
What we'd keep and what we'd change across UI, backend, security, observability, and positioning, after shipping the first version of Vectorbea's durable workflow engine.
lessonsretrospectiveengineering-culture