Reliability
Governance that never becomes your outage.
At your latency budget.
KoraSafe runs as inline infrastructure, not as a blocking approval. Circuit breakers, per-tenant rate limits, graceful degradation, and an SLA your SRE team already knows how to read.
Live status
Every request path, measured and open
Governance runs at the inference gateway, not in a side approval queue. Latency, dispatch, durability, and vendor breakers are visible in real time so your SRE team can read them the same way they read the rest of your stack.
Gateway, live
Policy decision latency stays under your budget
Governance runs at the inference gateway, not in a side approval queue. P99 budget 200ms with a 500ms hard timeout, so a slow policy never blocks your request path.
Orchestrator
A plan ships before any specialist runs
Every request gets a plan before any specialist runs. Your team sees the path, then the work dispatches. Over 2.1M plans dispatched with autonomy tiers enforced on every step.
Audit, WORM
Write durability you can show a regulator
Every audit record lands on disk in multiple regions before acknowledgment. Tamper-evident by default, provable on demand with Merkle checkpoints that verify offline.
Anthropic
Breaker closed, Bedrock warm
Error rate 0.02%, retries 0.3%, fallback to Bedrock configured. Breaker state, latency, and fallback routing visible to your SRE dashboard.
OpenAI
Breaker closed, Azure OpenAI warm
Error rate 0.04%, p95 latency 420ms, fallback to Azure OpenAI configured. Automatic vendor failover on breaker trips with no application-level handling.
Webhook
At-least-once delivery with a bounded replay window
HMAC-signed with a five-minute replay window. Seven retries with exponential backoff. No drops over 30 days.
Service-level objectives
SLOs your runbook can cite
99.99% quarterly. Measured at the edge, not inside the VPC.
Under 200 milliseconds for enforced calls. P95 under 450 milliseconds.
99.9% successful plan-to-first-agent within 300 milliseconds.
Fsynced to disk before acknowledging. Replicated across zones.
99.9% at-least-once within five minutes. Signed and idempotent.
Under one hour from registry change to regenerated pack.
Circuit breakers
Every external call wrapped
LLM providers
Anthropic, OpenAI, and tenant fallbacks
Every LLM endpoint sits behind a per-provider circuit breaker with closed, half-open, and open states and exponential backoff. Thresholds are configurable per tenant. When a breaker opens, the orchestrator routes to the tenant-declared fallback provider and emits a registry event so the audit trail reflects the degraded path.
Data stores
Postgres, Redis, and object storage
Connection pools are watched per-tenant. Slow-query brakes cap long tails and return bounded errors instead of starving the pool. Cache staleness is declared per endpoint so the gateway knows how long a stale read is acceptable before a full fetch.
MCP upstreams
Partner MCP servers
Each MCP origin gets its own breaker so one noisy partner cannot poison the rest. Signed webhooks degrade to bounded-backoff queues instead of dropping, and payloads stay replayable against the 5-minute HMAC-SHA256 replay window.
Notifiers
Slack, Teams, and email
Notifier dispatch is breaker-aware with tenant-level failure caps. If Slack degrades, warnings are buffered and fall through to a backup channel, and PagerDuty still pages on T4 breaks even when softer notifiers are open. Operators see the degraded path in the status panel.
Auditable decisions
Trace every decision back to the request
[2026-04-18T14:22:04.812Z] req_9f3a gateway calls orchestrator.plan [2026-04-18T14:22:04.851Z] req_9f3a orchestrator dispatches agents.run(PII Sentinel) [2026-04-18T14:22:04.912Z] req_9f3a pii-sentinel returns 2 matches, redact-strict [2026-04-18T14:22:04.983Z] req_9f3a orchestrator applies policy(eu-ai-act) [2026-04-18T14:22:05.041Z] req_9f3a policy allow, tier=high-risk [2026-04-18T14:22:05.098Z] req_9f3a gateway response 200 (286ms) // checkpoint: a8f3c2e1d4... @ 2026-04-18T14:25:00Z
Incidents
What we do when something breaks
Error-budget burn crosses a fast-burn threshold and Axiom pages the oncall within 60 seconds. Alerts carry the stable request id, the SLO that burned, the service, and the blast radius estimate so the responder starts triage with context instead of grep.
The Incident Agent auto-launches the matching runbook, cuts a dedicated Slack channel, assigns incident commander, scribe, and comms, and posts the first customer status update. Roles are declared per runbook so handoffs during long incidents stay unambiguous.
Mitigation precedes root cause. The IC picks a mitigation from the runbook, the agent executes it under the tenant tier, and customer status posts every 20 minutes until green. The fix lands in a follow-up PR once the system is stable and traffic is normal.
Every T3+ incident gets a blameless postmortem within five business days. Action items land in the linked tracker with owners and due dates. Design partners and Founding Members see the full document, including timeline, blast radius, and corrective controls.