p99 142ms uptime 99.97%

Reliability

Governance that never becomes your outage.
At your latency budget.

KoraSafe runs as inline infrastructure, not as a blocking approval. Circuit breakers, per-tenant rate limits, graceful degradation, and an SLA your SRE team already knows how to read.

99.99%
API availability SLO
Quarterly target
<200ms
P50 policy decision
At the gateway
<5s
Time-to-detect on breakers
Axiom-backed telemetry
1.0
Error budget burn multiplier
Alerts wire to oncall

Live status

Every request path, measured and open

Governance runs at the inference gateway, not in a side approval queue. Latency, dispatch, durability, and vendor breakers are visible in real time so your SRE team can read them the same way they read the rest of your stack.

Gateway, live
Policy decision latency stays under your budget
p50 38ms, burn 0.3

Governance runs at the inference gateway, not in a side approval queue. P99 budget 200ms with a 500ms hard timeout, so a slow policy never blocks your request path.

p50
38ms
p99
142ms
Throughput
8,212 rps
SLO
99.99%
Orchestrator
A plan ships before any specialist runs
99.94% plan-to-first-agent, 7d

Every request gets a plan before any specialist runs. Your team sees the path, then the work dispatches. Over 2.1M plans dispatched with autonomy tiers enforced on every step.

Dispatch
99.94%
Plans, 7d
2.1M
Autonomy
Enforced
Window
Rolling 7d
Audit, WORM
Write durability you can show a regulator
100%, 0 rollbacks, 30d

Every audit record lands on disk in multiple regions before acknowledgment. Tamper-evident by default, provable on demand with Merkle checkpoints that verify offline.

Durability
100%
Rollbacks
0
Tamper breaks
0
Checkpoint misses
0
Anthropic
Breaker closed, Bedrock warm
healthy, 30d

Error rate 0.02%, retries 0.3%, fallback to Bedrock configured. Breaker state, latency, and fallback routing visible to your SRE dashboard.

OpenAI
Breaker closed, Azure OpenAI warm
healthy, 30d

Error rate 0.04%, p95 latency 420ms, fallback to Azure OpenAI configured. Automatic vendor failover on breaker trips with no application-level handling.

Webhook
At-least-once delivery with a bounded replay window
99.98% under 90s

HMAC-signed with a five-minute replay window. Seven retries with exponential backoff. No drops over 30 days.

Service-level objectives

SLOs your runbook can cite

Gateway availability

99.99% quarterly. Measured at the edge, not inside the VPC.

P50 policy decision

Under 200 milliseconds for enforced calls. P95 under 450 milliseconds.

Orchestrator dispatch

99.9% successful plan-to-first-agent within 300 milliseconds.

Audit write durability

Fsynced to disk before acknowledging. Replicated across zones.

Webhook delivery

99.9% at-least-once within five minutes. Signed and idempotent.

Evidence pack freshness

Under one hour from registry change to regenerated pack.

Circuit breakers

Every external call wrapped

LLM providers
Anthropic, OpenAI, and tenant fallbacks
Per-provider, exp backoff

Every LLM endpoint sits behind a per-provider circuit breaker with closed, half-open, and open states and exponential backoff. Thresholds are configurable per tenant. When a breaker opens, the orchestrator routes to the tenant-declared fallback provider and emits a registry event so the audit trail reflects the degraded path.

States
Closed, half, open
Backoff
Exponential with jitter
Fallback
Tenant-declared
Audit
Degradation logged
AnthropicOpenAIBedrockAzure OpenAIGoogle
Data stores
Postgres, Redis, and object storage
Pool and query watchdogs

Connection pools are watched per-tenant. Slow-query brakes cap long tails and return bounded errors instead of starving the pool. Cache staleness is declared per endpoint so the gateway knows how long a stale read is acceptable before a full fetch.

Pool watchdog
Per tenant
Slow query
Bounded and capped
Cache TTL
Per endpoint
Storage
fsync, 3-zone
PostgresRedisS3GCS
MCP upstreams
Partner MCP servers
Per-origin, signed

Each MCP origin gets its own breaker so one noisy partner cannot poison the rest. Signed webhooks degrade to bounded-backoff queues instead of dropping, and payloads stay replayable against the 5-minute HMAC-SHA256 replay window.

Isolation
Per MCP origin
Webhooks
HMAC-SHA256
Replay window
5 min
Drop policy
Queue, never drop
MCP 2025-06A2AWebhooksServer-Sent Events
Notifiers
Slack, Teams, and email
Breaker-aware, paging preserved

Notifier dispatch is breaker-aware with tenant-level failure caps. If Slack degrades, warnings are buffered and fall through to a backup channel, and PagerDuty still pages on T4 breaks even when softer notifiers are open. Operators see the degraded path in the status panel.

Caps
Tenant-level
Fallback
Backup channel
Pager
Always on T4
Status
Surfaced inline
SlackTeamsEmailPagerDutyOpsgenie

Auditable decisions

Trace every decision back to the request

trace (truncated)
[2026-04-18T14:22:04.812Z] req_9f3a gateway      calls orchestrator.plan
[2026-04-18T14:22:04.851Z] req_9f3a orchestrator dispatches agents.run(PII Sentinel)
[2026-04-18T14:22:04.912Z] req_9f3a pii-sentinel returns 2 matches, redact-strict
[2026-04-18T14:22:04.983Z] req_9f3a orchestrator applies policy(eu-ai-act)
[2026-04-18T14:22:05.041Z] req_9f3a policy       allow, tier=high-risk
[2026-04-18T14:22:05.098Z] req_9f3a gateway      response 200 (286ms)
// checkpoint: a8f3c2e1d4... @ 2026-04-18T14:25:00Z

Incidents

What we do when something breaks

DetectSLO burn
Burn-rate alert on Axiom, oncall paged in under a minute.

Error-budget burn crosses a fast-burn threshold and Axiom pages the oncall within 60 seconds. Alerts carry the stable request id, the SLO that burned, the service, and the blast radius estimate so the responder starts triage with context instead of grep.

Signal
SLO fast-burn
Page latency
< 60 s
Context
req_id, SLO, blast
Sink
Axiom, PagerDuty
TriageRunbook
Incident Agent opens the runbook, assigns roles, posts status.

The Incident Agent auto-launches the matching runbook, cuts a dedicated Slack channel, assigns incident commander, scribe, and comms, and posts the first customer status update. Roles are declared per runbook so handoffs during long incidents stay unambiguous.

Runbook
Auto-launched
Channel
Per-incident Slack
Roles
IC, scribe, comms
First status
< 10 min
Mitigate firstStabilize
Stabilize before fixing. Customer status every twenty minutes.

Mitigation precedes root cause. The IC picks a mitigation from the runbook, the agent executes it under the tenant tier, and customer status posts every 20 minutes until green. The fix lands in a follow-up PR once the system is stable and traffic is normal.

Order
Mitigate, then fix
Cadence
20-min updates
Exec
Runbook under tenant tier
Close
On green, not on fix
BlamelessPostmortem
Write-up, action items, shared with design partners.

Every T3+ incident gets a blameless postmortem within five business days. Action items land in the linked tracker with owners and due dates. Design partners and Founding Members see the full document, including timeline, blast radius, and corrective controls.

SLA
5 business days
Tone
Blameless
Audience
Design partners, Founding Members
Actions
Owners and due dates