Reliability

Governance that never
causes the outage.

KoraSafe^™ runs as inline infrastructure, not as a blocking approval. Circuit breakers, per-tenant rate limits, graceful degradation, and an SLA your SRE team already knows how to read.

Review SLO posture Read incident runbook

99.99%

API availability SLO

Quarterly target

<200ms

P50 policy decision

At the gateway

<5s

Time-to-detect on breakers

Axiom-backed telemetry

1.0

Error budget burn multiplier

Alerts wire to oncall

Live status

Every request path, measured and open

Governance runs at the inference gateway, not in a side approval queue. Latency, dispatch, durability, and vendor breakers are visible in real time so your SRE team can read them the same way they read the rest of your stack.

Gateway, live

Policy decision latency stays under your budget

p50 38ms, burn 0.3

Governance runs at the inference gateway, not in a side approval queue. P99 budget 200ms with a 500ms hard timeout, so a slow policy never blocks your request path.

p50

38ms

p99

142ms

Throughput

8,212 rps

SLO

99.99%

Orchestrator

A plan ships before any specialist runs

99.94% plan-to-first-agent, 7d

Every request gets a plan before any specialist runs. Your team sees the path, then the work dispatches. Over 2.1M plans dispatched with autonomy tiers enforced on every step.

Dispatch

99.94%

Plans, 7d

2.1M

Autonomy

Enforced

Window

Rolling 7d

Audit, tamper-evident

Write durability you can show a regulator

100%, 0 rollbacks, 30d

Every audit record lands on disk in multiple regions before acknowledgment. Tamper-evident by default, provable on demand with audit checkpoints that verify offline.

Durability

100%

Rollbacks

Tamper breaks

Checkpoint misses

Anthropic

Breaker closed, Bedrock warm

healthy, 30d

Error rate 0.02%, retries 0.3%, fallback to Bedrock configured. Breaker state, latency, and fallback routing visible to your SRE dashboard.

OpenAI

Breaker closed, Azure OpenAI warm

healthy, 30d

Error rate 0.04%, p95 latency 420ms, fallback to Azure OpenAI configured. Automatic vendor failover on breaker trips with no application-level handling.

Webhook

At-least-once delivery with a bounded replay window

99.98% under 90s

Cryptographically signed with a five-minute replay window. Seven retries with exponential backoff. No drops over 30 days.

Service-level objectives

SLOs your runbook can cite

Gateway availability

99.99% quarterly. Measured at the edge, not inside the VPC.

P50 policy decision

Under 200 milliseconds for enforced calls. P95 under 450 milliseconds.

Orchestrator dispatch

99.9% successful plan-to-first-agent within 300 milliseconds.

Audit write durability

Fsynced to disk before acknowledging. Replicated across zones.

Webhook delivery

99.9% at-least-once within five minutes. Signed with safe retries.

Evidence pack freshness

Under one hour from registry change to regenerated pack.

Circuit breakers

Every external call wrapped

LLM providers

Anthropic, OpenAI, and tenant fallbacks

Per-provider, exp backoff

Every LLM endpoint sits behind a per-provider safety stop with closed, half-open, and open states and exponential backoff. Thresholds are configurable per tenant. When the safety stop opens, the orchestrator routes to the tenant-declared fallback provider and emits a registry event so the audit trail reflects the degraded path.

States

Closed, half, open

Backoff

Exponential with jitter

Fallback

Tenant-declared

Audit

Degradation logged

AnthropicOpenAIBedrockAzure OpenAIGoogle

Data stores

Postgres, Redis, and object storage

Pool and query watchdogs

Connection pools are watched per-tenant. Slow-query brakes cap long tails and return bounded errors instead of starving the pool. Cache staleness is declared per endpoint so the gateway knows how long a stale read is acceptable before a full fetch.

Pool watchdog

Per tenant

Slow query

Bounded and capped

Cache TTL

Per endpoint

Storage

fsync, 3-zone

PostgresRedisS3GCS

MCP upstreams

Partner MCP servers

Per-origin, signed

Each MCP origin gets its own safety stop so one noisy partner cannot poison the rest. Signed webhooks degrade to bounded-backoff queues instead of dropping, and payloads stay replayable against the 5-minute signed replay window.

Isolation

Per MCP origin

Webhooks

HMAC-SHA256

Replay window

5 min

Drop policy

Queue, never drop

MCP 2025-06A2AWebhooksServer-Sent Events

Notifiers

Slack, Teams, and email

Breaker-aware, paging preserved

Notifier dispatch is breaker-aware with tenant-level failure caps. If Slack degrades, warnings are buffered and fall through to a backup channel, and PagerDuty still pages on T4 breaks even when softer notifiers are open. Operators see the degraded path in the status panel.

Caps

Tenant-level

Fallback

Backup channel

Pager

Always on T4

Status

Surfaced inline

SlackTeamsEmailPagerDutyOpsgenie

Auditable decisions

Trace every decision back to the request

trace (truncated)

[2026-04-18T14:22:04.812Z] req_9f3a gateway      calls orchestrator.plan
[2026-04-18T14:22:04.851Z] req_9f3a orchestrator dispatches agents.run(PII Sentinel)
[2026-04-18T14:22:04.912Z] req_9f3a pii-sentinel returns 2 matches, redact-strict
[2026-04-18T14:22:04.983Z] req_9f3a orchestrator applies policy(eu-ai-act)
[2026-04-18T14:22:05.041Z] req_9f3a policy       allow, tier=high-risk
[2026-04-18T14:22:05.098Z] req_9f3a gateway      response 200 (286ms)
// checkpoint: a8f3c2e1d4... @ 2026-04-18T14:25:00Z

Incidents

What we do when something breaks

Detect Triage Resolve Postmortem

DetectSLO burn

Burn-rate alert on Axiom, oncall paged in under a minute.

Error-budget burn crosses a fast-burn threshold and Axiom pages the oncall within 60 seconds. Alerts carry the stable request id, the SLO that fired, the service, and the potential impact estimate so the responder starts triage with context instead of grep.

Signal

SLO fast-burn

Page latency

< 60 s

Context

req_id, SLO, blast

Sink

Axiom, PagerDuty

TriageRunbook

Incident Agent opens the runbook, assigns roles, posts status.

The Incident Agent auto-launches the matching runbook, cuts a dedicated Slack channel, assigns incident commander, scribe, and comms, and posts the first customer status update. Roles are declared per runbook so handoffs during long incidents stay unambiguous.

Runbook

Auto-launched

Channel

Per-incident Slack

Roles

IC, scribe, comms

First status

< 10 min

Mitigate firstStabilize

Stabilize before fixing. Customer status every twenty minutes.

Mitigation precedes root cause. The IC picks a mitigation from the runbook, the agent executes it under the tenant tier, and customer status posts every 20 minutes until green. The fix lands in a follow-up PR once the system is stable and traffic is normal.

Order

Mitigate, then fix

Cadence

20-min updates

Exec

Runbook under tenant tier

On green, not on fix

BlamelessPostmortem

Write-up, action items, shared with design partners.

Every T3+ incident gets a blameless postmortem within five business days. Action items land in the linked tracker with owners and due dates. Design partners see the full document, including timeline, blast radius, and corrective controls.

SLA

5 business days

Tone

Blameless

Audience

Design partners

Actions

Owners and due dates