In developmentNative classifiers

How native classifiers detect AI risk.

Four built-in runtime classifiers cover PII detection, hallucination, toxicity, and prompt injection. Each runs inline on the telemetry stream, returns a structured verdict, and operates independently so results combine via precedence rules.

Availability note

The native classifier bundle ships as part of the native edge shipper, currently in development. The specifications below describe the target state and preview implementation. Accuracy figures come from KoraSafe internal evaluation datasets. Independent third-party evaluation is planned before GA release.

Classifiers

Classifiers and risk surfaces

Detector Version Technique Accuracy P99 latency What it detects
PII-detect 3.2.0 Pattern + ML 99.4% 1.4 ms Personal identifiable information in prompts and responses: names, emails, phone numbers, SSNs, financial identifiers, health data markers, and configurable custom patterns.
Hallucination 2.1.4 NLI ensemble 91.8% 8.2 ms Factual inconsistencies and unsupported claims in model output. Natural language inference scores entailment between claims and grounded context. Highest latency due to the NLI pass.
Toxicity 4.0.1 Small LM 96.2% 6.1 ms Harmful, hateful, or policy-violating content in generated output. Runs a small fine-tuned language model classifier. Policy thresholds are configurable per system and sector pack.
Prompt-injection 2.3.1 Pattern + ML 96.8% 2.1 ms Instruction override attempts in user input: jailbreaks, role-hijack prompts, indirect injection via documents or tool outputs. Combined pattern matching and ML scoring.

Accuracy from KoraSafe internal evaluation sets. P99 latency at single-classifier throughput on reference hardware. Combined latency depends on which classifiers are enabled per policy.

Verdict schema

What each classifier returns

Every classifier produces a structured verdict object. Multiple classifiers can fire on the same event; verdicts combine via the precedence rules described below.

{ action: "pass" | "flag" | "block", detector_id: string, // e.g. "PII-detect" detector_version: string, // e.g. "3.2.0" confidence: number, // 0-100 reason: string, // human-readable explanation evidence_span_references: array, // span refs into the source payload policy_threshold: number, // threshold that triggered the action source: "native:c13" }
Precedence rules

Combining verdicts from multiple classifiers

When multiple classifiers fire on the same telemetry event, verdicts are combined using a deterministic precedence order:

1

Block beats flag

A block verdict from any classifier takes precedence over a flag from another. The combined action is block, using the highest-confidence block verdict.

2

Confidence wins within action class

When two classifiers return the same action level (both block, or both flag), the one with higher confidence is used as the primary verdict record.

3

Cross-source confirmation lift

When two independent classifiers detect evidence on the same span, confidence is boosted on the combined verdict to reflect corroborating signal from different detection techniques.

Throughput

Scale characteristics

~12,000 events/sec

classifier throughput at preview scale

4 shippers

telemetry fleet in preview configuration

k ≥ 5

k-value protection on egress events

K-value tracking enforces differential privacy anonymity on egress. Events flagging k < 5 are held pending cohort growth or withheld from aggregate outputs.

Limitations

Known scope boundaries

Document version: classifiers-preview-v1

Published by: KoraSafe Research

Last reviewed: 2026 Q2

Corresponds to: Native edge shipper, in development (pre-GA)

How probe scoring uses these classifiers in Stage 2 evaluation