AI Safety and Behavior

How AI Employees Behave

Transparency

Customers always know they are interacting with an AI. Every AI employee is clearly identified in the interface, with a profile that lists its capabilities and the tools it has been given access to. AI-generated content is labeled as such and never disguised as human work. Every action an AI employee takes is logged and visible to administrators through the activity feed. There are no hidden AI actions on this platform.

Guardrails and Content Safety

Inputs and outputs flow through a guardrail layer that detects and blocks prompt injection attempts, jailbreak patterns, and unsafe content. Personally identifiable information is detected and redacted from outputs by a dedicated detection engine. Organization administrators can configure guardrail policies to fit their own risk tolerance. Tighter or looser thresholds are available without redeploying the platform.

Human Oversight

Sensitive actions require explicit human approval before execution. The approval gateway pauses the AI workflow, surfaces the requested action with full context, and resumes only when the responsible human accepts it. This applies to high-blast-radius operations such as sending email, making payments, or modifying production systems. Tool execution is sandboxed and capped. Recursion limits prevent runaway loops. Per-action and per-day budgets prevent unintended cost or message storms.

Quality and Evaluation

AI behavior is measured against a continuous evaluation suite that probes for jailbreak resistance, factual accuracy, PII leakage, and tool misuse. The eval harness runs on demand against new prompts and skill changes, so behavioral regressions are caught before they reach customers. When the AI is unsure, it acknowledges the uncertainty rather than fabricating an answer. Source citations are surfaced for knowledge-grounded answers wherever the underlying tool supports them.

What this means for customers

AI employees are always clearly labeled as AI
Configurable guardrails for prompt injection, PII, and unsafe content
Human approval required for sensitive actions
Sandboxed tool execution with budgets and recursion limits
Continuous evaluation against jailbreak, accuracy, and safety probes