CLOSED-LOOP AGENTS. MONITOR, DRIFT-CORRECT, ADAPT IN-TASK.

Agents read their own metrics from the PostgreSQL tables auditors query, score drift against configured thresholds, and swap skills mid-task. One trace_id resolves the full request chain from user prompt to tool execution to cost.

▸ Feedback Table
▸ Self-Reading Agents
▸ Inline Behaviour Score
▸ Hot-Swap Skills
▸ Single-Trace Audit

Feedback Table

An agent that does not record what it just did cannot correct itself on the next turn, and a CTO asking "what did it do?" three weeks later has nowhere to read from. systemprompt.io writes every MCP tool call, AI request, and engagement event as a structured row in PostgreSQL before the response returns, so the feedback surface and the audit surface are the same table. Each row carries correlation fields (id, user_id, session_id, task_id, trace_id, context_id, client_id, timestamp). A staff engineer joining a tool execution to an AI request to a session does it with a single index hit, not a free-text log grep.

MCP executions record tool name, server name, input, output, status, execution time in milliseconds, and error message. AI requests record provider, model, input tokens, output tokens, cost in microdollars (integers, so aggregation does not accumulate floating-point error), latency, and status. Engagement rows capture per-page behaviour: scroll depth, rage and dead clicks, copy events, tab switches, visible versus hidden time, plus a reading-pattern label. The spread exists because bot scoring, cost analytics, and session replay all read from the same row, so a CISO auditing a suspicious tool call and a CTO reviewing cost attribution never disagree on what happened.

The data lives in your PostgreSQL instance, in tables you query with SQL, export to a warehouse, or forward to a SIEM. Batch ingestion for high-throughput writers is handled by a dedicated input struct named in the reference below. Your data, your database, your compliance boundary, no SaaS handoff.

One log answers every auditor's question — Every MCP tool call writes tool_name, server_name, input, output, status, execution_time_ms, and error_message, keyed by trace_id and task_id. A CISO asking which tool ran for which user reads one table. A staff engineer debugging a failure follows the same row. Both filter through the logs trace CLI without writing SQL.
Multi-agent handoffs on one thread — When one agent hands off to another over A2A, both turns emit events on the same context, joined by trace_id. The audit trail shows which agent produced which artefact and at which point, so a CTO asking 'was the second agent running with stale instructions?' reads the sequence directly. Without this, handoffs look like two unrelated tasks.
Engagement events feed the scoring loop — Per-interaction events capture scroll velocity, direction changes, rage and dead clicks, copy events, tab switches, and visible versus hidden time. Those fields are the input to bot scoring in the next section. They exist so a suspicious session can be flagged on behaviour, not just request volume.

Source

Self-Reading Agents

An agent that cannot read its own metrics cannot adapt, and a team asking "is this agent getting slower or more expensive?" ends up polling dashboards instead of answering the question at runtime. systemprompt.io exposes analytics repository methods that take a start, an end, and an optional agent-name filter and return total agents, total tasks, completed, failed, and average execution time. Companion methods return AI-request totals with cost in microdollars, and a per-task query returns started-at, status, and execution time so the caller folds its own trends. Agent-name filtering uses case-insensitive pattern matching, so an agent scopes a query to itself, to a peer, or to a family of agents.

Cost analytics live next door. A summary call returns total requests, total cost, and total tokens for a window. Breakdown calls return the same numbers split by model, provider, or agent. A time-series call returns points an agent feeds into its own trend calculation. Costs stay in microdollars the whole way so aggregation does not lose precision, and the per-provider and per-model rollups are typed rows, not free-form JSON, so a staff engineer verifies the schema without reading a single dashboard.

The same methods back the CLI and are exposed as MCP tools the agent calls mid-task. An agent that watches its own error rate cross a threshold switches tools on the next step. An agent whose cost per request climbs switches to a cheaper model without a human in the loop. A staff engineer asking "can I verify this?" reads the repository modules listed below and the CLI command set in the analytics commands directory.

One interface for CLI and agent — The analytics repositories back both the systemprompt analytics agents stats CLI and the MCP tools an agent invokes at runtime. The agent and the operator read from the same schema, so a CTO and a running agent never disagree on what the metrics say.
Typed cost rollups — Per-provider and per-model statistics return as typed rows carrying total requests, input and output tokens, cost in microdollars, and average latency. A cost spike is detectable without leaving the repository layer, so an agent responds before a finance review notices.
Measure, then swap in the same task — After reading its own numbers, an agent swaps instructions through the skill injector, reloads prior turns through the context service, or hands off to another agent over A2A. The adjustment runs inside the same task that read the metric, so drift is caught on the next step rather than in next week's postmortem.

Source

Inline Behaviour Score

Counting pageviews tells a CISO nothing about who is actually reading, and a scanner walking the site at 3am looks identical to a careful user in a traffic dashboard. systemprompt.io scores every session as it runs through behavioural checks against the session row, each adding a specific point value to a running score. The score turns a session suspicious at a configurable threshold, picked so one strong signal alone trips the flag and two weaker ones stacked together do the same. Once flagged, the session's identifier is written back to the row with the joined signal names recorded, so an auditor reading the row later sees exactly which checks fired.

Two checks flip a session on their own because they are near-certain evidence of scripted behaviour. A high request count for a single session is rare enough to be worth flagging on its own. A ghost session (no landing page, no entry URL, zero requests after a settling window) is typical of a crawler that hit the page and abandoned, and meaningless activity at that shape does not need a second confirmation to be suspicious. Two more checks (high page coverage across the site, and a user agent reporting an outdated browser major version) are strong on their own but a real user can plausibly trigger one, so pairing with any other signal clears the threshold.

Lower-weight checks catch the patterns that only mean anything in combination: sequential navigation, multiple sessions under one browser fingerprint, absence of JavaScript analytics events after several requests, regular inter-request timing, and high pages-per-minute. Alongside this, a small anomaly detection service carries thresholds for runtime metrics (requests per minute, sessions per fingerprint, error rate), and a trend check fires critical at a sustained multiple of the rolling average because sustained spikes almost always indicate a runaway loop rather than a burst of real traffic. Thresholds update at runtime, so a security team tightens them without a redeploy and the audit row still records which threshold caught the session.

Inline scoring, not post-hoc flagging — Each behavioural check adds a specific point value to a running score and a session turns suspicious at threshold. A ghost session trips on its own. An outdated-browser signal paired with sequential navigation clears the mark together. Enforcement happens on the session row before a weekly report would even render.
Anomaly thresholds you tighten live — Runtime metrics (requests per minute, sessions per fingerprint, error rate) ship with warning and critical thresholds, and the trend check fires critical on sustained multiples of the rolling average because sustained spikes are the shape of a runaway loop. A security engineer tightens a threshold at runtime, so incident response is minutes, not a release cycle.
One session row, one reason string — The suspicious flag lands on the session row alongside the joined signal names, so an auditor reading the row a week later sees exactly which checks fired. A CISO answering 'why was this session blocked?' reads the reason field directly, not a separate incident tracker.

Source

Hot-Swap Skills

A feedback loop only closes if a measurement can change behaviour without shipping a release. When an agent picks up an in-flight task, the context service reconstructs the full message history and the serialised artefacts from PostgreSQL, so the next turn starts from exactly the state the previous turn ended on. A typed lifecycle enum (tool execution completed, task status changed, artefact created, skill loaded, context created, updated, deleted, heartbeat, and a current-agent signal) fires on the shared context as work progresses, so one agent's output becomes another agent's input on the same logical thread rather than through a side-channel message.

Instructions change at runtime through the skill injector. A call takes the base prompt and an optional skill identifier, loads the skill body, and appends it under a "Writing Guidance" header before returning. A missing skill logs a warning and returns the base prompt unchanged, so a broken instruction never breaks a task. A metadata variant returns the same enhanced prompt with the skill's name, description, and tag data for the caller to record. A staff engineer asking "can I change how this agent behaves without a deploy?" reads the injector module listed below. Skills live on disk as Markdown with YAML frontmatter. An ingestion service scans the directory on startup, parses the frontmatter, strips it, and upserts to PostgreSQL. Editing a file and rerunning ingestion is the whole change flow.

Every step an agent takes is recorded by the execution tracking service. The module exposes methods covering the lifecycle (understanding, planning, skill usage, tool execution, completion, and the failure and completion terminals), and each method writes a step row that a CTO reviewing "what did the agent decide to do, and in what order?" reads directly. Configuration changes through the CLI take effect on the next task without a restart, so a governance adjustment after an incident ships in minutes, not on the next release train.

Handoffs replay from shared state — Message history and artefacts reconstruct from PostgreSQL, and the lifecycle-event enum propagates state across agents on the same context. A staff engineer debugging a multi-agent workflow reads one thread, not two. Handoffs cannot drop state and leave agents starting the next turn blind.
Swap instructions mid-task — The skill injector appends a loaded skill body to the base prompt under a 'Writing Guidance' header. A failed load logs a warning and returns the base prompt unchanged, so a missing skill never breaks a running task. A CTO asking 'can we change agent behaviour without a deploy?' gets yes, with the fallback logic in source.
Every step a queryable row — The execution tracking service writes a row per step across understanding, planning, skill usage, tool execution, and completion. An auditor reconstructing an agent's reasoning reads the step sequence directly, ordered by task. The service is named in the reference below.

Source

Single-Trace Audit

An auditor at 3am needs to answer one question: what did the agent do, and why. A cascade of queries across five log stores is not an answer. systemprompt.io exposes a trace-query service with a single call that runs concurrent queries against one trace identifier and returns log events, AI-request events, MCP execution events, execution-step events, the matching summaries, and the associated task identifier in one response. A CISO asking "can I prove this in an audit?" runs one query against the audit table and the answer is an exportable row, not a stitched-together ticket.

Audit depth comes from reconstruction calls that work from the same trace. One call returns the full AI conversation from the request-message table: role, content, and sequence number for every message in order. Another returns the tool name and input payload for every tool call the agent made, in order. A third joins MCP tool executions to AI-request tool calls so each call carries the MCP server that handled it, its status, and its execution time. The audit lookup row ties provider, model, input tokens, output tokens, cost in microdollars, latency, task_id, and trace_id into a single record a SOC 2 auditor reads directly. Partial prefix matching on identifiers means a short prefix is enough to resolve a request, so a responder pasting a log snippet into the query lands on the right row.

Search and filtering close the surface. Log search supports pattern matching with time and level filters. Summary views return counts by level and top modules. A trace-list filter takes typed parameters for limit, since, agent, status, tool, and whether the trace has an MCP call. The same tables back the agent's own queries and the human auditor's, so the lineage from user click to agent response to tool execution to cost attribution lives in one place rather than being re-joined in a dashboard.

One trace, one answer — The trace-query service fires concurrent queries against one trace identifier and returns events, summaries, and the task identifier in one response. A CISO answering 'what did the agent do?' reads one row, not five log streams stitched together.
SOC 2 row, not a reconstruction — The audit lookup row carries provider, model, input tokens, output tokens, cost in microdollars, latency, task_id, and trace_id. Conversation and tool-call reconstruction calls fill in message content and MCP linkage from the same trace. An auditor reads the row directly, with no custom export script required.
Prefix lookup resolves a request — Audit lookup cascades from request_id to task_id to trace_id with partial prefix matching, so a short prefix lands on the right row. A responder pasting a log snippet into the query gets the full lineage without a full identifier.

Source

Founder-led. Self-service first.

No sales team. No demo theatre. The template is free to evaluate — if it solves your problem, we talk.

Who we are

One founder, one binary, full IP ownership. Every line of Rust, every governance rule, every MCP integration — written in-house. Two years of building AI governance infrastructure from first principles. No venture capital dictating roadmap. No advisory board approving features.

How to engage

Self-Service

Evaluate

Clone the template from GitHub. Run it locally with Docker or compile from source. Full governance pipeline.

When Ready

Talk

Once you have seen the governance pipeline running, book a meeting to discuss your specific requirements — technical implementation, enterprise licensing, or custom integrations.

Own

Deploy

The binary and extension code run on your infrastructure. Perpetual licence, source-available under BSL-1.1, with support and update agreements tailored to your compliance requirements.

View the template on GitHub →

Ready to build?

Get started with systemprompt.io in minutes.

Evaluate Free Read the audit docs

CLOSED-LOOP AGENTS. MONITOR, DRIFT-CORRECT, ADAPT IN-TASK.

Feedback Table

Self-Reading Agents

Inline Behaviour Score

Hot-Swap Skills

Single-Trace Audit

Explore More Features

Analytics & Observability

Governance Pipeline

Compliance

Get the build log · plus the Enterprise Factsheet

Founder-led. Self-service first.

Who we are

How to engage

Evaluate

Talk

Deploy

Ready to build?