← All docs

Sentinel

An automated testing & monitoring system that verifies every Sponic surface nightly, auto-extends coverage as new dev work lands, and surfaces activity, agents, and findings in a dedicated intranet section.
2026-05-06 Build plan / RFC Revised: Claude Agent SDK ~4–4.5 weeks Touches: Supabase Β· alpu.ca Β· Cloudflare Β· Anthropic Agent SDK Β· GitHub Actions Β· prompt-runner

How to read this doc

Sections 1–3 frame the goal and architecture. Section 4 lists the seven agents. Section 5 is the data model. Sections 6–8 cover models, the orchestrator, and how coverage auto-grows. Section 9 is the frontend specification β€” read this if you're touching UI. Sections 11–12 are the implementation map: files to create + a phased build sequence. Section 13 is the end-to-end verification checklist. Markdown source: docs/devtasks/sentinel.md.

Contents

  1. Context
  2. Architecture β€” four layers
  3. Multi-step transparency
  4. v1 agent set (7 agents)
  5. Data model β€” Supabase schema
  6. Reasoner & model strategy
  7. Orchestrator on alpu.ca
  8. Coverage extension β€” crawler + LLM-on-merge
  9. Frontend β€” Sentinel intranet section
  10. Notifications: daily digest
  11. Files to create / modify
  12. Build sequence β€” phased & sequenced
  13. Verification β€” end-to-end checklist
  14. Out of scope for v1
  15. Open questions

1. Context

The Sponic monorepo has grown a wide surface β€” two deployed apps (apps/garden, apps/control), Supabase Postgres with 50+ migrations and ~15 critical tables, three pg_cron jobs, two Cloudflare Workers, ~15 external services, a headless claude -p runner on Oracle Phoenix β€” with near-zero automated verification. CI runs only tsc + eslint + build + image-gen lint. There are no unit or e2e tests, no health endpoints, no uptime monitoring, no Sentry/Datadog, no dependency scanning, no pre-commit hooks. The only smoke test is a manual smoke-image-gen.mjs.

The tasks system already has a queue + headless claude -p runner + cost tracking + activity log on Oracle Phoenix. Sentinel reuses concepts (queueing, persisted runs, cost telemetry) but runs on a separate orchestrator on alpu.ca so a Sentinel failure can't break the tasks UI and vice versa.

Goal

Ship a system that:

  1. Runs automated checks across every surface daily.
  2. Auto-extends coverage as new dev work lands (without manual setup per feature).
  3. Presents activity, agents, findings, coverage in a dedicated intranet section called Sentinel, with first-class transparency into every multi-step run (prompts, models, handoff artifacts, costs).
  4. Generates structured recommendations without auto-acting on them. (Test-coverage findings can be approved & queued to the prompt-runner via a manual gate.)

2. Architecture β€” four layers

  1. Probes β€” deterministic checks. Cheap, scriptable, structured-output. Live in apps/control/src/lib/sentinel/probes/. Each probe is a TS function (ctx) => ProbeResult. Probes can attach artifacts (HTTP response bodies, screenshots, stack traces, log excerpts) which are stored either inline or in R2. Each probe is also exposed to the Agent SDK as a tool β€” so probes serve double duty: directly callable in deterministic agents, and tool-callable by LLM-driven agents.
  2. Reasoner β€” LLM execution layer with two modes:
    • One-shot (reasoner/oneshot.ts): direct Anthropic API call with structured output schema. Used by deterministic agents that just need a brief summary + verdict over already-collected probe results. Cheap, fast, no tool calling.
    • Agent-loop (reasoner/agent-loop.ts): Claude Agent SDK initialized with a custom tool catalog (the probes from layer 1, plus a few orchestrator-supplied tools like record_finding, query_supabase, read_file). Used by agents that need to navigate the codebase or chain probes based on what they find β€” Test coverage and Security passive primarily.
    • Per-agent config selects which mode + model. The mode is stored on the agent record and locked in at run time.
  3. Orchestrator β€” systemd service on alpu.ca. Triggers from cron + an HMAC-signed webhook for manual UI runs. Reads the agent registry from Supabase, dispatches each run to the configured execution mode, collects step events emitted by the SDK (or by the one-shot path), persists every step (prompts, context, output, cost, tool calls, model rationale) to monitoring_run_steps. Source: infra/alpu/sentinel-orchestrator/.
  4. UI β€” new intranet section Sentinel in apps/control. Seven tabs: Overview, Coverage, Agents, Activity, Findings, Runs, Docs. Detailed in Β§9.

Layers are decoupled: probes are TS functions usable from either execution mode; the orchestrator can be moved (e.g. to Oracle Phoenix later) without touching probes/UI; the SDK version can be upgraded without touching probes or UI.

Why the Agent SDK and not Anthropic Managed Agents?

We considered hosted Managed Agents and rejected for v1 because (a) deterministic probes shouldn't pay token costs to wrap answers we already have, (b) active security probes need a stable origin so we don't trip our own WAF, (c) gitleaks/OSV/diff-readers need direct repo access, (d) the SDK preserves our v1.1 path to migrate select agents to a local Ollama model on alpu.ca. Using the SDK inside our own orchestrator gets the agentic-loop ergonomics without those tradeoffs.

3. Multi-step transparency

Every agent run is decomposed into discrete steps that are individually persisted. A step is anything with a clear input β†’ output and an attributable model/cost. Examples:

For agent-loop runs, the SDK emits per-turn and per-tool-call events natively; the orchestrator subscribes to those and writes them straight into monitoring_run_steps. We do not hand-roll the multi-step machinery β€” we adopt it.

Each step records: kind, model used (for steps where a model was invoked), model rationale (locked in at run time from the agent config), the prompt template id + rendered prompt text or system prompt, the full input context (jsonb), the output (jsonb), tool name + arguments + result for tool calls, references to artifacts produced, references to artifacts consumed (handoff lineage), duration, tokens, cost, status.

In the UI, clicking into any run shows a stepper with all of this. For agent-loop runs, the stepper renders as alternating agent turns and tool calls with arrows showing which turn invoked which tool. For one-shot runs, it's a flat list of probes followed by a single reasoner step. Both surface identically in the UI; the underlying mode is just a chip on the run header.

4. v1 agent set (7 agents)

Each agent owns a domain, has its own model config + execution mode, and registers its probes in monitoring_manifest. Mode = how the orchestrator runs it: one-shot (collect probe results, summarize once) or agent-loop (Claude Agent SDK with tool catalog, iterates).

#AgentModeProbes / behavior
1Uptimeone-shotHTTP 200 on every garden + control route (auto-discovered from apps/*/src/app/**/page.tsx); HTTPS cert expiry on both domains; claude-sessions.sponicgarden.workers.dev reachable. Reasoner summarizes failures + emits findings.
2Deploy verifierone-shotLatest CF Pages deploy state (both projects) via CF API; build-log scan for warnings; commit SHA freshness vs origin/main; version-bump file consistency.
3Database integrityone-shotRow counts Β± delta on critical tables (app_users, tasks, images, image_gen_jobs, event_payments, rental_payments, stripe_payments); R2 ↔ public.images reconciliation; pg_cron last-run age (3 jobs); migrations applied vs apps/control/migrations/ files.
4Edge function healthone-shotEach Supabase edge fn returns expected status; auth-required ones return 401 without keys; runs smoke-image-gen.mjs as a probe.
5Security passiveagent-loopTool catalog: run_gitleaks, query_osv, read_file, git_diff_since, record_finding. Agent decides what to scan, follows up on suspicious diffs, can correlate (e.g. "this commit added an auth function β€” read it and check for bypass patterns").
6Security activeone-shotDeterministic probes: anon-key calls that should fail on RLS-protected tables; public R2 bucket inventory check; unauth probe on every edge fn endpoint. Reasoner classifies any unexpected results.
7Test coverageagent-loopTool catalog: list_routes, list_edge_fns, list_critical_files, read_file, find_existing_tests, rank_criticality, draft_test_scaffold, record_finding. Agent inventories surfaces, decides what's worth testing, drafts scaffolds, emits findings with metadata.draft_task_payload. UI shows an Approve & queue button per finding β€” on click, inserts a row into the existing tasks table for the Oracle Phoenix prompt-runner. Manual gate stays in v1.

Plus, outside alpu.ca, a tiny dead-man-switch Cloudflare Worker (apps/control/worker/sentinel-deadman/) that flags a finding row if no orchestrator heartbeat in 25h.

Test-framework prerequisite

The project has no test framework today. Adding Vitest (unit/integration) + Playwright (e2e) is a prerequisite, delivered in Phase 0 as a prompt-runner task β€” Claude opens the PR; admin reviews and merges; the agent then operates against a real framework.

5. Data model β€” Supabase schema

New migration: apps/control/migrations/20260506_sentinel_schema.sql. All tables admin-only via RLS (role check against app_users.role IN ('oracle','admin')).

monitoring_agents
  id, name, slug, description, surface, owner,
  execution_mode text,            -- 'one-shot' | 'agent-loop'
  model_provider, model_id, model_params jsonb,
  model_rationale text,           -- short "why this mode + model" string
  system_prompt_template_id uuid, -- agent-loop only
  tool_catalog text[],            -- agent-loop only; subset of registered tool names
  max_turns int,                  -- agent-loop only; hard cap on SDK loop iterations
  schedule_cron text, enabled bool,
  daily_cost_ceiling_usd numeric, -- auto-pause if exceeded in 24h window
  timeout_seconds int default 600,
  last_run_id, last_status, last_run_at,
  created_at, updated_at

monitoring_runs
  id, agent_id, trigger ('cron'|'manual'|'deadman'),
  triggered_by uuid (app_users.id, nullable for cron),
  triggered_meta jsonb,
  started_at, completed_at,
  status ('queued'|'running'|'success'|'partial'|'failed'|'cancelled'|'timeout'),
  summary text, cost_usd numeric, total_tokens int,
  models_used text[], step_count int

monitoring_run_steps
  id, run_id, agent_id, step_index int, step_name,
  step_kind ('probe'|'reasoner'|'aggregate'|'tool_call'),
  model_provider text, model_id text, model_rationale text,
  prompt_template_id, prompt_text,
  input_context jsonb, output jsonb,
  artifact_ids uuid[],            -- references monitoring_artifacts
  consumed_artifact_ids uuid[],   -- handoff lineage
  started_at, completed_at, duration_ms int,
  prompt_tokens int, completion_tokens int, cost_usd numeric,
  status ('success'|'fail'|'timeout'|'error'|'skipped'),
  error text

monitoring_prompt_templates
  id, name, version int, agent_slug, step_kind,
  description text, content text, variables text[],
  created_at, deprecated_at,
  unique (name, version)

monitoring_probes
  id, run_id, run_step_id, agent_id, probe_name, target_kind, target_ref,
  status ('pass'|'fail'|'warn'|'skip'|'error'),
  output jsonb, duration_ms int, error text,
  baseline_duration_ms int        -- rolling p50 for regression hints

monitoring_artifacts
  id, run_id, run_step_id (nullable), agent_id,
  kind ('http_response'|'screenshot'|'stack_trace'|'log_excerpt'
       |'json_blob'|'diff'|'test_scaffold'),
  storage ('inline'|'r2'),
  inline_data text (nullable), r2_key text (nullable),
  size_bytes int, content_type text,
  created_at

monitoring_findings
  id, dedup_key (unique per agent), agent_id,
  first_seen_run_id, last_seen_run_id, occurrence_count int,
  consecutive_run_count int,
  severity ('critical'|'high'|'medium'|'low'|'info'),
  title text, description text, recommended_action text,
  surface text, target_ref text,
  metadata jsonb,                 -- carries draft_task_payload etc.
  status ('open'|'acknowledged'|'dismissed'|'resolved'),
  ack_by, ack_at, ack_notes,
  resolved_at,
  created_at, updated_at

monitoring_manifest
  id, agent_id, target_kind ('route'|'table'|'edge_fn'|'package'
                            |'file'|'cron_job'|'service'),
  target_ref text, probe_config jsonb,
  source ('crawler'|'llm_merge'|'manual'|'bootstrap'),
  approval_status ('active'|'pending_approval'|'rejected'),
  added_at, added_by, last_seen_at, enabled bool

monitoring_heartbeats
  id, source ('orchestrator'|'deadman'), recorded_at, meta jsonb

monitoring_audit_log
  id, actor_id (app_users.id, nullable for system),
  action text,                    -- 'agent_model_changed'|'agent_paused'|...
  target_kind, target_id,
  before jsonb, after jsonb,
  notes text,
  created_at

Dedup keyed on (agent_id, dedup_key). Each probe computes a stable dedup key (e.g. uptime:route:/en/relations/fundraising:status_5xx). On second occurrence the existing finding's last_seen_run_id and occurrence_count increment instead of creating a new row.

Regression-aware severity. When a finding first opens, severity comes from the reasoner. When the same dedup key reopens after N successful runs (default 30), severity is bumped one level. When a probe's duration exceeds 3Γ— its baseline_duration_ms, a performance_regression finding is auto-generated.

Retention. monitoring_run_steps, monitoring_probes, monitoring_artifacts rows older than 90d auto-delete via pg_cron job; R2 artifacts pruned in step. monitoring_runs, monitoring_findings, monitoring_audit_log retained indefinitely.

6. Reasoner & model strategy

The reasoner is the LLM execution layer. Two paths:

6.1 One-shot path (reasoner/oneshot.ts)

Direct Anthropic API call with a structured-output schema. The orchestrator collects probe results, hands them to the reasoner alongside diff context and manifest summary, and gets back a summary + findings array. No tool calling; one prompt, one response.

export interface OneShotInput {
  agent: { name: string; surface: string; description: string };
  runContext: { startedAt: Date; trigger: string; previousRunSummary?: string };
  probes: ProbeResult[];
  diffSinceLastRun?: { commits: string[]; files: string[]; patches: string };
  manifestSummary: { covered: string[]; new: string[] };
}

export interface ReasonerOutput {
  summary: string;
  findings: Array<{
    dedupKey: string;
    severity: 'critical'|'high'|'medium'|'low'|'info';
    title: string;
    description: string;
    recommendedAction: string;
    surface: string;
    targetRef?: string;
  }>;
  cost: { promptTokens: number; completionTokens: number; usd: number };
  modelUsed: string;
}

export interface OneShotReasoner { ask(input: OneShotInput): Promise<ReasonerOutput>; }

Used by: Uptime, Deploy verifier, DB integrity, Edge function health, Security active.

6.2 Agent-loop path (reasoner/agent-loop.ts)

Built on the Claude Agent SDK (@anthropic-ai/agent-sdk, same library as Claude Code). The orchestrator instantiates an SDK session per agent run with:

Used by: Security passive, Test coverage.

The SDK handles the loop (model β†’ tool call β†’ tool result β†’ model β†’ …) until the model emits no more tool calls or max_turns is reached. Final emitted findings come via the record_finding tool β€” the orchestrator collects these and persists them after dedup.

export interface AgentLoopRunner {
  run(input: {
    agent: AgentConfig;
    systemPrompt: string;
    tools: ToolCatalog;
    initialContext: { runId: string; diff?: GitDiff; manifest: ManifestSummary };
    maxTurns: number;
    onStep: (step: PersistableStep) => Promise<void>;  // streamed to monitoring_run_steps
  }): Promise<{ findings: Finding[]; cost: Cost; modelUsed: string; turns: number }>;
}

6.3 Tool catalog

Lives in apps/control/src/lib/sentinel/tools/. Each tool is a typed function with a Zod input schema, exposed to the SDK via the standard tool-definition shape. Tools delegate to the same probe library used by one-shot agents β€” so the capability is shared, only the calling style differs.

ToolPurposeUsed by
http_probe(url, expectedStatus?)Fetch a URL; record artifact on failureUptime (one-shot calls directly) + agent-loop
query_supabase(sql, params)Execute a read-only SQL query under service roleDB integrity, Test coverage, Security passive
read_file(path)Read a file from the working treeTest coverage, Security passive
git_diff_since(commit_sha)Diff between commit and HEADSecurity passive
run_gitleaks(scope)Run gitleaks against working tree; return JSON reportSecurity passive
query_osv(packageJsonPath)Hit OSV.dev for advisories on a package.jsonSecurity passive
list_routes() / list_edge_fns() / list_critical_files()Inventory helpers built on the manifestTest coverage
find_existing_tests(targetPath)Look up Vitest/Playwright tests covering a targetTest coverage
draft_test_scaffold(target, framework)Produce a starter test file bodyTest coverage
cf_pages_status(project)Latest deploy state via CF APIDeploy verifier (one-shot calls directly) + agent-loop
r2_list(bucket, prefix?)List R2 objectsSecurity active, DB integrity
record_finding(payload)Emit a finding (severity, title, description, recommendedAction, surface, dedupKey, metadata?)All agent-loop agents

Tools execute on the orchestrator process (not on Anthropic infra) β€” they have direct local network access to Supabase, R2, the working tree, and CF API.

6.4 Default model + mode assignments for v1

AgentModeModelRationale stored on agent
Uptimeone-shotHaiku 4.5"Low-complexity summarization of HTTP probe results β€” Haiku is sufficient."
Deploy verifierone-shotHaiku 4.5"Structured CF API output and build-log scan β€” Haiku handles deterministic formatting well."
DB integrityone-shotHaiku 4.5"Row-count delta interpretation; arithmetic + threshold judgment fits Haiku."
Edge function healthone-shotHaiku 4.5"Status-code summarization; smoke-test pass/fail aggregation."
Security passiveagent-loopSonnet 4.6"Diff-driven security investigation needs multi-turn reasoning + targeted tool calls; agent-loop on Sonnet is the right shape."
Security activeone-shotSonnet 4.6"Probes are deterministic; reasoning is interpreting unexpected results. One-shot avoids agent-loop overhead."
Test coverageagent-loopSonnet 4.6"Inventorying surfaces and ranking criticality is a navigation problem; agent-loop with file-reading tools fits."
LLM-on-merge (coverage extension)one-shotHaiku 4.5"Diff β†’ manifest entry is structured-output; Haiku is fast + cheap."

Both mode and model are mutable per agent from the UI. Mode changes are visible in the audit log alongside model changes. The rationale field surfaces in the activity log and run drill-down so anyone reviewing a run knows why this configuration was used at the time.

Migration intent: once we have ~4 weeks of steady-state runs + cost data, evaluate which agents could downshift to a local Ollama model on alpu.ca without quality loss. The Agent SDK supports custom model providers, so the agent-loop path will accept Ollama too.

7. Orchestrator on alpu.ca

Lives at infra/alpu/sentinel-orchestrator/ (committed; deployed via rsync + systemd). Stack: Node 20 + TypeScript, runs as sentinel.service.

Schedule

Single sweep nightly inside the 10pm–8am CST window. Cron at 0 22 * * * America/Chicago kicks off the orchestrator; agents run sequentially in default order (cheapest/fastest first, most-expensive last):

22:00 β€” Crawler refreshes monitoring_manifest 22:15 β€” Deploy verifier 22:30 β€” Uptime 23:00 β€” Edge function health 23:30 β€” Database integrity 00:00 β€” Security passive (LLM-on-diff) 01:00 β€” Security active 02:00 β€” Test coverage

Whole sweep typically finishes by ~03:30 CT. Each agent records start/end so the activity log shows actual cadence; sequencing is config-driven.

Entry points

Run flow per agent

  1. Insert monitoring_runs row (status=running)
  2. Gather manifest entries; execute probes; collect outputs + artifacts
  3. Compute diff since last successful run for this agent (commits via git log, file changes)
  4. Call reasoner via configured adapter; persist findings (with dedup logic)
  5. Update monitoring_runs (status, cost, summary)

Auth & secrets. Bitwarden CLI (bw) unlocks at service start with a session token; service-role Supabase key + Anthropic key + R2 keys stored in ~/.config/sentinel/.env. Bootstrap recipe added to infra/runbook.md.

8. Coverage extension β€” crawler + LLM-on-merge

Crawler (runs at the start of each nightly sweep)

LLM-on-merge (GitHub Action)

Crawler = always-on default coverage; LLM = high-quality additions when code lands. Coverage grows automatically while the human stays in the loop on what gets monitored.

9. Frontend β€” Sentinel intranet section

This is where someone unfamiliar with the system has to be able to land and orient. The frontend carries the entire weight of the "dashboard only" stance.

9.1 Information architecture

Sentinel (top-level intranet section) β”œβ”€β”€ Overview ← landing β”œβ”€β”€ Coverage ← surface Γ— agent matrix β”œβ”€β”€ Agents ← list of 7 agents (cards) β”‚ └─ Agent drawer ← About / Probes / Prompts / Runs / Findings / Settings β”œβ”€β”€ Activity ← chronological event feed β”œβ”€β”€ Findings ← filterable list β”‚ └─ Finding drawer ← description / recommended action / artifacts / actions β”œβ”€β”€ Runs ← list of runs β”‚ └─ Run drawer ← stepper with prompts, models, context, outputs, handoffs └── Docs ← system README + per-agent docs + glossary

Detail views are right-side drawers (mirroring the existing task-detail-drawer.tsx pattern). This keeps the static-export build workable β€” tab pages are pre-rendered, drawer content is fetched client-side from Supabase. Drawer state is stored in URL hash so detail views are linkable: /en/sentinel/runs#run=abc123.

9.2 Tab-by-tab spec

Overview

Coverage

The "at-a-glance" surface map.

Agents

Activity

Findings

Finding drawer:

Runs

Run drawer:

Docs

System-level documentation hub. Designed for a new collaborator to onboard cold. Three subsections:

All rendered with react-markdown; no MDX runtime needed.

9.3 Component catalog

Lives in apps/control/src/components/sentinel/. New shared components:

ComponentPurpose
<SeverityBadge severity>Colored pill with icon: critical / high / medium / low / info
<StatusPill status variant>Green/red/amber/gray dot with label. Variants: run probe agent manifest
<RunSummaryCard run>Card for runs list / agent's recent runs
<RunStepper run steps artifacts>Vertical step timeline with handoff arrows
<PromptViewer template renderedText variables>Collapsible prompt with version-history link
<ArtifactViewer artifact>Auto-routes by kind: image (R2 fetch), HTTP body (syntax highlight), JSON tree, screenshot, stack trace, diff, scaffold
<JsonTree data collapsedDepth>Expandable jsonb viewer for input/output
<CoverageMatrix surfaces agents results>CSS-grid surfaceΓ—agent matrix with cell popovers
<AgentCard agent>Card with sparkline, model chip, "Run now"
<FindingsTable findings filterable bulkActionable>Findings list with bulk actions
<ActivityEntry event>One feed row, expandable
<MarkdownDoc source>react-markdown wrapper with project styling
<SeverityHeatmap occurrences>Tiny inline heatmap for finding history
<CostBadge cost>USD pill ($0.0008 / $1.20)
<ModelChip provider model rationale>Provider icon + model name; hover tooltip = rationale
<Sparkline data status>30-point pass/fail SVG (reuses devcontrol/context-tab.tsx technique)
<DrawerShell title onClose>Right-side drawer wrapper; hash-state aware

9.4 Data layer

9.5 Loading / empty / error

9.6 Visual design

9.7 Accessibility

9.8 Routing

Static-export-compatible:

9.9 Auth gating

10. Notifications: daily morning digest

A single Cloudflare Worker (apps/control/worker/sentinel-digest/) on a 7am CT cron. Queries last 24h of runs/findings/audit log via Supabase service role. Renders an HTML email via Resend.

Recipients: app_users with role IN ('oracle','admin').

Content:

11. Files to create / modify

Create

Modify

12. Build sequence β€” phased & sequenced

Five phases. Each item is a discrete merge-able PR. Items within a phase are mostly parallelizable; items across phases have dependencies (called out where they cross).

Phase 0 β€” Prerequisites (~2 days)

Goal: unblock everything else. All items must complete before Phase 2's test-coverage agent can produce useful findings.

#ItemNotes
0.1Create alpu.ca user account + ssh key + Bitwarden secret recipeDocumented in infra/runbook.md
0.2Reserve Cloudflare Tunnel hostname for alpu.ca β†’ control webhookUsed by sentinel-trigger Worker in Phase 1
0.3Generate HMAC shared secret; add to ~/.config/sentinel/.env on alpu.ca + as a Cloudflare Worker secret
0.4Seed task in tasks table: "Bootstrap Vitest + Playwright + sample tests in apps/garden + apps/control + CI hooks"Manually inserted via SQL. Headless prompt-runner picks it up; opens PR. Admin reviews + merges. Dogfood moment.
0.5Confirm Bitwarden CLI install + unlock-on-boot recipe on alpu.ca

Exit criteria: alpu.ca reachable; tunnel up; HMAC secret in place; Vitest + Playwright PR merged with at least one passing test per app; CI runs them.

Phase 1 β€” Foundation (~1 week, depends on Phase 0)

Goal: a single agent runs end-to-end with full step transparency.

#ItemDepends on
1.1Migration: monitoring_* schema (all tables in Β§5, including execution_mode + tool/turn fields on monitoring_agents)0.1
1.2Migration: RLS policies (admin-only)1.1
1.3Migration: seed monitoring_prompt_templates (placeholders; refined in Phase 2)1.1
1.4Add <AdminGuard> componentβ€”
1.5One-shot reasoner (reasoner/oneshot.ts) with Anthropic adapter + mock adapterβ€”
1.6Artifact storage helper (inline vs R2 routing)1.1
1.7Step recorder (writes monitoring_run_steps + artifacts during execution; same module used by both modes)1.5, 1.6
1.8Probes: Uptime + Deploy verifier (deterministic, both produce artifacts on failure)1.6
1.9Orchestrator skeleton on alpu.ca: cron + heartbeat + HMAC webhook + per-step persistence + per-agent timeout + cost ceiling. Dispatches by execution_mode (only one-shot wired in this phase).0.2, 0.3, 1.7, 1.8
1.10sentinel-trigger Worker (HMAC-signs UI clicks β†’ alpu.ca tunnel)0.2, 0.3
1.11Add sentinel section to intranet.ts + route shell + admin gating1.4
1.12Sentinel UI: Overview tab (skeleton) + Runs tab + Run drawer with stepper (renders probe + reasoner step kinds; agent_turn + tool_call rendering arrives in Phase 3)1.9, 1.11
1.13Sentinel UI: Agents tab (cards) + manual "Run now" wiring1.10, 1.12
1.14Shared components: SeverityBadge, StatusPill, ModelChip, CostBadge, JsonTree, PromptViewer, ArtifactViewer (image + http_response + json), DrawerShell, Sparkline1.11

Exit criteria: A nightly cron run on alpu.ca produces monitoring_runs + monitoring_run_steps rows for Uptime and Deploy verifier. UI Run drawer shows every step with model + rationale + prompt + input + output + artifacts. Manual "Run now" works from UI.

Phase 2 β€” Full agent set + Coverage view (~1.5 weeks, depends on Phase 1)

Goal: all 6 non-test agents firing nightly; Coverage matrix lit up; severity bumps and audit log working.

#ItemDepends on
2.1Probes (deterministic, also exposed as tools): DB integrity (row counts, R2 reconciliation, pg_cron freshness, migrations applied)1.7
2.2Probes: Edge function health (incl. existing smoke-image-gen.mjs as a probe)1.7
2.3Probes: Security active (RLS bypass attempts, public R2 inventory, unauth edge-fn probes)1.7
2.4Tool registry (tools/index.ts) + tool implementations: http_probe, query_supabase, read_file, git_diff_since, cf_pages_status, r2_list, record_finding. Each tool has a Zod input schema and shares code with the probe library.1.7
2.5Refine monitoring_prompt_templates per agent (versioned; v1 production prompts including system prompts for the future agent-loop agents)1.3
2.6Findings dedup logic (orchestrator-side: increment occurrence vs new row; also called from the record_finding tool)1.7
2.7Regression-aware severity bumps (bump after 30 consecutive passes; performance regression on 3Γ— baseline)2.6
2.8Audit log emission for ack/dismiss/resolve, model change, execution-mode change, agent pause, manifest approval1.1
2.9Per-agent cost ceiling + timeout enforcement + auto-pause-with-finding (applies to both modes)1.9
2.10Sentinel UI: Findings tab + Finding drawer + bulk-action toolbar1.12
2.11Sentinel UI: Activity tab (chronological feed, filters, polling)1.12
2.12Sentinel UI: Coverage tab (matrix view + filters + Pending approvals section)2.10
2.13Per-agent in-UI docs: write six sentinel/docs/*.md files (Test coverage lands with that agent in Phase 3); render in Agent drawer About tab2.10
2.14Shared components: CoverageMatrix, FindingsTable, ActivityEntry, AgentCard, SeverityHeatmap, MarkdownDoc1.14

Exit criteria: Five of seven agents fire nightly in one-shot mode (Uptime, Deploy verifier, DB integrity, Edge function health, Security active). Findings dedupe correctly. Severity bumps after 30 successful runs. Coverage matrix is populated. Audit log captures admin actions. Cost ceiling auto-pauses an agent on test. Tool registry exists and is consumable.

Phase 3 β€” Agent-loop path + Coverage extension + Test agent + Notifications (~1.5 weeks, depends on Phase 2)

Goal: Claude Agent SDK wired into the orchestrator; the two agent-loop agents (Security passive, Test coverage) running; coverage auto-grows; daily digest goes out.

#ItemDepends on
3.1Add @anthropic-ai/agent-sdk dependency; build reasoner/agent-loop.ts runner that initializes an SDK session per run with system prompt, tool catalog, model, max-turns; subscribes to step events and forwards to the step recorder1.7, 2.4, 2.5
3.2Orchestrator dispatcher: route runs by execution_mode; both modes fully wired1.9, 3.1
3.3Run drawer rendering: agent_turn + tool_call step kinds; visual grouping of turns and their tool calls; tool name/args/result display1.12, 3.1
3.4Agent-loop tool implementations specific to advanced agents: run_gitleaks, query_osv, list_routes, list_edge_fns, list_critical_files, find_existing_tests, draft_test_scaffold2.4
3.5Wire Security passive agent (agent-loop, Sonnet 4.6); write its system prompt; smoke-test on a known-safe diff3.1, 3.4
3.6Wire Test coverage agent (agent-loop, Sonnet 4.6); write its system prompt; smoke-test against the Vitest+Playwright setup from Phase 0; write Test-coverage in-UI doc3.1, 3.4, 0.4
3.7Finding drawer: Approve & queue button + draft task preview + confirmation modal2.10
3.8approveAndQueueTask mutation (inserts row into existing tasks table from finding metadata)3.7
3.9Crawler logic + manifest auto-population at start of each sweep1.9, 2.12
3.10LLM-on-merge GitHub Action (.github/workflows/sentinel-coverage.yml)3.9
3.11Supabase edge function sentinel-llm-merge (signature verify; writes pending rows)3.10
3.12Coverage tab: Pending approvals UI (Approve / Reject buttons inline)2.12, 3.11
3.13Dead-man-switch CF Worker (sentinel-deadman) β€” flags finding if no heartbeat in 25h1.9
3.14Daily digest CF Worker (sentinel-digest) β€” 7am CT cron β†’ Resend2.10

Exit criteria: A new route added in a feature branch shows up in monitoring_manifest after merge. Test coverage agent flags a sample uncovered file with a recommended action and a draft task; clicking Approve & queue creates a tasks row that the prompt-runner picks up. Security passive agent runs against a sample diff and emits findings via tool calls. Daily digest email arrives at 7am CT. Dead-man flagged in <25h when service stopped.

Phase 4 β€” Documentation & polish (~3 days, depends on Phase 3)

Goal: an unfamiliar collaborator can land in the UI and onboard cold.

#ItemDepends on
4.1Write infra/sentinel/README.md (system-level "how Sentinel works")All prior
4.2Write infra/sentinel/glossary.md4.1
4.3Sentinel UI: Docs tab β€” render README + per-agent docs + glossary2.10, 2.13, 4.1, 4.2
4.4Update infra/runbook.md with Sentinel orchestrator ops recipe1.9
4.5Update root CLAUDE.md with Monitoring section4.1
4.6Update apps/control/scripts/lint-image-gen.sh whitelist for sentinel probes2.2
4.7Visual polish pass on all tabs (skeleton, empty, error states)2.12
4.8Accessibility audit (keyboard nav, screen reader, focus)4.7
4.9Cost-and-performance review of the first week of runs; tune defaultspost-launch

Exit criteria: Docs tab renders cleanly. Visual polish complete. Accessibility audit passes. Runbook updated.

Total estimate

Phases 0–4 β‰ˆ 4–4.5 weeks of focused work (slightly longer than the pre-SDK plan because Phase 3 now also covers the agent-loop runtime + tool catalog). Mostly sequential because the orchestrator + UI + agents stack on each other. Phase 0 should start immediately so the prompt-runner has time to deliver Vitest/Playwright bootstrap before Phase 3.

13. Verification β€” end-to-end checklist

  1. Manual UI trigger. Click "Run now" on the Uptime agent in Sentinel β†’ Agents. Run appears in Activity within 30s. Run drawer shows every step with model, rationale, prompt, input context, output, artifacts.
  2. Nightly cron. journalctl -u sentinel.service -f on alpu.ca. Confirm 22:00 CT kickoff, sequential agent execution, full sweep complete by ~03:30 CT, exit 0.
  3. Coverage matrix. Open Sentinel β†’ Coverage. Confirm every garden + control route, every critical Supabase table, every edge function, and every external service is listed. Each row shows which agents cover it.
  4. Dedup. Introduce a deliberate failure (e.g. delete a route temporarily). Run twice. Confirm finding has occurrence_count=2, not two rows.
  5. Regression severity. Simulate 30 consecutive passes on a probe, then a fail; confirm severity bumps one level vs a fresh-fail finding.
  6. Coverage extension. Create a branch with a new route + new edge function, merge to main. Within 2 min, GitHub Action posts manifest entries. Approve in UI. Next nightly run probes them.
  7. Dead-man. Stop sentinel.service on alpu.ca. Within 25h, the dead-man Worker writes a finding with agent='deadman'. Restart service; next heartbeat resolves the finding.
  8. Model swap with audit. Change Uptime agent's model_id from Haiku to Sonnet in the UI; provide a rationale. Confirm monitoring_audit_log has the change with before/after. Next run uses Sonnet and cost telemetry reflects it.
  9. Cost ceiling. Lower a test agent's daily_cost_ceiling_usd to $0.001 and trigger a run. Confirm the agent auto-pauses and a finding is generated.
  10. Daily digest. Confirm 7am CT email arrives at oracle/admin addresses with summary of last 24h.
  11. Docs tab. Open Sentinel β†’ Docs. Confirm README renders; each agent's doc renders; glossary visible.
  12. Multi-step transparency (agent-loop). Open a Test coverage or Security passive run. Confirm the stepper shows alternating agent_turn + tool_call steps, with each tool call's name, arguments, and result visible. The execution-mode chip on the run header reads agent-loop.
  13. Multi-step transparency (one-shot). Open an Uptime run. Confirm the stepper shows a flat list of probe steps followed by a single reasoner step. Execution-mode chip reads one-shot.
  14. Approve & queue. Approve a test-write finding. Confirm a row appears in tasks with the draft payload; confirm the headless prompt-runner picks it up.
  15. RLS. Sign in as a demo-role user ([email protected]). Confirm Sentinel tab returns 403 / hidden from nav.
  16. Webhook auth. POST to the alpu.ca trigger endpoint without a valid HMAC signature; confirm rejection.

14. Out of scope for v1

15. Open questions to revisit during build

Doc owner: Haydn. Drafted 2026-05-06. Markdown source: docs/devtasks/sentinel.md. Operational secrets in BW folder devops-sponic.