Observability — Function Reference

⚠ HERITAGE EXTRACTION — donor AMS observability surface (Wave 8 quarantine)

This file extracts the donor AMS observability tools from src/tools/observability.js (deleted R53). The 22-tool donor surface (observe_*, alert_*, metric_*) is donor accretion. Phase 0 Colibri ships none of these tools. Observability in Phase 0 is structured logs only — no MCP-exposed metrics surface, no alert engine, no mcp_action-style timeseries. The 19 Phase 0 tools are listed in ../mcp-tools-phase-0.md. Observability tooling is Phase 1+ territory.

Read this file as donor genealogy only.

Core Algorithm

The observability layer provides 22 MCP tools grouped into 6 categories:

Usage Tracking (3): claude usage, cost, token tracking
Performance (3): latency, throughput, error rates
Audit & Logs (3): realtime audit, log analysis, trace flow
Health & Alerts (6): health check, system metrics, alert CRUD + acknowledge
Alert Management (7): rule create/update/delete/list, alert resolve/silence/list
Export (2): metrics export, report generate

All metric data flows through mcp_action table (SQLite) for historical queries. Real-time metrics use getMetrics() middleware and getMetricsCollector(). Alerts use a pluggable engine (getAlertEngine, getAlertRuleEngine) with channel routing.

Tool Catalog

Usage Tracking

`observe_claude_usage`

Purpose: Aggregate action counts by time bucket with tool and session breakdowns. Algorithm:

Map period → timeExpr (hour/day/week/month → SQLite -N unit).
Map granularity → timeFormat (raw/hourly/daily → strftime format).
Apply optional filters (project_id, user_id, tool_name) to WHERE clause.
Query mcp_action with strftime(timeFormat, timestamp) GROUP BY time_bucket → { actions, tools, sessions }.
Secondary query: mcp_action GROUP BY tool_name → top tools.
Subquery: per-session action counts → avg actions per session. Returns: { period, granularity, filters, summary: { total_sessions, total_actions, avg_actions_per_session }, timeline[], tool_breakdown[] }

`observe_claude_cost`

Purpose: Cost analytics by time range and dimension. Algorithm:

Determine from/to (default: last 30 days).
group_by → SQL column: day|hour → strftime, project → project_id, tool → tool_name, user → user_id.
Query mcp_action grouped by selected dimension.
Cost estimation: token-based pricing model (per tool/model rate table — implementation detail in handler). Returns: { time_range, group_by, format, groups[], total_cost_usd, breakdown }

`observe_token_tracking`

Purpose: Token usage by session or global. Data source: In-memory tokenStore Map (max 10000 entries) + usageStore. Filtering: By session_id and/or model. Returns: { session_id?, model?, tokens: { input, output, total }, cost_estimate, history[] }

Performance

`observe_latency_dashboard`

Purpose: Latency percentiles for system operations. Algorithm:

getMetrics() + getSlowOperations() from middleware.
Calculate percentiles (default [50, 95, 99]) over operation durations.
Apply time_range filter if provided. Returns: { percentiles: { p50, p95, p99 }, slow_operations[], by_tool{}, time_range }

`observe_throughput_metrics`

Purpose: Request throughput at specified intervals. Algorithm:

getMetricsCollector() → current metrics snapshot.
Aggregate by interval (1m/5m/15m/1h). Returns: { interval, rps: current requests-per-second, timeline[], peak_rps, avg_rps }

`observe_error_rates`

Purpose: Error rates by severity or tool. Algorithm:

Query mcp_action (or error log table) grouped by tool_name.
Filter by severity and tool if provided.
Calculate error_rate = errors / total_calls per group. Returns: { severity, tool?, by_tool[], overall_rate, time_range }

Audit & Logs

`observe_audit_realtime`

Purpose: Real-time audit data with session/tool/user filters. Algorithm:

Build WHERE from filters (session_id, tool_name, user_id, time_window_minutes).
Query mcp_action ORDER BY timestamp DESC.
If stream=true: return polling hint (true streaming not supported in MCP sync model). Returns: { actions[], count, window_minutes, filters }

`observe_log_analyze`

Purpose: Application log search and analysis. Cache: logAnalysisCache Map with 5-min TTL. Algorithm:

Check cache by (query+time_range+level) key.
Read log files from AMS_ROOT log directory (using readdirSync/readFileSync).
Filter by level and keyword query (simple string search).
Truncate to limit (max 1000, default 100). Returns: { logs[], count, query, level, time_range }

`observe_trace_flow`

Purpose: Trace a request session through the system. Data source: sessionTraces Map + mcp_action + mcp_thought tables. Algorithm:

Fetch all actions for session_id from mcp_action.
Fetch all thoughts linked to session from mcp_thought.
Build trace timeline sorted by step_index.
If visualize=true: build graph nodes/edges for rendering. Returns: { session_id, actions[], thoughts[], timeline[], visualize?: { nodes, edges } }

Health & Alerts

`observe_health_check`

Purpose: System health check with optional deep inspection. Components checked:

Component	Check
db	SQLite `SELECT 1`, table count, size on disk
disk	`statSync(AMS_ROOT)`, free space estimation
memory	`process.memoryUsage()`
audit	Row count in `mcp_action`, latest timestamp

Deep check (deep=true): also checks index counts, slow query log. Returns: { status: "healthy"|"degraded"|"unhealthy", components: { db, disk, memory, audit }, checks[] }

`observe_system_metrics`

Purpose: CPU, memory, database metrics snapshot. Algorithm:

cpu=true: process.cpuUsage() → user + system time in microseconds.
memory=true: process.memoryUsage() → heapUsed, heapTotal, rss, external.
db=true: getDb() → query sqlite_master for table/index counts + mcp_action row count. Returns: { cpu?: { user_ms, system_ms }, memory?: { heap_used_mb, heap_total_mb, rss_mb }, db?: { tables, indexes, action_count } }

`observe_alert_create`

Purpose: Create a simple threshold-based alert rule (inline storage). Algorithm:

Validate condition.metric, condition.operator, condition.threshold.
Increment alertRuleId++, store in alertRules Map.
Schedule background check via getAlertEngine().evaluate() (polling-based). Actions: log | webhook | email. Channels array supported. Returns: { alert_id, condition, action, created_at }

`observe_alert_list`

Purpose: List firing/pending/acknowledged alerts. Filter: status ∈ all/firing/pending/acknowledged. Returns: { alerts[], count, status }

`observe_alert_acknowledge`

Purpose: Acknowledge an active alert. Algorithm: getAlertManager().acknowledge(alertId) → set status to acknowledged. Returns: { acknowledged, alert_id }

Alert Management Tools (7)

`alert_rule_create`

Purpose: Create typed alert rule (threshold / rate / anomaly / composite). Rule types:

Type	Condition fields	Notes
threshold	metric, operator (gt/lt/gte/lte/eq), value/threshold, duration	Standard level check
rate	metric, window, operator, value	Rate-of-change detection
anomaly	metric, method (zscore/iqr/mad)	Statistical outlier detection
composite	rules[] (sub-rule IDs)	Boolean AND of sub-rules

Algorithm: getAlertRuleEngine().createRule(config) → persists rule. Returns: { rule_id, name, type, condition, severity, channels, enabled: true }

`alert_rule_update`

getAlertRuleEngine().updateRule(ruleId, updates). Updatable fields: name, description, enabled, severity, channels, condition.

`alert_rule_delete`

getAlertRuleEngine().deleteRule(ruleId).

`alert_rule_list`

getAlertRuleEngine().listRules({ enabledOnly }). Returns all rules with current state.

`alert_resolve`

getAlertManager().resolve(alertId, resolution) → set status resolved + note.

`alert_silence`

getAlertRuleEngine().silenceRule(ruleId, duration, reason). Duration strings: “30m”, “1h”, “1d” parsed to milliseconds.

`alert_list`

getAlertManager().listAlerts({ status, severity, timeRange, limit }). Status filter: active/acknowledged/resolved/all. Severity filter: critical/warning/info/all.

Export Tools

`observe_metrics_export`

Purpose: Export metrics in json/prometheus/csv format. Algorithm:

queryMetrics(range) from db/timeseries.js.
Format:
- json: raw objects array.
- prometheus: metric_name{labels} value timestamp.
- csv: header row + data rows.
If destination provided: write to file path. Returns: { format, count, data: string|object[], destination? }

`observe_report_generate`

Purpose: Generate periodic observability reports. Report types: daily / weekly / monthly / custom. Algorithm:

Determine period from type or explicit period.from/to.
Aggregate: usage, cost, error rates, top tools, health events.
If recipients[] provided: send via configured email channel. Returns: { type, period, summary: { calls, errors, cost_usd, top_tools }, sections[] }

Time Series Module (`src/db/timeseries.js`)

Functions imported by observability:

`storeMetric(name, value, labels?, timestamp?)`: `Promise<void>`

Insert metric data point into ts_metrics table.

`queryMetrics(options)`: `Promise<object[]>`

Query time series data with range, name filter, label filter, downsampling.

`getMetricStats(name, range)`: `Promise<object>`

Returns min/max/avg/count/p50/p95/p99 for a named metric in time range.

`downsampleMetrics(name, resolution, range)`: `Promise<object[]>`

Reduce metric resolution (LTTB or average bucket algorithm).

`applyRetentionPolicies()`: `Promise<object>`

Delete metrics older than configured retention. Called periodically.

`getTimeSeriesStats()`: `Promise<object>`

Table statistics: row counts, oldest/newest timestamps, size estimate.

Configuration

Config key	Default	Purpose
`MAX_TOKEN_HISTORY`	10000	In-memory token tracking limit
`LOG_CACHE_TTL`	300000 (5 min)	Log analysis cache lifetime
`observe_health_check.deep`	false	Enables disk+index checks
`alert_rule_create.severity`	“warning”	Default rule severity
`alert_rule_create.type`	“threshold”	Default rule type
`observe_latency_dashboard.percentiles`	[50, 95, 99]	Default percentile array
`observe_throughput_metrics.interval`	“5m”	Default aggregation interval
`observe_metrics_export.format`	“json”	Default export format

Alert Severity Levels

ALERT_SEVERITY: critical > warning > info ALERT_STATUS / MANAGER_ALERT_STATUS: firing / pending / acknowledged / resolved CHANNEL_SEVERITY: critical / warning / info (routes to different channels)

Alert Rule Type Keywords

ALERT_RULE_TYPES: threshold, rate, anomaly, composite ALERT_OPERATORS: gt, lt, gte, lte, eq

Observability — Function Reference

⚠ HERITAGE EXTRACTION — donor AMS observability surface (Wave 8 quarantine)

Core Algorithm

Tool Catalog

Usage Tracking

observe_claude_usage

observe_claude_cost

observe_token_tracking

Performance

observe_latency_dashboard

observe_throughput_metrics

observe_error_rates

Audit & Logs

observe_audit_realtime

observe_log_analyze

observe_trace_flow

Health & Alerts

observe_health_check

observe_system_metrics

observe_alert_create

observe_alert_list

observe_alert_acknowledge

Alert Management Tools (7)

alert_rule_create

alert_rule_update

alert_rule_delete

alert_rule_list

alert_resolve

alert_silence

alert_list