Observability — Function Reference
⚠ HERITAGE EXTRACTION — donor AMS observability surface (Wave 8 quarantine)
This file extracts the donor AMS observability tools from
src/tools/observability.js(deleted R53). The 22-tool donor surface (observe_*,alert_*,metric_*) is donor accretion. Phase 0 Colibri ships none of these tools. Observability in Phase 0 is structured logs only — no MCP-exposed metrics surface, no alert engine, nomcp_action-style timeseries. The 19 Phase 0 tools are listed in../mcp-tools-phase-0.md. Observability tooling is Phase 1+ territory.Read this file as donor genealogy only.
Core Algorithm
The observability layer provides 22 MCP tools grouped into 6 categories:
- Usage Tracking (3): claude usage, cost, token tracking
- Performance (3): latency, throughput, error rates
- Audit & Logs (3): realtime audit, log analysis, trace flow
- Health & Alerts (6): health check, system metrics, alert CRUD + acknowledge
- Alert Management (7): rule create/update/delete/list, alert resolve/silence/list
- Export (2): metrics export, report generate
All metric data flows through mcp_action table (SQLite) for historical queries. Real-time metrics use getMetrics() middleware and getMetricsCollector(). Alerts use a pluggable engine (getAlertEngine, getAlertRuleEngine) with channel routing.
Tool Catalog
Usage Tracking
observe_claude_usage
Purpose: Aggregate action counts by time bucket with tool and session breakdowns. Algorithm:
- Map
period→timeExpr(hour/day/week/month → SQLite-N unit). - Map
granularity→timeFormat(raw/hourly/daily →strftimeformat). - Apply optional filters (project_id, user_id, tool_name) to WHERE clause.
- Query
mcp_actionwithstrftime(timeFormat, timestamp)GROUP BY time_bucket →{ actions, tools, sessions }. - Secondary query:
mcp_actionGROUP BYtool_name→ top tools. - Subquery: per-session action counts → avg actions per session.
Returns:
{ period, granularity, filters, summary: { total_sessions, total_actions, avg_actions_per_session }, timeline[], tool_breakdown[] }
observe_claude_cost
Purpose: Cost analytics by time range and dimension. Algorithm:
- Determine
from/to(default: last 30 days). group_by→ SQL column:day|hour→strftime,project→project_id,tool→tool_name,user→user_id.- Query
mcp_actiongrouped by selected dimension. - Cost estimation: token-based pricing model (per tool/model rate table — implementation detail in handler).
Returns:
{ time_range, group_by, format, groups[], total_cost_usd, breakdown }
observe_token_tracking
Purpose: Token usage by session or global.
Data source: In-memory tokenStore Map (max 10000 entries) + usageStore.
Filtering: By session_id and/or model.
Returns: { session_id?, model?, tokens: { input, output, total }, cost_estimate, history[] }
Performance
observe_latency_dashboard
Purpose: Latency percentiles for system operations. Algorithm:
getMetrics()+getSlowOperations()from middleware.- Calculate percentiles (default [50, 95, 99]) over operation durations.
- Apply
time_rangefilter if provided. Returns:{ percentiles: { p50, p95, p99 }, slow_operations[], by_tool{}, time_range }
observe_throughput_metrics
Purpose: Request throughput at specified intervals. Algorithm:
getMetricsCollector()→ current metrics snapshot.- Aggregate by
interval(1m/5m/15m/1h). Returns:{ interval, rps: current requests-per-second, timeline[], peak_rps, avg_rps }
observe_error_rates
Purpose: Error rates by severity or tool. Algorithm:
- Query
mcp_action(or error log table) grouped bytool_name. - Filter by
severityandtoolif provided. - Calculate error_rate = errors / total_calls per group.
Returns:
{ severity, tool?, by_tool[], overall_rate, time_range }
Audit & Logs
observe_audit_realtime
Purpose: Real-time audit data with session/tool/user filters. Algorithm:
- Build WHERE from
filters(session_id, tool_name, user_id, time_window_minutes). - Query
mcp_actionORDER BY timestamp DESC. - If
stream=true: return polling hint (true streaming not supported in MCP sync model). Returns:{ actions[], count, window_minutes, filters }
observe_log_analyze
Purpose: Application log search and analysis.
Cache: logAnalysisCache Map with 5-min TTL.
Algorithm:
- Check cache by (query+time_range+level) key.
- Read log files from
AMS_ROOTlog directory (usingreaddirSync/readFileSync). - Filter by
leveland keywordquery(simple string search). - Truncate to
limit(max 1000, default 100). Returns:{ logs[], count, query, level, time_range }
observe_trace_flow
Purpose: Trace a request session through the system.
Data source: sessionTraces Map + mcp_action + mcp_thought tables.
Algorithm:
- Fetch all actions for
session_idfrommcp_action. - Fetch all thoughts linked to session from
mcp_thought. - Build trace timeline sorted by
step_index. - If
visualize=true: build graph nodes/edges for rendering. Returns:{ session_id, actions[], thoughts[], timeline[], visualize?: { nodes, edges } }
Health & Alerts
observe_health_check
Purpose: System health check with optional deep inspection. Components checked:
| Component | Check |
|---|---|
| db | SQLite SELECT 1, table count, size on disk |
| disk | statSync(AMS_ROOT), free space estimation |
| memory | process.memoryUsage() |
| audit | Row count in mcp_action, latest timestamp |
Deep check (deep=true): also checks index counts, slow query log.
Returns: { status: "healthy"|"degraded"|"unhealthy", components: { db, disk, memory, audit }, checks[] }
observe_system_metrics
Purpose: CPU, memory, database metrics snapshot. Algorithm:
cpu=true:process.cpuUsage()→ user + system time in microseconds.memory=true:process.memoryUsage()→ heapUsed, heapTotal, rss, external.db=true:getDb()→ querysqlite_masterfor table/index counts +mcp_actionrow count. Returns:{ cpu?: { user_ms, system_ms }, memory?: { heap_used_mb, heap_total_mb, rss_mb }, db?: { tables, indexes, action_count } }
observe_alert_create
Purpose: Create a simple threshold-based alert rule (inline storage). Algorithm:
- Validate
condition.metric,condition.operator,condition.threshold. - Increment
alertRuleId++, store inalertRulesMap. - Schedule background check via
getAlertEngine().evaluate()(polling-based). Actions:log|webhook|email. Channels array supported. Returns:{ alert_id, condition, action, created_at }
observe_alert_list
Purpose: List firing/pending/acknowledged alerts.
Filter: status ∈ all/firing/pending/acknowledged.
Returns: { alerts[], count, status }
observe_alert_acknowledge
Purpose: Acknowledge an active alert.
Algorithm: getAlertManager().acknowledge(alertId) → set status to acknowledged.
Returns: { acknowledged, alert_id }
Alert Management Tools (7)
alert_rule_create
Purpose: Create typed alert rule (threshold / rate / anomaly / composite). Rule types:
| Type | Condition fields | Notes |
|---|---|---|
| threshold | metric, operator (gt/lt/gte/lte/eq), value/threshold, duration | Standard level check |
| rate | metric, window, operator, value | Rate-of-change detection |
| anomaly | metric, method (zscore/iqr/mad) | Statistical outlier detection |
| composite | rules[] (sub-rule IDs) | Boolean AND of sub-rules |
Algorithm: getAlertRuleEngine().createRule(config) → persists rule.
Returns: { rule_id, name, type, condition, severity, channels, enabled: true }
alert_rule_update
getAlertRuleEngine().updateRule(ruleId, updates).
Updatable fields: name, description, enabled, severity, channels, condition.
alert_rule_delete
getAlertRuleEngine().deleteRule(ruleId).
alert_rule_list
getAlertRuleEngine().listRules({ enabledOnly }). Returns all rules with current state.
alert_resolve
getAlertManager().resolve(alertId, resolution) → set status resolved + note.
alert_silence
getAlertRuleEngine().silenceRule(ruleId, duration, reason).
Duration strings: “30m”, “1h”, “1d” parsed to milliseconds.
alert_list
getAlertManager().listAlerts({ status, severity, timeRange, limit }).
Status filter: active/acknowledged/resolved/all. Severity filter: critical/warning/info/all.
Export Tools
observe_metrics_export
Purpose: Export metrics in json/prometheus/csv format. Algorithm:
queryMetrics(range)fromdb/timeseries.js.- Format:
- json: raw objects array.
- prometheus:
metric_name{labels} value timestamp. - csv: header row + data rows.
- If
destinationprovided: write to file path. Returns:{ format, count, data: string|object[], destination? }
observe_report_generate
Purpose: Generate periodic observability reports. Report types: daily / weekly / monthly / custom. Algorithm:
- Determine period from
typeor explicitperiod.from/to. - Aggregate: usage, cost, error rates, top tools, health events.
- If
recipients[]provided: send via configured email channel. Returns:{ type, period, summary: { calls, errors, cost_usd, top_tools }, sections[] }
Time Series Module (src/db/timeseries.js)
Functions imported by observability:
storeMetric(name, value, labels?, timestamp?): Promise<void>
Insert metric data point into ts_metrics table.
queryMetrics(options): Promise<object[]>
Query time series data with range, name filter, label filter, downsampling.
getMetricStats(name, range): Promise<object>
Returns min/max/avg/count/p50/p95/p99 for a named metric in time range.
downsampleMetrics(name, resolution, range): Promise<object[]>
Reduce metric resolution (LTTB or average bucket algorithm).
applyRetentionPolicies(): Promise<object>
Delete metrics older than configured retention. Called periodically.
getTimeSeriesStats(): Promise<object>
Table statistics: row counts, oldest/newest timestamps, size estimate.
Configuration
| Config key | Default | Purpose |
|---|---|---|
MAX_TOKEN_HISTORY |
10000 | In-memory token tracking limit |
LOG_CACHE_TTL |
300000 (5 min) | Log analysis cache lifetime |
observe_health_check.deep |
false | Enables disk+index checks |
alert_rule_create.severity |
“warning” | Default rule severity |
alert_rule_create.type |
“threshold” | Default rule type |
observe_latency_dashboard.percentiles |
[50, 95, 99] | Default percentile array |
observe_throughput_metrics.interval |
“5m” | Default aggregation interval |
observe_metrics_export.format |
“json” | Default export format |
Alert Severity Levels
ALERT_SEVERITY: critical > warning > info
ALERT_STATUS / MANAGER_ALERT_STATUS: firing / pending / acknowledged / resolved
CHANNEL_SEVERITY: critical / warning / info (routes to different channels)
Alert Rule Type Keywords
ALERT_RULE_TYPES: threshold, rate, anomaly, composite
ALERT_OPERATORS: gt, lt, gte, lte, eq