Observability — Function Reference

⚠ HERITAGE EXTRACTION — donor AMS observability surface (Wave 8 quarantine)

This file extracts the donor AMS observability tools from src/tools/observability.js (deleted R53). The 22-tool donor surface (observe_*, alert_*, metric_*) is donor accretion. Phase 0 Colibri ships none of these tools. Observability in Phase 0 is structured logs only — no MCP-exposed metrics surface, no alert engine, no mcp_action-style timeseries. The 19 Phase 0 tools are listed in ../mcp-tools-phase-0.md. Observability tooling is Phase 1+ territory.

Read this file as donor genealogy only.

Core Algorithm

The observability layer provides 22 MCP tools grouped into 6 categories:

  • Usage Tracking (3): claude usage, cost, token tracking
  • Performance (3): latency, throughput, error rates
  • Audit & Logs (3): realtime audit, log analysis, trace flow
  • Health & Alerts (6): health check, system metrics, alert CRUD + acknowledge
  • Alert Management (7): rule create/update/delete/list, alert resolve/silence/list
  • Export (2): metrics export, report generate

All metric data flows through mcp_action table (SQLite) for historical queries. Real-time metrics use getMetrics() middleware and getMetricsCollector(). Alerts use a pluggable engine (getAlertEngine, getAlertRuleEngine) with channel routing.


Tool Catalog

Usage Tracking

observe_claude_usage

Purpose: Aggregate action counts by time bucket with tool and session breakdowns. Algorithm:

  1. Map periodtimeExpr (hour/day/week/month → SQLite -N unit).
  2. Map granularitytimeFormat (raw/hourly/daily → strftime format).
  3. Apply optional filters (project_id, user_id, tool_name) to WHERE clause.
  4. Query mcp_action with strftime(timeFormat, timestamp) GROUP BY time_bucket → { actions, tools, sessions }.
  5. Secondary query: mcp_action GROUP BY tool_name → top tools.
  6. Subquery: per-session action counts → avg actions per session. Returns: { period, granularity, filters, summary: { total_sessions, total_actions, avg_actions_per_session }, timeline[], tool_breakdown[] }

observe_claude_cost

Purpose: Cost analytics by time range and dimension. Algorithm:

  1. Determine from/to (default: last 30 days).
  2. group_by → SQL column: day|hourstrftime, projectproject_id, tooltool_name, useruser_id.
  3. Query mcp_action grouped by selected dimension.
  4. Cost estimation: token-based pricing model (per tool/model rate table — implementation detail in handler). Returns: { time_range, group_by, format, groups[], total_cost_usd, breakdown }

observe_token_tracking

Purpose: Token usage by session or global. Data source: In-memory tokenStore Map (max 10000 entries) + usageStore. Filtering: By session_id and/or model. Returns: { session_id?, model?, tokens: { input, output, total }, cost_estimate, history[] }


Performance

observe_latency_dashboard

Purpose: Latency percentiles for system operations. Algorithm:

  1. getMetrics() + getSlowOperations() from middleware.
  2. Calculate percentiles (default [50, 95, 99]) over operation durations.
  3. Apply time_range filter if provided. Returns: { percentiles: { p50, p95, p99 }, slow_operations[], by_tool{}, time_range }

observe_throughput_metrics

Purpose: Request throughput at specified intervals. Algorithm:

  1. getMetricsCollector() → current metrics snapshot.
  2. Aggregate by interval (1m/5m/15m/1h). Returns: { interval, rps: current requests-per-second, timeline[], peak_rps, avg_rps }

observe_error_rates

Purpose: Error rates by severity or tool. Algorithm:

  1. Query mcp_action (or error log table) grouped by tool_name.
  2. Filter by severity and tool if provided.
  3. Calculate error_rate = errors / total_calls per group. Returns: { severity, tool?, by_tool[], overall_rate, time_range }

Audit & Logs

observe_audit_realtime

Purpose: Real-time audit data with session/tool/user filters. Algorithm:

  1. Build WHERE from filters (session_id, tool_name, user_id, time_window_minutes).
  2. Query mcp_action ORDER BY timestamp DESC.
  3. If stream=true: return polling hint (true streaming not supported in MCP sync model). Returns: { actions[], count, window_minutes, filters }

observe_log_analyze

Purpose: Application log search and analysis. Cache: logAnalysisCache Map with 5-min TTL. Algorithm:

  1. Check cache by (query+time_range+level) key.
  2. Read log files from AMS_ROOT log directory (using readdirSync/readFileSync).
  3. Filter by level and keyword query (simple string search).
  4. Truncate to limit (max 1000, default 100). Returns: { logs[], count, query, level, time_range }

observe_trace_flow

Purpose: Trace a request session through the system. Data source: sessionTraces Map + mcp_action + mcp_thought tables. Algorithm:

  1. Fetch all actions for session_id from mcp_action.
  2. Fetch all thoughts linked to session from mcp_thought.
  3. Build trace timeline sorted by step_index.
  4. If visualize=true: build graph nodes/edges for rendering. Returns: { session_id, actions[], thoughts[], timeline[], visualize?: { nodes, edges } }

Health & Alerts

observe_health_check

Purpose: System health check with optional deep inspection. Components checked:

Component Check
db SQLite SELECT 1, table count, size on disk
disk statSync(AMS_ROOT), free space estimation
memory process.memoryUsage()
audit Row count in mcp_action, latest timestamp

Deep check (deep=true): also checks index counts, slow query log. Returns: { status: "healthy"|"degraded"|"unhealthy", components: { db, disk, memory, audit }, checks[] }


observe_system_metrics

Purpose: CPU, memory, database metrics snapshot. Algorithm:

  1. cpu=true: process.cpuUsage() → user + system time in microseconds.
  2. memory=true: process.memoryUsage() → heapUsed, heapTotal, rss, external.
  3. db=true: getDb() → query sqlite_master for table/index counts + mcp_action row count. Returns: { cpu?: { user_ms, system_ms }, memory?: { heap_used_mb, heap_total_mb, rss_mb }, db?: { tables, indexes, action_count } }

observe_alert_create

Purpose: Create a simple threshold-based alert rule (inline storage). Algorithm:

  1. Validate condition.metric, condition.operator, condition.threshold.
  2. Increment alertRuleId++, store in alertRules Map.
  3. Schedule background check via getAlertEngine().evaluate() (polling-based). Actions: log | webhook | email. Channels array supported. Returns: { alert_id, condition, action, created_at }

observe_alert_list

Purpose: List firing/pending/acknowledged alerts. Filter: status ∈ all/firing/pending/acknowledged. Returns: { alerts[], count, status }


observe_alert_acknowledge

Purpose: Acknowledge an active alert. Algorithm: getAlertManager().acknowledge(alertId) → set status to acknowledged. Returns: { acknowledged, alert_id }


Alert Management Tools (7)

alert_rule_create

Purpose: Create typed alert rule (threshold / rate / anomaly / composite). Rule types:

Type Condition fields Notes
threshold metric, operator (gt/lt/gte/lte/eq), value/threshold, duration Standard level check
rate metric, window, operator, value Rate-of-change detection
anomaly metric, method (zscore/iqr/mad) Statistical outlier detection
composite rules[] (sub-rule IDs) Boolean AND of sub-rules

Algorithm: getAlertRuleEngine().createRule(config) → persists rule. Returns: { rule_id, name, type, condition, severity, channels, enabled: true }


alert_rule_update

getAlertRuleEngine().updateRule(ruleId, updates). Updatable fields: name, description, enabled, severity, channels, condition.


alert_rule_delete

getAlertRuleEngine().deleteRule(ruleId).


alert_rule_list

getAlertRuleEngine().listRules({ enabledOnly }). Returns all rules with current state.


alert_resolve

getAlertManager().resolve(alertId, resolution) → set status resolved + note.


alert_silence

getAlertRuleEngine().silenceRule(ruleId, duration, reason). Duration strings: “30m”, “1h”, “1d” parsed to milliseconds.


alert_list

getAlertManager().listAlerts({ status, severity, timeRange, limit }). Status filter: active/acknowledged/resolved/all. Severity filter: critical/warning/info/all.


Export Tools

observe_metrics_export

Purpose: Export metrics in json/prometheus/csv format. Algorithm:

  1. queryMetrics(range) from db/timeseries.js.
  2. Format:
    • json: raw objects array.
    • prometheus: metric_name{labels} value timestamp.
    • csv: header row + data rows.
  3. If destination provided: write to file path. Returns: { format, count, data: string|object[], destination? }

observe_report_generate

Purpose: Generate periodic observability reports. Report types: daily / weekly / monthly / custom. Algorithm:

  1. Determine period from type or explicit period.from/to.
  2. Aggregate: usage, cost, error rates, top tools, health events.
  3. If recipients[] provided: send via configured email channel. Returns: { type, period, summary: { calls, errors, cost_usd, top_tools }, sections[] }

Time Series Module (src/db/timeseries.js)

Functions imported by observability:

storeMetric(name, value, labels?, timestamp?): Promise<void>

Insert metric data point into ts_metrics table.

queryMetrics(options): Promise<object[]>

Query time series data with range, name filter, label filter, downsampling.

getMetricStats(name, range): Promise<object>

Returns min/max/avg/count/p50/p95/p99 for a named metric in time range.

downsampleMetrics(name, resolution, range): Promise<object[]>

Reduce metric resolution (LTTB or average bucket algorithm).

applyRetentionPolicies(): Promise<object>

Delete metrics older than configured retention. Called periodically.

getTimeSeriesStats(): Promise<object>

Table statistics: row counts, oldest/newest timestamps, size estimate.


Configuration

Config key Default Purpose
MAX_TOKEN_HISTORY 10000 In-memory token tracking limit
LOG_CACHE_TTL 300000 (5 min) Log analysis cache lifetime
observe_health_check.deep false Enables disk+index checks
alert_rule_create.severity “warning” Default rule severity
alert_rule_create.type “threshold” Default rule type
observe_latency_dashboard.percentiles [50, 95, 99] Default percentile array
observe_throughput_metrics.interval “5m” Default aggregation interval
observe_metrics_export.format “json” Default export format

Alert Severity Levels

ALERT_SEVERITY: critical > warning > info ALERT_STATUS / MANAGER_ALERT_STATUS: firing / pending / acknowledged / resolved CHANNEL_SEVERITY: critical / warning / info (routes to different channels)

Alert Rule Type Keywords

ALERT_RULE_TYPES: threshold, rate, anomaly, composite ALERT_OPERATORS: gt, lt, gte, lte, eq


Back to top

Colibri — documentation-first MCP runtime. Apache 2.0 + Commons Clause.

This site uses Just the Docs, a documentation theme for Jekyll.