CogniMesh Documents — Algorithm Extraction

Algorithmic content extracted from projects/ckamal/src/domains/documents/ (P0.5) for Colibri absorption. Implementation target: src/domains/documents/.

VectorStore API

Core semantic search and indexing interface:

class VectorStore {

  // Index a document for semantic search
  async index(doc)
  // doc: { id, content, metadata?, companyId? }
  // → Generates embedding for doc.content
  // → Stores (id, embedding, metadata) in vector store
  // → Returns: { id, indexed: true }

  // Search the vector store
  async search(query, options)
  // query:   plain text search query
  // options: { topK: number, threshold: number, filters?: object }
  //   topK:      number of results to return (default: 10)
  //   threshold: minimum similarity score to include (default: 0.0)
  //   filters:   metadata filters (e.g., { companyId, type })
  // → Returns: { results: [{ id, score, metadata, content }] }
  //   score: hybrid_score (semantic * 0.7 + keyword * 0.3)

  // Remove a document from the index
  async remove(id)
  // → Deletes embedding and metadata from vector store
  // → Returns: { id, removed: true }
}

Hybrid Scoring Formula

Every search result score is computed as a weighted combination:

hybrid_score = (semantic_score * 0.7) + (keyword_score * 0.3)

Where:

semantic_score: cosine similarity between query embedding and document embedding (range: 0.0–1.0)
keyword_score: BM25 or TF-IDF keyword match score, normalized to 0.0–1.0
Weights: semantic dominates (70%) but keyword relevance contributes (30%)

Results are filtered by threshold before returning. Results are ordered by hybrid_score descending.

When to use each component

Signal	Component	Strength
Conceptual/semantic similarity	semantic_score	Catches paraphrasing, synonyms
Exact term match	keyword_score	Catches specific names, codes, identifiers
Combined	hybrid_score	Balances both

Cosine Similarity Computation

def cosine_similarity(vec_a, vec_b):
  # Both vectors are L2-normalized (unit vectors)
  # After L2 normalization: ||v|| = 1 for all v
  # Therefore: cosine_similarity = dot(v_a, v_b) / (||v_a|| * ||v_b||)
  #                               = dot(v_a, v_b)  (since both are unit vectors)

  dot_product = sum(a_i * b_i for a_i, b_i in zip(vec_a, vec_b))
  # Simplified because vectors are pre-normalized:
  return dot_product

# Full form (when vectors may not be normalized):
def cosine_similarity_general(vec_a, vec_b):
  dot = sum(a * b for a, b in zip(vec_a, vec_b))
  mag_a = sqrt(sum(a**2 for a in vec_a))
  mag_b = sqrt(sum(b**2 for b in vec_b))
  if mag_a == 0 or mag_b == 0:
    return 0.0
  return dot / (mag_a * mag_b)

Postgres Mirror Pattern

The documents service uses a dual-storage architecture:

Write path:
  client.createDocument(data)
    │
    ├─► Primary: Postgres (or SQLite in Colibri)
    │     Full document stored: id, content, metadata, revisions, company
    │     Returns document record
    │
    └─► Cache: Memory (in-process)
          LRU cache by document id
          TTL: configurable (default 5 minutes)
          Eviction: LRU on capacity exceeded

Read path:
  client.getDocument(id)
    │
    ├─ Check memory cache → HIT: return immediately
    │
    └─ MISS: Query primary Postgres/SQLite → populate cache → return

Search path (vector):
  client.searchDocuments(query)
    │
    └─► VectorStore.search(query) → returns ids + scores
          │
          └─► Batch fetch full docs from primary by ids
                → Return enriched results

In Colibri’s absorption, Postgres is replaced by SQLite (the existing DB layer). The mirror pattern means: SQLite is the source of truth for document content; the memory cache accelerates repeated reads; the vector store (SQLite-backed TF-IDF) handles semantic queries.

EmbeddingProvider Interface

The documents service is decoupled from a specific embedding implementation:

// Interface (duck typing in JS)
interface EmbeddingProvider {
  async embed(text)
  // → Returns: Float32Array or number[] (the embedding vector)
  // → Vector must be L2-normalized
}

// Concrete providers:
class OpenAIEmbeddingProvider {
  constructor({ apiKey, model = "text-embedding-ada-002" })
  async embed(text) {
    response = await openai.embeddings.create({ input: text, model })
    return response.data[0].embedding  // already normalized by OpenAI
  }
}

class LocalEmbeddingProvider {
  // Uses TF-IDF (src/services/embeddings.js) — no external API
  constructor({ dimensions = 128 })
  async embed(text) {
    tokens = wordTokenizer.tokenize(text)
    vector = tfidf_vectorize(tokens, dimensions)
    return l2_normalize(vector)
  }
}

// Chain provider: try primary, fallback to secondary
class ChainEmbeddingProvider {
  constructor(providers)  // [primary, fallback, ...]
  async embed(text) {
    for provider in providers:
      try:
        return await provider.embed(text)
      catch error:
        continue  // try next
    throw new Error("All embedding providers failed")
  }
}

In Colibri: LocalEmbeddingProvider is the default (no external API dependency). OpenAIEmbeddingProvider is available when OPENAI_API_KEY is set. ChainEmbeddingProvider enables graceful fallback.

DocumentService API (Full)

class DocumentService {

  async createDocument(data, companyId, userId)
  // data: { title, content, type, metadata }
  // → Stores document in primary DB
  // → Indexes embedding in VectorStore
  // → Creates initial revision (revision_num = 1)
  // → Returns: full document record

  async updateDocument(id, data, userId)
  // → Creates new revision record (increment revision_num)
  // → Does NOT delete previous revision (full history retained)
  // → Re-indexes updated content embedding
  // → Returns: updated document with new revision_num

  async listRevisions(docId)
  // → Returns: revision[] ordered by revision_num asc
  // Each revision: { revision_num, content, updatedBy, updatedAt }

  async restoreRevision(docId, revisionNum, userId)
  // → Copies revision content to current document
  // → Creates new revision record (current becomes latest)
  // → Re-indexes embedding
  // → Returns: restored document

  async shareDocument(id, { targetCompanyId, permission }, userId)
  // permission: "read" | "comment" | "edit"
  // → Creates cross-company share record
  // → Returns: { shareId, documentId, targetCompanyId, permission }

  async deleteDocument(id, userId)
  // → SOFT DELETE: sets deleted_at timestamp, not actually removed
  // → Removes from VectorStore index
  // → Returns: { id, deleted: true }

  async searchDocuments(query, companyId, options)
  // → Calls VectorStore.search(query, options)
  // → Filters by companyId (includes shared documents)
  // → Returns: enriched results with full document metadata
}

Colibri Absorption Target

Implementation: src/domains/documents/
Files to create: document-service.js, document-repository.js, vector-store.js, embedding-providers.js, semantic-search-service.js
Note: Postgres mirror → SQLite (existing getDb())
Database tables:
- documents (id, title, content, type, metadata, company_id, created_by, created_at, deleted_at)
- document_revisions (id, document_id, revision_num, content, updated_by, updated_at)
- document_shares (id, document_id, target_company_id, permission, granted_by, granted_at)
- document_embeddings (document_id, embedding blob, dimensions, provider)
Integration points:
- ζ Decision Trail — long-form documentation attached to thought chains
- RAG system (P0.2 merge) — connects to src/analysis/ hybrid search
See implementation guide: [[guides/implementation/ckamal-extraction-guide CogniMesh Extraction Guide]] P0.5