CogniMesh Documents — Algorithm Extraction
Algorithmic content extracted from
projects/ckamal/src/domains/documents/(P0.5) for Colibri absorption. Implementation target:src/domains/documents/.
VectorStore API
Core semantic search and indexing interface:
class VectorStore {
// Index a document for semantic search
async index(doc)
// doc: { id, content, metadata?, companyId? }
// → Generates embedding for doc.content
// → Stores (id, embedding, metadata) in vector store
// → Returns: { id, indexed: true }
// Search the vector store
async search(query, options)
// query: plain text search query
// options: { topK: number, threshold: number, filters?: object }
// topK: number of results to return (default: 10)
// threshold: minimum similarity score to include (default: 0.0)
// filters: metadata filters (e.g., { companyId, type })
// → Returns: { results: [{ id, score, metadata, content }] }
// score: hybrid_score (semantic * 0.7 + keyword * 0.3)
// Remove a document from the index
async remove(id)
// → Deletes embedding and metadata from vector store
// → Returns: { id, removed: true }
}
Hybrid Scoring Formula
Every search result score is computed as a weighted combination:
hybrid_score = (semantic_score * 0.7) + (keyword_score * 0.3)
Where:
semantic_score: cosine similarity between query embedding and document embedding (range: 0.0–1.0)keyword_score: BM25 or TF-IDF keyword match score, normalized to 0.0–1.0- Weights: semantic dominates (70%) but keyword relevance contributes (30%)
Results are filtered by threshold before returning. Results are ordered by hybrid_score descending.
When to use each component
| Signal | Component | Strength |
|---|---|---|
| Conceptual/semantic similarity | semantic_score | Catches paraphrasing, synonyms |
| Exact term match | keyword_score | Catches specific names, codes, identifiers |
| Combined | hybrid_score | Balances both |
Cosine Similarity Computation
def cosine_similarity(vec_a, vec_b):
# Both vectors are L2-normalized (unit vectors)
# After L2 normalization: ||v|| = 1 for all v
# Therefore: cosine_similarity = dot(v_a, v_b) / (||v_a|| * ||v_b||)
# = dot(v_a, v_b) (since both are unit vectors)
dot_product = sum(a_i * b_i for a_i, b_i in zip(vec_a, vec_b))
# Simplified because vectors are pre-normalized:
return dot_product
# Full form (when vectors may not be normalized):
def cosine_similarity_general(vec_a, vec_b):
dot = sum(a * b for a, b in zip(vec_a, vec_b))
mag_a = sqrt(sum(a**2 for a in vec_a))
mag_b = sqrt(sum(b**2 for b in vec_b))
if mag_a == 0 or mag_b == 0:
return 0.0
return dot / (mag_a * mag_b)
Postgres Mirror Pattern
The documents service uses a dual-storage architecture:
Write path:
client.createDocument(data)
│
├─► Primary: Postgres (or SQLite in Colibri)
│ Full document stored: id, content, metadata, revisions, company
│ Returns document record
│
└─► Cache: Memory (in-process)
LRU cache by document id
TTL: configurable (default 5 minutes)
Eviction: LRU on capacity exceeded
Read path:
client.getDocument(id)
│
├─ Check memory cache → HIT: return immediately
│
└─ MISS: Query primary Postgres/SQLite → populate cache → return
Search path (vector):
client.searchDocuments(query)
│
└─► VectorStore.search(query) → returns ids + scores
│
└─► Batch fetch full docs from primary by ids
→ Return enriched results
In Colibri’s absorption, Postgres is replaced by SQLite (the existing DB layer). The mirror pattern means: SQLite is the source of truth for document content; the memory cache accelerates repeated reads; the vector store (SQLite-backed TF-IDF) handles semantic queries.
EmbeddingProvider Interface
The documents service is decoupled from a specific embedding implementation:
// Interface (duck typing in JS)
interface EmbeddingProvider {
async embed(text)
// → Returns: Float32Array or number[] (the embedding vector)
// → Vector must be L2-normalized
}
// Concrete providers:
class OpenAIEmbeddingProvider {
constructor({ apiKey, model = "text-embedding-ada-002" })
async embed(text) {
response = await openai.embeddings.create({ input: text, model })
return response.data[0].embedding // already normalized by OpenAI
}
}
class LocalEmbeddingProvider {
// Uses TF-IDF (src/services/embeddings.js) — no external API
constructor({ dimensions = 128 })
async embed(text) {
tokens = wordTokenizer.tokenize(text)
vector = tfidf_vectorize(tokens, dimensions)
return l2_normalize(vector)
}
}
// Chain provider: try primary, fallback to secondary
class ChainEmbeddingProvider {
constructor(providers) // [primary, fallback, ...]
async embed(text) {
for provider in providers:
try:
return await provider.embed(text)
catch error:
continue // try next
throw new Error("All embedding providers failed")
}
}
In Colibri: LocalEmbeddingProvider is the default (no external API dependency). OpenAIEmbeddingProvider is available when OPENAI_API_KEY is set. ChainEmbeddingProvider enables graceful fallback.
DocumentService API (Full)
class DocumentService {
async createDocument(data, companyId, userId)
// data: { title, content, type, metadata }
// → Stores document in primary DB
// → Indexes embedding in VectorStore
// → Creates initial revision (revision_num = 1)
// → Returns: full document record
async updateDocument(id, data, userId)
// → Creates new revision record (increment revision_num)
// → Does NOT delete previous revision (full history retained)
// → Re-indexes updated content embedding
// → Returns: updated document with new revision_num
async listRevisions(docId)
// → Returns: revision[] ordered by revision_num asc
// Each revision: { revision_num, content, updatedBy, updatedAt }
async restoreRevision(docId, revisionNum, userId)
// → Copies revision content to current document
// → Creates new revision record (current becomes latest)
// → Re-indexes embedding
// → Returns: restored document
async shareDocument(id, { targetCompanyId, permission }, userId)
// permission: "read" | "comment" | "edit"
// → Creates cross-company share record
// → Returns: { shareId, documentId, targetCompanyId, permission }
async deleteDocument(id, userId)
// → SOFT DELETE: sets deleted_at timestamp, not actually removed
// → Removes from VectorStore index
// → Returns: { id, deleted: true }
async searchDocuments(query, companyId, options)
// → Calls VectorStore.search(query, options)
// → Filters by companyId (includes shared documents)
// → Returns: enriched results with full document metadata
}
Colibri Absorption Target
- Implementation:
src/domains/documents/ - Files to create:
document-service.js,document-repository.js,vector-store.js,embedding-providers.js,semantic-search-service.js - Note: Postgres mirror → SQLite (existing
getDb()) - Database tables:
documents(id, title, content, type, metadata, company_id, created_by, created_at, deleted_at)document_revisions(id, document_id, revision_num, content, updated_by, updated_at)document_shares(id, document_id, target_company_id, permission, granted_by, granted_at)document_embeddings(document_id, embedding blob, dimensions, provider)
- Integration points:
- ζ Decision Trail — long-form documentation attached to thought chains
- RAG system (P0.2 merge) — connects to
src/analysis/hybrid search
-
See implementation guide: [[guides/implementation/ckamal-extraction-guide CogniMesh Extraction Guide]] P0.5
See Also
-
[[extractions/ckamal-approvals-extraction CogniMesh Approvals Extraction]] — companion P0.3 module -
[[extractions/ckamal-routines-extraction CogniMesh Routines Extraction]] — companion P0.4 module -
[[extractions/nu-integrations-extraction ν Integrations Extraction]] — TF-IDF embedding service details -
[[concepts/ζ-decision-trail ζ Decision Trail]] — thought records that reference documents -
[[guides/implementation/ckamal-extraction-guide CogniMesh Extraction Guide]] — full absorption plan