← All docs

Global search

One header-level search box that reaches everything internal — and finds it by meaning, not just keywords.
2026-06-04 Build plan / RFC 5 phases · internal sources only Touches: Supabase · pgvector · Gemini · Cloudflare

How to read this doc

If you want the shape in one read: §4 (two-layer architecture) and §9 (the five build phases). The decision that actually gates the harder work is §6 (per-row access control) — read that before assuming this is “just search.” Markdown source: docs/devtasks/global-search.md.

Contents

  1. Context & goal
  2. Scope — what’s in, what’s out
  3. The corpus — sources to index
  4. Architecture — two layers, one box
  5. The unified search index & ingestion
  6. Access control — per-row visibility
  7. The search endpoint
  8. UI — global header search + indocs upgrade
  9. Build sequence
  10. Open questions
  11. Risks & non-goals

1. Context & goal

Today the intranet has exactly one search surface: a Cmd-K command palette (apps/control/src/components/indocs/CommandPalette.tsx) that only searches the docs registry, only by title / category / tag, and only via client-side substring matching over rows already loaded in the browser. Everything else we own — tasks, the CRM, prompts, voice-note transcripts, the image catalog, spaces, people — has no global search at all. If you can’t remember which doc a thing lives in, you can’t find it.

The goal: one search box, available everywhere, that finds anything internal — even when you don’t type the exact words. Searching “how do we keep documents fresh” should surface the doc that says “registry backfill cadence.” It should feel instant as you type, and it should never be a wall of text.

This is grounded in a discovery pass (6 read-only scouts, 2026-06-04) that mapped the real corpus and infrastructure. The headline finding: “everything” is mostly already in our Supabase database — eight tables we own and can reach today, no new logins or external auth. That is the entire scope of this build.

Goal in one line

Promote search from an indocs-only, keyword-only, client-side palette to a global, semantic, server-backed search over every internal source — fast on every keystroke, smart enough to match meaning.

2. Scope — what’s in, what’s out

In scope (this build)

Explicitly out of scope (deferred)

3. The corpus — sources to index

Every realistic internal source, ranked by indexing priority. All live in Supabase and are reachable today.

SourceTableWhat it holdsPriority
Intranet docspublic.indocs~119 docs across 20 categories; markdown in body_mdMust-have
Tasks / work itemspublic.tasks + task_activitytitles, descriptions, comments, historyMust-have
Relations / CRMpublic.relationsnames, organizations, notes, emailMust-have
Prompts registrypublic.promptsprompt library — already has a working full-text indexMust-have (free win)
Image catalogpublic.imagesprompt, alt text, tags (already indexed); blobs in R2Nice-to-have
Voice notespublic.voice_notestranscripts + extracted action itemsNice-to-have
Spaces / propertypublic.spacesname, description, featuresNice-to-have
People / teampublic.app_usersnames, emailsNice-to-have
Research vaultobsid-sponic/ (local md)Sonia’s BD/partner research — 57 files, sensitiveNice-to-have, access-scoped only

4. Architecture — two layers, one box

Good search feels both instant and smart. Those pull in opposite directions (keyword matching is instant; semantic matching needs a model call), so we run two layers and merge them into one ranked list.

Layer A — instant keyword (every keystroke)

Matches the actual words in titles, descriptions, tags, body text. Returns in milliseconds. Handles the obvious cases (“that doc about R2 backups”).

Layer B — semantic (catches what the words miss)

Matches by meaning — compares the math-fingerprint (“embedding”) of the query against each item’s fingerprint. Finds the “registry backfill cadence” doc when you typed “keep documents fresh.”

Why Gemini and not Azure or local

Azure embeddings are marginally cheaper per token but need fresh setup and burn a credit expiring 2026-07-31; ALPUCA’s local Ollama has no text-embedding endpoint ready. Gemini is already wired, free at our volume, and dimension-compatible — the clear default.

How they merge

The keyword layer returns immediately so the box never feels slow. The semantic layer (query fingerprint ~100–400ms) folds in a beat later and re-ranks. The user sees one ranked list that gets smarter as it settles — no mode toggle, no two separate result sets.

keystroke โ”€โ”€โ–ถ keyword hits (tsvector + trgm) โ”€โ”€โ–ถ shown instantly โ””โ”€โ–ถ embed query (Gemini ~100-400ms) โ”€โ”€โ–ถ semantic hits โ”€โ”€โ–ถ merge + re-rank โ”€โ”€โ–ถ one list

5. The unified search index & ingestion pipeline

The real work isn’t the query — it’s getting every source into one searchable place and keeping it fresh. Recommended: a single search_index table with a consistent shape regardless of source.

ColumnPurpose
source_typedoc / task / contact / prompt / image / voice-note / space / person / research
source_idpoints back to the original row
title, snippetwhat shows in results
urlwhere clicking takes the user
search_text (tsvector)the instant keyword layer
embedding (vector(768))the semantic layer
acl / visibilitywho’s allowed to see this result (§6)
updated_atfreshness + sort

How each source stays fresh

Queries run through a single search endpoint (a Supabase edge function) so the static frontend never loads the whole corpus into the browser — it asks the endpoint, which runs keyword + semantic + visibility filtering and returns a ranked, paginated, grouped list.

6. Access control — per-row visibility (the real hidden cost)

This is the hard part, not the search

Today the intranet authorizes at the page level: any signed-in user can see any doc. A global search box would surface HR, Legal, CRM, and partner research to everyone who searches. Per-row visibility filtering does not exist yet — building it is the genuinely non-trivial engineering in this project, more than the search itself.

This must land before any sensitive source is added to the index. It’s safe to ship Phase 1–3 over non-sensitive sources first and add the restricted sources once ACL filtering is in (§9).

7. The search endpoint

A single Supabase edge function (search) is the one entry point:

  1. Takes the query string + the requesting user.
  2. Runs the keyword query (tsvector + pg_trgm) across search_index.
  3. Embeds the query via Gemini and runs the pgvector similarity query.
  4. Merges + re-ranks (keyword exact-ish hits and high-similarity semantic hits both float up), filters by the user’s visibility, groups by source_type, paginates.
  5. Returns a compact ranked list — title, snippet with the match highlighted, source type, url.

Keeping it server-side means the browser never holds the whole corpus (the current palette’s weakness) and visibility filtering can’t be bypassed client-side.

8. UI — global header search + indocs upgrade

Recommended direction: a Cmd-K command palette promoted to the global header, with an optional full /search results page for big searches.

Why command-palette: the intranet already has a CommandPalette component, so we extend what exists rather than inventing a new section — consistent with the “integrate, don’t reinvent” rule. It’s the least overwhelming pattern: one box, Escape to dismiss, reachable from anywhere via the header and the Cmd-K shortcut.

The experience

The indocs search is folded into this, not left parallel. The current CommandPalette.tsx (indocs-only, title-only, client-side) is rebuilt on the new endpoint: same component, now backed by the server search and the richer result UI, scoped to docs when invoked from the indocs section and global when invoked from the header. One engine, two entry points — no second search to maintain.

Keep it plain: no imported product vocabulary. Match the existing Drive Browser’s visual language; keep result rows consistent in width and layout.

9. Build sequence

Five phases. Order is by dependency, not duration. Each phase is shippable and verifiable on its own.

Phase 1 — Unified index + keyword layer over core sources

Create search_index. Add tsvector columns + DB triggers for the must-have sources (docs, tasks, relations, prompts — reuse the prompt_registry pattern). Backfill existing rows. No UI yet; verify via direct queries.

Phase 2 — Search endpoint + global header search UI (keyword)

Build the search edge function (keyword only for now). Promote the command palette to the global header (intranet-header.tsx) with grouped results and the Cmd-K shortcut. First user-visible milestone: instant keyword search across core sources from anywhere.

Phase 3 — Semantic layer

Add embedding vector(768) to search_index. Build the background fingerprint job (Gemini, mirror ingest-thought). Add the similarity query to the endpoint and the merge/re-rank logic. Now search catches meaning, not just spelling — the actual stated goal.

Phase 4 — Per-row access control

Add visibility/acl to search_index and enforce it in the endpoint (mirror indoc-acl.ts). This unlocks safely adding sensitive sources.

Phase 5 — Full internal coverage + indocs search replacement

Extend the index to the remaining internal sources (images, voice notes, spaces, people, and — if opted in — the research vault behind ACL). Rebuild the indocs CommandPalette on the new endpoint so there’s one search engine, and retire its client-side title-only matching.

Recommended starting point

Phases 1–3 are the core deliverable — they produce exactly what was asked for: a smart, semantic, header-level search across the sources we own, entirely on infrastructure that already works (pgvector enabled, Gemini in production, the prompt full-text pattern proven), with no external logins and no new costs. Phases 4–5 broaden coverage and harden access once the experience is proven.

10. Open questions

Real decisions for the team, not implementation details:

  1. Sensitive content — searchable by whom? HR docs, Legal docs, the CRM, and partner research all contain sensitive material. We can filter by role (e.g. only admins see HR/Legal; partner research limited to Sonia + core team) — but the team needs to set the policy. (Drives Phase 4.)
  2. Include the research vault at all? obsid-sponic/ is Sonia’s personal BD research with confidential contacts. Index it (scoped to her + approved people), index only the non-sensitive parts, or leave it out entirely?
  3. Search analytics? We can log what people search for (and what returns nothing) to find content gaps and improve ranking. Useful, or a privacy concern for this small team?

11. Risks & non-goals

Build journal

No entries yet โ€” appended by /feature, /journal, and /wrap.