Global search

One header-level search box that reaches everything internal — and finds it by meaning, not just keywords.

2026-06-04 Build plan / RFC 5 phases · internal sources only Touches: Supabase · pgvector · Gemini · Cloudflare

How to read this doc

If you want the shape in one read: §4 (two-layer architecture) and §9 (the five build phases). The decision that actually gates the harder work is §6 (per-row access control) — read that before assuming this is “just search.” Markdown source: docs/devtasks/global-search.md.

Context & goal
Scope — what’s in, what’s out
The corpus — sources to index
Architecture — two layers, one box
The unified search index & ingestion
Access control — per-row visibility
The search endpoint
UI — global header search + indocs upgrade
Build sequence
Open questions
Risks & non-goals

1. Context & goal

Today the intranet has exactly one search surface: a Cmd-K command palette (apps/control/src/components/indocs/CommandPalette.tsx) that only searches the docs registry, only by title / category / tag, and only via client-side substring matching over rows already loaded in the browser. Everything else we own — tasks, the CRM, prompts, voice-note transcripts, the image catalog, spaces, people — has no global search at all. If you can’t remember which doc a thing lives in, you can’t find it.

The goal: one search box, available everywhere, that finds anything internal — even when you don’t type the exact words. Searching “how do we keep documents fresh” should surface the doc that says “registry backfill cadence.” It should feel instant as you type, and it should never be a wall of text.

This is grounded in a discovery pass (6 read-only scouts, 2026-06-04) that mapped the real corpus and infrastructure. The headline finding: “everything” is mostly already in our Supabase database — eight tables we own and can reach today, no new logins or external auth. That is the entire scope of this build.

Goal in one line

Promote search from an indocs-only, keyword-only, client-side palette to a global, semantic, server-backed search over every internal source — fast on every keystroke, smart enough to match meaning.

2. Scope — what’s in, what’s out

In scope (this build)

A global header-level search bar, available from anywhere in the intranet (lives in the intranet shell — apps/control/src/components/intranet/intranet-header.tsx), not nested inside the indocs section.
A server-backed search endpoint (Supabase edge function) doing keyword + semantic search across all internal sources, with results grouped by type.
Upgrading the existing indocs search (CommandPalette.tsx) to use the new endpoint and UI rather than its current title-only client-side match — the indocs search and the global search share one engine.
Internal sources only (§3).

Explicitly out of scope (deferred)

Google Drive, Gmail, Calendar. The only connected Google account is a personal Alpaca Playhouse mailbox (mixed with photos, family calendars, promo mail), reachable only inside an agent session, not from our servers. Making it useful needs a dedicated company Workspace account, folder/label scoping, a saved login, and OCR for PDFs. Not now.
Slack. Blocked entirely until someone logs it in and we save a bot token.
Recorded in §9 as a possible later phase, but the user has confirmed: keep this build to internal sources only.

3. The corpus — sources to index

Every realistic internal source, ranked by indexing priority. All live in Supabase and are reachable today.

Source	Table	What it holds	Priority
Intranet docs	`public.indocs`	~119 docs across 20 categories; markdown in `body_md`	Must-have
Tasks / work items	`public.tasks` + `task_activity`	titles, descriptions, comments, history	Must-have
Relations / CRM	`public.relations`	names, organizations, notes, email	Must-have
Prompts registry	`public.prompts`	prompt library — already has a working full-text index	Must-have (free win)
Image catalog	`public.images`	prompt, alt text, tags (already indexed); blobs in R2	Nice-to-have
Voice notes	`public.voice_notes`	transcripts + extracted action items	Nice-to-have
Spaces / property	`public.spaces`	name, description, features	Nice-to-have
People / team	`public.app_users`	names, emails	Nice-to-have
Research vault	`obsid-sponic/` (local md)	Sonia’s BD/partner research — 57 files, sensitive	Nice-to-have, access-scoped only

public.prompts already ships a full-text index (apps/control/migrations/20260512_prompt_registry.sql) — it’s the proven tsvector pattern to copy for every other table, and the one source already searchable today.
The research vault (obsid-sponic/) is markdown on disk, easy to sync technically, but contains confidential partner notes — only index it behind access control (§6), and only if the user opts in (Open Question 2).
Exact row counts for tasks / relations / images are unconfirmed (scouts read schema, not live data), but at this team’s scale that doesn’t change the approach.

4. Architecture — two layers, one box

Good search feels both instant and smart. Those pull in opposite directions (keyword matching is instant; semantic matching needs a model call), so we run two layers and merge them into one ranked list.

Layer A — instant keyword (every keystroke)

Matches the actual words in titles, descriptions, tags, body text. Returns in milliseconds. Handles the obvious cases (“that doc about R2 backups”).

Tech: Postgres full-text search (tsvector) + fuzzy matching (pg_trgm) for typo tolerance. Already proven on public.prompts. Free, no new service.

Layer B — semantic (catches what the words miss)

Matches by meaning — compares the math-fingerprint (“embedding”) of the query against each item’s fingerprint. Finds the “registry backfill cadence” doc when you typed “keep documents fresh.”

Tech: pgvector (already enabled — used by Open Brain and dinner-profile matching) stores 768-dim fingerprints, queried by cosine similarity. Fingerprints are made with Google Gemini gemini-embedding-001 — already in production here (ingest-thought, open-brain-mcp), effectively free at our volume, already 768-dim to match existing indexes. No Azure, no new account, no new cost.

Why Gemini and not Azure or local

Azure embeddings are marginally cheaper per token but need fresh setup and burn a credit expiring 2026-07-31; ALPUCA’s local Ollama has no text-embedding endpoint ready. Gemini is already wired, free at our volume, and dimension-compatible — the clear default.

How they merge

The keyword layer returns immediately so the box never feels slow. The semantic layer (query fingerprint ~100–400ms) folds in a beat later and re-ranks. The user sees one ranked list that gets smarter as it settles — no mode toggle, no two separate result sets.

keystroke ──▶ keyword hits (tsvector + trgm) ──▶ shown instantly └─▶ embed query (Gemini ~100-400ms) ──▶ semantic hits ──▶ merge + re-rank ──▶ one list

5. The unified search index & ingestion pipeline

The real work isn’t the query — it’s getting every source into one searchable place and keeping it fresh. Recommended: a single search_index table with a consistent shape regardless of source.

Column	Purpose
`source_type`	doc / task / contact / prompt / image / voice-note / space / person / research
`source_id`	points back to the original row
`title`, `snippet`	what shows in results
`url`	where clicking takes the user
`search_text` (`tsvector`)	the instant keyword layer
`embedding` (`vector(768)`)	the semantic layer
`acl` / `visibility`	who’s allowed to see this result (§6)
`updated_at`	freshness + sort

How each source stays fresh

Supabase tables (docs, tasks, relations, prompts, images, voice notes, spaces, people): a database trigger updates search_text automatically on every insert/edit — Postgres does this natively, so keyword freshness is instant and free. The embedding is computed by a small background job (mirror the existing image-gen-runner / prompt-runner pattern, or a Cloudflare cron) that picks up rows changed since its last run, calls Gemini, and stores the fingerprint. A few API calls per changed item.
Research vault (obsid-sponic/): a sync job reads changed markdown files and upserts them with a restricted visibility tag. Technically easy; gated on the access decision (Open Question 2).

Queries run through a single search endpoint (a Supabase edge function) so the static frontend never loads the whole corpus into the browser — it asks the endpoint, which runs keyword + semantic + visibility filtering and returns a ranked, paginated, grouped list.

6. Access control — per-row visibility (the real hidden cost)

This is the hard part, not the search

Today the intranet authorizes at the page level: any signed-in user can see any doc. A global search box would surface HR, Legal, CRM, and partner research to everyone who searches. Per-row visibility filtering does not exist yet — building it is the genuinely non-trivial engineering in this project, more than the search itself.

Each search_index row carries a visibility / acl value (mirror the category ACL model already in apps/control/src/lib/indoc-acl.ts: intranet / admin+staff / admin).
The search endpoint filters results by the requesting user’s role before returning them — a result the user can’t open never appears.
Sensitive sources (HR, Legal docs, the CRM, the research vault) get restrictive defaults; ordinary docs/tasks default to intranet.

This must land before any sensitive source is added to the index. It’s safe to ship Phase 1–3 over non-sensitive sources first and add the restricted sources once ACL filtering is in (§9).

7. The search endpoint

A single Supabase edge function (search) is the one entry point:

Takes the query string + the requesting user.
Runs the keyword query (tsvector + pg_trgm) across search_index.
Embeds the query via Gemini and runs the pgvector similarity query.
Merges + re-ranks (keyword exact-ish hits and high-similarity semantic hits both float up), filters by the user’s visibility, groups by source_type, paginates.
Returns a compact ranked list — title, snippet with the match highlighted, source type, url.

Keeping it server-side means the browser never holds the whole corpus (the current palette’s weakness) and visibility filtering can’t be bypassed client-side.

8. UI — global header search + indocs upgrade

Recommended direction: a Cmd-K command palette promoted to the global header, with an optional full /search results page for big searches.

Why command-palette: the intranet already has a CommandPalette component, so we extend what exists rather than inventing a new section — consistent with the “integrate, don’t reinvent” rule. It’s the least overwhelming pattern: one box, Escape to dismiss, reachable from anywhere via the header and the Cmd-K shortcut.

The experience

One box, results grouped by type — “Documents,” “Tasks,” “Contacts,” “Prompts,” etc., a few top hits each. Grouping stops a broad search feeling like a wall of text.
Instant-then-smart: keyword hits appear as you type; semantic hits fold in a beat later and re-rank. No mode toggle — it just gets smarter.
Each result shows a title, a short snippet with the matching text highlighted, and the source type. Click goes straight to the doc / task / contact.
Light filters (by type, category, status) as simple chips for when a search returns a lot — not a heavy faceted sidebar.
“See all results” → a dedicated /search page for deep searches, same grouping, full pagination.

The indocs search is folded into this, not left parallel. The current CommandPalette.tsx (indocs-only, title-only, client-side) is rebuilt on the new endpoint: same component, now backed by the server search and the richer result UI, scoped to docs when invoked from the indocs section and global when invoked from the header. One engine, two entry points — no second search to maintain.

Keep it plain: no imported product vocabulary. Match the existing Drive Browser’s visual language; keep result rows consistent in width and layout.

9. Build sequence

Five phases. Order is by dependency, not duration. Each phase is shippable and verifiable on its own.

Phase 1 — Unified index + keyword layer over core sources

Create search_index. Add tsvector columns + DB triggers for the must-have sources (docs, tasks, relations, prompts — reuse the prompt_registry pattern). Backfill existing rows. No UI yet; verify via direct queries.

Phase 2 — Search endpoint + global header search UI (keyword)

Build the search edge function (keyword only for now). Promote the command palette to the global header (intranet-header.tsx) with grouped results and the Cmd-K shortcut. First user-visible milestone: instant keyword search across core sources from anywhere.

Phase 3 — Semantic layer

Add embedding vector(768) to search_index. Build the background fingerprint job (Gemini, mirror ingest-thought). Add the similarity query to the endpoint and the merge/re-rank logic. Now search catches meaning, not just spelling — the actual stated goal.

Phase 4 — Per-row access control

Add visibility/acl to search_index and enforce it in the endpoint (mirror indoc-acl.ts). This unlocks safely adding sensitive sources.

Phase 5 — Full internal coverage + indocs search replacement

Extend the index to the remaining internal sources (images, voice notes, spaces, people, and — if opted in — the research vault behind ACL). Rebuild the indocs CommandPalette on the new endpoint so there’s one search engine, and retire its client-side title-only matching.

Recommended starting point

Phases 1–3 are the core deliverable — they produce exactly what was asked for: a smart, semantic, header-level search across the sources we own, entirely on infrastructure that already works (pgvector enabled, Gemini in production, the prompt full-text pattern proven), with no external logins and no new costs. Phases 4–5 broaden coverage and harden access once the experience is proven.

10. Open questions

Real decisions for the team, not implementation details:

Sensitive content — searchable by whom? HR docs, Legal docs, the CRM, and partner research all contain sensitive material. We can filter by role (e.g. only admins see HR/Legal; partner research limited to Sonia + core team) — but the team needs to set the policy. (Drives Phase 4.)
Include the research vault at all? obsid-sponic/ is Sonia’s personal BD research with confidential contacts. Index it (scoped to her + approved people), index only the non-sensitive parts, or leave it out entirely?
Search analytics? We can log what people search for (and what returns nothing) to find content gaps and improve ranking. Useful, or a privacy concern for this small team?

11. Risks & non-goals

The embedding layer can quietly rot. The keyword layer maintains itself via DB triggers; the semantic layer depends on a background job that re-fingerprints changed rows. It’s cheap, but it’s the piece that silently goes stale if the job breaks — needs a health check.
Access control is the hard part, not search. Per-row visibility is new engineering (§6). Don’t let the search work hide that this is where the real risk and effort sit.
Non-goal: external sources. Drive / Gmail / Slack / Calendar are explicitly deferred. This build does not touch them; a future phase can, once a real company Google account and Slack token exist and the privacy questions are answered.
Non-goal: a new search “section.” This integrates into the existing header and command palette — it does not add a new nav destination or reorganize the intranet.

Build journal

No entries yet — appended by /feature, /journal, and /wrap.