TL;DR
web_search tool for questions the transcript doesn't cover.@tanstack/react-form for the share form, createServerFn for RPC-style data fetching, and TanStack AI with the @tanstack/ai-ollama adapter for both structured summary generation and tool-calling chat.[mm:ss] markers that pointed to the wrong parts of the video. We tried stricter prompts, stricter schemas, and in-prompt timecode anchors before landing on the fix: forbid the model from emitting timecodes at all, then recover them deterministically from the transcript via BM25. Detailed in section 4.web_search.YT Knowledge Base is a single-user, local-first knowledge base for YouTube videos. The app takes a YouTube URL, pulls the caption track in-process via youtubei.js, cleans and chunks the transcript, and runs it through a local Ollama model with a Zod outputSchema to produce a structured summary — title, description, key takeaways, chronological sections, and action steps.
The /learn/:videoId route renders that summary next to the embedded player; section headings and in-chat citations are [mm:ss] chips that call player.seekTo(). The chat panel is a TanStack Start file route that streams AG-UI-format SSE, runs BM25 over the stored transcript index for retrieval, and exposes a single web_search tool the model can invoke when the retrieved chunks don't cover the question.
Three questions we wanted to answer while building this, and where we landed:
youtubei.js. No third-party summarization or transcription service — though how well this keeps working for the edge cases (gated videos, videos with no captions) is something we're still figuring out.generateSummaryWithAI() switches to a map step that summarizes ~17-minute windows in parallel, followed by a reduce step that synthesizes the final schema-validated output from those bullet notes. The cutover value (15K) came from trial and error.Two things I wanted to explore with this project.
The first was a question I kept circling around: how realistic is it, right now, to build something useful on local models and local hardware — nothing calling out to a hosted frontier model, and everything working offline?
Most of my hands-on AI work so far had been against hosted APIs, and I wanted to know what actually changes when you don't have one. Does the same kind of pipeline (structured outputs, tool use, map-reduce over long inputs, retrieval with grounding) still hold together against a local Ollama instance? A few specific things I wanted to learn by actually building it:
chat() + adapter + tools) looks like end-to-end when you own every layer.The second was practical. I wanted something that let me quickly skim a YouTube video to tell whether it was worth my time, with the ability to ask follow-up questions, and keep the summary around as a learning resource I could come back to later.
| Layer | Tech |
|---|---|
| Client framework | TanStack Start (Vite + Nitro), React 19 |
| Routing | TanStack Router (file-based) |
| Forms | TanStack React Form + Zod validators |
| Server functions | createServerFn from @tanstack/react-start |
| AI / LLM | TanStack AI + @tanstack/ai-ollama adapter |
| Local model | Ollama running gemma4-kb:latest (custom Modelfile, Q4) |
| Backend / CMS | Strapi 5, SQLite for dev, Postgres-ready |
| Transcripts | youtubei.js directly against caption tracks |
| Styling | Tailwind v4, Radix UI, shadcn primitives |
| Markdown rendering | react-markdown + remark-gfm |
| Validation | Zod (forms, server functions, structured AI outputs) |
The whole monorepo is two yarn workspaces — client/ and server/ — wired together by a root package.json and a start.sh that brings up Ollama, Strapi, and the TanStack client in one command.
Three Strapi content types used:
youtubeVideoId. Caption segments + duration + title. Created once, reused across every regeneration so YouTube is never re-hit.The remaining sections walk through the four pieces of the app where the interesting stuff lives:
web_search tool over SSE).The /new-post form is a small useForm setup with Zod validators. On submit it calls a createServerFn RPC handler that creates the Strapi Video row immediately and kicks off summary generation in a fire-and-forget background task. By the time the router navigates the user to /learn/$videoId, the row exists and the summary is already running on the server.
1// client/src/components/NewPostForm.tsx
2const form = useForm({
3 defaultValues: {
4 url: '',
5 caption: '',
6 tags: '',
7 mode: 'auto' as GenerationMode,
8 } satisfies ShareVideoFormValues,
9 validators: { onChange: ShareVideoFormSchema as never },
10 onSubmit: async ({ value }) => {
11 const parsed = ShareVideoFormSchema.safeParse(value);
12 if (!parsed.success) {
13 setServerError('Fix the highlighted fields and try again');
14 return;
15 }
16 const result = await shareVideo({
17 data: {
18 url: parsed.data.url,
19 caption: parsed.data.caption || undefined,
20 tags: parsed.data.tags || undefined,
21 mode: parsed.data.mode,
22 },
23 });
24 if (result.status === 'error') return setServerError(result.error);
25
26 await router.invalidate();
27 router.navigate({
28 to: '/learn/$videoId',
29 params: { videoId: result.video.youtubeVideoId },
30 });
31 },
32});The server function on the other end is a createServerFn with a Zod input validator and a typed handler — there's no separate REST route, no manual fetch, no axios. Type information flows end-to-end:
1// client/src/data/server-functions/videos.ts
2export const shareVideo = createServerFn({ method: 'POST' })
3 .inputValidator((data: z.input<typeof ShareVideoInputSchema>) =>
4 ShareVideoInputSchema.parse(data),
5 )
6 .handler(async ({ data }): Promise<ShareVideoResult> => {
7 const videoId = extractYouTubeVideoId(data.url);
8 if (!videoId) return { status: 'error', error: "Doesn't look like a YouTube URL" };
9
10 const alreadyExists = await fetchVideoByVideoIdService(videoId);
11 if (alreadyExists) return { status: 'exists', video: alreadyExists };
12
13 const meta = await fetchYouTubeMeta(videoId);
14 const result = await createVideoService({
15 videoId,
16 url: data.url,
17 caption: data.caption,
18 tagNames: parseTagInput(data.tags ?? ''),
19 videoTitle: meta.title,
20 videoAuthor: meta.author,
21 videoThumbnailUrl:
22 meta.thumbnailUrl ?? `https://i.ytimg.com/vi/${videoId}/hqdefault.jpg`,
23 });
24 if (!result.success) return { status: 'error', error: result.error };
25
26 kickoffSummaryGeneration(videoId, data.mode); // fire-and-forget
27 return { status: 'created', video: result.video };
28 });kickoffSummaryGeneration adds the videoId to a shared in-memory Set to dedupe parallel triggers, then kicks off the real work in an async IIFE so the user sees the response in a few hundred milliseconds rather than waiting minutes for inference.
The same createServerFn pattern powers the rest of the data layer — getFeed, getVideoByVideoId, getGenerationProgress, triggerSummaryGeneration, regenerateSummary, updateSectionTimecode, searchTags. Every TanStack Router route loader calls these directly:
1// client/src/routes/feed.tsx
2export const Route = createFileRoute('/feed')({
3 validateSearch: FeedSearchSchema,
4 loaderDeps: ({ search }) => ({ q: search.q, tag: search.tag, page: search.page }),
5 loader: async ({ deps }) => {
6 const result = await getFeed({
7 data: { q: deps.q, tag: deps.tag, page: deps.page ?? 1, pageSize: 20 },
8 });
9 return { result };
10 },
11 component: FeedPage,
12});The route validates URL search params with Zod, declares them as loader deps, and gets full type-safe access to result in the component via Route.useLoaderData().
Most of the real work on the summary side goes through TanStack AI. We create the Ollama adapter once per model and reuse it across calls:
1// client/src/lib/services/learning.ts
2import { chat } from '@tanstack/ai';
3import { createOllamaChat } from '@tanstack/ai-ollama';
4
5const OLLAMA_HOST = (process.env.OLLAMA_BASE_URL ?? 'http://localhost:11434/v1')
6 .replace(/\/v1\/?$/, '');
7const SUMMARY_MODEL = process.env.OLLAMA_MODEL ?? 'gemma4-kb:latest';
8
9const ollamaAdapter = createOllamaChat(SUMMARY_MODEL, OLLAMA_HOST);A YouTube transcript is wildly variable in length — a 5-minute explainer might be 1,000 tokens; a 90-minute interview can hit 22,000+. We could try to handle both with one strategy, but each end of the spectrum punishes the other:
So we run two pipelines and route between them based on token count. The cutover lives at SINGLE_PASS_TOKEN_BUDGET = 15_000 — below it, single-pass; above it, map-reduce. (auto mode picks for you; single and mapreduce are user overrides for the rare cases the heuristic gets it wrong.)
For transcripts under ~15K tokens, we hand the whole cleaned transcript to one chat({ outputSchema }) call. The flow is:
SummarySchema as outputSchema. TanStack AI translates the Zod schema into Ollama's native JSON-mode format parameter, so the model is constrained at decode time to produce JSON that matches the shape — no markdown fences, no preamble, no "Sure, here's your summary!".1const SummarySchema = z.object({
2 title: z.string().describe('Short punchy title. MAX 200 characters.'),
3 description: z.string(),
4 overview: z.string(),
5 keyTakeaways: z.array(z.object({ text: z.string() })),
6 sections: z.array(z.object({
7 heading: z.string(),
8 body: z.string(),
9 })).min(2).max(15),
10 actionSteps: z.array(z.object({ title: z.string(), body: z.string() })),
11});
12
13const object = (await chat({
14 adapter: ollamaAdapter,
15 messages: [
16 { role: 'system', content: SUMMARY_SYSTEM },
17 { role: 'user', content: userPrompt },
18 ] as never,
19 outputSchema: SummarySchema,
20 temperature: 0.3,
21})) as GeneratedSummary;A few things worth calling out about this single call:
temperature: 0.3 suppresses creative drift. Ollama's default of 1.0 is great for chat, but for structured summarization we want grounded prose, not invention — especially in the action steps, where confabulated specifics ("install the X plugin") look authoritative when they're actually guesses..describe() doubles as a soft constraint. Each schema field's .describe() text gets surfaced to the model alongside the JSON shape, which is why we put hard-limit hints (MAX 280 characters) and ordering rules ("sections IN CHRONOLOGICAL ORDER from start to end") right in the schema.timeSec unset" rule, because timecodes get recovered deterministically from the transcript afterwards (see section 4) — anything the model writes there would just be discarded.Map-reduce is borrowed from classic distributed-data-processing — split the input across N workers, summarize each piece independently, then combine the partial results. LangChain popularized the pattern for LLM summarization; we run our own minimal implementation in pure JS rather than pulling in a framework.
For our case, the trade-off is exactly the one map-reduce was invented for: a single LLM call can't pay attention to 90 minutes of transcript at once, but it can pay great attention to a 17-minute window. So we split the transcript into windows, summarize each one in parallel, and then do a final synthesis pass that turns the bullet notes into the same structured SummarySchema as the single-pass path.
Step 1 — Chunk. chunkForSummary splits the cleaned transcript into ~2,500-word windows (~17 minutes of speech) with a 50-word overlap so we don't cut a thought clean in half across the seam. Each window carries the real timeSec of its first word, sourced from the caption-segment timestamps we preserved during cleaning.
Step 2 — Map (in parallel). Each window goes to its own chat() call with a tight system prompt: "You read one window of a YouTube transcript and produce concise bullet notes on what was said." The map model gets a different, simpler instruction than the reduce model — its only job is faithful note-taking, not synthesis. We use stream: false here because we don't surface map output to the UI, only the aggregate progress.
The parallelism uses a classic worker-pool pattern rather than Promise.all(chunks.map(...)). The difference matters: Promise.all(map) would fire all N requests at once, which Ollama would just queue (it serves requests against OLLAMA_NUM_PARALLEL slots). The worker pool gives us a constant in-flight count that matches the configured concurrency, which is honest about what's actually happening on the GPU and lets us report "map 4/9 done · 2 running" truthfully:
1let cursor = 0;
2const partialNotes: string[] = new Array(chunks.length);
3
4const workers = Array.from({ length: MAP_CONCURRENCY }, async () => {
5 while (true) {
6 const i = cursor++; // atomic in JS's single-threaded event loop
7 if (i >= chunks.length) return;
8 await processChunk(i); // writes into partialNotes[i] by index
9 }
10});
11await Promise.all(workers);MAP_CONCURRENCY defaults to 1 (safe on any laptop). Bumping it to 2-4 helps on machines with RAM headroom — but it must match OLLAMA_NUM_PARALLEL on the Ollama server, or extra requests just queue. Each extra slot costs ~3GB of KV cache on an 8B model at num_ctx=32768, so on a 24GB Mac with Chrome/editor open, 2 can push you into swap and end up slower than 1.
Writing results into partialNotes[i] by index — rather than pushing in completion order — guarantees the reduce step sees windows chronologically, which matters because the reduce prompt explicitly tells the model "these notes are in chronological order; produce sections that span from the start of the video to the end".
Step 3 — Reduce. Once all windows have produced bullet notes, we concatenate them and run one more chat({ outputSchema: SummarySchema }) call — the same call as the single-pass path, with the same schema, the same temperature: 0.3, and the same anti-confabulation system prompt. The only difference is the input: instead of a 22K-token raw transcript, the reduce step sees ~5K tokens of pre-digested bullet notes. The model has no trouble paying attention to all of it, so the resulting sections actually cover the back half of the video instead of trailing off after the opening:
1const reduceUser = [
2 `Video duration: ${formatTimecode(transcript.durationSec)}.`,
3 `You are summarizing a ${formatTimecode(transcript.durationSec)}-long video from per-window bullet notes (each window covers a distinct portion of the video).`,
4 `CRITICAL: Your sections MUST cover the ENTIRE video — including the FINAL windows. ${partialNotes.length} windows of notes → produce sections that collectively reference all of them.`,
5 '',
6 'Window notes:',
7 partialNotes.join('\n\n'),
8].join('\n');
9
10const object = await chat({
11 adapter: ollamaAdapter,
12 messages: [
13 { role: 'system', content: SUMMARY_SYSTEM },
14 { role: 'user', content: reduceUser },
15 ] as never,
16 outputSchema: SummarySchema,
17 temperature: 0.3,
18});Because the reduce step lands on the exact same schema as single-pass, everything downstream — clamping, BM25 grounding, Strapi save, learn-page rendering — is identical between the two pipelines. The branching is fully contained inside generateSummaryWithAI; the rest of the system has no idea which path produced the summary.
A few non-obvious knobs that shaped this design:
SINGLE_PASS_TOKEN_BUDGET = 15_000 (originally 25K). At 25K, a 100-minute video would technically fit in single-pass — but with under 10K tokens of headroom for the system prompt + structured-output bookkeeping against num_ctx=32768, the model produced shallow, generic sections. Lowering the cutover to 15K means anything past ~60 minutes goes through map-reduce, which gives more coherent per-section attention.stream: false, reduce step uses structured output. Map output is plain prose for internal consumption, so streaming buys nothing; reduce output is the user-visible structured summary, so we want JSON-mode constraint decoding.withRetry) wrap every model call with 2 attempts. Local models occasionally produce malformed JSON or empty completions under memory pressure; retrying once costs less than failing the whole 10-minute generation run.Throughout the run, a server-side progress map (videoId → { step, detail, elapsedMs }) is updated so the /learn/$videoId page can poll a getGenerationProgress server function and show a live label like "map 4/9 done · 2 running".
The chat endpoint is a TanStack Router file route that returns a Server-Sent Events stream in AG-UI format. The whole thing fits in a single handler:
1// client/src/routes/api.chat.tsx
2export const Route = createFileRoute('/api/chat')({
3 server: {
4 handlers: {
5 POST: async ({ request }) => {
6 const body = await request.json();
7 const video = await fetchVideoByVideoIdService(body.videoId);
8 if (!video || video.summaryStatus !== 'generated') {
9 return new Response('Summary not ready', { status: 409 });
10 }
11
12 const { system } = await prepareChatPrompt(video, body.messages);
13 const expanded = expandHistoryForModel(body.messages);
14
15 const adapter = createOllamaChat(CHAT_MODEL, OLLAMA_HOST);
16 const stream = chat({
17 adapter,
18 messages: [{ role: 'system', content: system }, ...expanded] as never,
19 tools: [webSearchTool], // agent loop: model can call this
20 });
21
22 return toServerSentEventsResponse(stream);
23 },
24 },
25 },
26});The web_search tool is defined declaratively with TanStack AI's toolDefinition helper. It defines its own input/output Zod schemas and an execute() function that runs server-side when the model emits a tool call — no custom plumbing needed:
1// client/src/lib/services/chat-tools.ts
2export const webSearchTool = toolDefinition({
3 name: 'web_search',
4 description:
5 'Search the public web for additional context when the video transcript ' +
6 "doesn't answer the user's question. Use sparingly — cite each result inline.",
7 inputSchema: z.object({
8 query: z.string().min(2).max(200),
9 }),
10 outputSchema: z.object({
11 results: z.array(z.object({
12 title: z.string(), snippet: z.string(), url: z.string(),
13 })),
14 }),
15}).server(async ({ query }) => {
16 const results = await webSearch(query, 5); // scrapes DDG HTML endpoint
17 return { results };
18});On the client, the chat UI consumes the SSE stream incrementally. Each event is one data: <json>\n\n line; the client splits on the blank-line delimiter, parses the JSON, and routes by type:
1// client/src/components/VideoChat.tsx
2async function* streamChatResponse(videoId, messages): AsyncGenerator<StreamEvent> {
3 const res = await fetch('/api/chat', {
4 method: 'POST',
5 headers: { 'Content-Type': 'application/json' },
6 body: JSON.stringify({ videoId, messages }),
7 });
8
9 const reader = res.body!.getReader();
10 const decoder = new TextDecoder();
11 let buffer = '';
12 while (true) {
13 const { done, value } = await reader.read();
14 if (done) break;
15 buffer += decoder.decode(value, { stream: true });
16 let idx;
17 while ((idx = buffer.indexOf('\n\n')) !== -1) {
18 const event = parseSseEventBlock(buffer.slice(0, idx));
19 buffer = buffer.slice(idx + 2);
20 if (event) yield event;
21 }
22 }
23}The component appends TEXT_MESSAGE_CONTENT deltas to the assistant message in real time and renders TOOL_CALL_START / TOOL_CALL_END events as an expandable accordion above the message body — so when the model calls web_search, the user sees a panel with the query, the results, and the model's natural-language follow-up underneath.
We added the /web slash command mostly as a testing. Small local models are generally less reliable at tool calling than hosted frontier ones — the Berkeley Function Calling Leaderboard is the standard place to compare if you want numbers, though Gemma 4 isn't on it yet (we'd expect it to score better than earlier Gemma releases, but that's a guess until someone runs the eval). Either way, we wanted a way to force a web_search call without depending on the model's discretion during testing.
Anecdotally though, in my own testing the model picked up web_search on its own most of the time the question clearly wasn't covered by the transcript.
I needed /web less often than I expected. Still useful to have for reproducible testing, and for the cases where the model tries to answer from transcript passages that don't really cover the question. The client rewrites the message into an explicit instruction when /web is used:
1function transformSlashCommand(input: string): string {
2 const webMatch = input.match(/^\/web\s+(.+)$/i);
3 if (webMatch) {
4 const query = webMatch[1].trim();
5 return `Use the web_search tool with the exact query "${query}", ` +
6 `then summarize the top results in 2-3 short paragraphs. Cite each ` +
7 `source URL inline. Do NOT answer from the transcript for this request.`;
8 }
9 return input;
10} The first version of this pipeline let the summary model generate timecodes directly. The schema had a timeSec field on each section, and the system prompt asked the model to fill it in from the transcript.
On short videos it mostly worked. On anything past 30 minutes it fell apart — the model emitted confidently wrong timestamps.
A section about "testing" would point to a timestamp from an earlier product-demo segment.
Chat answers cited [28:14] when the relevant content was at 10:02. The model wasn't reading timestamps off the transcript; it was pattern-matching what a timestamp looked like and producing one that felt plausible.
We tried fixes in the order we thought of them:
timeSec field must be the real caption start from the transcript; do not guess." The model obeyed the format of the rule and kept guessing.z.number().int().nonnegative() and added schema-level .describe() hints. Same result — well-formed numbers that pointed to the wrong place.[mm:ss] markers into the transcript at 15-second intervals before sending it, hoping the model would copy one into timeSec instead of inventing. It helped for sections near the front of the video, then got worse toward the back as the model's attention thinned.Each of these looked like a fix on short test clips and then quietly fell apart the moment we ran a real 45-minute video through it.
The underlying problem is that asking a language model to emit a factual pointer into its own input is asking it to do the one thing language models are worst at, precise factual recall over a long context.
You end up with confident wrong answers, and the user has no way to tell which timecodes are real and which aren't.
The design we landed on flips the problem. The model isn't allowed to produce timecodes at all. The system prompt explicitly forbids them; the timeSec field is omitted from generation; any [mm:ss] that slips through is stripped. The model's job is purely semantic — write a good section heading, a good section body, a good chat answer.
A separate, deterministic pass then attaches real caption-segment start times to those outputs by asking: given this snippet of text the model wrote, which window of the transcript was it actually talking about? That's a classic information-retrieval problem, and the algorithm we picked for it is BM25 — lexical, cheap, and deterministic. Same text plus same transcript always returns the same chunk, which is exactly the property a grounding pass needs.
BM25 ("Best Matching 25") is a lexical ranking function from the Okapi family, developed by Stephen Robertson and Karen Spärck Jones in the 1990s. It's the same algorithm that's powered Lucene, Elasticsearch, OpenSearch, and Solr for the better part of two decades — every time you've used the GitHub search box or typed into a wiki, BM25 (or a close cousin) was probably ranking the results.
It's built on top of two much older ideas:
BM25 combines those two with a couple of saturation knobs that fix the obvious problems with naïve TF-IDF — namely, that doubling the term frequency shouldn't double the score (saturation via k1), and that long documents shouldn't be penalized purely for being long (length normalization via b).
The scoring formula, applied per document for each query term and summed:
1 f(term, doc) · (k1 + 1)
2score(doc, query) = Σ IDF(term) · ─────────────────────────────────────
3 term ∈ query f(term, doc) + k1 · (1 − b + b · |doc| / avgdl)where f(term, doc) is the raw count of the term in the document, |doc| is the document length, avgdl is the average length across the corpus, and k1 and b are tuning constants (Lucene's defaults — k1=1.2, b=0.75 — are what we use). The whole thing fits in a few dozen lines of pure JavaScript.
What you get is a lexical retriever: it matches on word overlap, weighted by rarity. It does not understand synonyms. It does not understand paraphrase. If the transcript says "shipped" and you ask about "launched", BM25 alone will miss it. We deal with that two ways below — but the core algorithm is just term-frequency math, which means zero model downloads, zero vector storage, and zero inference cost at retrieval time.
Each transcript becomes a corpus of small chunks that BM25 ranks against a query. Two sizes, sharing the same primitive:
| Purpose | Chunk size | Overlap |
|---|---|---|
| Chat retrieval (top-k) | 150 words (~60s) | 20 words |
| Summary map-reduce | 2,500 words (~17 min) | 50 words |
Each chunk gets a real timeSec by looking up the segment timestamp of its first word — youtubei.js gives us caption segments with millisecond-precise start times, and we keep a parallel wordStartMs[] array through the cleaning pass so chunks land on real timestamps instead of linear-interpolated estimates. Chunks also get inline [mm:ss] markers at 15-second intervals so the model can copy real timestamps into the prose it produces:
1const text = wordStartMs
2 ? annotateSpan(words, wordStartMs, i, end, 15) // "...we [01:23] then..."
3 : words.slice(i, end).join(' ');We index those chunks once at summary time. Tokenization is just lowercased word-boundary splits + a small English stopword filter — no stemmer. The full index (per-chunk term frequencies, global IDF, length stats, the chunks themselves) serializes to plain JSON and lives in Video.transcriptSegments, so the chat endpoint can load it on demand without rebuilding.
To get around BM25's pure-lexical limitation, we layer two techniques on top for chat retrieval:
Multi-query rewriting + RRF: a small LLM call rewrites the user's question into 4 alternative phrasings using different vocabulary. We run BM25 against each one independently, then fuse the rankings with Reciprocal Rank Fusion (k=60):
1RRF_score(chunk) = Σ 1 / (k + rank_i(chunk))
2 i ∈ queriesChunks that show up across multiple phrasings rise to the top. RRF is rank-based (not score-based), so it handles the score-scale differences between queries cleanly.
The grounding pass that runs after the model finishes is the same primitive used in two places:
findEvidenceForQuote("${heading} ${body.slice(0, 200)}", index) and take the top chunk's timeSec as the section's real timestamp.[mm:ss] marker in a chat response, we BM25-match the surrounding text against the index, snap the chip to the top chunk's real timeSec, dedupe across the response (±15s), and render an expandable "Sources" accordion with the actual transcript snippet behind each citation. If the model's emitted timestamp drifts more than 30s from the grounded one, we flag it with a "drift" badge — usually a tell that the model hallucinated the citation.Honestly, I wanted to explore something other than building a standard RAG — embed everything with a sentence-transformer, stick it in a vector store, query with cosine similarity. That's the well-trodden path, and it felt worth seeing what else was out there before defaulting to it. That's how I stumbled on BM25.
Once I started reading into it, a few things clicked for this particular app:
We did layer two things on top to deal with BM25's blind spot around paraphrases — contextual retrieval at the chunk level, query rewriting + RRF at the query level. Between those, "shipped" vs "launched" mostly stops being a problem for chat. It's possible that at some point we'd hit the wall where embeddings win anyway, we just haven't gotten there with what we've built.
If you want to try embeddings, the whole retrieval layer lives in a single function (retrieveChunks in learning.ts) — keep the StoredTranscriptIndex shape and swap the implementation. For this app BM25 was enough. Multi-video search or cross-corpus retrieval are different problems where we'd want to measure before assuming the same answer.
This whole thing runs on a single laptop. You'll need:
ollama serve on Linux/Windows)Then:
# 1. Clone the repo
git clone https://github.com/codingafterthirty/yt-knowledge-base.git
cd yt-knowledge-base
# 2. Install deps + copy .env files for both client and server
yarn setup
# 3. Pull a chat-capable model. The default Modelfile is gemma4-kb,
ollama pull gemma4-kb:latest
# 4. (Optional) Load example videos so the feed isn't empty.
# Run BEFORE starting Strapi — `strapi import` needs exclusive
# write access to the SQLite DB.
yarn seed
# 5. Start Ollama + Strapi + the TanStack client together.
yarn startOpen http://localhost:3000 for the frontend and Strapi admin lives at http://localhost:1337/admin.
A few useful environment variables to know about (in client/.env):
| Variable | Default | Purpose |
|---|---|---|
OLLAMA_BASE_URL | http://localhost:11434/v1 | Ollama endpoint (the /v1 is stripped for the TanStack AI adapter, kept for backwards compat) |
OLLAMA_MODEL | gemma4-kb:latest | Model used for summary generation |
OLLAMA_CHAT_MODEL | inherits OLLAMA_MODEL | Use a separate model for chat if you want |
MAP_CONCURRENCY | 1 | Parallel map-step workers. Bump to 2-4 if you have RAM headroom — must match OLLAMA_NUM_PARALLEL |
STRAPI_URL | http://localhost:1337 | Local Strapi |
If your IP hits a YouTube "confirm you're not a bot" wall (rare on residential IPs, common on datacenter ones), set TRANSCRIPT_PROXY_URL to a residential proxy.
The other useful scripts:
yarn dev # Strapi + client (skips Ollama env setup)
yarn start:fresh # Like yarn start but force-restarts Ollama
yarn export # Export the SQLite DB to server/seed-data/seed.tar.gz
yarn --cwd client test # Run the vitest suiteThe architecture is deliberately swappable. Want embeddings instead of BM25? retrieveChunks in learning.ts is the single injection point — keep the StoredTranscriptIndex shape and nothing downstream has to change.
Want a new tool? Define it with toolDefinition, export it from chat-tools.ts, add it to the tools: [...] array in api.chat.tsx. Want a different model or provider? Change OLLAMA_MODEL, or replace the adapter with any other TanStack AI adapter.
The project is MIT. Questions and PRs welcome.
Thanks for checking out the post, if have any question, join us during our open office ours on Discord. Monday through Friday 12:30pm CST. Or joint the GitHub discussions.
Citations