Traditional keyword search fails when users phrase queries differently from your content. For example, a user searches for "how to handle user login," but your article is titled "Authentication and Authorization in Strapi." Zero results. Frustrated user.
Retrieval-Augmented Generation (RAG) addresses this by converting your content into vector embeddings that capture meaning, not just keywords. When a user asks a question, the system finds semantically relevant content chunks, feeds them to a Large Language Model (LLM), and returns a grounded answer based on your actual documentation.
This guide walks you through building a complete RAG pipeline over your Strapi content using OpenAI embeddings, Qdrant as a vector database, and GPT-4o-mini for answer generation. By the end, you'll have a working /rag-search endpoint in your Strapi backend that answers natural-language questions using your own headless CMS content.
In brief:
text-embedding-3-small model and store them in Qdrant. Before starting, you need:
articles docker run -p 6333:6333 qdrant/qdrant) or a free cloud cluster from qdrant.tech The pipeline has two phases. During ingestion, you pull content from Strapi, split it into chunks, generate embeddings, and store them in a vector database. At query time, the user's question gets embedded with the same model, the vector database returns the most semantically similar chunks, and those chunks get injected into a prompt sent to GPT-4o-mini.
1Ingestion: Strapi Content → Chunks → Embeddings → Qdrant
2Query: User Question → Embedding → Qdrant Search → Context + Prompt → GPT-4o-mini → AnswerVector embeddings are arrays of floating-point numbers that represent text as points in high-dimensional space. The embedding model maps semantically similar phrases to nearby points, so "user login" and "authentication" can end up close together even though they share no keywords. This is what makes semantic search possible. You're comparing meaning, not strings.
When comparing these vectors, cosine similarity measures how closely aligned two vectors are, regardless of their magnitude. Higher scores indicate greater similarity, while lower scores indicate weaker similarity. Cosine similarity works well for text because it focuses on the orientation of the vector rather than its length, which makes it useful across varying document sizes.
You might wonder why not just fine-tune an LLM on your headless CMS content instead. The problem is that content changes constantly. Articles get published, updated, and archived daily. Fine-tuning is expensive, slow, and produces a static model that's immediately outdated when someone edits a page. RAG lets you update the knowledge base by re-embedding changed content, with no retraining required.
Grounding is central to why RAG reduces hallucination. When you ask an LLM a question directly, it generates answers from its training data, which may be outdated or wrong for your specific content. By retrieving actual chunks from your documentation and injecting them into the prompt, you constrain the model to answer based on evidence it can see. The LLM becomes a reasoning engine over your content rather than a guessing machine.
One critical constraint is that you need to use the same embedding model at both ingestion and query time.
Start by pulling content from Strapi's REST API. You need a read-only API token for server-to-server extraction.
A common mistake is assuming populate relations are populated by default. You need to explicitly request them with the populate parameter.
Install dependencies first:
1npm install openai @qdrant/js-client-rest qs markedCreate a script called scripts/ingest.js to fetch and chunk your content:
1const qs = require('qs');
2const { marked } = require('marked');
3
4const STRAPI_URL = process.env.STRAPI_URL || 'http://localhost:1337';
5const API_TOKEN = process.env.STRAPI_API_TOKEN;
6
7// Fetch all entries with pagination
8async function fetchAllContent(contentType) {
9 let page = 1;
10 let allEntries = [];
11 let hasMore = true;
12
13 while (hasMore) {
14 const query = qs.stringify({
15 pagination: { page, pageSize: 25 },
16 populate: { blocks: true, author: true, categories: true },
17 sort: ['publishedAt:desc']
18 }, { encodeValuesOnly: true });
19
20 const response = await fetch(`${STRAPI_URL}/api/${contentType}?${query}`, {
21 headers: { 'Authorization': `Bearer ${API_TOKEN}` }
22 });
23
24 const data = await response.json();
25 allEntries = [...allEntries, ...data.data];
26
27 const { page: currentPage, pageCount } = data.meta.pagination;
28 hasMore = currentPage < pageCount;
29 page++;
30 }
31
32 return allEntries;
33}Using pagination during bulk extraction can help reduce load and avoid retrieving entire datasets in a single request.
Next, convert rich text to plain text. Strapi's Block Editor JSON uses JSON blocks, while the legacy editor uses Markdown. Neither format is ready for embedding as-is:
1// For Markdown content (legacy editor)
2function markdownToPlainText(markdown) {
3 const html = marked.parse(markdown);
4 return html.replace(/<[^>]*>/g, '').trim();
5}
6
7// For JSON Block content (new Block Editor)
8function blocksToPlainText(blocks) {
9 if (!blocks || !Array.isArray(blocks)) return '';
10 return blocks
11 .filter(block => block.type === 'paragraph' || block.type === 'heading')
12 .map(block => block.children?.map(child => child.text).join('') || '')
13 .join('\n\n');
14}This plain text conversion step helps embedding quality. Embedding models work best with clean, readable text. HTML tags, Markdown syntax characters, and JSON structure add noise that can affect the resulting vectors.
If you embed raw Markdown like ## Authentication, the model spends part of its capacity encoding the ## characters instead of just the word "Authentication." Stripping formatting helps the vector focus on the semantic meaning of the words rather than structural markup, which can improve similarity matches at query time.
Now chunk the content. For RAG, a practical starting point is chunks of roughly 200 to 800 tokens. Split by semantic boundaries first, then by paragraphs:
1function chunkContent(plainText, metadata, maxChars = 3000) {
2 const sections = plainText.split(/\n#{1,3}\s/);
3 const chunks = [];
4
5 for (const section of sections) {
6 if (section.length <= maxChars) {
7 chunks.push({ content: section.trim(), metadata });
8 } else {
9 const paragraphs = section.split(/\n\n+/);
10 let currentChunk = '';
11 for (const para of paragraphs) {
12 if ((currentChunk + para).length > maxChars) {
13 if (currentChunk) chunks.push({ content: currentChunk.trim(), metadata });
14 currentChunk = para;
15 } else {
16 currentChunk += '\n\n' + para;
17 }
18 }
19 if (currentChunk) chunks.push({ content: currentChunk.trim(), metadata });
20 }
21 }
22 return chunks;
23}That chunk range is a trade-off. A few patterns are worth keeping in mind:
The metadata attached to each chunk, title, URL, and documentId, is what enables source citation in the final answer. Without it, you'd have no way to link the user back to the original article.
Note that Strapi 5 uses a flattened format where attributes sit directly on the data object, unlike v4's nested attributes key. The extraction code below accounts for this:
1async function extractAndChunkContent() {
2 const articles = await fetchAllContent('articles');
3
4 const chunks = articles.flatMap(article => {
5 const content = article.attributes || article; // v4 vs v5
6 const plainText = blocksToPlainText(content.blocks)
7 || markdownToPlainText(content.body || '');
8
9 return chunkContent(plainText, {
10 documentId: article.documentId || article.id,
11 title: content.title,
12 url: `/articles/${content.slug}`,
13 updatedAt: content.updatedAt,
14 contentType: 'article'
15 });
16 });
17
18 return chunks;
19}With chunks ready, generate embeddings using OpenAI's text-embedding-3-small model and store them in Qdrant. This model outputs 1,536-dimensional vectors by default, but you can reduce dimensions to save storage:
1const OpenAI = require('openai');
2const { QdrantClient } = require('@qdrant/js-client-rest');
3
4const openai = new OpenAI(); // reads OPENAI_API_KEY from env
5const qdrant = new QdrantClient({ url: process.env.QDRANT_URL || 'http://localhost:6333' });
6
7const COLLECTION_NAME = 'strapi-content';
8
9async function createCollection() {
10 await qdrant.createCollection(COLLECTION_NAME, {
11 vectors: { size: 1536, distance: 'Cosine' }
12 });
13}Why text-embedding-3-small over text-embedding-3-large? The small model uses fewer dimensions, and for many headless CMS search use cases, it's a practical default. The large model outputs 3,072-dimensional vectors, which increases storage requirements and can make similarity search heavier.
It's worth considering if your content is highly technical or domain-specific, where finer semantic distinctions may matter. For general articles, blog posts, and documentation, the small model is a sensible starting point.
The Embeddings API accepts arrays of strings for batch processing, which is more efficient than sending single calls one at a time. Each input has an 8,192 token limit, but if you're chunking to about 3,000 characters, you'll stay well under that.
1async function generateAndStoreEmbeddings(chunks) {
2 // Process in batches during ingestion
3 const batchSize = 20;
4
5 for (let i = 0; i < chunks.length; i += batchSize) {
6 const batch = chunks.slice(i, i + batchSize);
7
8 const embeddingResponse = await openai.embeddings.create({
9 model: 'text-embedding-3-small',
10 input: batch.map(c => c.content.replace(/\n/g, ' ').trim()),
11 });
12
13 const points = batch.map((chunk, idx) => ({
14 id: `${chunk.metadata.documentId}-${i + idx}`,
15 vector: embeddingResponse.data[idx].embedding,
16 payload: {
17 chunkId: `${chunk.metadata.documentId}-${i + idx}`,
18 text: chunk.content.substring(0, 1000),
19 title: chunk.metadata.title,
20 url: chunk.metadata.url,
21 updatedAt: chunk.metadata.updatedAt,
22 }
23 }));
24
25 await qdrant.upsert(COLLECTION_NAME, { points });
26 console.log(`Indexed batch ${i / batchSize + 1}`);
27 }
28}Add a retry wrapper for rate limits. HTTP 429 errors are what you'll see when you hit limits:
1async function createEmbeddingWithRetry(input, model = "text-embedding-3-small", maxRetries = 3) {
2 for (let attempt = 0; attempt < maxRetries; attempt++) {
3 try {
4 return await openai.embeddings.create({ model, input });
5 } catch (error) {
6 if (error.status === 429 && attempt < maxRetries - 1) {
7 const delay = 1000 * Math.pow(2, attempt);
8 console.log(`Rate limited. Retrying in ${delay}ms...`);
9 await new Promise(resolve => setTimeout(resolve, delay));
10 } else {
11 throw error;
12 }
13 }
14 }
15}Tie it all together and run the ingestion:
1async function ingest() {
2 await createCollection();
3 const chunks = await extractAndChunkContent();
4 console.log(`Extracted ${chunks.length} chunks. Generating embeddings...`);
5 await generateAndStoreEmbeddings(chunks);
6 console.log('Ingestion complete.');
7}
8
9ingest().catch(console.error);Run it with:
1STRAPI_API_TOKEN=your-token OPENAI_API_KEY=sk-... node scripts/ingest.jsNow build the search endpoint inside Strapi using a custom controller. This follows Strapi's standard createCoreController pattern.
First, configure your API keys using Strapi environment configuration:
1# .env
2OPENAI_API_KEY=sk-...
3QDRANT_URL=http://localhost:6333Create the route file:
1// src/api/rag-search/routes/rag-search.js
2module.exports = {
3 routes: [
4 {
5 method: 'POST',
6 path: '/rag-search',
7 handler: 'rag-search.search',
8 config: {
9 policies: ['global::is-authenticated'],
10 middlewares: [],
11 },
12 },
13 ],
14};Strapi organizes custom APIs under src/api/[api-name]/ with separate directories for routes, controllers, and services. The route file maps HTTP methods and paths to controller actions. Here, a POST to /rag-search calls the search method on the rag-search controller. The config.policies array lets you attach authentication or authorization checks that run before the handler executes.
For production use, consider adding authentication policies or rate limiting via the Upstash rate limit plugin, though the plugin is currently marked as experimental.
Now the controller. This is where the RAG pipeline comes together:
1// src/api/rag-search/controllers/rag-search.js
2const { createCoreController } = require('@strapi/strapi').factories;
3const OpenAI = require('openai');
4const { QdrantClient } = require('@qdrant/js-client-rest');
5
6module.exports = createCoreController('api::rag-search.rag-search', ({ strapi }) => ({
7
8 async search(ctx) {
9 try {
10 const { query } = ctx.request.body;
11
12 if (!query || typeof query !== 'string') {
13 return ctx.throw(400, 'Valid query string is required');
14 }
15 if (query.length > 1000) {
16 return ctx.throw(400, 'Query exceeds maximum length of 1000 characters');
17 }
18
19 const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
20 const qdrant = new QdrantClient({ url: process.env.QDRANT_URL });
21
22 // 1. Embed the query
23 const embeddingResponse = await openai.embeddings.create({
24 model: 'text-embedding-3-small',
25 input: query.replace(/\n/g, ' ').trim(),
26 });
27 const queryEmbedding = embeddingResponse.data[0].embedding;
28
29 // 2. Search for relevant chunks
30 const searchResults = await qdrant.search('strapi-content', {
31 vector: queryEmbedding,
32 limit: 5,
33 with_payload: true,
34 });
35
36 if (!searchResults.length) {
37 return ctx.send({
38 data: { answer: 'No relevant content found.', sources: [] }
39 });
40 }
41
42 // 3. Build context from retrieved chunks
43 const context = searchResults
44 .map((m, i) => `[Source ${i + 1}: ${m.payload.title}]\n${m.payload.text}`)
45 .join('\n\n---\n\n');
46
47 // 4. Generate grounded answer
48 const completion = await openai.chat.completions.create({
49 model: 'gpt-4o-mini',
50 messages: [
51 {
52 role: 'system',
53 content: `You are a helpful assistant that answers questions based on the provided documentation.
54
55IMPORTANT RULES:
561. Answer ONLY using information from the context below
572. If the context doesn't contain enough information, say so
583. Cite which source(s) you used by referencing [Source N]
594. Do not make up facts not present in the context
60
61Context:
62${context}`,
63 },
64 { role: 'user', content: query },
65 ],
66 temperature: 0.1,
67 max_tokens: 800,
68 });
69
70 return ctx.send({
71 data: {
72 answer: completion.choices[0].message.content,
73 sources: searchResults.map(m => ({
74 score: m.score,
75 title: m.payload.title,
76 url: m.payload.url,
77 excerpt: m.payload.text?.substring(0, 200) + '...',
78 })),
79 },
80 });
81
82 } catch (error) {
83 if (error.status === 429) {
84 return ctx.throw(429, 'AI service rate limit exceeded. Please retry shortly.');
85 }
86 strapi.log.error('RAG search error:', error);
87 return ctx.throw(500, 'An error occurred during search');
88 }
89 },
90}));The limit: 5 parameter in the Qdrant search controls how many chunks are retrieved as context for the LLM. Five chunks is a useful starting point for many implementations because it gives the model multiple relevant passages without making the prompt unnecessarily large. Depending on your chunk size and content density, you can experiment with nearby values. If your chunks are small, you may need more. If they're larger, fewer may be enough.
In production, it helps to filter out low-relevance results before feeding them to the LLM. Each result from Qdrant includes a score field, and lower-scoring results may be noise rather than signal. Adding a score filter can help keep the context cleaner:
1const relevant = searchResults.filter(r => r.score > 0.7);Treat 0.7 as an example starting point and adjust based on your content. If you're getting too few results, lower it. If answers seem off-topic, raise it.
The prompt engineering here is deliberate. Setting temperature: 0.1 keeps responses more constrained, which is useful for retrieval-based answers. Temperature affects how variable the model's output is. For RAG, where you want answers grounded in retrieved content, a lower temperature helps reduce improvisation beyond what the context supports. The system prompt's explicit rules may help reduce hallucination, but their effect is limited and not reliably established. The [Source N] citation pattern lets your frontend link back to original content.
Test it with:
1curl -X POST http://localhost:1337/api/rag-search \
2 -H "Content-Type: application/json" \
3 -d '{"query": "How do I authenticate users?"}'A few things are worth considering before shipping this:
dimensions: 512 to openai.embeddings.create() with text-embedding-3-small to cut vector storage by 66%. Update your Qdrant collection's size accordingly. encoding_format: "base64". This can reduce API response payload size for embedding requests. express-rate-limit package or Upstash rate limit plugin can prevent abuse and control OpenAI costs. For existing Strapi plugins in this space, the Strapi blog post on building a semantic search plugin with OpenAI is worth reading if you want to package your implementation as a reusable Strapi plugin.
Your vector index can become outdated when someone publishes or updates content in the Admin Panel unless you have automatic synchronization in place. Lifecycle hooks can be used to trigger embedding updates on content create, update, and delete events, though in Strapi v5 they are triggered based on Document Service API methods and Strapi now generally recommends document service middleware for most use cases.
Register hooks in the bootstrap function:
1// src/index.js
2const OpenAI = require('openai');
3const { QdrantClient } = require('@qdrant/js-client-rest');
4
5module.exports = {
6 async bootstrap({ strapi }) {
7 const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
8 const qdrant = new QdrantClient({ url: process.env.QDRANT_URL });
9
10 strapi.db.lifecycles.subscribe({
11 models: ['api::article.article'],
12
13 async afterCreate(event) {
14 const { result } = event;
15 await upsertEmbedding(openai, qdrant, result);
16 },
17
18 async afterUpdate(event) {
19 const { result } = event;
20 await upsertEmbedding(openai, qdrant, result);
21 },
22
23 async beforeDelete(event) {
24 const { params } = event;
25 const id = params.where.id.toString();
26 const existing = await qdrant.scroll('strapi-content', {
27 limit: 100,
28 with_payload: true,
29 });
30 const points = existing
31 .filter(point => point.payload?.chunkId?.startsWith(`${id}-`))
32 .map(point => point.payload.chunkId);
33 await qdrant.delete('strapi-content', { points });
34 strapi.log.info(`Deleted embedding for article ${id}`);
35 },
36 });
37 },
38};
39
40async function upsertEmbedding(openai, qdrant, entry) {
41 try {
42 const textToEmbed = [entry.title, entry.description, entry.body || blocksToPlainText(entry.blocks)]
43 .filter(Boolean)
44 .join('\n\n');
45
46 if (!textToEmbed.trim()) return;
47
48 const embeddingResponse = await openai.embeddings.create({
49 model: 'text-embedding-3-small',
50 input: textToEmbed.replace(/\n/g, ' ').trim(),
51 });
52
53 await qdrant.upsert('strapi-content', {
54 points: [{
55 id: `${entry.documentId || entry.id.toString()}-0`,
56 vector: embeddingResponse.data[0].embedding,
57 payload: {
58 chunkId: `${entry.documentId || entry.id.toString()}-0`,
59 text: textToEmbed.substring(0, 1000),
60 title: entry.title,
61 url: `/articles/${entry.slug}`,
62 updatedAt: entry.updatedAt,
63 },
64 }]
65 });
66
67 strapi.log.info(`Upserted embedding for: ${entry.title}`);
68 } catch (error) {
69 strapi.log.error('Embedding upsert failed:', error.message);
70 }
71}An important detail is the try/catch above. It lets the hook handle a failed embedding call locally so the remaining logic in that hook can continue, unless the error is re-thrown. Handle that trade-off intentionally. For some teams, failing silently is acceptable. Others might prefer a webhook approach that processes embeddings asynchronously in a separate service.
The beforeDelete hook attempts to use the entry's ID to remove the corresponding vector from Qdrant, helping keep the index clean. Without it, deleted articles would continue appearing in search results, which is a confusing experience for users.
Be aware of bulk operations. In Strapi v5, bulk actions like createMany, updateMany, and deleteMany do not trigger lifecycles at all when called through the Document Service API, so bulk imports using these methods will not fire lifecycle hooks per record. This can quickly hit OpenAI rate limits. For bulk imports, consider using your embedding provider's supported batch-ingestion features or a dedicated import workflow designed for bulk operations.
If you're on Strapi 5, also review the Document Service Middleware docs. Lifecycle hooks still work, but Document Service Middleware may be a better fit depending on your architecture.
Empty search results. Check that your Qdrant collection name matches between ingestion and query. It should be strapi-content in both the ingestion script and the controller. Verify embeddings were actually stored by hitting Qdrant's REST API directly: GET http://localhost:6333/collections/strapi-content should return a points_count greater than zero. If the count is zero, your ingestion script didn't complete successfully.
Irrelevant results returned. Your chunks may be too large or contain too much boilerplate. Review the plain text output of your blocksToPlainText or markdownToPlainText functions and ensure they strip navigation elements, footers, and repeated content. Try reducing maxChars in the chunking function from 3000 to 1500 and re-running ingestion.
OpenAI 401 errors. Your OPENAI_API_KEY environment variable is missing or invalid. Verify it's set in your .env file and that Strapi is loading it. Restart the server after any .env changes. You can test the key independently with a simple curl call to the OpenAI API.
Lifecycle hooks not firing. In Strapi 5, ensure you're subscribing to the correct model UID format (api::article.article). Check the Strapi server logs on startup for any errors in the bootstrap function. If you don't see your strapi.log.info messages after creating or updating content, the subscription may not have registered.
Qdrant connection refused. Make sure your Qdrant Docker container is running (docker ps) and the port mapping is correct (-p 6333:6333). If you're using Qdrant Cloud, verify the QDRANT_URL includes the full URL with protocol and that any API key is configured correctly.
Slow search responses. Most latency comes from the two OpenAI API calls, embedding generation and chat completion. To reduce perceived latency, consider caching frequent query embeddings so repeated or similar questions skip the embedding step. You can also use streaming responses so users see the answer forming in real time rather than waiting for the full completion to finish before displaying anything.
Stale results after content updates. If lifecycle hooks are configured but search results still show old content, check that the documentId used during ingestion matches the ID used in lifecycle hook upserts. A mismatch means updates create new vectors instead of overwriting existing ones, leaving outdated vectors in the index alongside the new ones.
You now have a working RAG pipeline that turns your Strapi content into an intelligent search system. The pieces are straightforward: extract content via the REST API, embed it with OpenAI, store vectors in Qdrant, and query the whole thing through a custom Strapi endpoint. Content create, update, and delete events can be used to keep external search indexes in sync without full manual re-indexing, though in Strapi 5 the recommended approach for most cases is document service middleware rather than lifecycle hooks.
From here, you could extend this with hybrid search, combining keyword and semantic matching, multi-language content support, or a frontend chat interface using the Vercel AI SDK. The core architecture stays the same.
If you're evaluating how Strapi fits into your AI content stack, the integrations page covers how it connects with frontends like Next.js, and the SDK comparison on the Strapi blog can help you choose the right abstraction layer for your frontend and headless CMS workflow.
npx create-strapi-app@latest in your terminal and follow our Quick Start Guide to build your first Strapi project.