Powering Fast and Relevant Documentation Search

Documentation is the lifeblood of any technical product or platform. It empowers users, reduces support load, and accelerates adoption. But even the most comprehensive documentation fails if users can't find the information they need quickly and easily. A slow, inaccurate, or frustrating search experience within docs can lead to user abandonment, duplicated support requests, and a general sense of friction.

While basic keyword search seems simple on the surface, delivering a good experience – one that's fast, understands user queries reasonably well, and surfaces relevant results from potentially thousands of pages – involves significant engineering effort. Many teams resort to basic, often inadequate, built-in search functions or face the daunting task of implementing and maintaining a dedicated search engine just for their documentation site.

The Standard Approach: Building and Maintaining a Doc Search Engine

Setting up a robust keyword search system for documentation typically requires navigating these complex steps:

Step 1: Selecting and Deploying Search Infrastructure

You need a dedicated search engine to index and query your documentation content.

  • Technology Choice: Evaluate and choose a search engine like Elasticsearch, OpenSearch, Solr, MeiliSearch, or Typesense. Each has its own operational complexities.
  • Deployment & Configuration: Set up the chosen engine, configure clusters for availability and performance, manage security, and handle version upgrades.
  • Resource Provisioning: Allocate sufficient compute, memory, and storage resources, and plan for scaling as your documentation grows.

The Challenge: Significant operational overhead in setting up, configuring, and maintaining the core search infrastructure. Requires specialized expertise.

Step 2: Data Preparation, Scraping, and Indexing

Your documentation content (often in Markdown, MDX, HTML, etc.) needs to be processed and loaded into the search engine.

  • Content Scraping/Parsing:
  • Build reliable scripts (like the Python example provided using MarkdownHeaderTextSplitter) to crawl your documentation source files, parse content, extract relevant sections (headers, code blocks, paragraphs), and handle different formats.
  • Data Structuring:
  • Define a clear schema for your search index (e.g., fields for title, content, url, headers, chunk_id). Chunk large documents appropriately for better relevance.
  • Indexing Pipeline: Create and maintain a pipeline that automatically detects documentation changes, re-scrapes/re-parses content, and updates the search index efficiently.

The Challenge: Parsing diverse content formats reliably, structuring data effectively for search, and keeping the index constantly up-to-date requires careful engineering and ongoing maintenance.

Step 3: Implementing Query Parsing and Relevance Tuning

Making the search relevant goes beyond simple text matching.

  • Query Understanding: Implement logic to handle typos, synonyms, phrase matching, and potentially natural language operators (AND, OR).
  • Relevance Algorithm Tuning: Configure and tune the underlying relevance algorithm (like BM25) by adjusting field weights (e.g., boosting title matches over content matches), configuring text analysis (stemming, stop words), and potentially adding custom ranking logic.
  • Result Snippeting & Highlighting: Generate relevant snippets from matching documents and highlight the user's query terms within the results.

The Challenge: Achieving good search relevance is an iterative process requiring deep understanding of search algorithms and experimentation.

Step 4: Building the Search UI and API Integration

Users need an interface to interact with the search engine.

  • Frontend Component:
  • Develop a search input component for your documentation site (like the React DocSearch component used in the example).
  • Backend API Layer (Optional but common): Often involves creating an intermediary API endpoint on your web server that translates frontend requests into search engine queries and formats the results.
  • API Communication: Handle communication between the UI/backend and the search engine API.

The Challenge: Requires frontend and potentially backend development effort to integrate the search functionality smoothly into the documentation site.

Step 5: Ongoing Monitoring, Maintenance, and Scaling

Search is not a "set it and forget it" system.

  • Performance Monitoring: Track query latency, indexing speed, and resource utilization.
  • Log Analysis: Analyze search logs to understand user queries, identify common failure points (zero results), and inform relevance tuning.
  • Infrastructure Scaling: Scale the search cluster as documentation volume or query load increases.
  • Software Updates: Keep the search engine and supporting libraries updated.

The Challenge: Continuous operational burden and cost associated with keeping the search system healthy, performant, and relevant.

The Shaped Approach: Simplified Keyword Search with retrieve

Building and managing a dedicated search engine just for documentation is often overkill and distracts from core product development. Shaped provides a dramatically simpler solution for high-quality keyword search using its purpose-built retrieve endpoint.

Shaped handles the complexities of the underlying search engine infrastructure, indexing, and core relevance, allowing you to focus on getting your content in and querying it easily.

How Shaped Streamlines Documentation Search:

  • Managed Search Engine: Shaped utilizes Tantivy, a high-performance, Rust-based search library using the industry-standard BM25 relevance algorithm, under the hood. You don't need to manage Elasticsearch/Solr clusters.
  • Simple Data Integration: Connect your prepared documentation data (e.g., the CSV generated by your scraping script) to Shaped using datasets.
  • Automated Indexing: Once data is connected to a model, Shaped handles the process of indexing it for efficient keyword retrieval.
  • Powerful retrieve API:
  • A single API endpoint (/models/{model_name}/retrieve) handles keyword queries, relevance scoring (BM25), filtering (if needed, though not the focus here), and returns structured results including metadata.
  • Performance and Scalability: Shaped manages the infrastructure to ensure fast query responses and scales automatically with your needs.
  • No Search Expertise Required: Get state-of-the-art keyword search relevance without needing to become an expert in search engine internals or relevance tuning.

Implementing Documentation Search with Shaped: A Conceptual Example

Let's illustrate how to use Shaped's retrieve endpoint, mirroring the process used for the Docusaurus plugin example.

Goal: Implement fast, relevant keyword search for a documentation site, using pre-processed document chunks.

1. Prepare and Connect Your Data:

This crucial first step involves transforming your raw documentation source files (like Markdown) into a structured format that Shaped can easily index. This typically involves scraping, parsing, and chunking the content.

  • Step A (Offline - Data Preparation Script):
  • Use a script to parse your .md/.mdx files. The goal is to break down large documents into smaller, searchable chunks, often based on headings, and extract relevant metadata like URLs and headers.

Here are snippets illustrating key parts of such a Python script (based on the example provided):

  • Setting up the splitter:
  • Using a library like langchain_text_splitters helps break down Markdown based on headers.

from langchain_text_splitters import MarkdownHeaderTextSplitter
import os
import csv
import re  # Used for cleaning paths, headers etc.

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

csv_file = 'shaped_docs_index_data.csv'
csv_columns = ['chunk_id', 'doc_id', 'header_1', 'header_2', 'header_3', 'content', 'url']

chunk_id_counter = 0
doc_id_counter = 0

  • Iterating through files and splitting: The script walks through your documentation directory, reads each file, and splits it into chunks.

with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=csv_columns)
    writer.writeheader()

    for root, dirs, files in os.walk('docs'):
        for filename in files:
            if filename.endswith(('.md', '.mdx')):
                file_path = os.path.join(root, filename)
                print(f"Processing file: {file_path}")

                with open(file_path, 'r', encoding='utf-8') as md_file:
                    markdown_document = md_file.read()

                if not markdown_document.strip():
                    continue

                md_header_splits = markdown_splitter.split_text(markdown_document)

                if not md_header_splits:
                    continue

                doc_id_counter += 1

  • Extracting data and writing to CSV: For each chunk, extract the content, associated header metadata, generate a URL, and write it as a row in the CSV file. This structured CSV is what you'll upload to Shaped.

for doc in md_header_splits:
    metadata = doc.metadata
    content = doc.page_content

    relative_path = os.path.relpath(file_path, 'docs').replace('.mdx', '').replace('.md', '')
    header_anchor = ''

    for header_level in ['Header 3', 'Header 2', 'Header 1']:
        if metadata.get(header_level):
            cleaned_header = re.sub(r'[^a-zA-Z0-9\s-]', '', metadata[header_level]).lower()
            header_anchor = f"#{cleaned_header.replace(' ', '-')}"
            break

    url = f"{relative_path}{header_anchor}"

    writer.writerow({
        'chunk_id': chunk_id_counter,
        'doc_id': doc_id_counter,
        'header_1': metadata.get('Header 1', ''),
        'header_2': metadata.get('Header 2', ''),
        'header_3': metadata.get('Header 3', ''),
        'content': content.strip(),
        'url': url
    })

    chunk_id_counter += 1

print(f"CSV preparation complete: {csv_file}")

This script produces a CSV file (e.g., shaped_docs_index_data.csv) where each row represents a searchable chunk of your documentation.

  • Step B (Shaped - Upload Data): Upload the generated CSV file as a Shaped Dataset. This makes the structured content available within the platform.

shaped create-dataset --file shaped_docs_index_data.csv --name docs_content_dataset --id-col chunk_id

2. Define Your Shaped Model (YAML):

For keyword search using retrieve, the model definition primarily tells Shaped

which data fields to index and make searchable. No complex ML configuration is needed.


model:
  name: docs_search_engine

connectors:
  - type: Dataset
    name: docs_content
    id: docs_data

fetch:
  items: |
    SELECT
      chunk_id AS item_id,
      header_1,
      header_2,
      header_3,
      header_4,
      content,
      url
    FROM docs_data

3. Create the Model:

This triggers Shaped to index the data specified in the fetch.items query.


shaped create-model --file docs_search_model.yaml

4. Monitor Indexing:

Wait for the model docs_search_engine to become ACTIVE, indicating indexing is complete.


shaped view-model --model-name docs_search_engine

5. Query via the retrieve API (Application Backend / Frontend Logic):

This is where your documentation site's search component (like the ShapedDocsSearch React component) would call the Shaped API.

  • Step A (Your Code): Get the user's search query (e.g., "how to connect data").
  • Step B (Your Code):
  • Call Shaped's retrieve API endpoint.

Python Example:


import os
from shaped import Shaped

shaped_client = Shaped()
model_name = 'docs_search_engine'
user_query = "how to connect data"
search_limit = 10

try:
    response = shaped_client.retrieve(
        model_name=model_name,
        text_query=user_query,
        limit=search_limit,
        return_metadata=True
    )

    if response and response.ids:
        search_results = response.metadata or [{'id': id} for id in response.ids]
        print(f"Found {len(search_results)} results for '{user_query}'")
    else:
        print(f"No results found for '{user_query}'")

except Exception as e:
    print(f"Error calling Shaped retrieve API: {e}")

Node.js Example:


const { Shaped } = require('@shaped/shaped');

const shapedClient = new Shaped();

const modelName = 'docs_search_engine';
const userQuery = "how to connect data";
const searchLimit = 10;

async function getDocumentationSearchResults(query) {
  try {
    const response = await shapedClient.retrieve({
      modelName: modelName,
      textQuery: query,
      limit: searchLimit,
      returnMetadata: true
    });

    if (response && response.ids && response.ids.length > 0) {
      const searchResults = response.metadata || response.ids.map(id => ({ id }));
      console.log(`Found ${searchResults.length} results for '${query}'`);
      return searchResults;
    } else {
      console.log(`No results found for '${query}'`);
      return [];
    }
  } catch (error) {
    console.error(`Error calling Shaped retrieve API: ${error}`);
    return [];
  }
}

  • Step C (Your Code):
  • Process the response.metadata (which contains the url, headers, content, etc. you selected in the YAML) to display the search results in your UI, adding links and potentially highlighting matched terms (similar to the highlightMainWords function in your React component).

6. Clean Up (Optional):


shaped delete-model --model-name docs_search_engine

Connecting to the Docusaurus Example

The conceptual steps above directly mirror the workflow implied by your Docusaurus plugin and scraping script:

  1. Data Prep: The Python script processes Markdown into structured chunks (CSV).
  2. Data Connection: The CSV is uploaded as a Shaped Dataset.
  3. Model Definition:
  4. A simple Shaped model (docs_search_engine) is defined to point to this dataset and specify searchable fields (content, headers, etc.).
  5. API Call:
  6. The ShapedDocsSearch React component makes a fetch request to the Shaped /retrieve API endpoint (${SHAPED_API_BASE_URL}/models/${SHAPED_MODEL_ID}/retrieve) with the text_query.
  7. UI Rendering:
  8. The component receives the response.metadata from Shaped, formats it (using functions like cleanedHTMLString and highlightMainWords), and renders the results using the Algolia DocSearch UI components.

Shaped acts as the managed backend search engine, replacing the need to run and manage Elasticsearch/Algolia/etc., while you provide the structured content and the frontend experience.

Conclusion: Stop Wrestling with Search Infrastructure

Providing excellent keyword search for your documentation doesn't have to mean taking on the burden of building and managing a complex search engine. By preparing your content effectively and leveraging Shaped's retrieve endpoint, you get the benefits of a high-performance, relevant keyword search engine (powered by Tantivy and BM25) without the associated operational complexity.

Focus on creating great documentation, let Shaped handle the search indexing and querying, and give your users the fast, relevant search experience they need to succeed.

Ready to upgrade your documentation search?

Request a demo of Shaped today to see how easy it is to implement powerful keyword search. Or, start exploring immediately with our free trial sandbox.

* **Step B (Your Backend \- Method 1: Using `item_ids`):** If all candidate items are expected to exist in Shaped's catalog.

Get up and running with one engineer in one sprint

Guaranteed lift within your first 30 days or your money back

100M+
Users and items
1000+
Queries per second
1B+
Requests

Related Posts

Daniel Camilleri
 | 
September 22, 2022

Day 3 of #Recsys2022: our favorite 5 papers and talks

Jaime Ferrando Huertas
 | 
October 24, 2022

Why your feeds are getting worse over time

Zac Weigold
 | 
November 28, 2023

Embracing Embeddings: From fragmented insights to unified understanding