While basic keyword search seems simple on the surface, delivering a good experience – one that's fast, understands user queries reasonably well, and surfaces relevant results from potentially thousands of pages – involves significant engineering effort. Many teams resort to basic, often inadequate, built-in search functions or face the daunting task of implementing and maintaining a dedicated search engine just for their documentation site.
The Standard Approach: Building and Maintaining a Doc Search Engine
Setting up a robust keyword search system for documentation typically requires navigating these complex steps:
Step 1: Selecting and Deploying Search Infrastructure
You need a dedicated search engine to index and query your documentation content.
- Technology Choice: Evaluate and choose a search engine like Elasticsearch, OpenSearch, Solr, MeiliSearch, or Typesense. Each has its own operational complexities.
- Deployment & Configuration: Set up the chosen engine, configure clusters for availability and performance, manage security, and handle version upgrades.
- Resource Provisioning: Allocate sufficient compute, memory, and storage resources, and plan for scaling as your documentation grows.
The Challenge: Significant operational overhead in setting up, configuring, and maintaining the core search infrastructure. Requires specialized expertise.
Step 2: Data Preparation, Scraping, and Indexing
Your documentation content (often in Markdown, MDX, HTML, etc.) needs to be processed and loaded into the search engine.
- Content Scraping/Parsing:
- Build reliable scripts (like the Python example provided using
MarkdownHeaderTextSplitter
) to crawl your documentation source files, parse content, extract relevant sections (headers, code blocks, paragraphs), and handle different formats. - Data Structuring:
- Define a clear schema for your search index (e.g., fields for
title, content, url, headers, chunk_id
). Chunk large documents appropriately for better relevance. - Indexing Pipeline: Create and maintain a pipeline that automatically detects documentation changes, re-scrapes/re-parses content, and updates the search index efficiently.
The Challenge: Parsing diverse content formats reliably, structuring data effectively for search, and keeping the index constantly up-to-date requires careful engineering and ongoing maintenance.
Step 3: Implementing Query Parsing and Relevance Tuning
Making the search relevant goes beyond simple text matching.
- Query Understanding: Implement logic to handle typos, synonyms, phrase matching, and potentially natural language operators (AND, OR).
- Relevance Algorithm Tuning: Configure and tune the underlying relevance algorithm (like BM25) by adjusting field weights (e.g., boosting title matches over content matches), configuring text analysis (stemming, stop words), and potentially adding custom ranking logic.
- Result Snippeting & Highlighting: Generate relevant snippets from matching documents and highlight the user's query terms within the results.
The Challenge: Achieving good search relevance is an iterative process requiring deep understanding of search algorithms and experimentation.
Step 4: Building the Search UI and API Integration
Users need an interface to interact with the search engine.
- Frontend Component:
- Develop a search input component for your documentation site (like the React
DocSearch
component used in the example). - Backend API Layer (Optional but common): Often involves creating an intermediary API endpoint on your web server that translates frontend requests into search engine queries and formats the results.
- API Communication: Handle communication between the UI/backend and the search engine API.
The Challenge: Requires frontend and potentially backend development effort to integrate the search functionality smoothly into the documentation site.
Step 5: Ongoing Monitoring, Maintenance, and Scaling
Search is not a "set it and forget it" system.
- Performance Monitoring: Track query latency, indexing speed, and resource utilization.
- Log Analysis: Analyze search logs to understand user queries, identify common failure points (zero results), and inform relevance tuning.
- Infrastructure Scaling: Scale the search cluster as documentation volume or query load increases.
- Software Updates: Keep the search engine and supporting libraries updated.
The Challenge: Continuous operational burden and cost associated with keeping the search system healthy, performant, and relevant.
The Shaped Approach: Simplified Keyword Search with retrieve
Building and managing a dedicated search engine just for documentation is often overkill and distracts from core product development. Shaped provides a dramatically simpler solution for high-quality keyword search using its purpose-built retrieve endpoint.
Shaped handles the complexities of the underlying search engine infrastructure, indexing, and core relevance, allowing you to focus on getting your content in and querying it easily.
How Shaped Streamlines Documentation Search:
- Managed Search Engine: Shaped utilizes Tantivy, a high-performance, Rust-based search library using the industry-standard BM25 relevance algorithm, under the hood. You don't need to manage Elasticsearch/Solr clusters.
- Simple Data Integration: Connect your prepared documentation data (e.g., the CSV generated by your scraping script) to Shaped using datasets.
- Automated Indexing: Once data is connected to a model, Shaped handles the process of indexing it for efficient keyword retrieval.
- Powerful
retrieve
API: - A single API endpoint
(/models/{model_name}/retrieve)
handles keyword queries, relevance scoring (BM25), filtering (if needed, though not the focus here), and returns structured results including metadata. - Performance and Scalability: Shaped manages the infrastructure to ensure fast query responses and scales automatically with your needs.
- No Search Expertise Required: Get state-of-the-art keyword search relevance without needing to become an expert in search engine internals or relevance tuning.
Implementing Documentation Search with Shaped: A Conceptual Example
Let's illustrate how to use Shaped's retrieve
endpoint, mirroring the process used for the Docusaurus plugin example.
Goal: Implement fast, relevant keyword search for a documentation site, using pre-processed document chunks.
1. Prepare and Connect Your Data:
This crucial first step involves transforming your raw documentation source files (like Markdown) into a structured format that Shaped can easily index. This typically involves scraping, parsing, and chunking the content.
- Step A (Offline - Data Preparation Script):
- Use a script to parse your
.md/.mdx
files. The goal is to break down large documents into smaller, searchable chunks, often based on headings, and extract relevant metadata like URLs and headers.
Here are snippets illustrating key parts of such a Python script (based on the example provided):
- Setting up the splitter:
- Using a library like
langchain_text_splitters
helps break down Markdown based on headers.
- Iterating through files and splitting: The script walks through your documentation directory, reads each file, and splits it into chunks.
- Extracting data and writing to CSV: For each chunk, extract the content, associated header metadata, generate a URL, and write it as a row in the CSV file. This structured CSV is what you'll upload to Shaped.
This script produces a CSV file (e.g., shaped_docs_index_data.csv
) where each row represents a searchable chunk of your documentation.
- Step B (Shaped - Upload Data): Upload the generated CSV file as a Shaped Dataset. This makes the structured content available within the platform.
2. Define Your Shaped Model (YAML):
For keyword search using retrieve
, the model definition primarily tells Shaped
which data fields to index and make searchable. No complex ML configuration is needed.
3. Create the Model:
This triggers Shaped to index the data specified in the fetch.items
query.
4. Monitor Indexing:
Wait for the model docs_search_engine
to become ACTIVE
, indicating indexing is complete.
5. Query via the retrieve
API (Application Backend / Frontend Logic):
This is where your documentation site's search component (like the ShapedDocsSearch
React component) would call the Shaped API.
- Step A (Your Code): Get the user's search query (e.g., "how to connect data").
- Step B (Your Code):
- Call Shaped's
retrieve
API endpoint.
Python Example:
Node.js Example:
- Step C (Your Code):
- Process the
response.metadata
(which contains theurl, headers, content
, etc. you selected in the YAML) to display the search results in your UI, adding links and potentially highlighting matched terms (similar to thehighlightMainWords
function in your React component).
6. Clean Up (Optional):
Connecting to the Docusaurus Example
The conceptual steps above directly mirror the workflow implied by your Docusaurus plugin and scraping script:
- Data Prep: The Python script processes Markdown into structured chunks (CSV).
- Data Connection: The CSV is uploaded as a Shaped Dataset.
- Model Definition:
- A simple Shaped model (
docs_search_engine
) is defined to point to this dataset and specify searchable fields (content, headers
, etc.). - API Call:
- The
ShapedDocsSearch
React component makes afetch
request to the Shaped/retrieve
API endpoint(${SHAPED_API_BASE_URL}/models/${SHAPED_MODEL_ID}/retrieve)
with the text_query. - UI Rendering:
- The component receives the
response.metadata
from Shaped, formats it (using functions likecleanedHTMLString
andhighlightMainWords
), and renders the results using the AlgoliaDocSearch
UI components.
Shaped acts as the managed backend search engine, replacing the need to run and manage Elasticsearch/Algolia/etc., while you provide the structured content and the frontend experience.
Conclusion: Stop Wrestling with Search Infrastructure
Providing excellent keyword search for your documentation doesn't have to mean taking on the burden of building and managing a complex search engine. By preparing your content effectively and leveraging Shaped's retrieve
endpoint, you get the benefits of a high-performance, relevant keyword search engine (powered by Tantivy and BM25) without the associated operational complexity.
Focus on creating great documentation, let Shaped handle the search indexing and querying, and give your users the fast, relevant search experience they need to succeed.
Ready to upgrade your documentation search?
Request a demo of Shaped today to see how easy it is to implement powerful keyword search. Or, start exploring immediately with our free trial sandbox.
* **Step B (Your Backend \- Method 1: Using `item_ids`):** If all candidate items are expected to exist in Shaped's catalog.