A write-up on the paper by Lin, J., et al., "Vector Search with OpenAI Embeddings: Lucene Is All You Need" University of Waterloo and Roma Tre University, 2023.
Lucene's Search Engine Origins
Apache Lucene is a high-performance, full-featured text search engine library written in Java. Developed by Doug Cutting in 1999, it has become the backbone of many popular search applications and platforms. Lucene's primary function is to index and search through large volumes of text efficiently, using an inverted index structure.
Key features of Lucene include:
- Full-text indexing and searching capabilities
- Support for various query types, including phrase queries, wildcard queries, and proximity queries
- Fielded searching (e.g., title, author, contents)
- Sorting by any field
- Multiple-index searching with merged results
Historically, Lucene was not designed for vector search. However, it has evolved to meet the demands of modern search applications. In December 2021, Lucene incorporated support for hierarchical navigable small-world networks (HNSW), the primary algorithm used in vector search. This addition has positioned Lucene as a viable alternative to dedicated vector stores, challenging the notion that specialized databases are necessary for AI-powered search applications.
Lucene's HNSW Indexing Explained
Lucene's implementation of Hierarchical Navigable Small World (HNSW) indexing is central to its vector search capabilities, offering a scalable and efficient method for approximate nearest neighbor (ANN) search. HNSW organizes vectors into a graph where similar vectors are connected, enabling quick traversal to find the most relevant neighbors. This structure is particularly effective for high-dimensional data, leveraging a tiered approach to balance precision and speed.
Key features of Lucene's HNSW indexing include:
- Segment-based storage: Vectors are stored in immutable segments, which are periodically merged. This design ensures efficient memory usage and supports incremental updates without disrupting ongoing searches.
- Disk-based architecture: Following Lucene's principle of keeping data on disk, vector data relies on the page cache for rapid access, reducing memory overhead while maintaining performance.
- Configurable parameters: Users can fine-tune settings like the maximum number of connections per node (M) and beam width to optimize indexing and query performance for specific workloads.
- Parallel search optimization: Each HNSW graph segment can be searched independently across multiple threads, significantly reducing query latency by utilizing all available CPU cores.
- Pre-joining for parent-child relationships: Recent enhancements allow pre-joining against parent documents during vector searches, ensuring that results are returned at the document level rather than individual passages, improving both relevance and diversity.
This robust implementation positions Lucene as a competitive alternative to dedicated vector databases, blending ANN search with its traditional text-based capabilities in a unified framework.
MS MARCO Benchmark Results with Lucene
The MS MARCO passage ranking task serves as a crucial benchmark for evaluating vector search performance. Using OpenAI's ada2 embedding model and Lucene for indexing and retrieval, researchers achieved impressive results on this dataset. On the MS MARCO development queries, the system attained a Reciprocal Rank @10 of 0.343 and a Recall @1000 of 0.984. These scores are competitive with other state-of-the-art models, demonstrating Lucene's capability to handle vector search effectively.
- Performance on TREC queries: The system also performed well on queries from the TREC 2019 and 2020 Deep Learning Tracks, with nDCG@10 scores of 0.704 and 0.676 respectively.
- Efficiency: Using 16 threads, the system achieved a query throughput of about 22 queries per second, retrieving 1000 hits per query.
- Index size: The resulting index occupied 51 GB, distributed across 25 index segments.
These results challenge the notion that specialized vector databases are necessary for high-performance vector search, showcasing Lucene's potential as a versatile solution for both traditional and AI-enhanced search applications.
Explore Lucene's capabilities further on its official GitHub repository: https://github.com/apache/lucene
Cost-Benefit Analysis of Vector Stores vs. Lucene
The cost-benefit analysis of dedicated vector stores versus Lucene for vector search reveals compelling arguments for leveraging existing Lucene-based infrastructure. While vector databases offer specialized performance, the benefits may not outweigh the costs of increased architectural complexity for many organizations. Lucene's recent advancements in vector search capabilities, including HNSW indexing, have significantly narrowed the performance gap.
Key considerations in this analysis include:
- Infrastructure investment: Many enterprises have already invested heavily in Lucene-based search solutions, making it more cost-effective to extend these systems rather than introduce new components.
- Performance trade-offs: While dedicated vector stores may offer faster indexing and query times, Lucene's performance is rapidly improving and may be sufficient for many use cases.
- Scalability: Lucene's segment-based architecture allows for efficient management of large datasets, potentially exceeding available RAM size while maintaining performance.
- Integration: Lucene seamlessly integrates vector search with traditional text-based search, enabling hybrid approaches without additional complexity1.
- Future-proofing: Ongoing developments in the Lucene ecosystem, such as scalar quantization for memory optimization, suggest continued improvements in vector search capabilities.
Ultimately, the decision between a dedicated vector store and Lucene depends on specific organizational needs, but for many, Lucene may indeed be all they need for effective vector search implementation.
Lucene’s Potential and Future Evolution
While the study presents a compelling case for Lucene, it also acknowledges some current limitations:
1. Performance: Lucene still lags behind some alternatives in terms of indexing speed and query latency.
2. Implementation Quirks: The study highlights some practical hurdles encountered while utilizing Lucene's advanced features, particularly its vector search capabilities. One notable issue was Lucene’s restriction of vectors to 1024 dimensions, which posed challenges when working with OpenAI’s 1536-dimensional embeddings. To address this, researchers applied a creative yet unconventional solution: they modified Lucene's source files, adjusted the vector size limitations locally, and built a custom version to bypass this restriction. Additionally, they implemented a custom codec to enable indexing of higher-dimensional vectors. These "janky" workarounds underscore both the ingenuity required and the gaps in Lucene’s current offerings for high-dimensional vector search
However, the authors express optimism about Lucene's future, citing ongoing developments and investments in the ecosystem that are likely to address these issues.
A Paradigm Shift in Vector Search
The study's assertion that "Lucene is all you need" for vector search with OpenAI embeddings represents a potential paradigm shift in how we approach the implementation of advanced search capabilities. By demonstrating that existing, widely-deployed infrastructure can be adapted to meet the demands of modern AI-driven search, this research opens up new possibilities for organizations looking to stay at the cutting edge without completely overhauling their tech stacks.
As the field of AI and search continues to evolve, this work serves as a reminder that sometimes, the most revolutionary solutions come not from building entirely new systems, but from cleverly leveraging and extending the tools we already have at our disposal.