Hybrid Search is Enriching the Context of Search Queries
In an attempt to master the search experience, hybrid search integrates various algorithms, allowing a fusion of keyword-based search strategies and vector search methods.
Such cutting-edge tech is being implemented by Weaviate, a company that employs sparse and dense vectors to enrich the context of search queries and documents.
Hybrid search brings together the advantages of multiple search paradigms. It harnesses the power of distinct algorithms such as BM25 and SPLADE, used to compute sparse vectors, and machine learning models like GloVe and Transformers, utilized for dense embeddings.
A particular example of the hybrid search approach is seen in Weaviate, predominantly relying on:
- BM25/BM25F
- Vector search.
One method used, BM25, is an innovative algorithm that refines the keyword scoring mechanism of TF-IDF (Term Frequency-Inverse Document Frequency).
- It does so by integrating a normalization penalty that takes into account a document’s length compared to the average document size in a database.
- It also accommodates static parameters that can be adjusted for specific datasets, boosting performance.
Moreover, Weaviate employs BM25F, a BM25 variant that allows differential weighting of text fields within an object during ranking calculations. This adaptive weighting proves particularly useful when certain fields, like a title, may hold more value than others, such as an abstract.
BM25F offers more flexibility and customization than its progenitor, elevating its potential for user-oriented search experiences.
Weaviate’s Vectorstore offers an intriguing approach to data representation through the use of dense vectors, created by machine learning models and populated with significant, non-zero values. These vectors function as condensed representations of a variety of data types, including text and images.
from langchain.retrievers.weaviate_hybrid_search import WeaviateHybridSearchRetriever
from langchain.schema import Document
import weaviate
import os
WEAVIATE_URL = os.getenv("WEAVIATE_URL")
client = weaviate.Client(
url=WEAVIATE_URL,
auth_client_secret=weaviate.AuthApiKey(api_key=os.getenv("WEAVIATE_API_KEY")),
additional_headers={
"X-Openai-Api-Key": os.getenv("OPENAI_API_KEY"),
},
)
retriever = WeaviateHybridSearchRetriever(
client, index_name="LangChain", text_key="text"
)
retriever.add_documents(docs)
retriever.get_relevant_documents("the ethical implications of AI")
retriever.get_relevant_documents(
"AI integration in society",
where_filter={
"path": ["author"],
"operator": "Equal",
"valueString": "Prof. Jonathan K. Sterling",
},
)
By calculating the distances between these vectors, Weaviate can quantify similarities and differences in the underlying data.
Weaviate maintain the semantic value of data by representing each object as a vector or point within a multi-dimensional space. To give this concept tangibility, consider that in such a system, the vector for bananas would be positioned near apples, not cats, echoing the shared attributes and natural association of the former.
To conduct a search in this high-dimensional space, your query is transformed into a vector analogous to the vectors of your data. By computing the similarities between your query vector and the existing data points, the vector database swiftly identifies relevant matches.
The standout feature of vector databases lies in their impressive speed. Even when contending with massive datasets of tens to hundreds of millions of objects, they are capable of responding to queries in fractions of a second. This potent combination of nuanced understanding and rapid response time positions vector databases at the frontier of data management technology.
Hybrid search presents an innovative methodology that leverages both dense and sparse vectors, integrating the advantages of contextual understanding and keyword matching.
For instance, in a query like “How to Debug a Python Script Conundrum“:
- Dense vector interpretation accurately parses “debug” as finding the root cause of a problem, while sparse vector search focuses on “Python Script“. This example illustrates the power of hybrid search, blending the strengths of these two methodologies.
- Furthermore, the concept of Reciprocal Rank Fusion (RRF) is used to combine the results from different search methods such as BM25 and dense vector search into one ranked list. Inspired by the work of Benham and Culpepper, RRF computes the sum of reciprocal ranks, effectively penalizing lower-ranked documents.
To illustrate, if we have three documents labeled A, B, and C, a ranking from both a BM25 and Dense search would be consolidated using the RRF methodology. Document B might emerge as the leader with a score of 1.5, followed by A at 1.3 and C at 0.83.
The incorporation of hybrid search relies heavily on a process called re-ranking, guided by an alpha parameter. This process allows for the calibration of the weight assigned to each algorithm, effectively steering the re-ranking of search results.
The user experience of operating hybrid search with Vectorstore is designed to be simple and straightforward.
By incorporating five core parameters, which include:
- A marker for hybrid search use (‘hybrid‘)
- The search query (‘query‘)
- An optional parameters for algorithm weighting (‘alpha‘)
- A custom vector (‘vector‘)
- A supplementary data (‘score‘)
Users can easily manipulate the system to cater to their specific needs.
For instance, the execution of a hybrid search query can be as accessible as a few lines of code. This allows for an equal weighting of sparse and dense vector results, as demonstrated in the sample GraphQL query:
“Python whisperer who artfully untangles intricate code knots“, where ‘alpha‘ is set to 0.5 as shown below:
{
Get {
Article (
hybrid: {
query: "Python whisperer who artfully untangles intricate code knots"
alpha: 0.5
})
{
title
summary
_additional {score}
}
}
}
Moreover, Weaviate extends its commitment to user-centric design by providing detailed documentation for those eager to delve deeper into the intricacies of hybrid search.
As you take the first steps with a vectorstore of your choice, remember, Weaviate is open-source; the possibilities are truly endless.
Start building remarkable applications today and embark on an exhilarating tech adventure.