Tutorial: Creating a Hybrid Retrieval Pipeline


This tutorial uses Haystack 2.0. To learn more, read the Haystack 2.0 announcement or visit the Haystack 2.0 Documentation.

Overview

Hybrid Retrieval combines keyword-based and embedding-based retrieval techniques, leveraging the strengths of both approaches. In essence, dense embeddings excel in grasping the contextual nuances of the query, while keyword-based methods excel in matching keywords.

There are many cases when a simple keyword-based approaches like BM25 performs better than a dense retrieval (for example in a specific domain like healthcare) because a dense model needs to be trained on data. For more details about Hybrid Retrieval, check out Blog Post: Hybrid Document Retrieval.

Preparing the Colab Environment

Installing Haystack

Install Haystack 2.0 and other required packages with pip:

%%bash

pip install haystack-ai
pip install "datasets>=2.6.1"
pip install "sentence-transformers>=3.0.0"
pip install accelerate

Enabling Telemetry

Knowing you’re using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See Telemetry for more details.

from haystack.telemetry import tutorial_running

tutorial_running(33)

Initializing the DocumentStore

You’ll start creating your question answering system by initializing a DocumentStore. A DocumentStore stores the Documents that your system uses to find answers to your questions. In this tutorial, you’ll be using the InMemoryDocumentStore.

from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

InMemoryDocumentStore is the simplest DocumentStore to get started with. It requires no external dependencies and it’s a good option for smaller projects and debugging. But it doesn’t scale up so well to larger Document collections, so it’s not a good choice for production systems. To learn more about the different types of external databases that Haystack supports, see DocumentStore Integrations.

Fetching and Processing Documents

As Documents, you will use the PubMed Abstracts. There are a lot of datasets from PubMed on Hugging Face Hub; you will use anakin87/medrag-pubmed-chunk in this tutorial.

Then, you will create Documents from the dataset with a simple for loop. Each data point in the PubMed dataset has 4 features:

  • pmid
  • title
  • content: the abstract
  • contents: abstract + title

For searching, you will use the contents feature. The other features will be stored as metadata, and you will use them to have a pretty print of the search results or for metadata filtering.

from datasets import load_dataset
from haystack import Document

dataset = load_dataset("anakin87/medrag-pubmed-chunk", split="train")

docs = []
for doc in dataset:
    docs.append(
        Document(content=doc["contents"], meta={"title": doc["title"], "abstract": doc["content"], "pmid": doc["id"]})
    )

Indexing Documents with a Pipeline

Create a pipeline to store the data in the document store with their embedding. For this pipeline, you need a DocumentSplitter to split documents into chunks of 512 words, SentenceTransformersDocumentEmbedder to create document embeddings for dense retrieval and DocumentWriter to write documents to the document store.

As an embedding model, you will use BAAI/bge-small-en-v1.5 on Hugging Face. Feel free to test other models on Hugging Face or use another Embedder to switch the model provider.

If this step takes too long for you, replace the embedding model with a smaller model such as sentence-transformers/all-MiniLM-L6-v2 or sentence-transformers/all-mpnet-base-v2. Make sure that the split_length is updated according to your model’s token limit.

from haystack.components.writers import DocumentWriter
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.preprocessors.document_splitter import DocumentSplitter
from haystack import Pipeline
from haystack.utils import ComponentDevice

document_splitter = DocumentSplitter(split_by="word", split_length=512, split_overlap=32)
document_embedder = SentenceTransformersDocumentEmbedder(
    model="BAAI/bge-small-en-v1.5", device=ComponentDevice.from_str("cuda:0")
)
document_writer = DocumentWriter(document_store)

indexing_pipeline = Pipeline()
indexing_pipeline.add_component("document_splitter", document_splitter)
indexing_pipeline.add_component("document_embedder", document_embedder)
indexing_pipeline.add_component("document_writer", document_writer)

indexing_pipeline.connect("document_splitter", "document_embedder")
indexing_pipeline.connect("document_embedder", "document_writer")

indexing_pipeline.run({"document_splitter": {"documents": docs}})

Documents are stored in InMemoryDocumentStore with their embeddings, now it’s time for creating the hybrid retrieval pipeline โœ…

Creating a Pipeline for Hybrid Retrieval

Hybrid retrieval refers to the combination of multiple retrieval methods to enhance overall performance. In the context of search systems, a hybrid retrieval pipeline executes both traditional keyword-based search and dense vector search, later ranking the results with a cross-encoder model. This combination allows the search system to leverage the strengths of different approaches, providing more accurate and diverse results.

Here are the required steps for a hybrid retrieval pipeline:

1) Initialize Retrievers and the Embedder

Initialize a InMemoryEmbeddingRetriever and InMemoryBM25Retriever to perform both dense and keyword-based retrieval. For dense retrieval, you also need a SentenceTransformersTextEmbedder that computes the embedding of the search query by using the same embedding model BAAI/bge-small-en-v1.5 that was used in the indexing pipeline:

from haystack.components.retrievers.in_memory import InMemoryBM25Retriever, InMemoryEmbeddingRetriever
from haystack.components.embedders import SentenceTransformersTextEmbedder

text_embedder = SentenceTransformersTextEmbedder(
    model="BAAI/bge-small-en-v1.5", device=ComponentDevice.from_str("cuda:0")
)
embedding_retriever = InMemoryEmbeddingRetriever(document_store)
bm25_retriever = InMemoryBM25Retriever(document_store)

2) Join Retrieval Results

Haystack offers several joining methods in DocumentJoiner to be used for different use cases such as merge and reciprocal_rank_fusion. In this example, you will use the default concatenate mode to join the documents coming from two Retrievers as the Ranker will be the main component to rank the documents for relevancy.

from haystack.components.joiners import DocumentJoiner

document_joiner = DocumentJoiner()

3) Rank the Results

Use the TransformersSimilarityRanker that scores the relevancy of all retrieved documents for the given search query by using a cross encoder model. In this example, you will use BAAI/bge-reranker-base model to rank the retrieved documents but you can replace this model with other cross-encoder models on Hugging Face.

from haystack.components.rankers import TransformersSimilarityRanker

ranker = TransformersSimilarityRanker(model="BAAI/bge-reranker-base")

4) Create the Hybrid Retrieval Pipeline

Add all initialized components to your pipeline and connect them.

from haystack import Pipeline

hybrid_retrieval = Pipeline()
hybrid_retrieval.add_component("text_embedder", text_embedder)
hybrid_retrieval.add_component("embedding_retriever", embedding_retriever)
hybrid_retrieval.add_component("bm25_retriever", bm25_retriever)
hybrid_retrieval.add_component("document_joiner", document_joiner)
hybrid_retrieval.add_component("ranker", ranker)

hybrid_retrieval.connect("text_embedder", "embedding_retriever")
hybrid_retrieval.connect("bm25_retriever", "document_joiner")
hybrid_retrieval.connect("embedding_retriever", "document_joiner")
hybrid_retrieval.connect("document_joiner", "ranker")

5) Visualize the Pipeline (Optional)

To understand how you formed a hybrid retrieval pipeline, use draw() method of the pipeline. If you’re running this notebook on Google Colab, the generate file will be saved in “Files” section on the sidebar.

hybrid_retrieval.draw("hybrid-retrieval.png")

Testing the Hybrid Retrieval

Pass the query to text_embedder, bm25_retriever and ranker and run the retrieval pipeline:

query = "apnea in infants"

result = hybrid_retrieval.run(
    {"text_embedder": {"text": query}, "bm25_retriever": {"query": query}, "ranker": {"query": query}}
)

Pretty Print the Results

Create a function to print a kind of search page.

def pretty_print_results(prediction):
    for doc in prediction["documents"]:
        print(doc.meta["title"], "\t", doc.score)
        print(doc.meta["abstract"])
        print("\n", "\n")
pretty_print_results(result["ranker"])

What’s next

๐ŸŽ‰ Congratulations! You’ve create a hybrid retrieval pipeline!

If you’d like to use this retrieval method in a RAG pipeline, check out Tutorial: Creating Your First QA Pipeline with Retrieval-Augmentation to learn about the next steps.

To stay up to date on the latest Haystack developments, you can sign up for our newsletter or join Haystack discord community.

Thanks for reading!