Advanced RAG: Query Expansion
Expand keyword queries to improve recall and provide more context to RAG.
August 14, 2024This is part one of the Advanced Use Cases series:
1ď¸âŁ Extract Metadata from Queries to Improve Retrieval
2ď¸âŁ Query Expansion
3ď¸âŁ Query Decomposition
4ď¸âŁ Automated Metadata Enrichment
The quality of RAG (retrieval augmented generation) highly depends on the quality of the first step in the process: retrieval. The generation step can only be as good as the context its working on, which it will receive as a result of a retrieval step.
However, retrieval is also in turn dependent on the query that it receives. There are multiple types of retrieval: keyword based, semantic search (embedding) based, hybrid, or even in some cases simply based on the results of a query to an API (for example, the results of websearch and so on). But at the end of the day, in the majority of cases, thereâs a human behind a keyboard typing a query, and humans are not guaranteed to produce good quality queries for the results they intend to get.
In this article, weâll walk you through a very simple yet effective technique that allows us to make sure we are retrieving more of, and more relevant bits of context to a given query: query expansion.
TL;DR: Query expansion increases the number of results, so it increases recall (vs precision). In general, BM25 favors precision while embedding retrieval favors recall (See this explanation by Nils Reimers). So, it makes sense to use BM25+query expansion to increase recall in cases where you want to rely on keyword search.
Query Expansion
Query expansion is a technique where we take the user query, and generate a certain number of similar queries. For example:
User Query: âopen source NLP frameworksâ
After Query Expansion: [ânatural language processing toolsâ, âfree nlp librariesâ, âopen-source language processing platformsâ, âNLP software with open-source codeâ, âopen source NLP frameworksâ]
This helps improve retrieval results, and in turn the quality of RAG results in cases where:
- The user query is vague or badly formed.
- In cases of keyword based retrieval, it also allows you to cover your bases with queries of similar meaning or synonyms.
Take âglobal warmingâ as an example, query expansion would allow us to make sure weâre also doing keyword search for âclimate changeâ or similar queries.
Letâs start by building a simple QueryExpander
. This component is using an OpenAI model (gpt-3.5-turbo
in this case) to generate a certain number
of additional queries:
@component
class QueryExpander:
def __init__(self, prompt: Optional[str] = None, model: str = "gpt-3.5-turbo"):
self.query_expansion_prompt = prompt
self.model = model
if prompt == None:
self.query_expansion_prompt = """
You are part of an information system that processes users queries.
You expand a given query into {{ number }} queries that are similar in meaning.
Structure:
Follow the structure shown below in examples to generate expanded queries.
Examples:
1. Example Query 1: "climate change effects"
Example Expanded Queries: ["impact of climate change", "consequences of global warming", "effects of environmental changes"]
2. Example Query 2: ""machine learning algorithms""
Example Expanded Queries: ["neural networks", "clustering", "supervised learning", "deep learning"]
Your Task:
Query: "{{query}}"
Example Expanded Queries:
"""
builder = PromptBuilder(self.query_expansion_prompt)
llm = OpenAIGenerator(model = self.model)
self.pipeline = Pipeline()
self.pipeline.add_component(name="builder", instance=builder)
self.pipeline.add_component(name="llm", instance=llm)
self.pipeline.connect("builder", "llm")
@component.output_types(queries=List[str])
def run(self, query: str, number: int = 5):
result = self.pipeline.run({'builder': {'query': query, 'number': number}})
expanded_query = json.loads(result['llm']['replies'][0]) + [query]
return {"queries": list(expanded_query)}
To replicate the example user query and expanded queries as you see above, you would run the component as follows:
expander = QueryExpander()
expander.run(query="open source nlp frameworks", number=4)
This would result in the component returning queries
that include the original query + 4 expanded queries:
{'queries': ['natural language processing tools',
'free nlp libraries',
'open-source language processing platforms',
'NLP software with open-source code',
'open source nlp frameworks']}
Retrieval With Query Expansion
Letâs take a look at what happens if we use query expansion as a step in our retrieval pipeline. Letâs look at this through a very simple and small demo. To this end, I used some dummy data. Hereâs the list of documents
I used:
documents = [
Document(content="The effects of climate are many including loss of biodiversity"),
Document(content="The impact of climate change is evident in the melting of the polar ice caps."),
Document(content="Consequences of global warming include the rise in sea levels."),
Document(content="One of the effects of environmental changes is the change in weather patterns."),
Document(content="There is a global call to reduce the amount of air travel people take."),
Document(content="Air travel is one of the core contributors to climate change."),
Document(content="Expect warm climates in Turkey during the summer period."),
]
When asking to retrieve the top 3 documents to the query “climate changeâ using the InMemoryBM25Retriever
(so, weâre doing keyword search) hereâs what we get as our top 3 candidates:
'Air travel is one of the core contributors to climate change.'
'The impact of climate change is evident in the melting of the polar ice caps.'
'The effects of climate are many including loss of biodiversity'
There are 2 things to notice here:
- Weâre only asking for 3 documents, and weâre getting 3 relevant documents to the query âclimate changeâ. In this sense, this retrieval is completely valid and has done a good job.
- But, because weâre using the query âclimate changeâ in combination with a keyword retriever, we are actually missing out on some documents that may be even more relevant to the query. For example, the document with âglobal warmingâ is completely left out.
You can start to see how this could impact the results you get in cases where users are typing vague queries or keywords into the search box.
Now, letâs add query expansion to the mix. We will be using a custom retriever this time called the MultiQueryInMemoryBM25Retriever
which can accept a list of queries
instead of a single query
(see the cookbook for the full code). Hereâs the retrieval pipeline that we create:
query_expander = QueryExpander()
retriever = MultiQueryInMemoryBM25Retriever(InMemoryBM25Retriever(document_store=doc_store))
expanded_retrieval_pipeline = Pipeline()
expanded_retrieval_pipeline.add_component("expander", query_expander)
expanded_retrieval_pipeline.add_component("keyword_retriever", retriever)
expanded_retrieval_pipeline.connect("expander.queries", "keyword_retriever.queries")
Now, we can run this pipeline, again with the same query âclimate changeâ
expanded_retrieval_pipeline.run({"expander": {"query": "climate change"}},
include_outputs_from=["expander"])
And we get the following results. The query expander has created the following queries
:
'expander': {'queries': ['global warming consequences',
'environmental impact of climate change',
'effects of climate variability',
'implications of climate crisis',
'consequences of greenhouse gas emissions',
'climate change']}}
Note that you may get different results because your
QueryExpander
may generate differentqueries
And weâve received the following documents from the retrieval pipeline:
'Consequences of global warming include the rise in sea levels.'
'The impact of climate change is evident in the melting of the polar ice caps.',
'There is a global call to reduce the amount of air travel people take.'
'The effects of climate are many including loss of biodiversity'
'One of the effects of environmental changes is the change in weather patterns.'
'Air travel is one of the core contributors to climate change.'
Notice how weâre able to add context about âglobal warmingâ and âeffects of environmental changeâ.
Using Query Expansion for RAG
In the example cookbook, weâve also added a section on using query expansion for RAG on Wikipedia pages. We index the following wikipedia pages into an InMemoryDocumentStore
:
"Electric_vehicle", "Dam", "Electric_battery", "Tree", "Solar_panel", "Nuclear_power",
"Wind_power", "Hydroelectricity", "Coal", "Natural_gas",
"Greenhouse_gas", "Renewable_energy", "Fossil_fuel"
And then, we construct a RAG pipeline. For our resulting prompt to the LLM, we also indicate what the original query from the user was.
template = """
You are part of an information system that summarises related documents.
You answer a query using the textual content from the documents retrieved for the
following query.
You build the summary answer based only on quoting information from the documents.
You should reference the documents you used to support your answer.
###
Original Query: "{{query}}"
Retrieved Documents: {{documents}}
Summary Answer:
"""
query_expander = QueryExpander()
retriever = MultiQueryInMemoryBM25Retriever(InMemoryBM25Retriever(document_store=doc_store))
prompt_builder = PromptBuilder(template = template)
llm = OpenAIGenerator()
query_expanded_rag_pipeline = Pipeline()
query_expanded_rag_pipeline.add_component("expander", query_expander)
query_expanded_rag_pipeline.add_component("keyword_retriever", retriever)
query_expanded_rag_pipeline.add_component("prompt", prompt_builder)
query_expanded_rag_pipeline.add_component("llm", llm)
query_expanded_rag_pipeline.connect("expander.queries", "keyword_retriever.queries")
query_expanded_rag_pipeline.connect("keyword_retriever.documents", "prompt.documents")
query_expanded_rag_pipeline.connect("prompt", "llm")
Running this pipeline with the simple query âgreen energy sourcesâ with the query expander, weâre able to get a response constructed from Wikipedia pages including âElectric Vehicleâ, âWind Powerâ, âRenewable Energyâ, âFossil Fuelâ and âNuclear Powerâ. Without the MultiQueryInMemoryBM25Retriever
, we rely on the top k results from a single pass of BM25 retrieval on the query âgreen energy sourcesâ resulting in a response constructed from the pages âRenewable energyâ, âWind Powerâ and âFossil Fuelâ
Wrapping Up
Query Expansion is a great technique that will allow you to get a wider range of relevant resources while still using keyword search. While semantic search is a great option, it does require the use of an embedding model, and the existence of embeddings for the data source we will perform search on. This makes keyword based search quite an attractive option for faster, cheaper retrieval.
This does however mean that we heavily rely on the quality of the provided query. Query expansion allows you to navigate this issue by generating similar queries to the user query.
In my opinion, one of the main advantages of this technique is that it allows you to avoid embedding documentation at each update, while still managing to increase the relevance of retrieved documents at query time. Keyword retrieval doesnât require any extra embedding step, so the only inferencing happening at retrieval time in this scenario is when we ask an LLM to generate a certain number of similar queries.