Agentic RAG with Llama 3.2 3B


Β Β Β Β Β Β  Β Β Β Β Β Β 

In their Llama 3.2 collection, Meta released two small yet powerful Language Models.

In this notebook, we’ll use the 3B model to build an Agentic Retrieval Augmented Generation application.

🎯 Our goal is to create a system that answers questions using a knowledge base focused on the Seven Wonders of the Ancient World. If the retrieved documents don’t contain the answer, the application will fall back to web search for additional context.

Stack:

Setup

! pip install haystack-ai duckduckgo-api-haystack transformers sentence-transformers datasets

Create our knowledge base

In this section, we download a dataset on the Seven Wonders of the Ancient World, enrich each document with a semantic vector and store the documents in an in-memory database.

To better understand this process, you can explore to the introductory Haystack tutorial.

from datasets import load_dataset
from haystack import Document

from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.embedders import SentenceTransformersDocumentEmbedder

document_store = InMemoryDocumentStore()

dataset = load_dataset("bilgeyucel/seven-wonders", split="train")
docs = [Document(content=doc["content"], meta=doc["meta"]) for doc in dataset]

doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
doc_embedder.warm_up()

docs_with_embeddings = doc_embedder.run(docs)
document_store.write_documents(docs_with_embeddings["documents"])
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(



Batches:   0%|          | 0/5 [00:00<?, ?it/s]





151

Load and try Llama 3.2

We will use Hugging Face Transformers to load the model on a Colab.

There are plenty of other options to use open models on Haystack, including for example Ollama for local inference or serving with Groq.

( πŸ“• Choosing the Right Generator).

Authorization

import getpass, os

os.environ["HF_TOKEN"] = getpass.getpass("Your Hugging Face token")
Your Hugging Face tokenΒ·Β·Β·Β·Β·Β·Β·Β·Β·Β·
import torch
from haystack.components.generators import HuggingFaceLocalGenerator

generator = HuggingFaceLocalGenerator(
    model="meta-llama/Llama-3.2-3B-Instruct",
    huggingface_pipeline_kwargs={"device_map":"auto",
                                 "torch_dtype":torch.bfloat16},
    generation_kwargs={"max_new_tokens": 256})

generator.warm_up()
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
prompt = """<|begin_of_text|><|start_header_id|>user<|end_header_id|>
  What is the capital of France?<|eot_id|>
  <|start_header_id|>assistant<|end_header_id|>"""

generator.run(prompt)
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.





{'replies': ['\n\nThe capital of France is Paris.']}

Build the πŸ•΅πŸ» Agentic RAG Pipeline

Here’s the idea πŸ‘‡

  • Perform a vector search on our knowledge base using the query.
  • Pass the top 5 documents to Llama, injected in a specific prompt
  • In the prompt, instruct the model to reply with “no_answer” if it cannot infer the answer from the documents; otherwise, provide the answer.
  • If “no_answer” is returned, run a web search and inject the results into a new prompt.
  • Let Llama generate a final answer based on the web search results.

For a detailed explanation of a similar use case, take a look at this tutorial: Building Fallbacks to Websearch with Conditional Routing.

Retrieval part

Let’s initialize the components to use for the initial retrieval phase.

from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

text_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
retriever = InMemoryEmbeddingRetriever(document_store, top_k=5)

Prompt template

Let’s define the first prompt template, which instructs the model to:

  • answer the query based on the retrieved documents, if possible
  • reply with ’no_answer’, otherwise
from haystack.components.builders import PromptBuilder

prompt_template = """
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Answer the following query given the documents.
If the answer is not contained within the documents reply with 'no_answer'.
If the answer is contained within the documents, start the answer with "FROM THE KNOWLEDGE BASE: ".

Documents:
{% for document in documents %}
  {{document.content}}
{% endfor %}

Query: {{query}}<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
"""

prompt_builder = PromptBuilder(template=prompt_template)

Conditional Router

This is the component that will perform data routing, depending on the reply given by the Language Model.

from haystack.components.routers import ConditionalRouter

routes = [
    {
        "condition": "{{'no_answer' in replies[0]}}",
        "output": "{{query}}",
        "output_name": "go_to_websearch",
        "output_type": str,
    },
    {
        "condition": "{{'no_answer' not in replies[0]}}",
        "output": "{{replies[0]}}",
        "output_name": "answer",
        "output_type": str,
    },
]

router = ConditionalRouter(routes)
router.run(replies=["this is the answer!"])
{'answer': 'this is the answer!'}
router.run(replies=["no_answer"], query="my query")
{'go_to_websearch': 'my query'}
from duckduckgo_api_haystack import DuckduckgoApiWebSearch

websearch = DuckduckgoApiWebSearch(top_k=5)
# Perform a search
results = websearch.run(query="Where is Tanzania?")

# Access the search results
documents = results["documents"]
links = results["links"]

print("Found documents:")
for doc in documents:
    print(f"Content: {doc.content}")

print("\nSearch Links:")
for link in links:
    print(link)
Found documents:
Content: Tanzania is a country in East Africa within the African Great Lakes region. It is bordered by Uganda, Kenya, the Indian Ocean, Mozambique, Malawi, Zambia, Rwanda, Burundi, and the Democratic Republic of the Congo.
Content: Tanzania is a country in East Africa's Great Lakes Region, located just below the Equator. It is bordered by eight countries and the Indian Ocean, and has diverse geographical features such as mountains, lakes, rivers, and islands.
Content: Tanzania is an East African country formed by the union of Tanganyika and Zanzibar in 1964. It has diverse landscapes, including Mount Kilimanjaro, Lake Victoria, and the Great Rift Valley, and a rich cultural heritage.
Content: Tanzania is the largest and most populous country in East Africa, with a total area of 947,300 sq km and a coastline of 1,424 km. It has diverse natural features, including mountains, lakes, rivers, and islands, and borders eight other countries.
Content: Tanzania is a country in Eastern Africa, bordering the Indian Ocean, between Kenya and Mozambique. It has many lakes, national parks, and mountains, including Mount Kilimanjaro, the highest point in Africa.

Search Links:
https://en.wikipedia.org/wiki/Tanzania
https://www.worldatlas.com/maps/tanzania
https://www.britannica.com/place/Tanzania
https://www.cia.gov/the-world-factbook/countries/tanzania/
https://en.wikipedia.org/wiki/Geography_of_Tanzania
prompt_template_after_websearch = """
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Answer the following query given the documents retrieved from the web.
Start the answer with "FROM THE WEB: ".

Documents:
{% for document in documents %}
  {{document.content}}
{% endfor %}

Query: {{query}}<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
"""

prompt_builder_after_websearch = PromptBuilder(template=prompt_template_after_websearch)

Assembling the Pipeline

Now that we have all the components, we can assemble the full pipeline.

To handle the different prompt sources, we’ll use a BranchJoiner. This allows us to connect multiple output sockets (with prompts) to our language model. In our case, the prompt will either come from the initial prompt_builder or from prompt_builder_after_websearch.

from haystack.components.joiners import BranchJoiner
prompt_joiner  = BranchJoiner(str)

from haystack import Pipeline

pipe = Pipeline()
pipe.add_component("text_embedder", text_embedder)
pipe.add_component("retriever", retriever)
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("prompt_joiner", prompt_joiner)
pipe.add_component("llm", generator)
pipe.add_component("router", router)
pipe.add_component("websearch", websearch)
pipe.add_component("prompt_builder_after_websearch", prompt_builder_after_websearch)

pipe.connect("text_embedder", "retriever")
pipe.connect("retriever", "prompt_builder.documents")
pipe.connect("prompt_builder", "prompt_joiner")
pipe.connect("prompt_joiner", "llm")
pipe.connect("llm.replies", "router.replies")
pipe.connect("router.go_to_websearch", "websearch.query")
pipe.connect("router.go_to_websearch", "prompt_builder_after_websearch.query")
pipe.connect("websearch.documents", "prompt_builder_after_websearch.documents")
pipe.connect("prompt_builder_after_websearch", "prompt_joiner")
<haystack.core.pipeline.pipeline.Pipeline object at 0x7cd028903ca0>
πŸš… Components
  - text_embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - prompt_joiner: BranchJoiner
  - llm: HuggingFaceLocalGenerator
  - router: ConditionalRouter
  - websearch: DuckduckgoApiWebSearch
  - prompt_builder_after_websearch: PromptBuilder
πŸ›€οΈ Connections
  - text_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> prompt_joiner.value (str)
  - prompt_joiner.value -> llm.prompt (str)
  - llm.replies -> router.replies (List[str])
  - router.go_to_websearch -> websearch.query (str)
  - router.go_to_websearch -> prompt_builder_after_websearch.query (str)
  - websearch.documents -> prompt_builder_after_websearch.documents (List[Document])
  - prompt_builder_after_websearch.prompt -> prompt_joiner.value (str)
pipe.show()

Agentic RAG in action! πŸ”Ž

def get_answer(query):
  result = pipe.run({"text_embedder": {"text": query}, "prompt_builder": {"query": query}, "router": {"query": query}})
  print(result["router"]["answer"])
query = "Why did people build Great Pyramid of Giza?"

get_answer(query)
Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




FROM THE KNOWLEDGE BASE: The Great Pyramid of Giza was built as the tomb of Fourth Dynasty pharaoh Khufu, and its construction is believed to have taken around 27 years to complete.
query = "Where is Munich?"

get_answer(query)
Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




FROM THE WEB: Munich is located in the south of Germany, and is the capital of the federal state of Bavaria. It is connected to other major cities in Germany and Austria, and has direct access to Italy.
query = "What does Rhodes Statue look like?"

get_answer(query)
Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




FROM THE KNOWLEDGE BASE: The head of the Colossus of Rhodes was of a standard rendering at the time, with curly hair and evenly spaced spikes of bronze or silver flame radiating from it, similar to the images found on contemporary Rhodian coins.
query = "Was the the Tower of Pisa part of the 7 wonders of the ancient world?"

get_answer(query)
Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




FROM THE WEB: No, the Leaning Tower of Pisa was one of the Seven Wonders of the Medieval World, but not of the ancient world.
query = "Who was general Muawiyah?"

get_answer(query)
Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




FROM THE KNOWLEDGE BASE: Muawiyah I was a Muslim general who conquered Rhodes in 653.

(Notebook by Stefano Fiorucci)