RAG Pipeline Evaluation Using DeepEval

_{Last Updated:
October 3, 2024}

DeepEval is a framework to evaluate Retrieval Augmented Generation (RAG) pipelines. It supports metrics like context relevance, answer correctness, faithfulness, and more.

For more information about evaluators, supported metrics and usage, check out:

This notebook shows how to use DeepEval-Haystack integration to evaluate a RAG pipeline against various metrics.

Prerequisites:

OpenAI key
- DeepEval uses for computing some metrics, so we need an OpenAI key.

import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")

Enter OpenAI API key:··········

Install dependencies

!pip install "pydantic<1.10.10"
!pip install haystack-ai
!pip install "datasets>=2.6.1"
!pip install deepeval-haystack

Create a RAG pipeline

We’ll first need to create a RAG pipeline. Refer to this link for a detailed tutorial on how to create RAG pipelines.

In this notebook, we’re using the SQUAD V2 dataset for getting the context, questions and ground truth answers.

Initialize the document store

from datasets import load_dataset
from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

dataset = load_dataset("rajpurkar/squad_v2", split="validation")
documents = list(set(dataset["context"]))
docs = [Document(content=doc) for doc in documents]
document_store.write_documents(docs)

Downloading readme:   0%|          | 0.00/8.92k [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/16.4M [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]



Generating train split:   0%|          | 0/130319 [00:00<?, ? examples/s]



Generating validation split:   0%|          | 0/11873 [00:00<?, ? examples/s]





1204

import os
from getpass import getpass
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever

retriever = InMemoryBM25Retriever(document_store, top_k=3)

template = """
Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""

prompt_builder = PromptBuilder(template=template)
generator = OpenAIGenerator(model="gpt-4o-mini-2024-07-18")

Build the RAG pipeline

from haystack import Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder

rag_pipeline = Pipeline()
# Add components to your pipeline
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", generator)
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")

# Now, connect the components to each other
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("llm.meta", "answer_builder.meta")
rag_pipeline.connect("retriever", "answer_builder.documents")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7d2b1da93b20>
🚅 Components
  - retriever: InMemoryBM25Retriever
  - prompt_builder: PromptBuilder
  - llm: OpenAIGenerator
  - answer_builder: AnswerBuilder
🛤️ Connections
  - retriever.documents -> prompt_builder.documents (List[Document])
  - retriever.documents -> answer_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.prompt (str)
  - llm.replies -> answer_builder.replies (List[str])
  - llm.meta -> answer_builder.meta (List[Dict[str, Any]])

Running the pipeline

question = "In what country is Normandy located?"

response = rag_pipeline.run(
    {"retriever": {"query": question}, "prompt_builder": {"question": question}, "answer_builder": {"query": question}}
)

Ranking by BM25...:   0%|          | 0/1204 [00:00<?, ? docs/s]

print(response["answer_builder"]["answers"][0].data)

France

We’re done building our RAG pipeline. Let’s evaluate it now!

Get questions, contexts, responses and ground truths for evaluation

For computing most metrics, we will need to provide the following to the evaluator:

Questions
Generated responses
Retrieved contexts
Ground truth (Specifically, this is needed for context precision, context recall and answer correctness metrics)

We’ll start with random three questions from the dataset (see below) and now we’ll get the matching contexts and responses for those questions.

Helper function to get context and responses for our questions

def get_contexts_and_responses(questions, pipeline):
    contexts = []
    responses = []
    for question in questions:
        response = pipeline.run(
            {
                "retriever": {"query": question},
                "prompt_builder": {"question": question},
                "answer_builder": {"query": question},
            }
        )

        contexts.append([d.content for d in response["answer_builder"]["answers"][0].documents])
        responses.append(response["answer_builder"]["answers"][0].data)
    return contexts, responses

question_map = {
    "Which mountain range influenced the split of the regions?": 0,
    "What is the prize offered for finding a solution to P=NP?": 1,
    "Which Californio is located in the upper part?": 2
}
questions = list(question_map.keys())
contexts, responses = get_contexts_and_responses(questions, rag_pipeline)

Ranking by BM25...:   0%|          | 0/1204 [00:00<?, ? docs/s]



Ranking by BM25...:   0%|          | 0/1204 [00:00<?, ? docs/s]



Ranking by BM25...:   0%|          | 0/1204 [00:00<?, ? docs/s]

Ground truths, review all fields

Now that we have questions, contexts, and responses we’ll also get the matching ground truth answers.

ground_truths = [""] * len(question_map)

for question, index in question_map.items():
    idx = dataset["question"].index(question)
    ground_truths[index] = dataset["answers"][idx]["text"][0]

print("Questions:\n")
print("\n".join(questions))

Questions:

Which mountain range influenced the split of the regions?
What is the prize offered for finding a solution to P=NP?
Which Californio is located in the upper part?

print("Contexts:\n")
for c in contexts:
  print(c[0])

Contexts:

The state is most commonly divided and promoted by its regional tourism groups as consisting of northern, central, and southern California regions. The two AAA Auto Clubs of the state, the California State Automobile Association and the Automobile Club of Southern California, choose to simplify matters by dividing the state along the lines where their jurisdictions for membership apply, as either northern or southern California, in contrast to the three-region point of view. Another influence is the geographical phrase South of the Tehachapis, which would split the southern region off at the crest of that transverse range, but in that definition, the desert portions of north Los Angeles County and eastern Kern and San Bernardino Counties would be included in the southern California region due to their remoteness from the central valley and interior desert landscape.
If a problem X is in C and hard for C, then X is said to be complete for C. This means that X is the hardest problem in C. (Since many problems could be equally hard, one might say that X is one of the hardest problems in C.) Thus the class of NP-complete problems contains the most difficult problems in NP, in the sense that they are the ones most likely not to be in P. Because the problem P = NP is not solved, being able to reduce a known NP-complete problem, Π2, to another problem, Π1, would indicate that there is no known polynomial-time solution for Π1. This is because a polynomial-time solution to Π1 would yield a polynomial-time solution to Π2. Similarly, because all NP problems can be reduced to the set, finding an NP-complete problem that can be solved in polynomial time would mean that P = NP.
In the centre of Basel, the first major city in the course of the stream, is located the "Rhine knee"; this is a major bend, where the overall direction of the Rhine changes from West to North. Here the High Rhine ends. Legally, the Central Bridge is the boundary between High and Upper Rhine. The river now flows North as Upper Rhine through the Upper Rhine Plain, which is about 300 km long and up to 40 km wide. The most important tributaries in this area are the Ill below of Strasbourg, the Neckar in Mannheim and the Main across from Mainz. In Mainz, the Rhine leaves the Upper Rhine Valley and flows through the Mainz Basin.

print("Responses:\n")
print("\n".join(responses))

Responses:

The Tehachapi mountain range influenced the split of the regions in California.
The prize offered for finding a solution to P=NP is US$1,000,000.
The Californio located in the upper part is the Ill below of Strasbourg.

print("Ground truths:\n")
print("\n".join(ground_truths))

Ground truths:

Tehachapis
$1,000,000
Monterey

Evaluate the RAG pipeline

Now that we have the questions, contexts,responses and the ground truths, we can begin our pipeline evaluation and compute all the supported metrics.

Metrics computation

In addition to evaluating the final responses of the LLM, it is important that we also evaluate the individual components of the RAG pipeline as they can significantly impact the overall performance. Therefore, there are different metrics to evaluate the retriever, the generator and the overall pipeline. For a full list of available metrics and their expected inputs, check out the DeepEvalEvaluator Docs

The DeepEval documentation provides explanation of the individual metrics with simple examples for each of them.

Contextul Precision

The contextual precision metric measures our RAG pipeline’s retriever by evaluating whether items in our contexts that are relevant to the given input are ranked higher than irrelevant ones.

from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

context_precision_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_PRECISION, metric_params={"model":"gpt-4"})
context_precision_pipeline.add_component("evaluator", evaluator)

evaluation_results = context_precision_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "ground_truths": ground_truths, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])

Output()



Output()

Output()

Output()

======================================================================

Metrics Summary

  - ✅ Contextual Precision (score: 1.0, threshold: 0.0, evaluation model: gpt-4, reason: The score is 1.00 because
all the relevant information is perfectly ranked. The first node in the retrieval context accurately mentions the 
'Tehachapis' influencing the split of regions, while the second and third nodes, discussing structural geology 
experiments and Victoria's warmest regions respectively, do not contain any specific influence on the regional 
split. Well done!)

For test case:

  - input: Which mountain range influenced the split of the regions?

  - actual output: The Tehachapi mountain range influenced the split of the regions in California.

  - expected output: Tehachapis

  - context: None

  - retrieval context: ['The state is most commonly divided and promoted by its regional tourism groups as 
consisting of northern, central, and southern California regions. The two AAA Auto Clubs of the state, the 
California State Automobile Association and the Automobile Club of Southern California, choose to simplify matters 
by dividing the state along the lines where their jurisdictions for membership apply, as either northern or 
southern California, in contrast to the three-region point of view. Another influence is the geographical phrase 
South of the Tehachapis, which would split the southern region off at the crest of that transverse range, but in 
that definition, the desert portions of north Los Angeles County and eastern Kern and San Bernardino Counties would
be included in the southern California region due to their remoteness from the central valley and interior desert 
landscape.', 'Among the most well-known experiments in structural geology are those involving orogenic wedges, 
which are zones in which mountains are built along convergent tectonic plate boundaries. In the analog versions of 
these experiments, horizontal layers of sand are pulled along a lower surface into a back stop, which results in 
realistic-looking patterns of faulting and the growth of a critically tapered (all angles remain the same) orogenic
wedge. Numerical models work in the same way as these analog models, though they are often more sophisticated and 
can include patterns of erosion and uplift in the mountain belt. This helps to show the relationship between 
erosion and the shape of the mountain range. These studies can also give useful information about pathways for 
metamorphism through pressure, temperature, space, and time.', "The Mallee and upper Wimmera are Victoria's warmest
regions with hot winds blowing from nearby semi-deserts. Average temperatures exceed 32 °C (90 °F) during summer 
and 15 °C (59 °F) in winter. Except at cool mountain elevations, the inland monthly temperatures are 2–7 °C (4–13 
°F) warmer than around Melbourne (see chart). Victoria's highest maximum temperature since World War II, of 48.8 °C
(119.8 °F) was recorded in Hopetoun on 7 February 2009, during the 2009 southeastern Australia heat wave."]

======================================================================

Metrics Summary

  - ✅ Contextual Precision (score: 0.5, threshold: 0.0, evaluation model: gpt-4, reason: The score is 0.50 because
only the second node in the retrieval contexts directly answers the question about the prize for solving P=NP, 
stating 'The P versus NP problem is one of the Millennium Prize Problems proposed by the Clay Mathematics 
Institute. There is a US$1,000,000 prize for resolving the problem.' The first and third nodes, despite discussing 
aspects of the P=NP problem, do not mention the prize, making them less relevant to the input. Therefore, they 
should be ranked lower.)

For test case:

  - input: What is the prize offered for finding a solution to P=NP?

  - actual output: The prize offered for finding a solution to P=NP is US$1,000,000.

  - expected output: $1,000,000

  - context: None

- retrieval context: ['If a problem X is in C and hard for C, then X is said to be complete for C. This means
that X is the hardest problem in C. (Since many problems could be equally hard, one might say that X is one of the
hardest problems in C.) Thus the class of NP-complete problems contains the most difficult problems in NP, in the
sense that they are the ones most likely not to be in P. Because the problem P = NP is not solved, being able to
reduce a known NP-complete problem, Π2, to another problem, Π1, would indicate that there is no known
polynomial-time solution for Π1. This is because a polynomial-time solution to Π1 would yield a polynomial-time
solution to Π2. Similarly, because all NP problems can be reduced to the set, finding an NP-complete problem that
can be solved in polynomial time would mean that P = NP.', 'The question of whether P equals NP is one of the most
important open questions in theoretical computer science because of the wide implications of a solution. If the
answer is yes, many important problems can be shown to have more efficient solutions. These include various types
of integer programming problems in operations research, many problems in logistics, protein structure prediction in
biology, and the ability to find formal proofs of pure mathematics theorems. The P versus NP problem is one of the
Millennium Prize Problems proposed by the Clay Mathematics Institute. There is a US$1,000,000 prize for resolving
the problem.', 'What intractability means in practice is open to debate. Saying that a problem is not in P does not
imply that all large cases of the problem are hard or even that most of them are. For example, the decision problem
in Presburger arithmetic has been shown not to be in P, yet algorithms have been written that solve the problem in
reasonable times in most cases. Similarly, algorithms can solve the NP-complete knapsack problem over a wide range
of sizes in less than quadratic time and SAT solvers routinely handle large instances of the NP-complete Boolean
satisfiability problem.']

======================================================================

Metrics Summary

  - ✅ Contextual Precision (score: 0, threshold: 0.0, evaluation model: gpt-4, reason: The score is 0.00 because 
all the nodes in the retrieval context are irrelevant to the input query. The first node talks about the geography 
of the Rhine and its surrounding areas, the second node discusses the Germanic frontier and its relationship with 
Rome, and the third node explains the concept of tectonic plates and their movement, none of which provide 
information about the location of Californio.)

For test case:

  - input: Which Californio is located in the upper part?

  - actual output: The Californio located in the upper part is the Ill below of Strasbourg.

  - expected output: Monterey

  - context: None

- retrieval context: ['In the centre of Basel, the first major city in the course of the stream, is located the
"Rhine knee"; this is a major bend, where the overall direction of the Rhine changes from West to North. Here the
High Rhine ends. Legally, the Central Bridge is the boundary between High and Upper Rhine. The river now flows
North as Upper Rhine through the Upper Rhine Plain, which is about 300 km long and up to 40 km wide. The most
important tributaries in this area are the Ill below of Strasbourg, the Neckar in Mannheim and the Main across from
Mainz. In Mainz, the Rhine leaves the Upper Rhine Valley and flows through the Mainz Basin.', 'From the death of
Augustus in AD 14 until after AD 70, Rome accepted as her Germanic frontier the water-boundary of the Rhine and
upper Danube. Beyond these rivers she held only the fertile plain of Frankfurt, opposite the Roman border fortress
of Moguntiacum (Mainz), the southernmost slopes of the Black Forest and a few scattered bridge-heads. The northern
section of this frontier, where the Rhine is deep and broad, remained the Roman boundary until the empire fell. The
southern part was different. The upper Rhine and upper Danube are easily crossed. The frontier which they form is
inconveniently long, enclosing an acute-angled wedge of foreign territory between the modern Baden and Württemberg.
The Germanic populations of these lands seem in Roman times to have been scanty, and Roman subjects from the modern
Alsace-Lorraine had drifted across the river eastwards.', "In the 1960s, a series of discoveries, the most
important of which was seafloor spreading, showed that the Earth's lithosphere, which includes the crust and rigid
uppermost portion of the upper mantle, is separated into a number of tectonic plates that move across the
plastically deforming, solid, upper mantle, which is called the asthenosphere. There is an intimate coupling
between the movement of the plates on the surface and the convection of the mantle: oceanic plate motions and
mantle convection currents always move in the same direction, because the oceanic lithosphere is the rigid upper
thermal boundary layer of the convecting mantle. This coupling between rigid plates moving on the surface of the
Earth and the convecting mantle is called plate tectonics."]

----------------------------------------------------------------------

✅ Tests finished! Run "deepeval login" to view evaluation results on the web.

[[{'name': 'contextual_precision', 'score': 1.0, 'explanation': "The score is 1.00 because all the relevant information is perfectly ranked. The first node in the retrieval context accurately mentions the 'Tehachapis' influencing the split of regions, while the second and third nodes, discussing structural geology experiments and Victoria's warmest regions respectively, do not contain any specific influence on the regional split. Well done!"}], [{'name': 'contextual_precision', 'score': 0.5, 'explanation': "The score is 0.50 because only the second node in the retrieval contexts directly answers the question about the prize for solving P=NP, stating 'The P versus NP problem is one of the Millennium Prize Problems proposed by the Clay Mathematics Institute. There is a US$1,000,000 prize for resolving the problem.' The first and third nodes, despite discussing aspects of the P=NP problem, do not mention the prize, making them less relevant to the input. Therefore, they should be ranked lower."}], [{'name': 'contextual_precision', 'score': 0, 'explanation': 'The score is 0.00 because all the nodes in the retrieval context are irrelevant to the input query. The first node talks about the geography of the Rhine and its surrounding areas, the second node discusses the Germanic frontier and its relationship with Rome, and the third node explains the concept of tectonic plates and their movement, none of which provide information about the location of Californio.'}]]

Contextual Recall

Contextual recall measures the extent to which the contexts aligns with the ground truth.

from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

context_recall_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_RECALL, metric_params={"model":"gpt-4"})
context_recall_pipeline.add_component("evaluator", evaluator)

evaluation_results = context_recall_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "ground_truths": ground_truths, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])

Output()



Output()

Output()

Output()

======================================================================

Metrics Summary

  - ✅ Contextual Recall (score: 1.0, threshold: 0.0, evaluation model: gpt-4, reason: The score is 1.00 because 
every part of the expected output, 'Tehachapis', is successfully found and matches perfectly with the context 
provided by the 1st node in the retrieval context. Well done!)

For test case:

  - input: Which mountain range influenced the split of the regions?

  - actual output: The Tehachapi mountain range influenced the split of the regions in California.

  - expected output: Tehachapis

  - context: None

  - retrieval context: ['The state is most commonly divided and promoted by its regional tourism groups as 
consisting of northern, central, and southern California regions. The two AAA Auto Clubs of the state, the 
California State Automobile Association and the Automobile Club of Southern California, choose to simplify matters 
by dividing the state along the lines where their jurisdictions for membership apply, as either northern or 
southern California, in contrast to the three-region point of view. Another influence is the geographical phrase 
South of the Tehachapis, which would split the southern region off at the crest of that transverse range, but in 
that definition, the desert portions of north Los Angeles County and eastern Kern and San Bernardino Counties would
be included in the southern California region due to their remoteness from the central valley and interior desert 
landscape.', 'Among the most well-known experiments in structural geology are those involving orogenic wedges, 
which are zones in which mountains are built along convergent tectonic plate boundaries. In the analog versions of 
these experiments, horizontal layers of sand are pulled along a lower surface into a back stop, which results in 
realistic-looking patterns of faulting and the growth of a critically tapered (all angles remain the same) orogenic
wedge. Numerical models work in the same way as these analog models, though they are often more sophisticated and 
can include patterns of erosion and uplift in the mountain belt. This helps to show the relationship between 
erosion and the shape of the mountain range. These studies can also give useful information about pathways for 
metamorphism through pressure, temperature, space, and time.', "The Mallee and upper Wimmera are Victoria's warmest
regions with hot winds blowing from nearby semi-deserts. Average temperatures exceed 32 °C (90 °F) during summer 
and 15 °C (59 °F) in winter. Except at cool mountain elevations, the inland monthly temperatures are 2–7 °C (4–13 
°F) warmer than around Melbourne (see chart). Victoria's highest maximum temperature since World War II, of 48.8 °C
(119.8 °F) was recorded in Hopetoun on 7 February 2009, during the 2009 southeastern Australia heat wave."]

======================================================================

Metrics Summary

  - ✅ Contextual Recall (score: 1.0, threshold: 0.0, evaluation model: gpt-4, reason: The score is 1.00 because 
the expected output "$1,000,000" perfectly matches with the information given in the 2nd node in the retrieval 
context. Well done!)

For test case:

  - input: What is the prize offered for finding a solution to P=NP?

  - actual output: The prize offered for finding a solution to P=NP is US$1,000,000.

  - expected output: $1,000,000

  - context: None

======================================================================

Metrics Summary

  - ✅ Contextual Recall (score: 0.0, threshold: 0.0, evaluation model: gpt-4, reason: The score is 0.00 because 
none of the nodes in the retrieval context mention 'Monterey'.)

For test case:

  - input: Which Californio is located in the upper part?

  - actual output: The Californio located in the upper part is the Ill below of Strasbourg.

  - expected output: Monterey

  - context: None

----------------------------------------------------------------------

✅ Tests finished! Run "deepeval login" to view evaluation results on the web.

[[{'name': 'contextual_recall', 'score': 1.0, 'explanation': "The score is 1.00 because every part of the expected output, 'Tehachapis', is successfully found and matches perfectly with the context provided by the 1st node in the retrieval context. Well done!"}], [{'name': 'contextual_recall', 'score': 1.0, 'explanation': 'The score is 1.00 because the expected output "$1,000,000" perfectly matches with the information given in the 2nd node in the retrieval context. Well done!'}], [{'name': 'contextual_recall', 'score': 0.0, 'explanation': "The score is 0.00 because none of the nodes in the retrieval context mention 'Monterey'."}]]

Contextual Relevancy

The contextual relevancy metric measures the quality of our RAG pipeline’s retriever by evaluating the overall relevance of the context for a given question.

from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

context_relevancy_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_RELEVANCE, metric_params={"model":"gpt-4"})
context_relevancy_pipeline.add_component("evaluator", evaluator)

evaluation_results = context_relevancy_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])

Output()



Output()

Output()

Output()

======================================================================

Metrics Summary

  - ✅ Contextual Relevancy (score: 0.09090909090909091, threshold: 0.0, evaluation model: gpt-4, reason: The score
is 0.09 because the sentences provided do not directly address the influence of a mountain range on the division of
regions. They discuss the division of California, temperature data for Victoria, and experiments in structural 
geology, but do not provide information pertinent to the input question.)

For test case:

  - input: Which mountain range influenced the split of the regions?

  - actual output: The Tehachapi mountain range influenced the split of the regions in California.

  - expected output: None

  - context: None

  - retrieval context: ['The state is most commonly divided and promoted by its regional tourism groups as 
consisting of northern, central, and southern California regions. The two AAA Auto Clubs of the state, the 
California State Automobile Association and the Automobile Club of Southern California, choose to simplify matters 
by dividing the state along the lines where their jurisdictions for membership apply, as either northern or 
southern California, in contrast to the three-region point of view. Another influence is the geographical phrase 
South of the Tehachapis, which would split the southern region off at the crest of that transverse range, but in 
that definition, the desert portions of north Los Angeles County and eastern Kern and San Bernardino Counties would
be included in the southern California region due to their remoteness from the central valley and interior desert 
landscape.', 'Among the most well-known experiments in structural geology are those involving orogenic wedges, 
which are zones in which mountains are built along convergent tectonic plate boundaries. In the analog versions of 
these experiments, horizontal layers of sand are pulled along a lower surface into a back stop, which results in 
realistic-looking patterns of faulting and the growth of a critically tapered (all angles remain the same) orogenic
wedge. Numerical models work in the same way as these analog models, though they are often more sophisticated and 
can include patterns of erosion and uplift in the mountain belt. This helps to show the relationship between 
erosion and the shape of the mountain range. These studies can also give useful information about pathways for 
metamorphism through pressure, temperature, space, and time.', "The Mallee and upper Wimmera are Victoria's warmest
regions with hot winds blowing from nearby semi-deserts. Average temperatures exceed 32 °C (90 °F) during summer 
and 15 °C (59 °F) in winter. Except at cool mountain elevations, the inland monthly temperatures are 2–7 °C (4–13 
°F) warmer than around Melbourne (see chart). Victoria's highest maximum temperature since World War II, of 48.8 °C
(119.8 °F) was recorded in Hopetoun on 7 February 2009, during the 2009 southeastern Australia heat wave."]

======================================================================

Metrics Summary

  - ✅ Contextual Relevancy (score: 0.5384615384615384, threshold: 0.0, evaluation model: gpt-4, reason: The score 
is 0.54 because the majority of the sentences extracted from the retrieval context, particularly from nodes 2 and 
3, focus on explaining the complexities and characteristics of the P=NP problem, rather than directly addressing 
the specific question about the prize offered for finding a solution to this problem.)

For test case:

  - input: What is the prize offered for finding a solution to P=NP?

  - actual output: The prize offered for finding a solution to P=NP is US$1,000,000.

  - expected output: None

  - context: None

======================================================================

Metrics Summary

  - ✅ Contextual Relevancy (score: 0.0, threshold: 0.0, evaluation model: gpt-4, reason: The score is 0.00 because
none of the sentences in the retrieval context provide any information related to the queried Californio's 
location. They mostly discuss geological and geographical matters like tectonic plates and river courses, which are
irrelevant to the original question.)

For test case:

  - input: Which Californio is located in the upper part?

  - actual output: The Californio located in the upper part is the Ill below of Strasbourg.

  - expected output: None

  - context: None

----------------------------------------------------------------------

✅ Tests finished! Run "deepeval login" to view evaluation results on the web.

[[{'name': 'contextual_relevance', 'score': 0.09090909090909091, 'explanation': 'The score is 0.09 because the sentences provided do not directly address the influence of a mountain range on the division of regions. They discuss the division of California, temperature data for Victoria, and experiments in structural geology, but do not provide information pertinent to the input question.'}], [{'name': 'contextual_relevance', 'score': 0.5384615384615384, 'explanation': 'The score is 0.54 because the majority of the sentences extracted from the retrieval context, particularly from nodes 2 and 3, focus on explaining the complexities and characteristics of the P=NP problem, rather than directly addressing the specific question about the prize offered for finding a solution to this problem.'}], [{'name': 'contextual_relevance', 'score': 0.0, 'explanation': "The score is 0.00 because none of the sentences in the retrieval context provide any information related to the queried Californio's location. They mostly discuss geological and geographical matters like tectonic plates and river courses, which are irrelevant to the original question."}]]

Answer relevancy

The answer relevancy metric measures the quality of our RAG pipeline’s response by evaluating how relevant the response is compared to the provided question.

from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

answer_relevancy_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.ANSWER_RELEVANCY, metric_params={"model":"gpt-4"})
answer_relevancy_pipeline.add_component("evaluator", evaluator)

evaluation_results = answer_relevancy_pipeline.run(
    {"evaluator": {"questions": questions, "responses": responses, "contexts": contexts}}
)
print(evaluation_results["evaluator"]["results"])

Output()



Output()

Output()

Output()

======================================================================

Metrics Summary

  - ✅ Answer Relevancy (score: 0.3333333333333333, threshold: 0.0, evaluation model: gpt-4, reason: The score is 
0.33 because the answer correctly identifies the Tehachapi mountain range as the influence for the split of the 
regions in California. However, the score is not higher because the majority of the points presented in the answer,
including details about AAA Auto Clubs, orogenic wedges, numerical models, and Victoria's temperature records, are 
irrelevant to the original question.)

For test case:

  - input: Which mountain range influenced the split of the regions?

  - actual output: The Tehachapi mountain range influenced the split of the regions in California.

  - expected output: None

  - context: None

  - retrieval context: ['The state is most commonly divided and promoted by its regional tourism groups as 
consisting of northern, central, and southern California regions. The two AAA Auto Clubs of the state, the 
California State Automobile Association and the Automobile Club of Southern California, choose to simplify matters 
by dividing the state along the lines where their jurisdictions for membership apply, as either northern or 
southern California, in contrast to the three-region point of view. Another influence is the geographical phrase 
South of the Tehachapis, which would split the southern region off at the crest of that transverse range, but in 
that definition, the desert portions of north Los Angeles County and eastern Kern and San Bernardino Counties would
be included in the southern California region due to their remoteness from the central valley and interior desert 
landscape.', 'Among the most well-known experiments in structural geology are those involving orogenic wedges, 
which are zones in which mountains are built along convergent tectonic plate boundaries. In the analog versions of 
these experiments, horizontal layers of sand are pulled along a lower surface into a back stop, which results in 
realistic-looking patterns of faulting and the growth of a critically tapered (all angles remain the same) orogenic
wedge. Numerical models work in the same way as these analog models, though they are often more sophisticated and 
can include patterns of erosion and uplift in the mountain belt. This helps to show the relationship between 
erosion and the shape of the mountain range. These studies can also give useful information about pathways for 
metamorphism through pressure, temperature, space, and time.', "The Mallee and upper Wimmera are Victoria's warmest
regions with hot winds blowing from nearby semi-deserts. Average temperatures exceed 32 °C (90 °F) during summer 
and 15 °C (59 °F) in winter. Except at cool mountain elevations, the inland monthly temperatures are 2–7 °C (4–13 
°F) warmer than around Melbourne (see chart). Victoria's highest maximum temperature since World War II, of 48.8 °C
(119.8 °F) was recorded in Hopetoun on 7 February 2009, during the 2009 southeastern Australia heat wave."]

======================================================================

Metrics Summary

  - ✅ Answer Relevancy (score: 0.5, threshold: 0.0, evaluation model: gpt-4, reason: The score is 0.50 because 
while the answer did provide the correct information about the prize amount for solving the P=NP problem, it also 
included unnecessary details about the significance of the P=NP problem itself, which was not asked for in the 
question.)

For test case:

  - input: What is the prize offered for finding a solution to P=NP?

  - actual output: The prize offered for finding a solution to P=NP is US$1,000,000.

  - expected output: None

  - context: None

======================================================================

Metrics Summary

  - ✅ Answer Relevancy (score: 0.2, threshold: 0.0, evaluation model: gpt-4, reason: The score is 0.20 because 
while the answer does mention a location in the upper part, it is not related to the original question about a 
Californio. The answer also includes irrelevant information about the Ill, the Upper Rhine Plain, and the Rhine, 
which are not related to the question. The score is not zero because the concept of a location in the upper part is
addressed.)

For test case:

  - input: Which Californio is located in the upper part?

  - actual output: The Californio located in the upper part is the Ill below of Strasbourg.

  - expected output: None

  - context: None

✅ Tests finished! Run "deepeval login" to view evaluation results on the web.

[[{'name': 'answer_relevancy', 'score': 0.3333333333333333, 'explanation': "The score is 0.33 because the answer correctly identifies the Tehachapi mountain range as the influence for the split of the regions in California. However, the score is not higher because the majority of the points presented in the answer, including details about AAA Auto Clubs, orogenic wedges, numerical models, and Victoria's temperature records, are irrelevant to the original question."}], [{'name': 'answer_relevancy', 'score': 0.5, 'explanation': 'The score is 0.50 because while the answer did provide the correct information about the prize amount for solving the P=NP problem, it also included unnecessary details about the significance of the P=NP problem itself, which was not asked for in the question.'}], [{'name': 'answer_relevancy', 'score': 0.2, 'explanation': 'The score is 0.20 because while the answer does mention a location in the upper part, it is not related to the original question about a Californio. The answer also includes irrelevant information about the Ill, the Upper Rhine Plain, and the Rhine, which are not related to the question. The score is not zero because the concept of a location in the upper part is addressed.'}]]

Note

When this notebook was created, the version 0.20.57 of deepeval required the use of contexts for calculating Answer Relevancy. Please note that future versions will no longer require the context field. Specifically, the upcoming release of deepeval-haystack will eliminate the context field as a mandatory requirement.

Faithfulness

The faithfulness metric measures the quality of our RAG pipeline’s responses by evaluating whether the response factually aligns with the contents of context we provided.

from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

faithfulness_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.FAITHFULNESS, metric_params={"model":"gpt-4"} )
faithfulness_pipeline.add_component("evaluator", evaluator)

evaluation_results = faithfulness_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])

Output()



Output()

Output()

Output()

======================================================================

Metrics Summary

  - ✅ Faithfulness (score: 0.2631578947368421, threshold: 0.0, evaluation model: gpt-4, reason: The score is 0.26 
because the actual output persistently mentions the Tehachapi mountain range influencing the split of regions in 
California, which is not addressed in the context of the discussions on orogenic wedges, numerical models, and 
their role in mountain building in the second node of the retrieval context. Additionally, the output is also 
unrelated to the third node of the retrieval context, which discusses about Victoria's warmest regions, the Mallee,
and upper Wimmera, and their weather patterns.)

For test case:

  - input: Which mountain range influenced the split of the regions?

  - actual output: The Tehachapi mountain range influenced the split of the regions in California.

  - expected output: None

  - context: None

  - retrieval context: ['The state is most commonly divided and promoted by its regional tourism groups as 
consisting of northern, central, and southern California regions. The two AAA Auto Clubs of the state, the 
California State Automobile Association and the Automobile Club of Southern California, choose to simplify matters 
by dividing the state along the lines where their jurisdictions for membership apply, as either northern or 
southern California, in contrast to the three-region point of view. Another influence is the geographical phrase 
South of the Tehachapis, which would split the southern region off at the crest of that transverse range, but in 
that definition, the desert portions of north Los Angeles County and eastern Kern and San Bernardino Counties would
be included in the southern California region due to their remoteness from the central valley and interior desert 
landscape.', 'Among the most well-known experiments in structural geology are those involving orogenic wedges, 
which are zones in which mountains are built along convergent tectonic plate boundaries. In the analog versions of 
these experiments, horizontal layers of sand are pulled along a lower surface into a back stop, which results in 
realistic-looking patterns of faulting and the growth of a critically tapered (all angles remain the same) orogenic
wedge. Numerical models work in the same way as these analog models, though they are often more sophisticated and 
can include patterns of erosion and uplift in the mountain belt. This helps to show the relationship between 
erosion and the shape of the mountain range. These studies can also give useful information about pathways for 
metamorphism through pressure, temperature, space, and time.', "The Mallee and upper Wimmera are Victoria's warmest
regions with hot winds blowing from nearby semi-deserts. Average temperatures exceed 32 °C (90 °F) during summer 
and 15 °C (59 °F) in winter. Except at cool mountain elevations, the inland monthly temperatures are 2–7 °C (4–13 
°F) warmer than around Melbourne (see chart). Victoria's highest maximum temperature since World War II, of 48.8 °C
(119.8 °F) was recorded in Hopetoun on 7 February 2009, during the 2009 southeastern Australia heat wave."]

======================================================================

Metrics Summary

======================================================================

Metrics Summary

  - ✅ Faithfulness (score: 0.03571428571428571, threshold: 0.0, evaluation model: gpt-4, reason: The score is 0.04
because the actual output is significantly unfaithful to the retrieval context. It completely ignores the series of
discoveries made in the 1960s, tectonic plates, and their movement, seafloor spreading, components of the Earth's 
lithosphere, and other related concepts as stated in the first node. Additionally, it incorrectly references 
'Californio' and 'Strasbourg' which contradict the context of the Rhine knee and the geographical information 
provided in the second node. It also inaccurately mentions 'Californio' and 'Strasbourg' in the context of Rome and
its historical and geographical facts from the third node.)

For test case:

  - input: Which Californio is located in the upper part?

  - actual output: The Californio located in the upper part is the Ill below of Strasbourg.

  - expected output: None

  - context: None

----------------------------------------------------------------------

✅ Tests finished! Run "deepeval login" to view evaluation results on the web.

[[{'name': 'faithfulness', 'score': 0.2631578947368421, 'explanation': "The score is 0.26 because the actual output persistently mentions the Tehachapi mountain range influencing the split of regions in California, which is not addressed in the context of the discussions on orogenic wedges, numerical models, and their role in mountain building in the second node of the retrieval context. Additionally, the output is also unrelated to the third node of the retrieval context, which discusses about Victoria's warmest regions, the Mallee, and upper Wimmera, and their weather patterns."}], [{'name': 'faithfulness', 'score': 1.0, 'explanation': 'The score is 1.00 because the actual output perfectly aligns with all the nodes in the retrieval context, without any contradictions.'}], [{'name': 'faithfulness', 'score': 0.03571428571428571, 'explanation': "The score is 0.04 because the actual output is significantly unfaithful to the retrieval context. It completely ignores the series of discoveries made in the 1960s, tectonic plates, and their movement, seafloor spreading, components of the Earth's lithosphere, and other related concepts as stated in the first node. Additionally, it incorrectly references 'Californio' and 'Strasbourg' which contradict the context of the Rhine knee and the geographical information provided in the second node. It also inaccurately mentions 'Californio' and 'Strasbourg' in the context of Rome and its historical and geographical facts from the third node."}]]

Our pipeline evaluation using DeepEval is now complete!

Haystack 2.0 Useful Sources