Day 10: Jingle Metrics All the Way ๐
Haystack Elves worked tirelessly this year to make the holiday season stress-free and joyful. Determined to innovate, they tackled challenges with cutting-edge AI solutions.
They enhanced pipelines with speech-to-text models, explored various LLM providers, and customized Haystack pipelines for unique needs. They built AI Agents with tool-calling and self-reflection, added tracing mechanisms, and developed faster with deepset Studio. To ensure a top-notch tech stack, they partnered with tools like Weaviate, AssemblyAI, NVIDIA NIMs, Arize Phoenix, and MongoDB.
However, there’s one crucial step remaining before taking anything into production: ๐ Evaluation ๐
Haystack equips the elves with the tools they need, including integrations with evaluation frameworks and built-in evaluators. Adding to this, the Haystack ecosystem now features a powerful new tool: EvaluationHarness. This tool streamlines the evaluation process for Haystack pipelines by eliminating the need to create a separate evaluation pipeline while also making it easier to compare configurations using overrides.
For this challenge, you need to help Haystack elves evaluate a simple RAG pipeline using RAGEvaluationHarness
, a specialized extension of EvaluationHarness
designed to simplify and optimize evaluation specifically for RAG pipelines.
๐ฏ Requirements:
- A Hugging Face API Key with access to free gated models, meta-llama/Llama-3.2-1B-Instruct and meta-llama/Llama-3.2-3B-Instruct. Visit the model pages to request access. More details are in the Starter Colab below.
- An
OpenAI API Key to use LLM-based evaluators with
EvaluationHarness
, such as FaithfulnessEvaluator, ContextRelevanceEvaluator
๐ Some Hints:
- Explore the Walkthrough: Evaluation to learn all about evaluation in Haystack.
- For practical examples, check out Cookbook: Evaluating RAG Pipelines with EvaluationHarness and Cookbook: Evaluating AI with Haystack.
โญ Bonus Task: Take it a step further by incorporating hybrid retrieval into your pipeline. Use
EvaluationHarness
with customizations to test whether hybrid retrieval improves Recall and MRR ๐
๐ฉต Here is the Starter Colab