Day 10: Jingle Metrics All the Way 🔔

Haystack Elves

Haystack Elves worked tirelessly this year to make the holiday season stress-free and joyful. Determined to innovate, they tackled challenges with cutting-edge AI solutions.

They enhanced pipelines with speech-to-text models, explored various LLM providers, and customized Haystack pipelines for unique needs. They built AI Agents with tool-calling and self-reflection, added tracing mechanisms, and developed faster with deepset Studio. To ensure a top-notch tech stack, they partnered with tools like Weaviate, AssemblyAI, NVIDIA NIMs, Arize Phoenix, and MongoDB.

However, there’s one crucial step remaining before taking anything into production: 📊 Evaluation 📊

Haystack equips the elves with the tools they need, including integrations with evaluation frameworks and built-in evaluators. Adding to this, the Haystack ecosystem now features a powerful new tool: EvaluationHarness. This tool streamlines the evaluation process for Haystack pipelines by eliminating the need to create a separate evaluation pipeline while also making it easier to compare configurations using overrides.

For this challenge, you need to help Haystack elves evaluate a simple RAG pipeline using RAGEvaluationHarness, a specialized extension of EvaluationHarness designed to simplify and optimize evaluation specifically for RAG pipelines.

🎯 Requirements:

A Hugging Face API Key with access to free gated models, meta-llama/Llama-3.2-1B-Instruct and meta-llama/Llama-3.2-3B-Instruct. Visit the model pages to request access. More details are in the Starter Colab below.
An OpenAI API Key to use LLM-based evaluators with EvaluationHarness, such as FaithfulnessEvaluator, ContextRelevanceEvaluator

💝 Some Hints:

Explore the Walkthrough: Evaluation to learn all about evaluation in Haystack.

For practical examples, check out Cookbook: Evaluating RAG Pipelines with EvaluationHarness and Cookbook: Evaluating AI with Haystack.

⭐ Bonus Task: Take it a step further by incorporating hybrid retrieval into your pipeline. Use EvaluationHarness with customizations to test whether hybrid retrieval improves Recall and MRR 👀

🩵 Here is the Starter Colab

Advent of Haystack

Day 10: Jingle Metrics All the Way 🔔

🎯 Requirements:

💝 Some Hints: