Integration: INSTRUCTOR Embedders

A component for computing embeddings using INSTRUCTOR embedding models - built for Haystack 2.0.

Authors

Ashwin Mathur

Varun Mathur

GitHub Repo PyPI Package

This custom component for Haystack 2.0 can be used to create embeddings using the INSTRUCTOR Embedding Models.

INSTRUCTOR is an instruction-finetuned text embedding model that can generate text embeddings tailored to any task (e.g., classification, retrieval, clustering, text evaluation, etc.) and domains (e.g., science, finance, etc.) by simply providing the task instruction, without any finetuning. INSTRUCTOR achieves SOTA on 70 diverse embedding tasks ( MTEB leaderboard). For more details, check out the paper and project page. The model checkpoints can be found on HuggingFace.

The INSTRUCTOR models can be used to create domain-specific and task-aware embeddings, by passing an instruction along with the text to be encoded.

Unified Template for Creating Instructions

To create customized embeddings for specific sentences, you should follow the unified template to write instructions:

    Represent the 'domain' 'text_type' for 'task_objective':

domain is optional, and it specifies the domain of the text, e.g., science, finance, medicine, etc.
text_type is required, and it specifies the encoding unit, e.g., sentence, document, paragraph, etc.
task_objective is optional, and it specifies the objective of embedding, e.g., retrieve a document, classify the sentence, etc.

Example:

Document Text - ‘Capitalism has been dominant in the Western world since the end of feudalism, but most feel that the term “mixed economies” more precisely describes most contemporary economies, due to their containing both private-owned and state-owned enterprises. In capitalism, prices determine the demand-supply scale. For example, higher demand for certain goods and services lead to higher prices and lower demand for certain goods lead to lower prices.’
Document Embedding Instruction - ‘Represent the Wikipedia document for retrieval:’

Query - ‘In a mixed economy, what are the key factors that determine whether a particular enterprise is privately owned or state-owned?’
Query Embedding Instruction - ‘Represent the Wikipedia question for retrieving supporting documents:’
Document Text - ‘The Federal Reserve on Wednesday raised its benchmark interest rate. The funds rose less than 0.5 per cent on Friday.’
Document Embedding Instruction - ‘Represent the Financial statement:’

Query - ‘What was the impact of the interest rate hike?’
Query Embedding Instruction - ‘Represent the Financial question:’

This component contains:

InstructorTextEmbedder, a component that embeds a list of strings into a list of vectors.
InstructorDocumentEmbedder, a component that embeds a list of Haystack Documents. The embedding of each Document is stored in the embedding field of the Document.

You can use these embedders as a standalone component or within an indexing pipeline.

Installation

To use this component, install the instructor-embedders-haystack package.

pip install instructor-embedders-haystack

Usage

To initialize the InstructorTextEmbedder or InstructorDocumentEmbedder you need to pass Local path or name of the model in Hugging Face’s model hub, such as 'hkunlp/instructor-base', using the model parameter.
The instruction string to be used while computing domain-specific embeddings needs to be passed using the instruction parameter.

Using the Text Embedder

from haystack.utils.device import ComponentDevice
from haystack_integrations.components.embedders.instructor_embedders import InstructorTextEmbedder

# Example text from the Amazon Reviews Polarity Dataset (https://huggingface.co/datasets/amazon_polarity)
text = "It clearly says online this will work on a Mac OS system. The disk comes and it does not, only Windows. Do Not order this if you have a Mac!!"
instruction = (
    "Represent the Amazon comment for classifying the sentence as positive or negative"
)

text_embedder = InstructorTextEmbedder(
    model="hkunlp/instructor-base", instruction=instruction,
    device=ComponentDevice.from_str("cpu"),
)
text_embedder.warm_up()
result = text_embedder.run(text)
print(f"Embedding: {result['embedding']}")
print(f"Embedding Dimension: {len(result['embedding'])}")

Using the Document Embedder

from haystack.utils.device import ComponentDevice
from haystack.dataclasses import Document
from haystack_integrations.components.embedders.instructor_embedders import InstructorDocumentEmbedder


doc_embedding_instruction = "Represent the Medical Document for retrieval:"

doc_embedder = InstructorDocumentEmbedder(
    model="hkunlp/instructor-base",
    instruction=doc_embedding_instruction,
    batch_size=32,
    device=ComponentDevice.from_str("cpu"),
)

doc_embedder.warm_up()

# Text taken from PubMed QA Dataset (https://huggingface.co/datasets/pubmed_qa)
document_list = [
    Document(
        content="Oxidative stress generated within inflammatory joints can produce autoimmune phenomena and joint destruction. Radical species with oxidative activity, including reactive nitrogen species, represent mediators of inflammation and cartilage damage.",
        meta={
            "pubid": "25,445,628",
            "long_answer": "yes",
        },
    ),
    Document(
        content="Plasma levels of pancreatic polypeptide (PP) rise upon food intake. Although other pancreatic islet hormones, such as insulin and glucagon, have been extensively investigated, PP secretion and actions are still poorly understood.",
        meta={
            "pubid": "25,445,712",
            "long_answer": "yes",
        },
    ),
    Document(
        content="Disturbed sleep is associated with mood disorders. Both depression and insomnia may increase the risk of disability retirement. The longitudinal links among insomnia, depression and work incapacity are poorly known.",
        meta={
            "pubid": "25,451,441",
            "long_answer": "yes",
        },
    ),
]

result = doc_embedder.run(document_list)
print(f"Document Text: {result['documents'][0].content}")
print(f"Document Embedding: {result['documents'][0].embedding}")
print(f"Embedding Dimension: {len(result['documents'][0].embedding)}")

Using the Embedders in a Semantic Search Pipeline

# Import necessary modules and classes
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.dataclasses import Document
from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.utils.device import ComponentDevice
from datasets import load_dataset

# Import custom INSTRUCTOR Embedders
from haystack_integrations.components.embedders.instructor_embedders import InstructorDocumentEmbedder
from haystack_integrations.components.embedders.instructor_embedders import InstructorTextEmbedder

# Initialize a InMemoryDocumentStore, which will be used to store and retrieve documents
# It uses cosine similarity for document embeddings comparison
doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine")

# Define an instruction for document embedding
doc_embedding_instruction = "Represent the News Article for retrieval:"
# Create an InstructorDocumentEmbedder instance with specified parameters
doc_embedder = InstructorDocumentEmbedder(
    model="hkunlp/instructor-base",
    instruction=doc_embedding_instruction,
    batch_size=32,
    device=ComponentDevice.from_str("cpu"),
)
# Warm up the embedder (loading the pre-trained model)
doc_embedder.warm_up()

# Create an indexing pipeline
indexing_pipeline = Pipeline()
# Add the document embedder component to the pipeline
indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder")
# Add a DocumentWriter component to the pipeline that writes documents to the Document Store
indexing_pipeline.add_component(
    instance=DocumentWriter(document_store=doc_store), name="DocWriter"
)
# Connect the output of DocEmbedder to the input of DocWriter
indexing_pipeline.connect("DocEmbedder", "DocWriter")

# Load the 'XSum' dataset from HuggingFace (https://huggingface.co/datasets/xsum)
dataset = load_dataset("xsum", split="train")

# Create Document objects from the dataset and add them to the document store using the indexing pipeline
docs = [
    Document(
        content=doc["document"],
        meta={
            "summary": doc["summary"],
            "doc_id": doc["id"],
        },
    )
    for doc in dataset
]
indexing_pipeline.run({"DocEmbedder": {"documents": docs}})

# Print the first document and its embedding from the document store
print(doc_store.filter_documents()[0])
print(doc_store.filter_documents()[0].embedding)

# Define an instruction for query embedding
query_embedding_instruction = (
    "Represent the news question for retrieving supporting articles:"
)
# Create an InstructorTextEmbedder instance for query embedding
text_embedder = InstructorTextEmbedder(
    model="hkunlp/instructor-base",
    instruction=query_embedding_instruction,
    device=ComponentDevice.from_str("cpu"),
)
# Load the text embedding model
text_embedder.warm_up()

# Create a query pipeline
query_pipeline = Pipeline()
# Add the text embedder component to the pipeline
query_pipeline.add_component("TextEmbedder", text_embedder)
# Add a InMemoryEmbeddingRetriever component to the pipeline that retrieves documents from the doc_store
query_pipeline.add_component(
    "Retriever", InMemoryEmbeddingRetriever(document_store=doc_store)
)
# Connect the output of TextEmbedder to the input of Retriever
query_pipeline.connect("TextEmbedder", "Retriever")

# Run the query pipeline with a sample query text
results = query_pipeline.run(
    {
        "TextEmbedder": {
            "text": "What were the concerns expressed by Jeanette Tate regarding the response to the flooding in Newton Stewart?"
        }
    }
)

# Print information about retrieved documents
for doc in results["Retriever"]["documents"]:
    print(f"Text:\n{doc.content[:150]}...\n")
    print(f"Metadata: {doc.meta}")
    print(f"Score: {doc.score}")
    print("-" * 10 + "\n")

License

instructor-embedders-haystack is distributed under the terms of the Apache-2.0 license.