- Retrieval: a vector search is performed in a knowledge base, and top-k document chunks are retrieved that are closest to the input query per distance in the embedding space.
- Generation: Retrieved contexts and the original question are supplied to a Large Language Model (LLM), which generates the final answer for the end user.
Retrieval Metrics
To measure the quality of the retrieved contexts, Vijil uses two LLM-based metrics. Each produce a score between 0 and 1.- Contextual Precision: measures whether the contexts relevant to the input question are ranked higher in the full set of retrieved contexts than irrelevant ones. A higher score indicates greater alignment in ranking.
- Contextual Recall: measures the extent to which the retrieved contexts align with the golden answers. A higher score indicates greater alignment with the golden answer.
gpt-4o as the judge LLM in these metrics.
Generation Metrics
Vijil’s generation metrics are divided into three categories, attempting to measure the LLM in a RAG for different capabilities.Correctness
To measure correctness of the LLM-generated answers, we use the following traditional NLP metrics.- BLEU
- METEOR
- BERTScore
Relevancy
Vijil’s LLM-based Answer Relevancy metric measures the degree to which the final generated output is relevant to the original input. It produces a score between 0 and 1, higher score indicating higher relevancy.Hallucination
Vijil uses an LLM-based Faithfulness metric to measure how much the generated response stays faithful to the retrieved contexts, i.e. the opposite of hallucination. This metric produces scores from 0 to 1, where a higher score means that the response is more faithful to the context (has fewer hallucinations).Evaluating Domain-specific Question Answering
In the example below, you will use the FinanceBench benchmark dataset to evaluate how accuratelygpt-4o-mini can produce reliable answers from a dataset of financial documents.
Note that Vijil Python client uses an API token, loaded as the environment variable VIJIL_API_KEY. Please make sure you have fetched an API key from the UI and stored it in the env file.
get_status method to keep track of the progress of the evaluation.
COMPLETE, you can aggregate the values of all metrics.
To do so, you first download all inputs, outputs, and metric values.
AnswerRelevancy, the generated answers are moderately relevant to the input question. Contextual precision is high, since the benchmark dataset do not contain any contexts relevant to the golden answers. Contextual recall is around 77%, indicating that sometimes the generated answer may not include all information in the contexts. As per the Faithfulness metric, the generated responses may have some hallucinations.