LLMs evaluation methods

5 min readApr 28, 2024

Evaluation metrics and starter code snippets

Overview and evaluation techniques

LLM evaluation is as difficult as training LLM model. For evaluating predictive models, metrics such as MSE, MSA, RMSE, F1-Score, confusion Metrics etc. can do a great job. While LLM evaluation differs from the evaluation of traditional predictive models in several key aspects due to the nature of tasks LLMs are designed for and the complexity of language processing.

Evaluation Metrics

One of the crucial task in LLM evaluation is to decide the set of metrics you want to evaluate your LLM application on. Some of the examples could be

factual consistency (how factually correct your LLM application is based on the provided context)
answer relevancy (how relevant your LLM application’s outputs are based on the respective inputs in your evals dataset)
complexity and Context Understanding: (how logical and consistent your LLM application’s outputs are)
toxicity (whether your LLM application is outputting harmful content)
bias and fairness

There are two types of LLM evaluation metrics:

Context less:

n-gram based metrics which evaluates the predicted response based on the reference provided does not need any additional document for context. Some of the examples are BLEU, ROUGE etc.

BLEU(Bilingual Evaluation Understudy):

The BLEU score — Used primarily for evaluating machine translation quality; measures the similarity between the LLM’s output and one or more reference translations or texts provided by humans. It does this by calculating the n-gram overlap between the generated text and the reference texts.

Specifically, BLEU computes the precision of n-grams (sequences of n words) in the generated text by comparing them to the n-grams in the reference texts. The final BLEU score is a weighted geometric mean of these n-gram precisions, multiplied by a brevity penalty that penalizes shorter outputs.

import evaluate

predictions = ["Thank you for reaching out to discuss your concerns.", "foo is an AI Company"]
references = [["Thank you for reaching out to discuss your concerns.", "hello there !"], ["foo ia a foobar"]]
bleu = evaluate.load("bleu")
results = bleu.compute(predictions=predictions, references=references)
print(results)
# {'bleu': 0.7320459696259976, 'precisions': [0.7333333333333333, 0.6923076923076923, 0.7272727272727273, 0.7777777777777778], 'brevity_penalty': 1.0, 'length_ratio': 2.142857142857143, 'translation_length': 15, 'reference_length': 7}

ROUGE(Recall-Oriented Understudy for Gisting Evaluation):

ROUGE — Commonly used for evaluating text summarization tasks; compares machine-generated text against human-written reference texts and measures the overlap of n-grams (sequences of n words) between them. The most commonly used ROUGE metrics are:

ROUGE-N: Measures the overlap of n-grams between the generated and reference texts. ROUGE-1 considers unigram overlap, ROUGE-2 considers bigram overlap, and so on.
ROUGE-L: Evaluates the longest common subsequence (LCS) between the generated and reference texts. It captures sentence-level structure similarity.
ROUGE-W: Weighted LCS-based metric that favours consecutive LCSes.

import evaluate

predictions = ["Thank you for reaching out to discuss your concerns.", "foo is an AI Company"]
references = [["Thank you for reaching out to discuss your concerns.", "hello there !"], ["foo ia a foobar"]]
rouge = evaluate.load("rouge")
results = rouge.compute(predictions=predictions, references=references)
print(results)
# {'rouge1': 0.6111111111111112, 'rouge2': 0.5, 'rougeL': 0.6111111111111112, 'rougeLsum': 0.6111111111111112}

Limitation of context less metrics:

Focuses solely on surface-level similarities, not capturing semantic equivalence or coherence.
Sensitive to minor variations in wording, even if the overall meaning is preserved. Also, they cannot detect paraphrasing and punctuations.
Relies on the quality and representativeness of the reference summaries.

Context based:

When you provide contextual embeddings to evaluate generated text, the evaluation results becomes more meaningful. Here, embeddings are required to provide context to the metrics in terms of getting evaluation results. Some examples are perplexity, BERTScore and human evaluation.

BERTScore:

BERTScore — Suitable when deep semantic understanding and context matching are critical, such as in complex question-answering tasks; leverages the contextual embeddings from the pre-trained BERT model to calculate the semantic similarity between the generated text and the reference text. Here’s how it works:

1. It computes the cosine similarity between each token in the generated text and each token in the reference text using their BERT embeddings.
2. It then calculates precision by finding the maximum similarity score for each token in the generated text against all tokens in the reference.
3. Recall is calculated by finding the maximum similarity for each token in the reference against all tokens in the generated text.
4. The final BERTScore is calculated using the precision, recall, and F1 measure.

Advantages of BERTScore:

- It captures semantic similarity better than n-gram based metrics like BLEU by using contextual embeddings.
- It accounts for paraphrases and synonyms, which n-gram metrics often fail to capture.
- It can handle long-range dependencies and word order changes better than n-gram metrics.
- It provides precision, recall, and F1 scores, giving more insight into the evaluation.

Limitations:

- It requires more computational resources compared to string-based metrics due to the use of BERT embeddings.
- It may not always align perfectly with human judgments, especially when coherence and structure of the text are important.

from evaluate import load
bertscore = load("bertscore")
predictions = ["Thank you for reaching out to discuss your concerns.", "foo is an AI Company"]
references = [["Thank you for reaching out to discuss your concerns.", "hello there !"], ["foo ia a foobar"]]
results = bertscore.compute(predictions=predictions, references=references, lang="en")
print(results)
# {'precision': [1.0, 0.8442325592041016], 'recall': [1.0, 0.8272688388824463], 'f1': [1.0, 0.8356646299362183], 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.40.0)'}

# Comparing BERTScore with n-gram based metrics 
bertscore = load("bertscore")
predictions = ["Thank you for reaching out to discuss your concerns"]
references = ["Thanks for reaching out to me discussing your concerns."]
bert = bertscore.compute(predictions=predictions, references=references, lang="en")
r = rouge.compute(predictions=predictions, references=references)
print("BERT = ", bert)
print("ROUGE = ", r)
# BERT =  {'precision': [0.962797999382019], 'recall': [0.9634072184562683], 'f1': [0.9631025195121765], 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.40.0)'}
# ROUGE =  {'rouge1': 0.6666666666666666, 'rouge2': 0.5, 'rougeL': 0.6666666666666666, 'rougeLsum': 0.6666666666666666}

Perplexity:

It measures how well a language model can predict the next word in a sequence, based on the probability distribution it assigns to the words in its vocabulary.

Perplexity quantifies the “surprise” or uncertainty of a language model when predicting the next word. A lower perplexity indicates that the model is more confident in its predictions, while a higher perplexity means the model is more “perplexed” or uncertain.Mathematically, perplexity is defined as the exponentiated average negative log-likelihood of a sequence.

Limitations:

It does not capture the semantic quality or diversity of generated text.
Models with similar perplexity can produce vastly different outputs.
Perplexity only measures the difference between the model’s distribution and the ground truth, not the actual quality of the predictions.

perplexity = evaluate.load("perplexity", module_type="metric")
input_texts = ["Thank you for reaching out to discuss your ", "concerns"]

results = perplexity.compute(model_id='gpt2', add_start_token=False,
                             predictions=input_texts)
print(results)
# {'perplexities': [145.87115478515625, 14432.1455078125], 'mean_perplexity': 7289.008331298828}

Human Evaluation: While automated metrics provide scalability, human judgments remain the gold standard for evaluating context understanding and relevance. Human evaluators can better assess subtleties like humor, sarcasm, and cultural relevance. However, this is not an ideal option for implementing at large scale projects.

Conclusion

Hence, in this article, widely used LLM metrics are discussed. However, State-of-the-art large language models (LLMs) are evaluated using a comprehensive set of metrics and benchmarks that assess their performance across various dimensions.

References:

Evaluate module by Huggingface

BLEU score

ROUGE score