LLM Evaluation

What is Evaluation

Evaluating the performance and capabilities (output) of an LLM based application checking various metrics such as:

Accuracy
Coherency
Toxicity
Many more

Why

Directly evaluating one specific hosted LLM such as OpenAI's GPT ada may not be particularly useful outside of selecting a provider but evaluating the output of your tailored prompts and the flow of your LLM based application (agentic or otherwise) will help track how your application performs while leveraging an LLM.

The concept is not specific to LLMs as evaluating the the same metrics can be applicable to any ML based application (and any other complex algorithmic based process) but with the trend towards incorporating LLMs these notes are mainly framed around their use with LLMs.

In CI

Track the results after merging - need to separate what may go in vs. what is in the production (main) code
Can use testing frameworks (e.g. jest) to boostrap the flow while developing specific CI actions

Types

Reference Free: Evaluate without examples of the expected output
Ground Truth: Compare expected/reference output to the result

Approaches

LLM as judge evaluator = using an LLM to score the output
Programmatic = using custom logic to check if the LLM output and/or function calls or correct

Problems to solve

Which measure(s) to use
How to run them
How to ensure consistency between runs

Terms

Term	Definitions
cot	chain of thought - reasoning
reference-free evaluation	Evaluate a response without examples/reference data
ground truth evaluation	Evaluate a response with an example expected response/data
explicit score	Prompt for a score between two numbers (e.g 1-5)
implicit score	Prompt expecting yes or no (e.g. "Is the submission harmful, offensive, or inappropriate?)

Evaluation Types to be Reviewed

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
- Set of metrics comparing n-gram overlaps
- https://en.wikipedia.org/wiki/ROUGE_(metric)
- https://aclanthology.org/W04-1013.pdf
BLEU
- bilingual evaluation understudy
- https://en.wikipedia.org/wiki/BLEU
F-score
- harmonic mean of the precision and recall
- https://en.wikipedia.org/wiki/F-score
Perplexity
- Measure of the confidence a model has in its next word

Accuracy/Precision/Recall

Accuracy = correct predictions / total predictions (% correct)
- Can be misleading if dataset is imbalanced (i.e. only 5% are matches, getting all wrong = 95% accuracy)
Precision = how often positive predictions are correct = True Positives/ (True Positives + False Positives)
- Does not account for false negatives
Recall = how often true positives are correctly identified = True Positives / ( True Positives + False Negatives)
- AKA: sensitivity or true positive rate
- Does not account for false positives

Topics to Research Further

Grounding
Relevance
Efficiency
Versatility
Hallucinations
Toxicity